Stochastic approximation

Stochastic approximation methods are a family of iterative methods typically used for root-finding problems or for optimization problems. The recursive update rules of stochastic approximation methods can be used, among other things, for solving linear systems when the collected data is corrupted by noise, or for approximating extreme values of functions which cannot be computed directly, but only estimated via noisy observations.

In a nutshell, stochastic approximation algorithms deal with a function of the form ${\textstyle f(\theta )=\operatorname {E} _{\xi }[F(\theta ,\xi )]}$ which is the expected value of a function depending on a random variable ${\textstyle \xi }$ . The goal is to recover properties of such a function ${\textstyle f}$ without evaluating it directly. Instead, stochastic approximation algorithms use random samples of ${\textstyle F(\theta ,\xi )}$ to efficiently approximate properties of ${\textstyle f}$ such as zeros or extrema.

Recently, stochastic approximations have found extensive applications in the fields of statistics and machine learning, especially in settings with big data. These applications range from stochastic optimization methods and algorithms, to online forms of the EM algorithm, reinforcement learning via temporal differences, and deep learning, and others.^[1] Stochastic approximation algorithms have also been used in the social sciences to describe collective dynamics: fictitious play in learning theory and consensus algorithms can be studied using their theory.^[2]

The earliest, and prototypical, algorithms of this kind are the Robbins–Monro and Kiefer–Wolfowitz algorithms introduced respectively in 1951 and 1952.

Robbins–Monro algorithm[edit]

The Robbins–Monro algorithm, introduced in 1951 by Herbert Robbins and Sutton Monro,^[3] presented a methodology for solving a root finding problem, where the function is represented as an expected value. Assume that we have a function ${\textstyle M(\theta )}$ , and a constant ${\textstyle \alpha }$ , such that the equation ${\textstyle M(\theta )=\alpha }$ has a unique root at ${\textstyle \theta ^{*}}$ . It is assumed that while we cannot directly observe the function ${\textstyle M(\theta )}$ , we can instead obtain measurements of the random variable ${\textstyle N(\theta )}$ where ${\textstyle \operatorname {E} [N(\theta )]=M(\theta )}$ . The structure of the algorithm is to then generate iterates of the form:

\theta _{n+1}=\theta _{n}-a_{n}(N(\theta _{n})-\alpha )

Here, $a_{1},a_{2},\dots$ is a sequence of positive step sizes. Robbins and Monro proved^[3]^{, Theorem 2} that $\theta _{n}$ converges in $L^{2}$ (and hence also in probability) to $\theta ^{*}$ , and Blum^[4] later proved the convergence is actually with probability one, provided that:

${\textstyle N(\theta )}$ is uniformly bounded,
${\textstyle M(\theta )}$ is nondecreasing,
${\textstyle M'(\theta ^{*})}$ exists and is positive, and
The sequence ${\textstyle a_{n}}$ satisfies the following requirements:

\qquad \sum _{n=0}^{\infty }a_{n}=\infty \quad {\mbox{ and }}\quad \sum _{n=0}^{\infty }a_{n}^{2}<\infty \quad

A particular sequence of steps which satisfy these conditions, and was suggested by Robbins–Monro, have the form: ${\textstyle a_{n}=a/n}$ , for ${\textstyle a>0}$ . Other series are possible but in order to average out the noise in ${\textstyle N(\theta )}$ , the above condition must be met.

Complexity results[edit]

If ${\textstyle f(\theta )}$ is twice continuously differentiable, and strongly convex, and the minimizer of ${\textstyle f(\theta )}$ belongs to the interior of ${\textstyle \Theta }$ , then the Robbins–Monro algorithm will achieve the asymptotically optimal convergence rate, with respect to the objective function, being ${\textstyle \operatorname {E} [f(\theta _{n})-f^{*}]=O(1/n)}$ , where ${\textstyle f^{*}}$ is the minimal value of ${\textstyle f(\theta )}$ over ${\textstyle \theta \in \Theta }$ .^[5]^[6]
Conversely, in the general convex case, where we lack both the assumption of smoothness and strong convexity, Nemirovski and Yudin^[7] have shown that the asymptotically optimal convergence rate, with respect to the objective function values, is ${\textstyle O(1/{\sqrt {n}})}$ . They have also proven that this rate cannot be improved.

Subsequent developments and Polyak–Ruppert averaging[edit]

While the Robbins–Monro algorithm is theoretically able to achieve ${\textstyle O(1/n)}$ under the assumption of twice continuous differentiability and strong convexity, it can perform quite poorly upon implementation. This is primarily due to the fact that the algorithm is very sensitive to the choice of the step size sequence, and the supposed asymptotically optimal step size policy can be quite harmful in the beginning.^[6]^[8]

Chung (1954)^[9] and Fabian (1968)^[10] showed that we would achieve optimal convergence rate ${\textstyle O(1/{\sqrt {n}})}$ with ${\textstyle a_{n}=\bigtriangledown ^{2}f(\theta ^{*})^{-1}/n}$ (or ${\textstyle a_{n}={\frac {1}{(nM'(\theta ^{*}))}}}$ ). Lai and Robbins^[11]^[12] designed adaptive procedures to estimate ${\textstyle M'(\theta ^{*})}$ such that ${\textstyle \theta _{n}}$ has minimal asymptotic variance. However the application of such optimal methods requires much a priori information which is hard to obtain in most situations. To overcome this shortfall, Polyak (1991)^[13] and Ruppert (1988)^[14] independently developed a new optimal algorithm based on the idea of averaging the trajectories. Polyak and Juditsky^[15] also presented a method of accelerating Robbins–Monro for linear and non-linear root-searching problems through the use of longer steps, and averaging of the iterates. The algorithm would have the following structure:

\theta _{n+1}-\theta _{n}=a_{n}(\alpha -N(\theta _{n})),\qquad {\bar {\theta }}_{n}={\frac {1}{n}}\sum _{i=0}^{n-1}\theta _{i}

The convergence of

{\bar {\theta }}_{n}

to the unique root

\theta ^{*}

relies on the condition that the step sequence

\{a_{n}\}

decreases sufficiently slowly. That is

A1)

a_{n}\rightarrow 0,\qquad {\frac {a_{n}-a_{n+1}}{a_{n}}}=o(a_{n})

[:1-1] Toulis, Panos; Airoldi, Edoardo (2015). "Scalable estimation strategies based on stochastic approximations: classical results and new insights". Statistics and Computing. 25 (4): 781–795. doi:10.1007/s11222-015-9560-y. PMC 4484776. PMID 26139959.

[2] Le Ny, Jerome. "Introduction to Stochastic Approximation Algorithms" (PDF). Polytechnique Montreal. Teaching Notes. Retrieved 16 November 2016.

[rm-3] Robbins, H.; Monro, S. (1951). "A Stochastic Approximation Method". The Annals of Mathematical Statistics. 22 (3): 400. doi:10.1214/aoms/1177729586.

[:0-4] Blum, Julius R. (1954-06-01). "Approximation Methods which Converge with Probability one". The Annals of Mathematical Statistics. 25 (2): 382–386. doi:10.1214/aoms/1177728794. ISSN 0003-4851.

[jsacks-5] Sacks, J. (1958). "Asymptotic Distribution of Stochastic Approximation Procedures". The Annals of Mathematical Statistics. 29 (2): 373–405. doi:10.1214/aoms/1177706619. JSTOR 2237335.

[NJLS-6] Nemirovski, A.; Juditsky, A.; Lan, G.; Shapiro, A. (2009). "Robust Stochastic Approximation Approach to Stochastic Programming". SIAM Journal on Optimization. 19 (4): 1574. doi:10.1137/070704277.

[NYcomp-7] Problem Complexity and Method Efficiency in Optimization, A. Nemirovski and D. Yudin, Wiley -Intersci. Ser. Discrete Math 15 John Wiley New York (1983) .

[jcsbook-8] Introduction to Stochastic Search and Optimization: Estimation, Simulation and Control, J.C. Spall, John Wiley Hoboken, NJ, (2003).

[9] Chung, K. L. (1954-09-01). "On a Stochastic Approximation Method". The Annals of Mathematical Statistics. 25 (3): 463–483. doi:10.1214/aoms/1177728716. ISSN 0003-4851.

[10] Fabian, Vaclav (1968-08-01). "On Asymptotic Normality in Stochastic Approximation". The Annals of Mathematical Statistics. 39 (4): 1327–1332. doi:10.1214/aoms/1177698258. ISSN 0003-4851.

[11] Lai, T. L.; Robbins, Herbert (1979-11-01). "Adaptive Design and Stochastic Approximation". The Annals of Statistics. 7 (6): 1196–1221. doi:10.1214/aos/1176344840. ISSN 0090-5364.

[12] Lai, Tze Leung; Robbins, Herbert (1981-09-01). "Consistency and asymptotic efficiency of slope estimates in stochastic approximation schemes". Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete. 56 (3): 329–360. doi:10.1007/BF00536178. ISSN 0044-3719. S2CID 122109044.

[13] Polyak, B T (1990-01-01). "New stochastic approximation type procedures. (In Russian.)". 7 (7). {{cite journal}}: Cite journal requires |journal= (help)

[14] Ruppert, D. "Efficient estimators from a slowly converging robbins-monro process". {{cite journal}}: Cite journal requires |journal= (help)

[pj-15] Polyak, B. T.; Juditsky, A. B. (1992). "Acceleration of Stochastic Approximation by Averaging". SIAM Journal on Control and Optimization. 30 (4): 838. doi:10.1137/0330046.

[NY-16] On Cezari's convergence of the steepest descent method for approximating saddle points of convex-concave functions, A. Nemirovski and D. Yudin, Dokl. Akad. Nauk SSR 2939, (1978 (Russian)), Soviet Math. Dokl. 19 (1978 (English)).

[17] Kushner, Harold; George Yin, G. (2003-07-17). Stochastic Approximation and Recursive Algorithms and | Harold Kushner | Springer. www.springer.com. ISBN 9780387008943. Retrieved 2016-05-16.

[18] Bouleau, N.; Lepingle, D. (1994). Numerical Methods for stochastic Processes. New York: John Wiley. ISBN 9780471546412.

[KW-19] Kiefer, J.; Wolfowitz, J. (1952). "Stochastic Estimation of the Maximum of a Regression Function". The Annals of Mathematical Statistics. 23 (3): 462. doi:10.1214/aoms/1177729392.

[Jsp-20] Spall, J. C. (2000). "Adaptive stochastic approximation by the simultaneous perturbation method". IEEE Transactions on Automatic Control. 45 (10): 1839–1853. doi:10.1109/TAC.2000.880982.

[kushneryin-21] Kushner, H. J.; Yin, G. G. (1997). Stochastic Approximation Algorithms and Applications. doi:10.1007/978-1-4899-2696-8. ISBN 978-1-4899-2698-2.

[22] Stochastic Approximation and Recursive Estimation, Mikhail Borisovich Nevel'son and Rafail Zalmanovich Has'minskiĭ, translated by Israel Program for Scientific Translations and B. Silver, Providence, RI: American Mathematical Society, 1973, 1976. ISBN 0-8218-1597-0.

[23] Martin, R.; Masreliez, C. (1975). "Robust estimation via stochastic approximation". IEEE Transactions on Information Theory. 21 (3): 263. doi:10.1109/TIT.1975.1055386.

[24] Dvoretzky, Aryeh (1956-01-01). "On Stochastic Approximation". The Regents of the University of California. {{cite journal}}: Cite journal requires |journal= (help)

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

Stochastic approximation

Robbins–Monro algorithm[edit]

Complexity results[edit]

Subsequent developments and Polyak–Ruppert averaging[edit]

Application in stochastic optimization[edit]

Convergence of the algorithm[edit]

Example (where the stochastic gradient method is appropriate)^[8][edit]

Kiefer–Wolfowitz algorithm[edit]

Subsequent developments and important issues[edit]

Further developments[edit]

See also[edit]

References[edit]

Robbins–Monro algorithm[edit]

Complexity results[edit]

Subsequent developments and Polyak–Ruppert averaging[edit]

Application in stochastic optimization[edit]

Convergence of the algorithm[edit]

Example (where the stochastic gradient method is appropriate)[8][edit]

Kiefer–Wolfowitz algorithm[edit]

Subsequent developments and important issues[edit]

Further developments[edit]

See also[edit]

References[edit]

Example (where the stochastic gradient method is appropriate)^[8][edit]