User:VirtualVistas/sandbox
Original Article: Stochastic Gradient Descent
Added section: [NOTE: put section towards start of article] [TODO: graphic]
History
[edit]In 1951, Herbert Robbins and Sutton Monro introduced the earliest stochastic approximation methods, preceding stochastic gradient descent.[1] Building on this work one year later, Jack Kiefer and Jacob Wolfowitz published an optimization algorithm very close to stochastic gradient descent, using differences as an approximation of the gradient.[2] Later in the 1950s, Frank Rosenblatt used SGD to optimize his perceptron model, demonstrating the first applicability of stochastic gradient descent to neural networks.[3]
Backpropagation was first described in 1986, with stochastic gradient descent being used to efficiently optimize parameters across neural networks with multiple hidden layers. Soon after, another improvement was developed: mini-batch gradient descent, where small batches of data are substituted for single samples. In 1997, the practical performance benefits from vectorization achievable with such small batches were first explored,[4] paving the way for efficient optimization in machine learning. As of 2023, this mini-batch approach remains the norm for training neural networks, balancing the benefits of stochastic gradient descent with gradient descent.[5]
By the 1980s, momentum had already been introduced, and was added to SGD optimization techniques in 1986.[6] However, these optimization techniques assumed constant hyperparameters, i.e. a fixed learning rate and momentum parameter. In the 2010s, adaptive approaches to applying SGD with a per-parameter learning rate were introduced with AdaGrad (for "Adaptive Gradient") in 2011[7] and RMSprop (for "Root Mean Square Propogation") in 2012.[8] In 2014, Adam (for "Adaptive Moment Estimation") was published, applying the adaptive approaches of RMSprop to momentum; many improvements and branches of Adam were then developed such as Adadelta, Adagrad, AdamW, and Adamax.[9][10]
Within machine learning, approaches to optimization in 2023 are dominated by Adam-derived optimizers. TensorFlow and PyTorch, by far the most popular machine learning libraries,[11] as of 2023 largely only include Adam-derived optimizers, as well as predecessors to Adam such as RMSprop and classic SGD. PyTorch also partially supports LBFGS, a line-search method, but only for single-device setups without parameter groups. In 2023, LION, a sign-based stochastic gradient descent algorithm was introduced, beating existing Adam-based optimizers.[12][13][14]
- ^ Robbins, H.; Monro, S. (1951). "A Stochastic Approximation Method". The Annals of Mathematical Statistics. 22 (3): 400. doi:10.1214/aoms/1177729586.
- ^ "Stochastic Estimation of the Maximum of a Regression Function". doi:10.1214/aoms/1177729392.
{{cite journal}}
: Cite journal requires|journal=
(help) - ^ "The perceptron: A probabilistic model for information storage and organization in the brain". doi:10.1037/h0042519.
{{cite journal}}
: Cite journal requires|journal=
(help) - ^ Bilmes, Jeff; Asanovic, Krste; Chin, Chee-Whye; Demmel, James (April 1997). "Using PHiPAC to speed error back-propagation learning". 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing. ICASSP. Munich, Germany: IEEE. pp. 4153-4156 vol.5. doi:10.1109/ICASSP.1997.604861.
- ^ "Accelerating Minibatch Stochastic Gradient Descent Using Typicality Sampling". ieeexplore.ieee.org. Retrieved 2023-10-02.
- ^ Rumelhart, David E.; Hinton, Geoffrey E.; Williams, Ronald J. (1986-10). "Learning representations by back-propagating errors". Nature. 323 (6088): 533–536. doi:10.1038/323533a0. ISSN 1476-4687.
{{cite journal}}
: Check date values in:|date=
(help) - ^ Duchi, John; Hazan, Elad; Singer, Yoram (2011). "Adaptive subgradient methods for online learning and stochastic optimization" (PDF). JMLR. 12: 2121–2159.
- ^ Hinton, Geoffrey. "Lecture 6e rmsprop: Divide the gradient by a running average of its recent magnitude" (PDF). p. 26. Retrieved 19 March 2020.
- ^ Kingma, Diederik; Ba, Jimmy (2014). "Adam: A Method for Stochastic Optimization". arXiv:1412.6980 [cs.LG].
- ^ "torch.optim — PyTorch 2.0 documentation". pytorch.org. Retrieved 2023-10-02.
- ^ Nguyen, Giang; Dlugolinsky, Stefan; Bobák, Martin; Tran, Viet; García, Álvaro; Heredia, Ignacio; Malík, Peter; Hluchý, Ladislav (19 January 2019). "Machine Learning and Deep Learning frameworks and libraries for large-scale data mining: a survey" (PDF). Artificial Intelligence Review.
- ^ "torch.optim — PyTorch 2.0 documentation". pytorch.org. Retrieved 2023-10-02.
- ^ "Module: tf.keras.optimizers | TensorFlow v2.14.0". TensorFlow. Retrieved 2023-10-02.
- ^ https://arxiv.org/abs/2302.06675