Jump to content

User:VirtualVistas/sandbox

From Wikipedia, the free encyclopedia

Original Article: Stochastic Gradient Descent

Added section: [NOTE: put section towards start of article] [TODO: graphic]

History

[edit]

In 1951, Herbert Robbins and Sutton Monro introduced the earliest stochastic approximation methods, preceding stochastic gradient descent.[1] Building on this work one year later, Jack Kiefer and Jacob Wolfowitz published an optimization algorithm very close to stochastic gradient descent, using differences as an approximation of the gradient.[2] Later in the 1950s, Frank Rosenblatt used SGD to optimize his perceptron model, demonstrating the first applicability of stochastic gradient descent to neural networks.[3]

Backpropagation was first described in 1986, with stochastic gradient descent being used to efficiently optimize parameters across neural networks with multiple hidden layers. Soon after, another improvement was developed: mini-batch gradient descent, where small batches of data are substituted for single samples. In 1997, the practical performance benefits from vectorization achievable with such small batches were first explored,[4] paving the way for efficient optimization in machine learning. As of 2023, this mini-batch approach remains the norm for training neural networks, balancing the benefits of stochastic gradient descent with gradient descent.[5]

By the 1980s, momentum had already been introduced, and was added to SGD optimization techniques in 1986.[6] However, these optimization techniques assumed constant hyperparameters, i.e. a fixed learning rate and momentum parameter. In the 2010s, adaptive approaches to applying SGD with a per-parameter learning rate were introduced with AdaGrad (for "Adaptive Gradient") in 2011[7] and RMSprop (for "Root Mean Square Propogation") in 2012.[8] In 2014, Adam (for "Adaptive Moment Estimation") was published, applying the adaptive approaches of RMSprop to momentum; many improvements and branches of Adam were then developed such as Adadelta, Adagrad, AdamW, and Adamax.[9][10]

Within machine learning, approaches to optimization in 2023 are dominated by Adam-derived optimizers. TensorFlow and PyTorch, by far the most popular machine learning libraries,[11] as of 2023 largely only include Adam-derived optimizers, as well as predecessors to Adam such as RMSprop and classic SGD. PyTorch also partially supports LBFGS, a line-search method, but only for single-device setups without parameter groups. In 2023, LION, a sign-based stochastic gradient descent algorithm was introduced, beating existing Adam-based optimizers.[12][13][14]

  1. ^ Robbins, H.; Monro, S. (1951). "A Stochastic Approximation Method". The Annals of Mathematical Statistics. 22 (3): 400. doi:10.1214/aoms/1177729586.
  2. ^ "Stochastic Estimation of the Maximum of a Regression Function". doi:10.1214/aoms/1177729392. {{cite journal}}: Cite journal requires |journal= (help)
  3. ^ "The perceptron: A probabilistic model for information storage and organization in the brain". doi:10.1037/h0042519. {{cite journal}}: Cite journal requires |journal= (help)
  4. ^ Bilmes, Jeff; Asanovic, Krste; Chin, Chee-Whye; Demmel, James (April 1997). "Using PHiPAC to speed error back-propagation learning". 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing. ICASSP. Munich, Germany: IEEE. pp. 4153-4156 vol.5. doi:10.1109/ICASSP.1997.604861.
  5. ^ "Accelerating Minibatch Stochastic Gradient Descent Using Typicality Sampling". ieeexplore.ieee.org. Retrieved 2023-10-02.
  6. ^ Rumelhart, David E.; Hinton, Geoffrey E.; Williams, Ronald J. (1986-10). "Learning representations by back-propagating errors". Nature. 323 (6088): 533–536. doi:10.1038/323533a0. ISSN 1476-4687. {{cite journal}}: Check date values in: |date= (help)
  7. ^ Duchi, John; Hazan, Elad; Singer, Yoram (2011). "Adaptive subgradient methods for online learning and stochastic optimization" (PDF). JMLR. 12: 2121–2159.
  8. ^ Hinton, Geoffrey. "Lecture 6e rmsprop: Divide the gradient by a running average of its recent magnitude" (PDF). p. 26. Retrieved 19 March 2020.
  9. ^ Kingma, Diederik; Ba, Jimmy (2014). "Adam: A Method for Stochastic Optimization". arXiv:1412.6980 [cs.LG].
  10. ^ "torch.optim — PyTorch 2.0 documentation". pytorch.org. Retrieved 2023-10-02.
  11. ^ Nguyen, Giang; Dlugolinsky, Stefan; Bobák, Martin; Tran, Viet; García, Álvaro; Heredia, Ignacio; Malík, Peter; Hluchý, Ladislav (19 January 2019). "Machine Learning and Deep Learning frameworks and libraries for large-scale data mining: a survey" (PDF). Artificial Intelligence Review.
  12. ^ "torch.optim — PyTorch 2.0 documentation". pytorch.org. Retrieved 2023-10-02.
  13. ^ "Module: tf.keras.optimizers | TensorFlow v2.14.0". TensorFlow. Retrieved 2023-10-02.
  14. ^ https://arxiv.org/abs/2302.06675