User:VirtualVistas/sandbox

This is the user sandbox of VirtualVistas. A user sandbox is a subpage of the user's user page. It serves as a testing spot and page development space for the user and is not an encyclopedia article. Create or edit your own sandbox here.

Other sandboxes: Main sandbox | Template sandbox

Finished writing a draft article? Are you ready to request review of it by an experienced editor for possible inclusion in Wikipedia? Submit your draft for review!

Original Article: Stochastic Gradient Descent

Added section: [NOTE: put section towards start of article] [TODO: graphic]

History

In 1951, Herbert Robbins and Sutton Monro introduced the earliest stochastic approximation methods, preceding stochastic gradient descent.^[1] Building on this work one year later, Jack Kiefer and Jacob Wolfowitz published an optimization algorithm very close to stochastic gradient descent, using differences as an approximation of the gradient.^[2] Later in the 1950s, Frank Rosenblatt used SGD to optimize his perceptron model, demonstrating the first applicability of stochastic gradient descent to neural networks.^[3]

Backpropagation was first described in 1986, with stochastic gradient descent being used to efficiently optimize parameters across neural networks with multiple hidden layers. Soon after, another improvement was developed: mini-batch gradient descent, where small batches of data are substituted for single samples. In 1997, the practical performance benefits from vectorization achievable with such small batches were first explored,^[4] paving the way for efficient optimization in machine learning. As of 2023, this mini-batch approach remains the norm for training neural networks, balancing the benefits of stochastic gradient descent with gradient descent.^[5]

By the 1980s, momentum had already been introduced, and was added to SGD optimization techniques in 1986.^[6] However, these optimization techniques assumed constant hyperparameters, i.e. a fixed learning rate and momentum parameter. In the 2010s, adaptive approaches to applying SGD with a per-parameter learning rate were introduced with AdaGrad (for "Adaptive Gradient") in 2011^[7] and RMSprop (for "Root Mean Square Propogation") in 2012.^[8] In 2014, Adam (for "Adaptive Moment Estimation") was published, applying the adaptive approaches of RMSprop to momentum; many improvements and branches of Adam were then developed such as Adadelta, Adagrad, AdamW, and Adamax.^[9]^[10]

Within machine learning, approaches to optimization in 2023 are dominated by Adam-derived optimizers. TensorFlow and PyTorch, by far the most popular machine learning libraries,^[11] as of 2023 largely only include Adam-derived optimizers, as well as predecessors to Adam such as RMSprop and classic SGD. PyTorch also partially supports LBFGS, a line-search method, but only for single-device setups without parameter groups. In 2023, LION, a sign-based stochastic gradient descent algorithm was introduced, beating existing Adam-based optimizers.^[12]^[13]^[14]

^ Robbins, H.; Monro, S. (1951). "A Stochastic Approximation Method". The Annals of Mathematical Statistics. 22 (3): 400. doi:10.1214/aoms/1177729586.
^ "Stochastic Estimation of the Maximum of a Regression Function". doi:10.1214/aoms/1177729392. {{cite journal}}: Cite journal requires |journal= (help)
^ "The perceptron: A probabilistic model for information storage and organization in the brain". doi:10.1037/h0042519. {{cite journal}}: Cite journal requires |journal= (help)
^ Bilmes, Jeff; Asanovic, Krste; Chin, Chee-Whye; Demmel, James (April 1997). "Using PHiPAC to speed error back-propagation learning". 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing. ICASSP. Munich, Germany: IEEE. pp. 4153-4156 vol.5. doi:10.1109/ICASSP.1997.604861.
^ "Accelerating Minibatch Stochastic Gradient Descent Using Typicality Sampling". ieeexplore.ieee.org. Retrieved 2023-10-02.
^ Rumelhart, David E.; Hinton, Geoffrey E.; Williams, Ronald J. (1986-10). "Learning representations by back-propagating errors". Nature. 323 (6088): 533–536. doi:10.1038/323533a0. ISSN 1476-4687. {{cite journal}}: Check date values in: |date= (help)
^ Duchi, John; Hazan, Elad; Singer, Yoram (2011). "Adaptive subgradient methods for online learning and stochastic optimization" (PDF). JMLR. 12: 2121–2159.
^ Hinton, Geoffrey. "Lecture 6e rmsprop: Divide the gradient by a running average of its recent magnitude" (PDF). p. 26. Retrieved 19 March 2020.
^ Kingma, Diederik; Ba, Jimmy (2014). "Adam: A Method for Stochastic Optimization". arXiv:1412.6980 [cs.LG].
^ "torch.optim — PyTorch 2.0 documentation". pytorch.org. Retrieved 2023-10-02.
^ Nguyen, Giang; Dlugolinsky, Stefan; Bobák, Martin; Tran, Viet; García, Álvaro; Heredia, Ignacio; Malík, Peter; Hluchý, Ladislav (19 January 2019). "Machine Learning and Deep Learning frameworks and libraries for large-scale data mining: a survey" (PDF). Artificial Intelligence Review.
^ "torch.optim — PyTorch 2.0 documentation". pytorch.org. Retrieved 2023-10-02.
^ "Module: tf.keras.optimizers | TensorFlow v2.14.0". TensorFlow. Retrieved 2023-10-02.
^ https://arxiv.org/abs/2302.06675

[rm-1] Robbins, H.; Monro, S. (1951). "A Stochastic Approximation Method". The Annals of Mathematical Statistics. 22 (3): 400. doi:10.1214/aoms/1177729586.

[2] "Stochastic Estimation of the Maximum of a Regression Function". doi:10.1214/aoms/1177729392. {{cite journal}}: Cite journal requires |journal= (help)

[3] "The perceptron: A probabilistic model for information storage and organization in the brain". doi:10.1037/h0042519. {{cite journal}}: Cite journal requires |journal= (help)

[4] Bilmes, Jeff; Asanovic, Krste; Chin, Chee-Whye; Demmel, James (April 1997). "Using PHiPAC to speed error back-propagation learning". 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing. ICASSP. Munich, Germany: IEEE. pp. 4153-4156 vol.5. doi:10.1109/ICASSP.1997.604861.

[5] "Accelerating Minibatch Stochastic Gradient Descent Using Typicality Sampling". ieeexplore.ieee.org. Retrieved 2023-10-02.

[6] Rumelhart, David E.; Hinton, Geoffrey E.; Williams, Ronald J. (1986-10). "Learning representations by back-propagating errors". Nature. 323 (6088): 533–536. doi:10.1038/323533a0. ISSN 1476-4687. {{cite journal}}: Check date values in: |date= (help)

[duchi-7] Duchi, John; Hazan, Elad; Singer, Yoram (2011). "Adaptive subgradient methods for online learning and stochastic optimization" (PDF). JMLR. 12: 2121–2159.

[rmsprop-8] Hinton, Geoffrey. "Lecture 6e rmsprop: Divide the gradient by a running average of its recent magnitude" (PDF). p. 26. Retrieved 19 March 2020.

[Adam2014-9] Kingma, Diederik; Ba, Jimmy (2014). "Adam: A Method for Stochastic Optimization". arXiv:1412.6980 [cs.LG].

[10] "torch.optim — PyTorch 2.0 documentation". pytorch.org. Retrieved 2023-10-02.

[11] Nguyen, Giang; Dlugolinsky, Stefan; Bobák, Martin; Tran, Viet; García, Álvaro; Heredia, Ignacio; Malík, Peter; Hluchý, Ladislav (19 January 2019). "Machine Learning and Deep Learning frameworks and libraries for large-scale data mining: a survey" (PDF). Artificial Intelligence Review.

[12] "torch.optim — PyTorch 2.0 documentation". pytorch.org. Retrieved 2023-10-02.

[13] "Module: tf.keras.optimizers | TensorFlow v2.14.0". TensorFlow. Retrieved 2023-10-02.

[14] ttps://arxiv.org/abs/2302.06675

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]