||This article possibly contains original research. (September 2013)|
Deep learning is a set of algorithms in machine learning that attempt to learn in multiple levels, corresponding to different levels of abstraction. It typically uses artificial neural networks. The levels in these learned statistical models correspond to distinct levels of concepts, where higher-level concepts are defined from lower-level ones, and the same lower-level concepts can help to define many higher-level concepts.
Deep learning is part of a broader family of machine learning methods based on learning representations. An observation (e.g., an image) can be represented in many ways (e.g., a vector of pixels), but some representations make it easier to learn tasks of interest (e.g., is this the image of a human face?) from examples, and research in this area attempts to define what makes better representations and how to learn them.
Some of the most successful deep learning methods involve artificial neural networks. Deep Learning Neural Networks date back at least to the 1980 Neocognitron by Kunihiko Fukushima. Ronan Collobert has said that "deep learning is just a buzzword for neural nets".
The term "deep learning" gained traction in the mid-2000s after a publication by Geoffrey Hinton and Ruslan Salakhutdinov showed how a many-layered feedforward neural network could be effectively pre-trained one layer at a time, treating each layer in turn as an unsupervised restricted Boltzmann machine, then using supervised backpropagation for fine-tuning. In 1992, Jürgen Schmidhuber had already implemented a very similar idea for the more general case of unsupervised deep hierarchies of recurrent neural networks, and also experimentally shown its benefits for speeding up supervised learning.
Deep neural architectures, however, are older and date back at least to the deep Neocognitron (1980) of Kunihiko Fukushima who introduced convolutional neural networks, partially trained by unsupervised learning. Combinations of such convolutional architectures and supervised learning are frequently used in current successful systems. Yann LeCun et al (1989) were the first to apply backpropagation to such architectures.
Although the backpropagation algorithm had been available for training neural networks since 1974, it was often considered too slow for practical use, due to the so-called vanishing gradient problem analyzed in 1991 by Schmidhuber's student Sepp Hochreiter (more details in the section on artificial neural networks below). As a result, neural networks fell out of favor in practical machine learning and simpler models such as support vector machines (SVMs) dominated much of the field in the 1990s and 2000s. However, SVM learning is essentially a linear process, while neural network learning can be highly non-linear. In 2010 it was shown by Ciresan et al. that plain backpropagation in deep non-linear networks can outperform all previous techniques on the famous MNIST handwritten digit benchmark, without unsupervised pretraining.
Advances in hardware have been an important enabling factor for the resurgence of neural networks and the advent of deep learning, in particular the availability of powerful and inexpensive graphics processing units (GPUs) also suitable for general-purpose computing. GPUs are highly suited for the kind of "number crunching" involved in machine learning, and have been shown to speed up training algorithms by orders of magnitude, bringing running times of weeks back to days.
Deep learning is often presented as a step towards realising strong AI and has attracted the attention of such thinkers as Ray Kurzweil, who was hired by Google to do deep learning research. Gary Marcus has expressed skepticism of deep learning's capabilities, noting that
Realistically, deep learning is only part of the larger challenge of building intelligent machines. Such techniques lack ways of representing causal relationships (...) have no obvious ways of performing logical inferences, and they are also still a long way from integrating abstract knowledge, such as information about what objects are, what they are for, and how they are typically used. The most powerful A.I. systems, like Watson (...) use techniques like deep learning as just one element in a very complicated ensemble of techniques, ranging from the statistical technique of Bayesian inference to deductive reasoning.
Deep learning algorithms are based on distributed representations, a notion that was introduced with connectionism in the 1980s. The underlying assumption behind distributed representations is that the observed data were generated by the interactions of many factors (not all known to the observer), and that what is learned about a particular factor from some configurations of the other factors can often generalize to other, unseen configurations of the factors. Deep learning adds the assumption (seen as a prior about the unknown, data-generating process) that these factors are organized into multiple levels, corresponding to different levels of abstraction or composition: higher-level representations are obtained by transforming or generating lower-level representations. The relationships between these factors can be viewed as similar to the relationships between entries in a dictionary or in Wikipedia, although these factors can be numerical (e.g., the position of the face in the image) or categorical (e.g., is it human face?), whereas entries in a dictionary are purely symbolic. The appropriate number of levels and the structure that relates these factors is something that a deep learning algorithm is also expected to discover from examples.
Deep learning algorithms often involve other important ideas that correspond to broad a priori beliefs about these unknown underlying factors. An important prior regarding a supervised learning task of interest (e.g., given an input image, predicting the presence of a face and the identity of the person) is that among the factors that explain the variations observed in the inputs (e.g. images), some of them are relevant to the prediction interest. This is a special case of the semi-supervised learning setup, which allows a learner to exploit large quantities of unlabeled data (e.g., images for which the presence of a face and the identity of the person, if any, are not known).
Many deep learning algorithms are actually framed as unsupervised learning, e.g., using many examples of natural images to discover good representations of them. Because most of these learning algorithms can be applied to unlabeled data, they can leverage large amounts of unlabeled data, even when these examples are not necessarily labeled, and even when the data cannot be associated with labels of the immediate tasks of interest.
Deep learning in artificial neural networks
||This section possibly contains original research. (August 2013)|
Some of the most successful deep learning methods involve artificial neural networks. Deep Learning Neural Networks date back at least to the 1980 Neocognitron by Kunihiko Fukushima. It is inspired by the 1959 biological model proposed by Nobel laureates David H. Hubel & Torsten Wiesel, who found two types of cells in the visual primary cortex: simple cells and complex cells. Many artificial neural networks can be viewed as cascading models of cell types inspired by these biological observations. Fukushima's model introduced convolutional neural networks partially trained by unsupervised learning. Yann LeCun et al (1989) applied supervised backpropagation to such architectures.
With the advent of the back-propagation algorithm in the 1970s, many researchers tried to train supervised deep artificial neural networks from scratch, initially with little success. Sepp Hochreiter's diploma thesis of 1991 formally identified the reason for this failure in the "vanishing gradient problem," which not only affects many-layered feedforward networks, but also recurrent neural networks. The latter are trained by unfolding them into very deep feedforward networks, where a new layer is created for each time step of an input sequence processed by the network. As errors propagate from layer to layer, they shrink exponentially with the number of layers.
To overcome this problem, several methods were proposed. One is Jürgen Schmidhuber's multi-level hierarchy of networks (1992) pre-trained one level at a time through unsupervised learning, fine-tuned through backpropagation. Here each level learns a compressed representation of the observations that is fed to the next level.
Another method is the long short term memory (LSTM) network of 1997 by Hochreiter & Schmidhuber. In 2009, deep multidimensional LSTM networks demonstrated the power of deep learning with many nonlinear layers, by winning three ICDAR 2009 competitions in connected handwriting recognition, without any prior knowledge about the three different languages to be learned.
Other methods also use unsupervised pre-training to structure a neural network, making it first learn generally useful feature detectors. Then the network is trained further by supervised back-propagation to classify labeled data. The deep model of Hinton et al. (2006) involves learning the distribution of a high level representation using successive layers of binary or real-valued latent variables. It uses a restricted Boltzmann machine (Smolensky, 1986) to model each new layer of higher level features. Each new layer guarantees an increase on the lower-bound of the log likelihood of the data, thus improving the model, if trained properly. Once sufficiently many layers have been learned the deep architecture may be used as a generative model by reproducing the data when sampling down the model (an "ancestral pass") from the top level feature activations. Hinton reports that his models are effective feature extractors over high-dimensional, structured data.
The Google Brain team led by Andrew Ng and Jeff Dean created a neural network that learned to recognize higher-level concepts, such as cats, only from watching unlabeled images taken from YouTube videos. 
Other methods rely on the sheer processing power of modern computers, in particular, GPUs. In 2010 it was shown by Dan Ciresan and colleagues in Jürgen Schmidhuber's group at the Swiss AI Lab IDSIA that despite the above-mentioned "vanishing gradient problem," the superior processing power of GPUs makes plain back-propagation feasible for deep feedforward neural networks with many layers. The method outperformed all other machine learning techniques on the old, famous MNIST handwritten digits problem of Yann LeCun and colleagues at NYU.
As of 2011, the state of the art in deep learning feedforward networks alternates convolutional layers and max-pooling layers, topped by several pure classification layers. Training is usually done without any unsupervised pre-training. Since 2011, GPU-based implementations of this approach won many pattern recognition contests, including the IJCNN 2011 Traffic Sign Recognition Competition, the ISBI 2012 Segmentation of neuronal structures in EM stacks challenge, and others.
Such supervised deep learning methods also were the first artificial pattern recognizers to achieve human-competitive performance on certain tasks.
Deep learning in the human brain
Computational deep learning (see above) is closely related to a class of theories of brain development (specifically, neocortical development) proposed by cognitive neuroscientists in the early 1990s. The most approachable summary of this work is Elman, et al.'s 1996 book "Rethinking Innateness" (see also: Shrager and Johnson, 1996; Quartz and Sejnowski, 1997). As these theories were also instantiated in computational models, they are technical predecessors of purely computationally-motivated deep learning models. These models share the interesting property that various proposed learning dynamics in the brain (e.g., a wave of neurotrophic growth factor) conspire to support the self-organization of just the sort of inter-related neural networks utilized in the later, purely computational deep learning models, and which appear to be analogous to one way of understanding the neocortex of the brain as a hierarchy of filters where each layer captures some of the information in the operating environment, and then passes the remainder, as well as modified base signal, to other layers further up the hierarchy. The result of this process is a self-organizing stack of transducers, well-tuned to their operating environment. As described in The New York Times in 1995: "...the infant's brain seems to organize itself under the influence of waves of so-called trophic-factors ... different regions of the brain become connected sequentially, with one layer of tissue maturing before another and so on until the whole brain is mature."
The importance of deep learning with respect to the evolution and development of human cognition did not escape the attention of these researchers. One aspect of human development that distinguishes us from our nearest primate neighbors may be changes in the timing of development. Among primates, the human brain remains relatively plastic until late in the post-natal period, whereas the brains of our closest relatives are more completely formed by birth. Thus, humans have greater access to the complex experiences afforded by being out in the world during the most formative period of brain development. This may enable us to "tune in" to rapidly changing features of the environment that other animals, more constrained by evolutionary structuring of their brains, are unable to take account of. To the extent that these changes are reflected in similar timing changes in hypothesized wave of cortical development, they may also lead to changes in the extraction of information from the stimulus environment during the early self-organization of the brain. Of course, along with this flexibility comes an extended period of immaturity, during which we are dependent upon our caretakers and our community for both support and training. The theory of deep learning therefore sees the coevolution of culture and cognition as a fundamental condition of human evolution.
- Bengio, Y. (2009). Learning Deep Architectures for AI.. Now Publishers.
- K. Fukushima. Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological Cybernetics, 36(4): 193–202, 1980.
- Ronan Collobert (May 6, 2011). "Deep Learning for Efficient Discriminative Parsing". videolectures.net. Ca. 7:45.
- Marcus, Gary (25 November 2012). "Is "Deep Learning" a Revolution in Artificial Intelligence?". The New Yorker. Retrieved 10 May 2013.
- Schmidhuber, Jürgen; Learning complex, extended sequences using the principle of history compression., Neural Computation, 4(2):234-242, 1992
- Schmidhuber, Jürgen; My First Deep Learning System of 1991 + Deep Learning Timeline 1962-2013, http://www.idsia.ch/~juergen/firstdeeplearner.html
- Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, L. D. Jackel. Backpropagation Applied to Handwritten Zip Code Recognition. Neural Computation, 1(4):541-551, 1989.
- Paul J. Werbos. Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences. PhD thesis, Harvard University, 1974
- D. C. Ciresan, U. Meier, L. M. Gambardella, J. Schmidhuber. Deep Big Simple Neural Nets For Handwritten Digit Recognition. Neural Computation 22(12): 3207–3220, 2010.
- Rajat Raina; Anand Madhavan; Andrew Y. Ng (2009). "Large-scale Deep Unsupervised Learning using Graphics Processors". Proc. 26th Int'l Conf. on Machine Learning.
- Daniela Hernandez (7 May 2013). "The Man Behind the Google Brain: Andrew Ng and the Quest for the New AI". Wired. Retrieved 10 May 2013.
- Steven Levy (25 April 2013). "How Ray Kurzweil Will Help Google Make the Ultimate AI Brain". Wired. Retrieved 10 May 2013.
- M Riesenhuber, T Poggio. Hierarchical models of object recognition in cortex. Nature neuroscience, 1999(11) 1019-1025.
- S. Hochreiter. Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, Institut f. Informatik, Technische Univ. Munich, 1991. Advisor: J. Schmidhuber
- S. Hochreiter, Y. Bengio, P. Frasconi, and J. Schmidhuber. Gradient flow in recurrent nets: the difficulty of learning long-term dependencies. In S. C. Kremer and J. F. Kolen, editors, A Field Guide to Dynamical Recurrent Neural Networks. IEEE Press, 2001.
- Hochreiter, Sepp; and Schmidhuber, Jürgen; Long Short-Term Memory, Neural Computation, 9(8):1735–1780, 1997
- Graves, Alex; and Schmidhuber, Jürgen; Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks, in Bengio, Yoshua; Schuurmans, Dale; Lafferty, John; Williams, Chris K. I.; and Culotta, Aron (eds.), Advances in Neural Information Processing Systems 22 (NIPS'22), December 7th–10th, 2009, Vancouver, BC, Neural Information Processing Systems (NIPS) Foundation, 2009, pp. 545–552
- A. Graves, M. Liwicki, S. Fernandez, R. Bertolami, H. Bunke, J. Schmidhuber. A Novel Connectionist System for Improved Unconstrained Handwriting Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 5, 2009.
- Sven Behnke (2003). Hierarchical Neural Networks for Image Interpretation.. Lecture Notes in Computer Science 2766. Springer.
- Smolensky, P. (1986). "Information processing in dynamical systems: Foundations of harmony theory.". In D. E. Rumelhart, J. L. McClelland, & the PDP Research Group, Parallel Distributed Processing: Explorations in the Microstructure of Cognition. 1. pp. 194–281.
- Hinton, G. E.; Osindero, S.; Teh, Y. (2006). "A fast learning algorithm for deep belief nets". Neural Computation 18 (7): 1527–1554. doi:10.1162/neco.2006.18.7.1527. PMID 16764513.
- http://www.scholarpedia.org/article/Deep_belief_networks /
- John Markoff (2012). "How Many Computers to Identify a Cat? 16,000.". New York Times.
- Ng, Andrew; Dean, Jeff (2012). "Building High-level Features Using Large Scale Unsupervised Learning".
- D. C. Ciresan, U. Meier, J. Masci, L. M. Gambardella, J. Schmidhuber. Flexible, High Performance Convolutional Neural Networks for Image Classification. International Joint Conference on Artificial Intelligence (IJCAI-2011, Barcelona), 2011.
- Martines, H., Bengio, Y., & Yannakakis, G. N. (2013). Learning Deep Physiological Models of Affect. I EEE Computational Intelligence, 8(2), 20.
- D. C. Ciresan, U. Meier, J. Masci, J. Schmidhuber. Multi-Column Deep Neural Network for Traffic Sign Classification. Neural Networks, 2012.
- D. Ciresan, A. Giusti, L. Gambardella, J. Schmidhuber. Deep Neural Networks Segment Neuronal Membranes in Electron Microscopy Images. In Advances in Neural Information Processing Systems (NIPS 2012), Lake Tahoe, 2012.
- D. C. Ciresan, U. Meier, J. Schmidhuber. Multi-column Deep Neural Networks for Image Classification. IEEE Conf. on Computer Vision and Pattern Recognition CVPR 2012.
- P. E. Utgoﬀ and D. J. Stracuzzi (2002) Many-layered learning. Neural Computation 14, 2497–2529.
- J. Elman, et al. (1996) Rethinking Innateness. MIT Press.
- J Shrager, MH Johnson (1996) Dynamic plasticity influences the emergence of function in a simple cortical array. Neural Networks 9 (7), 1119–1129
- SR Quartz, TJ Sejnowski (1997) The neural basis of cognitive development: A constructivist manifesto. Behavioral and Brain Sciences 20 (4):537–556
- S. Blakeslee (1995, Aug, 29) In brain's early growth, timetable may be critical. The New York Times, Science Section, pp. B5–B6
- Bufill, E.; Agustí, J.; Blesa, R. (2011). "Human neoteny revisited: The case of synaptic plasticity". American Journal of Human Biology 23 (6): 729–739. doi:10.1002/ajhb.21225. PMID 21957070.
- Shrager, J., Johnson, M. H. (1995). Timing in the development of cortical function: A computational approach. In B. Julesz & I. Kovacs (Eds.), Maturational windows and adult cortical plasticity. New York: Addison–Wesley.
- deeplearning.net Deep Learning web site of the Machine Learning Laboratory (LISA) of the University of Montreal (Yoshua Bengio's lab)
- deeplearning.it Deep Learning web site of the Swiss AI Lab IDSIA (Jürgen Schmidhuber's lab)
- Recent Developments in Deep Learning, video by Geoff Hinton
- 2012 Kurzweil AI Interview with Jürgen Schmidhuber
- The Gigaom guide to deep learning: Who’s doing it, and why it matters