User:Wfwhitney/sandbox

From Wikipedia, the free encyclopedia

In computer science, a convolutional neural network is a type of feed-forward artificial neural network where the individual neurons (or "nodes") share weights with one another and are tiled in such a way that they respond to overlapping regions in the visual field.[1] Convolutional networks were inspired by biological processes[2] and are variations of multilayer perceptrons which are designed to reduce the number of free parameters in the model and provide robustness to local translations.[3] They are widely used models for image recognition.[2][4][5][6][7]

Typically a convolutional network will have a layer of convolution followed by a layer of subsampling, or pooling. In the convolution layer, each node first computes a weighted sum of the inputs in its field of view. Afterward it applies a pointwise nonlinearity to that sum, which allows the network to model arbitrary functions.[8] Then, in the pooling layer, each node takes as its own value the maximum of its inputs. Effectively the network applies several filters to get feature maps, then at each point uses the feature map with the strongest response to the input.

A convolutional layer and a pooling layer applied to an image. Each feature map is a set of nodes with shared weights.

Overview[edit]

The core idea behind neural networks is pattern recognition. As the network learns, each node will learn a function of its inputs that is useful for the later layers. Then the output layer learns a function which minimizes the error. These functions are templates which serve to detect a pattern in the input to each node. In the specific case of convolutional neural networks, the underlying assumption is that those "templates" should be independent of the pixel location in the image at which they occur. That is, the identity of patterns in an image should be location invariant.

The key distinguishing features of convolutional neural networks are

  1. Local connectivity: Each node in the network is connected to a relatively small number of the nodes in the layer below it which are "close" to it in the network connectivity graph. Typically this results in a pyramidal shape to the network, with nodes at the bottom having a very small "field of view" on the input, or "receptive field", while nodes higher up, aggregating the outputs of several nodes below them, receive information from much more of the input data. This allows the network to first create good representations of a small part of the input (e.g.a few pixels), then assemble representations of larger areas of input (e.g. objects) from those low-level representations.
  2. Shared weights: Also known as "tied weights". Instead of having a node with unique weights at each location within a layer, this technique essentially places a copy of the same node, with the same weights, at each location in the layer. When the network is run, it thus computes the same function at every point in the input, and so will, for example, detect a face in an image regardless of where in the image the face occurs. Because of this, they are able to tolerate translation of the input image.

Together, these properties allow convolutional neural networks to very rapidly converge to good representations of the input at each layer, then combine those representations into larger ones. Weight sharing also helps by reducing the required memory size to run the network.

Along with these signature convolutional layers, convolutional neural networks may include fully-connected layers and local or global pooling layers, which combine the outputs of neuron clusters for dimensionality reduction.

History[edit]

Convolutional neural networks were originally inspired by the work of Hubel and Wiesel,[9][10] who proposed a model of vision based on their work in the cat and monkey visual cortices. In their model, there are two kinds of cells in the visual cortex, simple and complex. Each simple cell responds to a particular template in a very small section of the visual field, while each complex cell combines the responses of several simple cells over a larger field, producing a response that is locally invariant to translations.

Convolutional neural networks were introduced in a 1980 paper by Kunihiko Fukushima[10] in a model he named Neocognitron. However, they did not see widespread application until Yann LeCun's system for handwritten digit recognition, LeNet.[11] Their design was later improved in 1998 by Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner,[12] generalized in 2003 by Sven Behnke,[13] and simplified by Patrice Simard, David Steinkraus, and John C. Platt in the same year.[14] In 1999 Riesenhuber and Poggio published a model called HMAX,[15] a hierarchical max-pooling model of primate vision which led to the widespread using of pooling in convolutional nets. In 2006 several publications described new ways to train convolutional neural networks more efficiently that allowed for networks with more layers to be trained.[16][17][18]

In 2011, they were refined by Dan Ciresan et al. and were implemented on a GPU with impressive performance results.[19] Fast GPU-based convolutional network implementations opened the door to record-breaking results on many benchmarks. In 2012, Dan Ciresan et al. significantly improved upon the best performance in the literature for multiple image databases, including the MNIST database, the NORB database, the HWDB1.0 dataset (Chinese characters), and the CIFAR10 dataset (dataset of 60000 32x32 labeled RGB images).[20] That same year, Alex Krizhevsky et al. won the ImageNet Large Scale Visual Recognition Challenge by a wide margin, with 11% fewer errors than the second-best entry; convolutional network implementations have dominated the contest since.

Performance[edit]

Convolutional neural networks are often used in image recognition systems. They have achieved an error rate of 0.23 percent on the MNIST database, which as of February 2012 is the lowest achieved on the database.[21] Another paper on using convolutional neural networks for image classification reported that the learning process of convolutional neural networks was "surprisingly fast"; in the same paper, the best published results at the time were achieved in the MNIST database and the NORB database.[19]

When applied to facial recognition, they were able to contribute to a large decrease in error rate.[22] In another paper, they were able to achieve a 97.6 percent recognition rate on "5,600 still images of more than 10 subjects".[23] Convolutional neural networks have been used to assess video quality in an objective way after being manually trained; the resulting system had a very low root mean square error.[24]

The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) is a benchmark in object classification and detection, with millions of images and hundreds of object classes. In 2012 convolutional neural networks beat out other techniques, with a top-5 classification error rate of 15.3%, versus 26.2% achieved by the runner up. Performance of convolutional neural networks on the ImageNet tests is now close to that of humans,[25][26] with the GoogLeNet submission achieving a top-5 classification error rate of 6.7% in 2014.[7]

The best algorithms still struggle with objects that are small or thin, such as a small ant on a stem of a flower or a person holding a quill in their hand.They also have trouble with images that have been distorted with filters, an increasingly common phenomenon with modern digital cameras.[26] By contrast, those kinds of images rarely trouble humans. Humans, however, tend to have trouble with other issues. For example, they are not good at classifying objects into fine-grained categories such as the particular breed of dog or species of bird, whereas convolutional neural networks handle this with ease.

Advantages[edit]

The design of convolutional neural networks has three major advantages over other forms of feedforward neural networks, namely faster training, reduced memory requirements, and reduced overfitting (or "regularization").

Training Speed[edit]

By sharing weights across many nodes within a layer, convolutional neural networks are in effect employing transfer learning across the spacially-distributed nodes of the network. The error signal from a particular patch of the training data, which would otherwise only allow learning in the nodes in that location in the network, transfers to every tied node in the layer. Accordingly, if the items of interest are uniformly distributed in image location throughout the entire training set, there will be a speedup in training proportional to the degree of convolution (i.e. the number of nodes which share each weight).

Memory Requirements[edit]

The sharing of weights across nodes in each layer translates to very sparse weight networks which must be maintained in memory; since only one weight vector is required for each filter in a layer, the memory required to store it is decreased by a factor of the number of nodes per filter.

A modern deep convolutional neural network, such as the Krizhevsky ImageNet network,[27] may have more than half a million neurons and a total of 60 million parameters. Without shared weights, the number of parameters (and accordingly, the memory required to store them) would be drastically higher.Memory requirements are a major bottleneck and often restrict the size of network which can be trained on given hardware.[27] As such decreasing the memory footprint of a network allows the training of larger, more powerful networks.

Limitations[edit]

Convolutional neural networks have been criticized for their location invariance. Because pooling is designed to discard positional data about the input, it loses information about spatial relationships that is important to certain kinds of tasks, such as face detection.[28]

Convolutional networks share the limitations of other kinds of feedforward neural networks, including the inability to handle arbitrarily-sized inputs (e.g. for language models), use additional processing time to refine their predictions without additional data, or provide insight into their intermediate representations (that is, explaining the inner workings of large neural networks can be very difficult).

Convolutional networks take a great deal of labeled inputs to learn (hence benchmarks like the million-image ImageNet), and are unable to generalize from small numbers of labeled data.

Network Design[edit]

When sharing weights between subsets of nodes within a layer, each set of nodes with shared weights can be viewed as a distinct "filter" over the input data. As such, most successful convolutional neural network designs, for example LeNet, use several filters in each layer, each of which has its own set of weights and in effect performs a different transformation of the input data.[2] The idea is that, given filters in a layer, with training they will approximate the maximally-informative (with respect to the error function) set of uniformly-applied transformations of the input.

In convolutional neural networks for image recognition tasks, the most common design (used in such networks as LeNet[2] and the Krizhevsky ImageNet network[27]) is roughly as follows. The first layers of the network (realistically, as many as can be computationally afforded) are alternating layers of convolution and max-pooling; followed by one or more dense (fully-connected) layers, the last layer of which has the same dimensionality as the number of classes; and finally a classifier, which may be linear regression, softmax, or something else.

The pooling layers serve to reduce the dimensionality of the data, extracting the single highest response from any of the nodes in a region and using it to represent the entire region. The fully-connected layers allow the network to aggregate data globally from all regions of the image, which lower layers could not do. The classifier layer turns the response from the top layer into the actual classification of the image.

Using a softmax activation function as the classifier layer of the network allows the outputs to be interpreted as posterior probabilities. This then gives a certainty measure on classification which is useful for analysis.

Some time-delay neural networks also use a very similar architecture to convolutional neural networks, especially those for image recognition and/or classification tasks, since the "tiling" of the neuron outputs can easily be carried out in timed stages in a manner useful for analysis of images.[24]

Regularization methods[edit]

Empirical[edit]

Dropout[edit]

One common form of regularization used with convolutional neural networks is dropout,[29] which "consists of setting to zero the output of each hidden neuron with probability 0.5".[27] That is, for any given execution of the network, a random set of half the nodes will not participate in either forward prediction or backpropagation. As a result, each training example will be run on a random network architecture, but all of those architectures share weights. This technique seems to reduce the complex, tightly-fitted interactions between nodes, leading them to learn more robust features which better generalize to new data. Dropout has been shown to improve the performance of neural networks on tasks in vision, speech recognition, document classification, and computational biology.[30]

Artificial data[edit]

Since the degree of overfitting of a model is determined by both its power and the amount of training it receives, providing a convolutional network with extra training examples can reduce overfitting. Since these networks are usually already trained with all available data, one approach is to either generate new examples from scratch (if possible) or perturb the existing training samples to create new ones. For example, input images could be asymmetrically cropped by a few percent to create new examples with the same label as the original.[27]

Explicit[edit]

Network size[edit]

The simplest way to prevent overfitting of a network is simply to limit the number of hidden units and free parameters (connections) in the network.[31] This restricts the predictive power of the network directly, reducing the complexity of the function that it can perform on the data, and thus limits the amount of overfitting. This is equivalent to a "zero norm".

Weight decay[edit]

A simple form of added regularizer is weight decay, which simply adds an additional error, proportional to the sum of weights (L1 norm) or squared magnitude (L2 norm) of the weight vector, to the error at each node. This is equivalent to a zero-mean Gaussian prior over the weight vector. The level of acceptable model complexity can be reduced by increasing the proportionality constant, thus increasing the penalty for large weight vectors.[31]

Industry Use[edit]

The first widespread use of convolutional neural networks in industry was a handwriting recognition system developed by Yann LeCun which was brought to market in several forms, including a bank check optical character recognition system.[12][32]

In recent years there have been several deployments of convolutional network-based technologies, primarily in speech recognition and vision tasks.

Speech[edit]

As of 2012, the voice recognition features included in Google's Android mobile operating system are powered by a convolutional neural network,[33][34] and in the same year Microsoft deployed a new version of their Microsoft Audio Video Indexing Service based on convolutional neural networks.[35]

Vision[edit]

Google has used convolutional neural networks for image recognition in several of their products. Google Street View uses them both for privacy protection, detecting faces, license plates, and other identifying information in order to obscure them;[6] and for house number identification,[36][5] to better align imagery with street addresses. In 2013 Google released a photo search feature in Google+ Photos which used convolutional neural networks to index photos by their contents.[4]

Other[edit]

Spotify has tested a music recommendation service using convolutional neural networks to learn from snippets of songs and provide recommendations based on which songs the network judges to be similar to those the user likes.[37]

Implementations[edit]

Complete systems[edit]

  • Torch7, a Lua library used by Facebook AI, Deep Mind, and Google Brain[38]
  • Caffe, a fast GPU-based implementation with Python bindings
  • Pylearn2, a comprehensive Python machine learning library

Component libraries[edit]

  • cuda-convnet and its successor, cuda-convnet2: Alex Krizhevsky's GPU-based CNN library
  • cudamat, a Python library for GPU-based matrix computations.
  • cuDNN, a library provided by NVidia for running CNNs on their GPUs
  • MatConvNet, a Matlab toolbox for building CNNs
  • DeepLearnToolbox, a Matlab toolbox with several flavors of deep networks

See also[edit]

References[edit]

  1. ^ LeCun, Yann (1990). "Handwritten Digit Recognition with a Back-Propagation Network" (PDF). Advances in Neural Information Processing Systems (NIPS 1989).
  2. ^ a b c d "Convolutional Neural Networks (LeNet) - DeepLearning 0.1 documentation". DeepLearning 0.1. LISA Lab. Retrieved 31 August 2013.
  3. ^ LeCun, Yann. "LeNet-5, convolutional neural networks". Retrieved 16 November 2013.
  4. ^ a b "Improving Photo Search: A Step Across the Semantic Gap". 12 June 2013.
  5. ^ a b Goodfellow, Ian (14 April 2014). "Multi-digit Number Recognition from Street View Imagery using Deep Convolutional Neural Networks". arXiv:1312.6082 [cs.CV].
  6. ^ a b Frome, Andrea; Cheung, German; Abdulkader, Ahmad; Zennaro, Marco; Bo Wu; Bissacco, Alessandro; Adam, Hartwig; Neven, Hartmut; Vincent, Luc (September 2009). "Large-scale privacy protection in Google Street View". 2009 IEEE 12th International Conference on Computer Vision. pp. 2373–2380. doi:10.1109/ICCV.2009.5459413. ISBN 978-1-4244-4420-5. S2CID 1964985.
  7. ^ a b "Results of ILSVRC2014". ImageNet Large Scale Visual Recognition Challenge.
  8. ^ Cybenko, G. (1989). "Approximation by Superpositions of a Sigmoidal Function" (PDF). Mathematics of Control, Signals, and Systems. 2 (4): 303–314. doi:10.1007/BF02551274. S2CID 3958369.
  9. ^ Hubel, David; Wiesel, Torsten (1968). "Receptive fields and functional architecture of monkey striate cortex". Journal of Physiology. 195: 215–243. doi:10.1113/jphysiol.1968.sp008455. S2CID 7136759. Retrieved 6 December 2014.
  10. ^ a b Fukushima, Kunihiko (1980). "Neocognitron: A Self-organizing Neural Network Model for a Mechanism of Pattern Recognition Unaffected by Shift in Position" (PDF). Biological Cybernetics. 36 (4): 193–202. doi:10.1007/BF00344251. PMID 7370364. S2CID 206775608. Retrieved 16 November 2013.
  11. ^ LeCun, Yann (1989). "Backpropagation Applied to Handwritten Zip Code Recognition". Neural Computation. 1 (4): 541–551. doi:10.1162/neco.1989.1.4.541. S2CID 41312633. Retrieved 6 December 2014.
  12. ^ a b LeCun, Yann; Léon Bottou; Yoshua Bengio; Patrick Haffner (1998). "Gradient-based learning applied to document recognition" (PDF). Proceedings of the IEEE. 86 (11): 2278–2324. doi:10.1109/5.726791. S2CID 14542261. Retrieved 16 November 2013.
  13. ^ S. Behnke. Hierarchical Neural Networks for Image Interpretation, volume 2766 of Lecture Notes in Computer Science. Springer, 2003.
  14. ^ Simard, Patrice, David Steinkraus, and John C. Platt. "Best Practices for Convolutional Neural Networks Applied to Visual Document Analysis." In ICDAR, vol. 3, pp. 958-962. 2003.
  15. ^ Riesenhuber, Maximilian; Poggio, Tomaso (1999). "Hierarchical models of object recognition in cortex" (PDF). Nature America.
  16. ^ Hinton, GE; Osindero, S; Teh, YW (Jul 2006). "A fast learning algorithm for deep belief nets". Neural Computation. 18 (7): 1527–54. doi:10.1162/neco.2006.18.7.1527. PMID 16764513. S2CID 2309950.
  17. ^ Bengio, Yoshua; Lamblin, Pascal; Popovici, Dan; Larochelle, Hugo (2007). "Greedy Layer-Wise Training of Deep Networks". Advances in Neural Information Processing Systems: 153–160.
  18. ^ Ranzato, MarcAurelio; Poultney, Christopher; Chopra, Sumit; LeCun, Yann (2007). "Efficient Learning of Sparse Representations with an Energy-Based Model" (PDF). Advances in Neural Information Processing Systems.
  19. ^ a b Ciresan, Dan; Ueli Meier; Jonathan Masci; Luca M. Gambardella; Jurgen Schmidhuber (2011). "Flexible, High Performance Convolutional Neural Networks for Image Classification" (PDF). Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence-Volume Volume Two. 2: 1237–1242. Retrieved 17 November 2013.
  20. ^ Ciresan, Dan (13 February 2012). "Multi-column Deep Neural Networks for Image Classification". arXiv:1202.2745 [cs.CV].
  21. ^ "The MNIST Database".
  22. ^ Lawrence, Steve; C. Lee Giles; Ah Chung Tsoi; Andrew D. Back (1997). "Face Recognition: A Convolutional Neural Network Approach". Neural Networks, IEEE Transactions on. 8 (1): 98–113. CiteSeerX 10.1.1.92.5813. doi:10.1109/72.554195.
  23. ^ Matusugu, Masakazu; Katsuhiko Mori; Yusuke Mitari; Yuji Kaneda (2003). "Subject independent facial expression recognition with robust face detection using a convolutional neural network" (PDF). Neural Networks. 16 (5): 555–559. doi:10.1016/S0893-6080(03)00115-1. Retrieved 17 November 2013.
  24. ^ a b Le Callet, Patrick; Christian Viard-Gaudin; Dominique Barba (2006). "A Convolutional Neural Network Approach for Objective Video Quality Assessment" (PDF). IEEE Transactions on Neural Networks. 17 (5): 1316–1327. doi:10.1109/TNN.2006.879766. PMID 17001990. S2CID 221185563. Retrieved 17 November 2013.
  25. ^ O. Russakovsky et al., "ImageNet Large Scale Visual Recognition Challenge", 2014.
  26. ^ a b Karpathy, Andrej (2 September 2014). "What I learned from competing against a ConvNet on ImageNet".
  27. ^ a b c d e Krizhevsky, Alex (2012). "ImageNet Classification with Deep Convolutional Neural Networks". Advances in Neural Information Processing Systems.
  28. ^ "Reddit AMA: Geoff Hinton".
  29. ^ Hinton, Geoffrey (3 July 2012). "Improving neural networks by preventing co-adaptation of feature detectors". arXiv:1207.0580 [cs.NE].
  30. ^ Srivastava, Nitish; Hinton, Geoffrey (June 2014). "Dropout: A Simple Way to Prevent Neural Networks from Overfitting" (PDF). Journal of Machine Learning Research.
  31. ^ a b Srihari, Sargur. "Regularization in Neural Networks" (PDF).
  32. ^ LeCun, Yann. "Yann LeCun's Research".
  33. ^ "Speech Recognition and Deep Learning". Google Research Blog. 6 August 2012.
  34. ^ Hinton, Geoffrey (2012). "Deep Neural Networks for Acoustic Modeling in Speech Recognition". Signal Processing Magazine. doi:10.1109/MSP.2012.2205597. S2CID 206485943.
  35. ^ "Deep-Neural-Network Speech Recognition Debuts". Inside Microsoft Research. 14 June 2012.
  36. ^ "How Google Cracked House Number Identification in Street View". 6 January 2014.
  37. ^ "Spotify intern dreams up better music recommendations through deep learning". VentureBeat. 6 August 2014.
  38. ^ "Reddit AMA: Yann LeCun".

External links[edit]

Category:Computational neuroscience Category:Artificial neural networks