# Residual neural network

Canonical form of a residual neural network. A layer   − 1 is skipped over activation from  − 2.

A residual neural network (ResNet)[1] is an artificial neural network (ANN). It is a gateless or open-gated variant of the HighwayNet,[2] the first working very deep feedforward neural network with hundreds of layers, much deeper than previous neural networks. Skip connections or shortcuts are used to jump over some layers (HighwayNets may also learn the skip weights themselves through an additional weight matrix for their gates). Typical ResNet models are implemented with double- or triple- layer skips that contain nonlinearities (ReLU) and batch normalization in between. Models with several parallel skips are referred to as DenseNets.[3] In the context of residual neural networks, a non-residual network may be described as a plain network.

Like in the case of Long Short-Term Memory recurrent neural networks[4] there are two main reasons to add skip connections: to avoid the problem of vanishing gradients,[5] thus leading to easier to optimize neural networks, where the gating mechanisms facilitate information flow across many layers ("information highways"),[6][7] or to mitigate the Degradation (accuracy saturation) problem; where adding more layers to a suitably deep model leads to higher training error.[1] During training, the weights adapt to mute the upstream layer[clarification needed], and amplify the previously-skipped layer. In the simplest case, only the weights for the adjacent layer's connection are adapted, with no explicit weights for the upstream layer. This works best when a single nonlinear layer is stepped over, or when the intermediate layers are all linear. If not, then an explicit weight matrix should be learned for the skipped connection (a HighwayNet should be used).

Skipping effectively simplifies the network, using fewer layers in the initial training stages[clarification needed]. This speeds learning by reducing the impact of vanishing gradients,[5] as there are fewer layers to propagate through. The network then gradually restores the skipped layers as it learns the feature space. Towards the end of training, when all layers are expanded, it stays closer to the manifold[clarification needed] and thus learns faster. A neural network without residual parts explores more of the feature space. This makes it more vulnerable to perturbations that cause it to leave the manifold, and necessitates extra training data to recover.

A residual neural network was used to win the ImageNet[8] 2015 competition,[1] and has become the most cited neural network of the 21st century.[9]

## Forward propagation

Given a weight matrix ${\textstyle W^{\ell -1,\ell }}$ for connection weights from layer ${\textstyle \ell -1}$ to ${\textstyle \ell }$, and a weight matrix ${\textstyle W^{\ell -2,\ell }}$ for connection weights from layer ${\textstyle \ell -2}$ to ${\textstyle \ell }$, then the forward propagation through the activation function would be (aka HighwayNets)

{\displaystyle {\begin{aligned}a^{\ell }&:=\mathbf {g} (W^{\ell -1,\ell }\cdot a^{\ell -1}+b^{\ell }+W^{\ell -2,\ell }\cdot a^{\ell -2})\\&:=\mathbf {g} (Z^{\ell }+W^{\ell -2,\ell }\cdot a^{\ell -2})\end{aligned}}}

where

${\textstyle a^{\ell }}$ the activations (outputs) of neurons in layer ${\textstyle \ell }$,
${\textstyle \mathbf {g} }$ the activation function for layer ${\textstyle \ell }$,
${\textstyle W^{\ell -1,\ell }}$ the weight matrix for neurons between layer ${\textstyle \ell -1}$ and ${\textstyle \ell }$, and
${\textstyle Z^{\ell }=W^{\ell -1,\ell }\cdot a^{\ell -1}+b^{\ell }}$

If the number of vertices on layer ${\textstyle \ell -2}$ equals the number of vertices on layer ${\textstyle \ell }$ and if ${\textstyle W^{\ell -2,\ell }}$ is the identity matrix, then forward propagation through the activation function simplifies to ${\displaystyle a^{\ell }:=\mathbf {g} (Z^{\ell }+a^{\ell -2}).}$ In this case, the connection between layers ${\textstyle \ell -2}$ and ${\textstyle \ell }$ is called an identity block.

In the cerebral cortex such forward skips are done for several layers. Usually all forward skips start from the same layer, and successively connect to later layers. In the general case this will be expressed as (aka DenseNets)

${\displaystyle a^{\ell }:=\mathbf {g} \left(Z^{\ell }+\sum _{k=2}^{K}W^{\ell -k,\ell }\cdot a^{\ell -k}\right)}$.

## Backward propagation

During backpropagation learning for the normal path

${\displaystyle \Delta w^{\ell -1,\ell }:=-\eta {\frac {\partial E^{\ell }}{\partial w^{\ell -1,\ell }}}=-\eta a^{\ell -1}\cdot \delta ^{\ell }}$

and for the skip paths (nearly identical)

${\displaystyle \Delta w^{\ell -2,\ell }:=-\eta {\frac {\partial E^{\ell }}{\partial w^{\ell -2,\ell }}}=-\eta a^{\ell -2}\cdot \delta ^{\ell }}$.

In both cases

${\textstyle \eta }$ a learning rate (${\textstyle \eta <0)}$,
${\textstyle \delta ^{\ell }}$ the error signal of neurons at layer ${\textstyle \ell }$, and
${\textstyle a_{i}^{\ell }}$ the activation of neurons at layer ${\textstyle \ell }$.

If the skip path has fixed weights (e.g. the identity matrix, as above), then they are not updated. If they can be updated, the rule is an ordinary backpropagation update rule.

In the general case there can be ${\textstyle K}$ skip path weight matrices, thus

${\displaystyle \Delta w^{\ell -k,\ell }:=-\eta {\frac {\partial E^{\ell }}{\partial w^{\ell -k,\ell }}}=-\eta a^{\ell -k}\cdot \delta ^{\ell }}$

As the learning rules are similar, the weight matrices can be merged and learned in the same step.

## References

1. ^ a b c He, Kaiming; Zhang, Xiangyu; Ren, Shaoqing; Sun, Jian (2016). Deep Residual Learning for Image Recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, NV, USA: IEEE. pp. 770–778. arXiv:1512.03385. doi:10.1109/CVPR.2016.90. ISBN 978-1-4673-8851-1.
2. ^ Srivastava, Rupesh Kumar; Greff, Klaus; Schmidhuber, Jürgen (2015-05-02). "Highway Networks". arXiv:1505.00387 [cs.LG].
3. ^ Huang, Gao; Liu, Zhuang; Van Der Maaten, Laurens; Weinberger, Kilian Q. (2017). Densely Connected Convolutional Networks. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, HI: IEEE. pp. 2261–2269. arXiv:1608.06993. doi:10.1109/CVPR.2017.243. ISBN 978-1-5386-0457-1.
4. ^ Sepp Hochreiter; Jürgen Schmidhuber (1997). "Long short-term memory". Neural Computation. 9 (8): 1735–1780. doi:10.1162/neco.1997.9.8.1735. PMID 9377276. S2CID 1915014.
5. ^ a b Hochreiter, Sepp (1991). Untersuchungen zu dynamischen neuronalen Netzen (PDF) (diploma thesis). Technical University Munich, Institute of Computer Science, advisor: J. Schmidhuber.
6. ^ Srivastava, Rupesh Kumar; Greff, Klaus; Schmidhuber, Jürgen (2 May 2015). "Highway Networks". arXiv:1505.00387 [cs.LG].
7. ^ Srivastava, Rupesh K; Greff, Klaus; Schmidhuber, Juergen (2015). "Training Very Deep Networks". Advances in Neural Information Processing Systems 28. Curran Associates, Inc. 28: 2377–2385.
8. ^ Deng, Jia; Dong, Wei; Socher, Richard; Li, Li-Jia; Li, Kai; Fei-Fei, Li (2009). "Imagenet: A large-scale hierarchical image database". CVPR.
9. ^ Schmidhuber, Jürgen (2021). "The most cited neural networks all build on work done in my labs". AI Blog. IDSIA, Switzerland. Retrieved 2022-04-30.