# Residual neural network

Canonical form of a residual neural network. A layer   − 1 is skipped over activation from  − 2.

A residual neural network is an artificial neural network (ANN) of a kind that builds on constructs known from pyramidal cells in the cerebral cortex[citation needed]. Residual neural networks do this by utilizing skip connections or short-cuts to jump over some layers. In its limit as ResNets it will only skip over a single layer.[1] With an additional weight matrix to learn the skip weights it is referred to as HighwayNets.[2] With several parallel skips it is referred to as DenseNets.[3] In comparison, a non-residual neural network is described as a plain network in the context of residual neural networks.

A reconstruction of a pyramidal cell. Soma and dendrites are labeled in red, axon arbor in blue. (1) Soma, (2) Basal dendrite, (3) Apical dendrite, (4) Axon, (5) Collateral axon.

The brain has structures similar to residual nets, as cortical layer VI neurons gets input from layer I, skipping over all intermediary layers[citation needed]. In the figure (below, on the right), this compares to signals from the (3) apical dendrite skipping over layers, while the (2) basal dendrite collecting signals from the previous and/or same layer.[note 1][4] Similar structures exists for other layers.[5] How many layers in the cerebral cortex compare to layers in an artificial neural network is not clear, neither if every area in cerebral cortex exhibits the same structure, but over large areas they look quite similar.

One motivation for skipping over layers in ANNs is to avoid the problem of vanishing gradients by reusing activations from a previous layer until the layer next to the current one have learned its weights. During training the weights will adapt to mute the previous layer and amplify the layer next to the current. In the simplest case only the weights for the connection to the next to the current layer is adapted, with no explicit weights for the upstream previous layer. This usually works properly when a single non-linear layer is stepped over, or in the case when the intermediate layers are all linear. If not, then an explicit weight matrix should be learned for the skipped connection.

The intuition on why this works is that the neural network collapses into fewer layers in the initial phase, which makes it easier to learn, and thus gradually expands the layers as it learns more of the feature space. During later learning, when all layers are expanded, it will stay closer to the manifold and thus learn faster. A neural network without residual parts will explore more of the feature space. This makes it more vulnerable to small perturbations that cause it to leave the manifold altogether, and require extra training data to get back on track.

## Forward propagation

For single skips, the layers may be indexed either as ${\textstyle \ell -2}$ to ${\textstyle \ell }$ or as ${\textstyle \ell }$ to ${\textstyle \ell +2}$. (Script ${\textstyle \ell }$ used for clarity, usually it is written as a simple l.) The two indexing systems are convenient when describing skips as going backward or forward. As signal flows forward through the network it is easier to describe the skip as ${\textstyle \ell +k}$ from a given layer, but as a learning rule (back propagation) it is easier to describe which activation layer you reuse as ${\textstyle \ell -k}$, where ${\textstyle k-1}$ is the skip number.

If there is a weight matrix ${\textstyle W^{\ell -1,\ell }}$for connection weights from layer ${\textstyle \ell -1}$ to ${\textstyle \ell }$, and a weight matrix ${\textstyle W^{\ell -2,\ell }}$ for connection weights from layer ${\textstyle \ell -2}$ to ${\textstyle \ell }$, then the forward propagation through the activation function would be (aka HighwayNets)

{\displaystyle {\begin{aligned}a^{\ell }&:=\mathbf {g} (W^{\ell -1,\ell }\cdot a^{\ell -1}+b^{\ell }+W^{\ell -2,\ell }\cdot a^{\ell -2})\\&:=\mathbf {g} (Z^{\ell }+W^{\ell -2,\ell }\cdot a^{\ell -2})\end{aligned}}}

where we have

${\textstyle a^{\ell }}$ the activations (outputs) of neurons in layer ${\textstyle \ell }$,
${\textstyle \mathbf {g} }$ the activation function for layer ${\textstyle \ell }$,
${\textstyle W^{\ell -1,\ell }}$ the weight matrix for neurons between layer ${\textstyle \ell -1}$ and ${\textstyle \ell }$, and
${\textstyle Z^{\ell }=W^{\ell -1,\ell }\cdot a^{\ell -1}+b^{\ell }}$

If there is no explicit matrix ${\textstyle W^{\ell -2,\ell }}$ (aka ResNets), then the forward propagation through the activation function simplifies to

${\displaystyle a^{\ell }:=\mathbf {g} (Z^{\ell }+a^{\ell -2})}$

Another way to formulate this is to substitute an identity matrix for ${\textstyle W^{\ell -2,\ell }}$, but that is only valid when the dimensions match. This is somewhat confusingly called an identity block, which means that the activations from layer ${\textstyle \ell -2}$ are passed to layer ${\textstyle \ell }$ without a weight matrix.

In the cerebral cortex such forward skips are done for several layers. Usually all forward skips start from the same layer, and successively connect to later layers. In the general case this will be expressed as (aka DenseNets)

${\displaystyle a^{\ell }:=\mathbf {g} \left(Z^{\ell }+\sum _{k=2}^{K}W^{\ell -k,\ell }\cdot a^{\ell -k}\right)}$

## Backward propagation

During backpropagation learning for the normal path

${\displaystyle \Delta w^{\ell -1,\ell }:=-\eta {\frac {\partial E^{\ell }}{\partial w^{\ell -1,\ell }}}=-\eta a^{\ell }\cdot \delta ^{\ell }}$

and for the skip paths (note that they are close to identical)

${\displaystyle \Delta w^{\ell -2,\ell }:=-\eta {\frac {\partial E^{\ell }}{\partial w^{\ell -2,\ell }}}=-\eta a^{\ell }\cdot \delta ^{\ell }}$

In both cases we have

${\textstyle \eta }$ a learning rate (${\textstyle \eta <0)}$,
${\textstyle \delta ^{\ell }}$ the error signal of neurons at layer ${\textstyle \ell }$, and
${\textstyle a_{i}^{\ell }}$ the activation of neurons at layer ${\textstyle \ell }$

If the skip path has fixed weights (e.g. the identity matrix, see above), then they will not be updated. If they can be updated, then the rule will be an ordinary backprop update rule.

In the general case there can be ${\textstyle K}$ skip path weight matrices, thus

${\displaystyle \Delta w^{\ell -k,\ell }:=-\eta {\frac {\partial E^{\ell }}{\partial w^{\ell -k,\ell }}}=-\eta a^{\ell }\cdot \delta ^{\ell }}$

As the learning rules are similar the weight matrices can be merged and learned in the same step.

## Notes

1. ^ Some research indicates that there are additional structures here, so this explanation is somewhat simplified.

like Bridged Multilayer Perceptron?

## References

1. ^ He, Kaiming; Zhang, Xiangyu; Ren, Shaoqing; Sun, Jian (2015-12-10). "Deep Residual Learning for Image Recognition". arXiv:1512.03385 [cs.CV].
2. ^ Srivastava, Rupesh Kumar; Greff, Klaus; Schmidhuber, Jürgen (2015-05-02). "Highway Networks". arXiv:1505.00387 [cs.LG].
3. ^ Huang, Gao; Liu, Zhuang; Weinberger, Kilian Q.; van der Maaten, Laurens (2016-08-24). "Densely Connected Convolutional Networks". arXiv:1608.06993 [cs.CV].
4. ^ Winterer, Jochen; Maier, Nikolaus; Wozny, Christian; Beed, Prateep; Breustedt, Jörg; Evangelista, Roberta; Peng, Yangfan; D’Albis, Tiziano; Kempter, Richard (2017). "Excitatory Microcircuits within Superficial Layers of the Medial Entorhinal Cortex". Cell Reports. 19 (6): 1110–1116. doi:10.1016/j.celrep.2017.04.041. PMID 28494861.
5. ^ Fitzpatrick, David (1996-05-01). "The Functional Organization of Local Circuits in Visual Cortex: Insights from the Study of Tree Shrew Striate Cortex". Cerebral Cortex. 6 (3): 329–341. doi:10.1093/cercor/6.3.329. ISSN 1047-3211.