# Batch normalization

Batch normalization (also known as batch norm) is a method used to make artificial neural networks faster and more stable through normalization of the input layer by re-centering and re-scaling.[1][2] It was proposed by Sergey Ioffe and Christian Szegedy in 2015.[3]

While the effect of batch normalization is evident, the reasons behind its effectiveness remain under discussion. It was believed that it can mitigate the problem of internal covariate shift, where parameter initialization and changes in the distribution of the inputs of each layer affect the learning rate of the network.[3] Recently, some scholars have argued that batch normalization does not reduce internal covariate shift, but rather smooths the objective function, which in turn improves the performance.[4] However, at initialization, batch normalization in fact induces severe gradient explosion in deep networks, which is only alleviated by skip connections in residual networks.[5] Others sustain that batch normalization achieves length-direction decoupling, and thereby accelerates neural networks.[6]

After batch norm, many other in-layer normalization methods have been introduced, such as instance normalization, layer normalization, group normalization.

## Motivation: The phenomenon of internal covariate shift

Each layer of a neural network has inputs with a corresponding distribution, which is affected during the training process by the randomness in the parameter initialization and the randomness in the input data. The effect of these sources of randomness on the distribution of the inputs to internal layers during training is described as internal covariate shift. Although a clear-cut precise definition seems to be missing, the phenomenon observed in experiments is the change on means and variances of the inputs to internal layers during training.

Batch normalization was initially proposed to mitigate internal covariate shift.[3] During the training stage of networks, as the parameters of the preceding layers change, the distribution of inputs to the current layer changes accordingly, such that the current layer needs to constantly readjust to new distributions. This problem is especially severe for deep networks, because small changes in shallower hidden layers will be amplified as they propagate within the network, resulting in significant shift in deeper hidden layers. Therefore, the method of batch normalization is proposed to reduce these unwanted shifts to speed up training and to produce more reliable models.

Besides reducing internal covariate shift, batch normalization is believed to introduce many other benefits. With this additional operation, the network can use higher learning rate without vanishing or exploding gradients. Furthermore, batch normalization seems to have a regularizing effect such that the network improves its generalization properties, and it is thus unnecessary to use dropout to mitigate overfitting. It has been observed also that with batch norm the network becomes more robust to different initialization schemes and learning rates.

## Procedures[3]

### Batch Normalizing Transform

In a neural network, batch normalization is achieved through a normalization step that fixes the means and variances of each layer's inputs. Ideally, the normalization would be conducted over the entire training set, but to use this step jointly with stochastic optimization methods, it is impractical to use the global information. Thus, normalization is restrained to each mini-batch in the training process.

Use B to denote a mini-batch of size m of the entire training set. The empirical mean and variance of B could thus be denoted as

${\displaystyle \mu _{B}={\frac {1}{m}}\sum _{i=1}^{m}x_{i}}$, and ${\displaystyle \sigma _{B}^{2}={\frac {1}{m}}\sum _{i=1}^{m}(x_{i}-\mu _{B})^{2}}$.

For a layer of the network with d-dimensional input, ${\displaystyle x=(x^{(1)},...,x^{(d)})}$, each dimension of its input is then normalized (i.e. re-centered and re-scaled) separately,

${\displaystyle {\hat {x}}_{i}^{(k)}={\frac {x_{i}^{(k)}-\mu _{B}^{(k)}}{\sqrt {\sigma _{B}^{(k)^{2}}+\epsilon }}}}$, where ${\displaystyle k\in [1,d]}$ and ${\displaystyle i\in [1,m]}$; ${\displaystyle \mu _{B}^{(k)}}$ and ${\displaystyle \sigma _{B}^{(k)^{2}}}$ are the per-dimension mean and variance, respectively.

${\displaystyle \epsilon }$ is added in the denominator for numerical stability and is an arbitrarily small constant. The resulting normalized activation ${\displaystyle {\hat {x}}^{(k)}}$have zero mean and unit variance, if ${\displaystyle \epsilon }$ is not taken into account. To restore the representation power of the network, a transformation step then follows as

${\displaystyle y_{i}^{(k)}=\gamma ^{(k)}{\hat {x}}_{i}^{(k)}+\beta ^{(k)}}$,

where the parameters ${\displaystyle \gamma ^{(k)}}$ and ${\displaystyle \beta ^{(k)}}$ are subsequently learned in the optimization process.

Formally, the operation that implements batch normalization is a transform ${\displaystyle BN_{\gamma ^{(k)},\beta ^{(k)}}:x_{1...m}^{(k)}\rightarrow y_{1...m}^{(k)}}$ called the Batch Normalizing transform. The output of the BN transform ${\displaystyle y^{(k)}=BN_{\gamma ^{(k)},\beta ^{(k)}}(x^{(k)})}$ is then passed to other network layers, while the normalized output ${\displaystyle {\hat {x}}_{i}^{(k)}}$ remains internal to the current layer.

### Backpropagation

The described BN transform is a differentiable operation, and the gradient of the loss l with respect to the different parameters can be computed directly with chain rule.

Specifically, ${\displaystyle {\frac {\partial l}{\partial y_{i}^{(k)}}}}$ depends on the choice of activation function, and the gradient against other parameters could be expressed as a function of ${\displaystyle {\frac {\partial l}{\partial y_{i}^{(k)}}}}$:

${\displaystyle {\frac {\partial l}{\partial {\hat {x}}_{i}^{(k)}}}={\frac {\partial l}{\partial y_{i}^{(k)}}}\gamma ^{(k)}}$,

${\displaystyle {\frac {\partial l}{\partial \gamma ^{(k)}}}=\sum _{i=1}^{m}{\frac {\partial l}{\partial y_{i}^{(k)}}}{\hat {x}}_{i}^{(k)}}$, ${\displaystyle {\frac {\partial l}{\partial \beta ^{(k)}}}=\sum _{i=1}^{m}{\frac {\partial l}{\partial y_{i}^{(k)}}}}$,
${\displaystyle {\frac {\partial l}{\partial \sigma _{B}^{(k)^{2}}}}=\sum _{i=1}^{m}{\frac {\partial l}{\partial y_{i}^{(k)}}}(x_{i}^{(k)}-\mu _{B}^{(k)})\left(-{\frac {\gamma ^{(k)}}{2}}(\sigma _{B}^{(k)^{2}}+\epsilon )^{-3/2}\right)}$, ${\displaystyle {\frac {\partial l}{\partial \mu _{B}^{(k)}}}=\sum _{i=1}^{m}{\frac {\partial l}{\partial y_{i}^{(k)}}}{\frac {-\gamma ^{(k)}}{\sqrt {\sigma _{B}^{(k)^{2}}+\epsilon }}}+{\frac {\partial l}{\partial \sigma _{B}^{(k)^{2}}}}{\frac {1}{m}}\sum _{i=1}^{m}(-2)\cdot (x_{i}^{(k)}-\mu _{B}^{(k)})}$,

and ${\displaystyle {\frac {\partial l}{\partial x_{i}^{(k)}}}={\frac {\partial l}{\partial {\hat {x}}_{i}^{(k)}}}{\frac {1}{\sqrt {\sigma _{B}^{(k)^{2}}+\epsilon }}}+{\frac {\partial l}{\partial \sigma _{B}^{(k)^{2}}}}{\frac {2(x_{i}^{(k)}-\mu _{B}^{(k)})}{m}}+{\frac {\partial l}{\partial \mu _{B}^{(k)}}}{\frac {1}{m}}}$.

### Inference with Batch-Normalized Networks

During the training stage, the normalization steps depend on the mini-batches to ensure efficient and reliable training. However, in the inference stage, this dependence is not useful any more. Instead, the normalization step in this stage is computed with the population statistics such that the output could depend on the input in a deterministic manner. The population mean, ${\displaystyle E[x^{(k)}]}$, and variance, ${\displaystyle \operatorname {Var} [x^{(k)}]}$, are computed as:

${\displaystyle E[x^{(k)}]=E_{B}[\mu _{B}^{(k)}]}$, and ${\displaystyle \operatorname {Var} [x^{(k)}]={\frac {m}{m-1}}E_{B}[\sigma _{B}^{(k)^{2}}]}$.

The population statistics thus is a complete representation of the mini-batches.

The BN transform in the inference step thus becomes

${\displaystyle y^{(k)}=BN_{\gamma ^{(k)},\beta ^{(k)}}^{\text{inf}}(x^{(k)})={\frac {\gamma ^{(k)}}{\sqrt {\operatorname {Var} [x^{(k)}]+\epsilon }}}x^{(k)}+{\Bigg (}\beta ^{(k)}-{\frac {\gamma ^{(k)}E[x^{(k)}]}{\sqrt {\operatorname {Var} [x^{(k)}]+\epsilon }}}{\Bigg )}}$,

where ${\displaystyle y^{(k)}}$ is passed on to future layers instead of ${\displaystyle x^{(k)}}$. Since the parameters are fixed in this transformation, the batch normalization procedure is essentially applying a linear transform to the activation.

## Understanding Batch Normalization

Although batch normalization has become a popular method due to its strengths, the working mechanism of the method is not yet well-understood. Scholars show that internal covariate shift is not reduced significantly by batch normalization, despite of common belief.[4] Some scholars attribute the good performance to smoothing the objective function, while others propose that length-direction decoupling is the reason behind its effectiveness.[4][6]

### Batch Normalization and Internal Covariate Shift[4]

The correlation between batch normalization and internal covariate shift is widely accepted but was not supported by experimental results. Scholars recently show with experiments that the hypothesized relationship is not an accurate one. Rather, the enhanced accuracy with the batch normalization layer seems to be independent of internal covariate shift.

#### Adding Covariate Shift to Batch Normalization Layers

To understand if there is any correlation between reducing covariate shift and improving performance, an experiment is performed to elucidate the relationship. Specifically, three models are trained and compared: a standard VGG network without batch normalization, a VGG network with batch normalization layers, and a VGG network with batch normalization layers and random noise. In the third model, the noise has non-zero mean and non-unit variance, and is generated at random for each layer. It is then added after the batch normalization layers to deliberately introduce covariate shift into activation.

With these three models, two observations are made. First, the third, noisy model has less stable distributions at all layers compared with the other two models due to the extra noise layer. Despite of the noise, the training accuracy of the second and the third model are similar, and are both higher than that of the first model. While the internal covariate shifts are larger at all levels, the model with batch normalization still performs better than the standard VGG model. It could thus be concluded that internal covariate shift might not be the contributing factor of the performance of batch normalization.

#### Measuring Internal Covariate Shift with and without Batch Normalization Layers

Since it is hypothesized that batch normalization layers could reduce internal covariate shift, an experiment is set up to measure quantitatively how much covariate shift is reduced. First, the notion of internal covariate shift needs to be defined mathematically. Specifically, to quantify the adjustment that a layer's parameters make in response to updates in previous layers, the correlation between the gradients of the loss before and after all previous layers are updated is measured, since gradients could capture the shifts from the first-order training method. If the shift introduced by the changes in previous layers is small, then the correlation between the gradients would be close to 1.

The correlation between the gradients are computed for four models: a standard VGG network, a VGG network with batch normalization layers, a 25-layer deep linear network (DLN) trained with full-batch gradient descent, and a DLN network with batch normalization layers. Interestingly, it is shown that the standard VGG and DLN models both have higher correlations of gradients compared with their counterparts, indicating that the additional batch normalization layers are not reducing internal covariate shift.

### Smoothness of the Optimization Landscape[4]

Some scholars proposed and proved that batch normalization could introduce greater Lipschitzness into the loss and the gradient during training, and that this improved smoothness could explain its great performance. These effects can be observed by comparing VGG networks trained with and without batch normalization, and is also consistent among other networks, such as linear deep networks. Specifically, it is observed that the loss changes less, and that the gradients of the loss have smaller magnitudes and are more Lipschitz. Moreover, the batch normalized models are compared with models with different normalization techniques. Specifically, these normalization methods work by first fixing the first order moment of activation, and then normalizing it by the average of the ${\displaystyle l_{p}}$ norm. These methods thus have larger distributional shift, but smoother landscape. Evidently, these models yield similar performance as batch normalized models. This two-way relationship could thus indicate that smoothness of the optimization landscape could be a contributing factor to the superior performance of batch normalization.

Besides analyzing this correlation experimentally, theoretical analysis is also provided for verification that batch normalization could result in a smoother landscape. Consider two identical networks, one contains batch normalization layers and the other doesn't, the behaviors of these two networks are then compared. Denote the loss functions as ${\displaystyle L}$ and ${\displaystyle {\hat {L}}}$, respectively. Let the input to both networks be ${\displaystyle x}$, and the output be ${\displaystyle y}$, for which ${\displaystyle y=Wx}$, where ${\displaystyle W}$ is the layer weights. For the second network, ${\displaystyle y}$ additionally goes through a batch normalization layer. Denote the normalized activation as ${\displaystyle {\hat {y}}}$, which has zero mean and unit variance. Let the transformed activation be ${\displaystyle z=\gamma {\hat {y}}+\beta }$, and suppose ${\displaystyle \gamma }$ and ${\displaystyle \beta }$ are constants. Finally, denote the standard deviation over a mini-batch ${\displaystyle {\hat {y_{j}}}\in \mathbb {R} ^{m}}$ as ${\displaystyle \sigma _{j}}$.

First, it can be shown that the gradient magnitude of a batch normalized network, ${\displaystyle ||\triangledown _{y_{i}}{\hat {L}}||}$, is bounded, with the bound expressed as

${\displaystyle ||\triangledown _{y_{i}}{\hat {L}}||^{2}\leq {\frac {\gamma ^{2}}{\sigma _{j}^{2}}}{\Bigg (}||\triangledown _{y_{i}}L||^{2}-{\frac {1}{m}}\langle 1,\triangledown _{y_{i}}L\rangle ^{2}-{\frac {1}{m}}\langle \triangledown _{y_{i}}L,{\hat {y}}_{j}\rangle ^{2}{\bigg )}}$.

Since the gradient magnitude represents the Lipschitzness of the loss, this relationship indicates that a batch normalized network could achieve greater Lipschitzness comparatively. Notice that the bound gets tighter when the gradient ${\displaystyle \triangledown _{y_{i}}{\hat {L}}}$ correlates with the activation ${\displaystyle {\hat {y_{i}}}}$, which is a common phenomena. The scaling of ${\displaystyle {\frac {\gamma ^{2}}{\sigma _{j}^{2}}}}$ is also significant, since the variance is often large.

Secondly, the quadratic form of the loss Hessian with respect to activation in the gradient direction can be bounded as

${\displaystyle (\triangledown _{y_{j}}{\hat {L}})^{T}{\frac {\partial {\hat {L}}}{\partial y_{j}\partial y_{j}}}(\triangledown _{y_{j}}{\hat {L}})\leq {\frac {\gamma ^{2}}{\sigma ^{2}}}{\bigg (}{\frac {\partial {\hat {L}}}{\partial y_{j}}}{\bigg )}^{T}{\bigg (}{\frac {\partial L}{\partial y_{j}\partial y_{j}}}{\bigg )}{\bigg (}{\frac {\partial {\hat {L}}}{\partial y_{j}}}{\bigg )}-{\frac {\gamma }{m\sigma ^{2}}}\langle \triangledown _{y_{j}}L,{\hat {y_{j}}}\rangle {\bigg |}{\bigg |}{\frac {\partial {\hat {L}}}{\partial y_{j}}}{\bigg |}{\bigg |}^{2}}$.

The scaling of ${\displaystyle {\frac {\gamma ^{2}}{\sigma _{j}^{2}}}}$ indicates that the loss Hessian is resilient to the mini-batch variance, whereas the second term on the right hand side suggests that it becomes smoother when the Hessian and the inner product are non-negative. If the loss is locally convex, then the Hessian is positive semi-definite, while the inner product is positive if ${\displaystyle {\hat {g_{j}}}}$ is in the direction towards the minimum of the loss. It could thus be concluded from this inequality that the gradient generally becomes more predictive with the batch normalization layer.

It then follows to translate the bounds related to the loss with respect to the normalized activation to a bound on the loss with respect to the network weights:

${\displaystyle {\hat {g_{j}}}\leq {\frac {\gamma ^{2}}{\sigma _{j}^{2}}}(g_{j}^{2}-m\mu _{g_{j}}^{2}-\lambda ^{2}\langle \triangledown _{y_{j}}L,{\hat {y}}_{j}\rangle ^{2})}$, where ${\displaystyle g_{j}=max_{||X||\leq \lambda }||\triangledown _{W}L||^{2}}$ and ${\displaystyle {\hat {g}}_{j}=max_{||X||\leq \lambda }||\triangledown _{W}{\hat {L}}||^{2}}$.

In addition to the smoother landscape, it is further shown that batch normalization could result in a better initialization with the following inequality:

${\displaystyle ||W_{0}-{\hat {W}}^{*}||^{2}\leq ||W_{0}-W^{*}||^{2}-{\frac {1}{||W^{*}||^{2}}}(||W^{*}||^{2}-\langle W^{*},W_{0}\rangle )^{2}}$, where ${\displaystyle W^{*}}$ and ${\displaystyle {\hat {W}}^{*}}$ are the local optimal weights for the two networks, respectively.

Some scholars argue that the above analysis cannot fully capture the performance of batch normalization, because the proof only concerns the largest eigenvalue, or equivalently, one direction in the landscape at all points. It is suggested that the complete eigenspectrum needs to be taken into account to make a conclusive analysis.[6]

### Counterintuitive Roughness of the Optimization Landscape at Initialization[5]

Even though batchnorm was originally introduced to alleviate gradient vanishing or explosion problems, a deep batchnorm network in fact suffers from gradient explosion at initialization time, no matter what it uses for nonlinearity. Thus the optimization landscape is very far from smooth for a randomly initialized, deep batchnorm network. More precisely, if the network has ${\displaystyle L}$ layers, then the gradient of the first layer weights has norm ${\displaystyle >c\lambda ^{L}}$ for some ${\displaystyle \lambda >1,c>0}$ depending only on the nonlinearity. For any fixed nonlinearity, ${\displaystyle \lambda }$ decreases as the batch size increases. For example, for ReLU, ${\displaystyle \lambda }$ decreases to ${\displaystyle \pi /(\pi -1)\approx 1.467}$ as the batch size tends to infinity. Practically, this means deep batchnorm networks are untrainable. This is only relieved by skip connections in the fashion of residual networks.

This gradient explosion on the surface contradicts the smoothness property explained in the previous section, but in fact they are consistent. The previous section studies the effect of inserting a single batchnorm in a network, while the gradient explosion depends on stacking batchnorms typical of modern deep neural networks.

### Length-Direction Decoupling[6]

It is argued that the success of batch normalization could be at least partially credited to the length-direction decoupling effect that the method provides.

By interpreting the batch normalization procedure as the reparametrization of weight space, it could be shown that the length and the direction of the weights are separated after the procedure, and they could thus be trained separately. For a particular neural network unit with input ${\displaystyle x}$ and weight vector ${\displaystyle w}$, denote its output as ${\displaystyle f(w)=E_{x}[\phi (x^{T}w)]}$, where ${\displaystyle \phi }$ is the activation function, and denote ${\displaystyle S=E[xx^{T}]}$. Assume that ${\displaystyle E[x]=0}$, and that the spectrum of the matrix ${\displaystyle S}$ is bounded as ${\displaystyle 0<\mu =\lambda _{min}(S)}$, ${\displaystyle L=\lambda _{max}(S)<\infty }$, such that ${\displaystyle S}$ is symmetric positive definite. Adding batch normalization to this unit thus results in

${\displaystyle f_{BN}(w,\gamma ,\beta )=E_{x}[\phi (BN(x^{T}w))]=E_{x}{\bigg [}\phi {\bigg (}\gamma ({\frac {x^{T}w-E_{x}[x^{T}w]}{var_{x}[x^{T}w]^{1/2}}})+\beta {\bigg )}{\bigg ]}}$, by definition.

The variance term can be simplified such that ${\displaystyle var_{x}[x^{T}w]=w^{T}Sw}$. Assume that ${\displaystyle x}$ has zero mean and ${\displaystyle \beta }$ can be omitted, then it follows that

${\displaystyle f_{BN}(w,\gamma )=E_{x}{\bigg [}\phi {\bigg (}\gamma {\frac {x^{T}w}{(w^{T}Sw)^{1/2}}}{\bigg )}{\bigg ]}}$, where ${\displaystyle (w^{T}Sw)^{\frac {1}{2}}}$ is the induced norm of ${\displaystyle S}$, ${\displaystyle ||w||_{s}}$.

Hence, it could be concluded that ${\displaystyle f_{BN}(w,\gamma )=E_{x}[\phi (x^{T}{\tilde {w}})]}$, where ${\displaystyle {\tilde {w}}=\gamma {\frac {w}{||w||_{s}}}}$, and ${\displaystyle \gamma }$ and ${\displaystyle w}$ accounts for its length and direction separately. This property could then be used to prove the faster convergence of problems with batch normalization.

#### Linear Convergence of the Least-Square Problem with Batch Normalization

With the reparametrization interpretation, it could then be proved that applying batch normalization to the ordinary least squares problem achieves a linear convergence rate in gradient descent, which is faster than the regular gradient descent with only sub-linear convergence.

Denote the objective of minimizing an ordinary least squares problem as

${\displaystyle min_{{\tilde {w}}\in R^{d}}f_{OLS}({\tilde {w}})=min_{{\tilde {w}}\in R^{d}}(E_{x,y}[(y-x^{T}{\tilde {w}})^{2}])=min_{{\tilde {w}}\in R^{d}}(2u^{T}{\tilde {w}}+{\tilde {w}}^{T}S{\tilde {w}})}$, where ${\displaystyle u=E[-yx]}$.

Since ${\displaystyle {\tilde {w}}=\gamma {\frac {w}{||w||_{s}}}}$, the objective thus becomes

${\displaystyle min_{w\in R^{d}\backslash \{0\},\gamma \in R}f_{OLS}(w,\gamma )=min_{w\in R^{d}\backslash \{0\},\gamma \in R}{\bigg (}2\gamma {\frac {u^{T}w}{||w||_{S}+\gamma ^{2}}}{\bigg )}}$, where 0 is excluded to avoid 0 in the denominator.

Since the objective is convex with respect to ${\displaystyle \gamma }$, its optimal value could be calculated by setting the partial derivative of the objective against ${\displaystyle \gamma }$ to 0. The objective could be further simplified to be

${\displaystyle min_{w\in R^{d}\backslash \{0\}}\rho (w)=min_{w\in R^{d}\backslash \{0\}}{\bigg (}-{\frac {w^{T}uu^{T}w}{w^{T}Sw}}{\bigg )}}$.

Note that this objective is a form of the generalized Rayleigh quotient

${\displaystyle {\tilde {\rho }}(w)={\frac {w^{T}Bw}{w^{T}Aw}}}$, where ${\displaystyle B\in R^{d\times d}}$ is a symmetric matrix and ${\displaystyle A\in R^{d\times d}}$ is a symmetric positive definite matrix.

It is proven that the gradient descent convergence rate of the generalized Rayleigh quotient is

${\displaystyle {\frac {\lambda _{1}-\rho (w_{t+1})}{\rho (w_{t+1}-\lambda _{2})}}\leq {\bigg (}1-{\frac {\lambda _{1}-\lambda _{2}}{\lambda _{1}-\lambda _{min}}}{\bigg )}^{2t}{\frac {\lambda _{1}-\rho (w_{t})}{\rho (w_{t})-\lambda _{2}}}}$, where ${\displaystyle \lambda _{1}}$ is the largest eigenvalue of ${\displaystyle B}$, ${\displaystyle \lambda _{2}}$ is the second largest eigenvalue of ${\displaystyle B}$, and ${\displaystyle \lambda _{min}}$ is the smallest eigenvalue of ${\displaystyle B}$.[7]

In our case, ${\displaystyle B=uu^{T}}$is a rank one matrix, and the convergence result can be simplified accordingly. Specifically, consider gradient descent steps of the form ${\displaystyle w_{t+1}=w_{t}-\eta _{t}\triangledown \rho (w_{t})}$ with step size ${\displaystyle \eta _{t}={\frac {w_{t}^{T}Sw_{t}}{2L|\rho (w_{t})|}}}$, and starting from ${\displaystyle \rho (w_{0})\neq 0}$, then

${\displaystyle \rho (w_{t})-\rho (w^{*})\leq {\bigg (}1-{\frac {\mu }{L}}{\bigg )}^{2t}(\rho (w_{0})-\rho (w^{*}))}$.

#### Linear Convergence of the Learning Halfspace Problem with Batch Normalization

The problem of learning halfspaces refers to the training of the Perceptron, which is the simplest form of neural network. The optimization problem in this case is

${\displaystyle min_{{\tilde {w}}\in R^{d}}f_{LH}({\tilde {w}})=E_{y,x}[\phi (z^{T}{\tilde {w}})]}$, where ${\displaystyle z=-yx}$ and ${\displaystyle \phi }$ is an arbitrary loss function.

Suppose that ${\displaystyle \phi }$ is infinitely differentiable and has a bounded derivative. Assume that the objective function ${\displaystyle f_{LH}}$ is ${\displaystyle \zeta }$-smooth, and that a solution ${\displaystyle \alpha ^{*}=argmin_{\alpha }||\triangledown f(\alpha w)||^{2}}$ exists and is bounded such that ${\displaystyle -\infty <\alpha ^{*}<\infty }$. Also assume ${\displaystyle z}$ is a multivariate normal random variable. With the Gaussian assumption, it can be shown that all critical points lie on the same line, for any choice of loss function ${\displaystyle \phi }$. Specifically, the gradient of ${\displaystyle f_{LH}}$ could be represented as

${\displaystyle \triangledown _{\tilde {w}}f_{LH}({\tilde {w}})=c_{1}({\tilde {w}})u+c_{2}({\tilde {w}})S{\tilde {w}}}$, where ${\displaystyle c_{1}({\tilde {w}})=E_{z}[\phi ^{(1)}(z^{T}{\tilde {w}})]-E_{z}[\phi ^{(2)}(z^{T}{\tilde {w}})](u^{T}{\tilde {w}})}$, ${\displaystyle c_{2}({\tilde {w}})=E_{z}[\phi ^{(2)}(z^{T}{\tilde {w}})]}$, and ${\displaystyle \phi ^{(i)}}$ is the ${\displaystyle i}$-th derivative of ${\displaystyle \phi }$.

By setting the gradient to 0, it thus follows that the bounded critical points ${\displaystyle {\tilde {w}}_{*}}$ can be expressed as ${\displaystyle {\tilde {w}}_{*}=g_{*}S^{-1}u}$, where ${\displaystyle g_{*}}$ depends on ${\displaystyle {\tilde {w}}_{*}}$ and ${\displaystyle \phi }$. Combining this global property with length-direction decoupling, it could thus be proved that this optimization problem converges linearly.

First, a variation of gradient descent with batch normalization, Gradient Descent in Normalized Parameterization (GDNP), is designed for the objective function ${\displaystyle min_{w\in R^{d}\backslash \{0\},\gamma \in R}f_{LH}(w,\gamma )}$, such that the direction and length of the weights are updated separately. Denote the stopping criterion of GDNP as

${\displaystyle h(w_{t},\gamma _{t})=E_{z}[\phi '(z^{T}{\tilde {w}}_{t})](u^{T}w_{t})-E_{z}[\phi ''(z^{T}{\tilde {w}}_{t})](u^{T}w_{t})^{2}}$.

Let the step size be

${\displaystyle s_{t}=s(w_{t},\gamma _{t})=-{\frac {||w_{t}||_{S}^{3}}{Lg_{t}h(w_{t},\gamma _{t})}}}$.

For each step, if ${\displaystyle h(w_{t},\gamma _{t})\neq 0}$, then update the direction as

${\displaystyle w_{t+1}=w_{t}-s_{t}\triangledown _{w}f(w_{t},\gamma _{t})}$.

Then update the length according to

${\displaystyle \gamma _{t}=Bisection(T_{s},f,w_{t})}$, where ${\displaystyle Bisection()}$ is the classical bisection algorithm, and ${\displaystyle T_{s}}$ is the total iterations ran in the bisection step.

Denote the total number of iterations as ${\displaystyle T_{d}}$, then the final output of GDNP is

${\displaystyle {\tilde {w}}_{T_{d}}=\gamma _{T_{d}}{\frac {w_{T_{d}}}{||w_{T_{d}}||_{S}}}}$.

The GDNP algorithm thus slightly modifies the batch normalization step for the ease of mathematical analysis.

It can be shown that in GDNP, the partial derivative of ${\displaystyle f_{LH}}$against the length component converges to zero at a linear rate, such that

${\displaystyle (\partial _{\gamma }f_{LH}(w_{t},a_{t}^{(T_{s})})^{2}\leq {\frac {2^{-T_{s}}\zeta |b_{t}^{(0)}-a_{t}^{(0)}|}{\mu ^{2}}}}$, where ${\displaystyle a_{t}^{(0)}}$ and ${\displaystyle b_{t}^{0}}$ are the two starting points of the bisection algorithm on the left and on the right, correspondingly.

Further, for each iteration, the norm of the gradient of ${\displaystyle f_{LH}}$ with respect to ${\displaystyle w}$ converges linearly, such that

${\displaystyle ||w_{t}||_{S}^{2}||\triangledown f_{LH}(w_{t},g_{t})||_{S^{-1}}^{2}\leq {\bigg (}1-{\frac {\mu }{L}}{\bigg )}^{2t}\Phi ^{2}\gamma _{t}^{2}(\rho (w_{0})-\rho ^{*})}$.

Combining these two inequalities, a bound could thus be obtained for the gradient with respect to ${\displaystyle {\tilde {w}}_{T_{d}}}$:

${\displaystyle ||\triangledown _{\tilde {w}}f({\tilde {w}}_{T_{d}})||^{2}\leq {\bigg (}1-{\frac {\mu }{L}}{\bigg )}^{2T_{d}}\Phi ^{2}(\rho (w_{0})-\rho ^{*})+{\frac {2^{-T_{s}}\zeta |b_{t}^{(0)}-a_{t}^{(0)}|}{\mu ^{2}}}}$, such that the algorithm is guaranteed to converge linearly.

Although the proof stands on the assumption of Gaussian input, it is also shown in experiments that GDNP could accelerate optimization without this constraint.

#### Linear Convergence of Neural Networks with Batch Normalization

Consider a multilayer perceptron (MLP) with one hidden layer and ${\displaystyle m}$ hidden units with mapping from input ${\displaystyle x\in R^{d}}$ to a scalar output described as

${\displaystyle F_{x}({\tilde {W}},\Theta )=\sum _{i=1}^{m}\theta _{i}\phi (x^{T}{\tilde {w}}^{(i)})}$, where ${\displaystyle {\tilde {w}}^{(i)}}$ and ${\displaystyle \theta _{i}}$ are the input and output weights of unit ${\displaystyle i}$ correspondingly, and ${\displaystyle \phi }$ is the activation function and is assumed to be a tanh function.

The input and output weights could then be optimized with

${\displaystyle min_{{\tilde {W}},\Theta }(f_{NN}({\tilde {W}},\Theta )=E_{y,x}[l(-yF_{x}({\tilde {W}},\Theta ))])}$, where ${\displaystyle l}$ is a loss function, ${\displaystyle {\tilde {W}}=\{{\tilde {w}}^{(1)},...,{\tilde {w}}^{(m)}\}}$, and ${\displaystyle \Theta =\{\theta ^{(1)},...,\theta ^{(m)}\}}$.

Consider fixed ${\displaystyle \Theta }$ and optimizing only ${\displaystyle {\tilde {W}}}$, it can be shown that the critical points of ${\displaystyle f_{NN}({\tilde {W}})}$ of a particular hidden unit ${\displaystyle i}$, ${\displaystyle {\hat {w}}^{(i)}}$, all align along one line depending on incoming information into the hidden layer, such that

${\displaystyle {\hat {w}}^{(i)}={\hat {c}}^{(i)}S^{-1}u}$, where ${\displaystyle {\hat {c}}^{(i)}\in R}$ is a scalar, ${\displaystyle i=1,...,m}$.

This result could be proved by setting the gradient of ${\displaystyle f_{NN}}$ to zero and solving the system of equations.

Apply the GDNP algorithm to this optimization problem by alternating optimization over the different hidden units. Specifically, for each hidden unit, run GDNP to find the optimal ${\displaystyle W}$ and ${\displaystyle \gamma }$. With the same choice of stopping criterion and stepsize, it follows that

${\displaystyle ||\triangledown _{{\tilde {w}}^{(i)}}f({\tilde {w}}_{t}^{(i)})||_{S^{-1}}^{2}\leq {\bigg (}1-{\frac {\mu }{L}}{\bigg )}^{2t}C(\rho (w_{0})-\rho ^{*})+{\frac {2^{-T_{s}^{(i)}}\zeta |b_{t}^{(0)}-a_{t}^{(0)}|}{\mu ^{2}}}}$.

Since the parameters of each hidden unit converge linearly, the whole optimization problem has a linear rate of convergence.

## References

1. ^ "Glossary of Deep Learning: Batch Normalisation". medium.com. 2017-06-27. Retrieved 24 April 2018.
2. ^ "Batch normalization in Neural Networks". towardsdatascience.com. 2017-10-20. Retrieved 24 April 2018.
3. ^ a b c d Ioffe, Sergey; Szegedy, Christian (2015). "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift". arXiv:1502.03167 [cs.LG].
4. Santurkar, Shibani; Tsipras, Dimitris; Ilyas, Andrew; Madry, Aleksander (2018-05-29). "How Does Batch Normalization Help Optimization?". arXiv:1805.11604 [stat.ML].
5. ^ a b Yang, Greg; Pennington, Jeffrey; Rao, Vinay; Sohl-Dickstein, Jascha; Schoenholz, Samuel S. (2019). "A Mean Field Theory of Batch Normalization". arXiv:1902.08129 [cs.NE].
6. ^ a b c d Kohler, Jonas; Daneshmand, Hadi; Lucchi, Aurelien; Zhou, Ming; Neymeyr, Klaus; Hofmann, Thomas (2018-05-27). "Exponential convergence rates for Batch Normalization: The power of length-direction decoupling in non-convex optimization". arXiv:1805.10694 [stat.ML].
7. ^ Knyazev, Neymeyr, A.V., K. (2003). "A geometric theory for preconditioned inverse iteration III: A short and sharp convergence estimate for generalized eigenvalue problems". Linear Algebra and Its Applications. 358 (1–3): 95–114. doi:10.1016/S0024-3795(01)00461-X.
• Ioffe, Sergey; Szegedy, Christian (2015). "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift", ICML'15: Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, July 2015 Pages 448–456