# Dilution (neural networks)

(Redirected from Dropout (neural networks))

Stochastic Delta Rule (also called Dropout or DropConnect) is a regularization technique for reducing overfitting in artificial neural networks by preventing complex co-adaptations on training data. It is an efficient way of performing model averaging with neural networks. The term dilution refers to the thinning of the weights. The term dropout refers to randomly "dropping out", or omitting, units (both hidden and visible) during the training process of a neural network. Both the thinning of weights and dropping out units trigger the same type of regularization, and often the term dropout is used when referring to the removal of weights.

## Types and uses

Generally dropout or SDR are used for adding damping noise to the connections or nodes.

These techniques are also sometimes referred to as random pruning of weights, but this is usually a non-recurring one-way operation. Although in SDR the weights are represented as probability distribution with mean and standard deviation. Both parameters are modified through gradient descent with the standard deviation collapsing to zero through simulated annealing. At some point the network, therefore converges to one network with only mean values and all variances converging to zero. Dropout is a special case of this type of search and regularization. Output from a layer of linear nodes, in an artificial neural net can be described as

$y_{i}=\sum _{j}w_{ij}x_{j}$ (1)

• $y_{i}$ – output from node $i$ • $w_{ij}$ – real weight before dilution, also called the Hebb connection strength
• $x_{j}$ – input from node $j$ This can be written in vector notation as

$\mathbf {y} =\mathbf {W} \mathbf {x}$ (2)

• $\mathbf {y}$ – output vector
• $\mathbf {W}$ – weight matrix
• $\mathbf {x}$ – input vector

Equations (1) and (2) are used in the subsequent sections.

## Stochastic Delta Rule

During SDR weights are created as probability distributions with a mean and standard deviation.,

During learning both the mean and standard deviation are updated with partial of error with respect to each parameter. This algorithm, thus introduces weight noise that is adaptive and eventually converges to a single network with mean values as S.D. decay towards zero over learning. This adaptively removes weight connections over learning depending on prediction error as well as injecting adaptive noise into the network.

## Dropout

Dropout is a special case of the previous weight equation (3), where the aforementioned equation is adjusted to remove a whole row in the vector matrix, and not only random weight.

See https://direct.mit.edu/neco/article/32/5/1018/95589/The-Stochastic-Delta-Rule-Faster-and-More-Accurate which shows the proof for how Dropout is a special case of SD