User:AI456/sandbox
Derivation
[edit]Since backpropagation uses the gradient descent method, one needs to calculate the derivative of the squared error function with respect to the weights of the network. The squared error function is (the term is added to cancel the exponent when differentiating):
, | |
= the squared error | |
= target output | |
= actual output of the output neuron[note 1] |
Therefore the error, , depends on the output . However, the output depends on the weighted sum of all its input:
= the number of input units to the neuron | |
= the ith weight | |
= the ith input value to the neuron |
The above formula only holds true for a neuron with a linear activation function (that is the output is solely the weighted sum of the input). In general, a non-linear, differentiable activation function, , is used. Thus, more correctly:
This lays the groundwork for calculating the partial derivative of the error with respect to a weight using the chain rule:
= How the error changes when the weights are changed | |
= How the error changes when the output is changed | |
= How the output changes when the weighted sum changes | |
= How the weighted sum changes as the weights change |
Since the weighted sum is just the sum over all products , therefore the partial derivative of the sum with respect to a weight is the just the corresponding input . Similarly, the partial derivative of the sum with respect to an input value is just the weight :
The derivative of the output with respect to the weighted sum is simply the derivative of the activation function :
This is the reason why backpropagation requires the activation function to be differentiable. A commonly used activation function is the logistic function:
which has a nice derivative of:
For example purposes, assume the network uses a logistic activation function, in which case the derivative of the output with respect to the weighted sum is the same as the derivative of the logistic function:
Finally, the derivative of the error with respect to the output is:
Putting it all together:
If one were to use a different activation function, the only difference would be the term will be replaced by the derivative of the newly chosen activation function.
To update the weight using gradient descent, one must chooses a learning rate, . The change in weight after learning then would be the product of the learning rate and the gradient:
For a linear neuron, the derivative of the activation function is 1, which yields:
This is exactly the delta rule for perceptron learning, which is why the backpropagation algorithm is a generalization of the delta rule. In backpropagation and perceptron learning, when the output matches the desired output , the change in weight would be zero, which is exactly what is desired.
Limitations and Improvements
[edit]The result may converge to a local minimum
[edit]The "hill climbing" strategy of gradient descent is guaranteed to work if there is only one minimum. However, often times the error surface has many local minimum and maximum. If the starting point of the gradient descent happens to be somewhere between a local maximum and local minimum, then going down the direction with the most negative gradient will lead to the local minimum.
Learning is slow when error surface is elongated
[edit]Solution: Scale the inputs to have zero mean over the training set
[edit]Consider the following training example: (101, 101) -> 2 (101, 99) -> 0
Notes
[edit]- ^ There can be multiple output neurons, however backpropagation treats each in isolation when calculating the gradient, therefore, in the rest of the derivation, only one output neuron is considered.