Backpropagation through time

"BPTT" redirects here. For the running events originally known as the Bushy Park Time Trial, see parkrun.

Backpropagation through time (BPTT) is a gradient-based technique for training certain types of recurrent neural networks. It can be used to train Elman networks. The algorithm was independently derived by numerous researchers[1][2][3]

Algorithm

The training data for BPTT should be an ordered sequence of input-output pairs, ${\displaystyle \langle \mathbf {a} _{0},\mathbf {y} _{0}\rangle ,\langle \mathbf {a} _{1},\mathbf {y} _{1}\rangle ,\langle \mathbf {a} _{2},\mathbf {y} _{2}\rangle ,...,\langle \mathbf {a} _{k-1},\mathbf {y} _{k-1}\rangle }$. An initial value must be specified for ${\displaystyle \mathbf {x} _{0}}$. Typically, a vector of all zeros is used for this purpose.

BPTT begins by unfolding a recurrent neural network through time as shown in this figure. This recurrent neural network contains two feed-forward neural networks, f and g. When the network is unfolded through time, the unfolded network contains k instances of f and one instance of g. In the example shown, the network has been unfolded to a depth of k=3.

Training then proceeds in a manner similar to training a feed-forward neural network with backpropagation, except that the training patterns are visited in sequential order. Each training pattern consists of ${\displaystyle \langle \mathbf {x} _{t},\mathbf {a} _{t},\mathbf {a} _{t+1},\mathbf {a} _{t+2},...,\mathbf {a} _{t+k-1},\mathbf {y} _{t+k}\rangle }$. (All of the actions for k time-steps are needed because the unfolded network contains inputs at each unfolded level.) After a pattern is presented for training, the weight updates in each instance of f (${\displaystyle f_{1},f_{2},...,f_{k}}$) are summed together, then applied to all instances of f. The zero vector is typically used to initialize ${\displaystyle \mathbf {x} }$.

Pseudo-code

Pseudo-code for BPTT:

Back_Propagation_Through_Time(a, y)   // a[t] is the input at time t. y[t] is the output
Unfold the network to contain k instances of f
do until stopping criteria is met:
x = the zero-magnitude vector;// x is the current context
for t from 0 to n - k         // t is time. n is the length of the training sequence
Set the network inputs to x, a[t], a[t+1], ..., a[t+k-1]
p = forward-propagate the inputs over the whole unfolded network
e = y[t+k] - p;           // error = target - prediction
Back-propagate the error, e, back across the whole unfolded network
Sum the weight changes in the k instances of f together.
Update all the weights in f and g.
x = f(x, a[t]);           // compute the context for the next time-step


BPTT tends to be significantly faster for training recurrent neural networks than general-purpose optimization techniques such as evolutionary optimization.[4]