Backpropagation through time

Backpropagation through time (BPTT) is a gradient-based technique for training certain types of recurrent neural networks. It can be used to train Elman networks. The algorithm was independently derived by numerous researchers^[1]^[2]^[3]

Algorithm

The training data for BPTT should be an ordered sequence of input-output pairs, $\langle \mathbf {a} _{0},\mathbf {y} _{0}\rangle ,\langle \mathbf {a} _{1},\mathbf {y} _{1}\rangle ,\langle \mathbf {a} _{2},\mathbf {y} _{2}\rangle ,...,\langle \mathbf {a} _{k-1},\mathbf {y} _{k-1}\rangle$ . An initial value must be specified for $\mathbf {x} _{0}$ . Typically, a vector of all zeros is used for this purpose.

BPTT begins by unfolding a recurrent neural network through time as shown in this figure. This recurrent neural network contains two feed-forward neural networks, f and g. When the network is unfolded through time, the unfolded network contains k instances of f and one instance of g. In the example shown, the network has been unfolded to a depth of k=3.

Training then proceeds in a manner similar to training a feed-forward neural network with backpropagation, except that the training patterns are visited in sequential order. Each training pattern consists of $\langle \mathbf {x} _{t},\mathbf {a} _{t},\mathbf {a} _{t+1},\mathbf {a} _{t+2},...,\mathbf {a} _{t+k-1},\mathbf {y} _{t+k}\rangle$ . (All of the actions for k time-steps are needed because the unfolded network contains inputs at each unfolded level.) After a pattern is presented for training, the weight updates in each instance of f ( $f_{1},f_{2},...,f_{k}$ ) are summed together, then applied to all instances of f. The zero vector is typically used to initialize $\mathbf {x}$ .

Pseudo-code

Pseudo-code for BPTT:

Back_Propagation_Through_Time(a, y)   // a[t] is the input at time t. y[t] is the output
    Unfold the network to contain k instances of f
    do until stopping criteria is met:
        x = the zero-magnitude vector;// x is the current context
        for t from 0 to n - k         // t is time. n is the length of the training sequence
            Set the network inputs to x, a[t], a[t+1], ..., a[t+k-1]
            p = forward-propagate the inputs over the whole unfolded network
            e = y[t+k] - p;           // error = target - prediction
            Back-propagate the error, e, back across the whole unfolded network
            Sum the weight changes in the k instances of f together.
            Update all the weights in f and g.
            x = f(x, a[t]);           // compute the context for the next time-step

Advantages

BPTT tends to be significantly faster for training recurrent neural networks than general-purpose optimization techniques such as evolutionary optimization.^[4]

Disadvantages

BPTT has difficulty with local optima. With recurrent neural networks, local optima is a much more significant problem than it is with feed-forward neural networks.^[5] The recurrent feedback in such networks tends to create chaotic responses in the error surface which cause local optima to occur frequently, and in very poor locations on the error surface.

References

^ Mozer, M. C. (1995). Y. Chauvin; D. Rumelhart (eds.). A Focused Backpropagation Algorithm for Temporal Pattern Recognition. Hillsdale, NJ: Lawrence Erlbaum Associates. pp. 137–169. {{cite book}}: Unknown parameter |booktitle= ignored (help)
^ Robinson, A. J.; Fallside, F. (1987). The utility driven dynamic error propagation network (Technical report). Cambridge University, Engineering Department. CUED/F-INFENG/TR.1. {{cite tech report}}: Unknown parameter |lastauthoramp= ignored (|name-list-style= suggested) (help)
^ Paul J. Werbos (1988). "Generalization of backpropagation with application to a recurrent gas market model". Neural Networks. 1 (4): 339–356. doi:10.1016/0893-6080(88)90007-X.
^ Jonas Sjöberg and Qinghua Zhang and Lennart Ljung and Albert Benveniste and Bernard Deylon and Pierre-yves Glorennec and Hakan Hjalmarsson and Anatoli Juditsky (1995). "Nonlinear Black-Box Modeling in System Identification: a Unified Overview". Automatica. 31: 1691–1724. doi:10.1016/0005-1098(95)00120-8.
^ M.P. Cuéllar and M. Delgado and M.C. Pegalajar (2006). "An Application of Non-linear Programming to Train Recurrent Neural Networks in Time Series Prediction Problems". Enterprise Information Systems VII. Springer Netherlands: 95–102. doi:10.1007/978-1-4020-5347-4\_11. ISBN 978-1-4020-5323-8.

[1] Mozer, M. C. (1995). Y. Chauvin; D. Rumelhart (eds.). A Focused Backpropagation Algorithm for Temporal Pattern Recognition. Hillsdale, NJ: Lawrence Erlbaum Associates. pp. 137–169. {{cite book}}: Unknown parameter |booktitle= ignored (help)

[2] Robinson, A. J.; Fallside, F. (1987). The utility driven dynamic error propagation network (Technical report). Cambridge University, Engineering Department. CUED/F-INFENG/TR.1. {{cite tech report}}: Unknown parameter |lastauthoramp= ignored (|name-list-style= suggested) (help)

[3] Paul J. Werbos (1988). "Generalization of backpropagation with application to a recurrent gas market model". Neural Networks. 1 (4): 339–356. doi:10.1016/0893-6080(88)90007-X.

[4] Jonas Sjöberg and Qinghua Zhang and Lennart Ljung and Albert Benveniste and Bernard Deylon and Pierre-yves Glorennec and Hakan Hjalmarsson and Anatoli Juditsky (1995). "Nonlinear Black-Box Modeling in System Identification: a Unified Overview". Automatica. 31: 1691–1724. doi:10.1016/0005-1098(95)00120-8.

[5] M.P. Cuéllar and M. Delgado and M.C. Pegalajar (2006). "An Application of Non-linear Programming to Train Recurrent Neural Networks in Time Series Prediction Problems". Enterprise Information Systems VII. Springer Netherlands: 95–102. doi:10.1007/978-1-4020-5347-4\_11. ISBN 978-1-4020-5323-8.

[1]

[2]

[3]

[4]

[5]

Algorithm

Pseudo-code

Advantages

Disadvantages

See also

References