Bidirectional long short-term memory is an artificial neural network architecture used for deep learning.



Since the origin of computing, artificial intelligence has been an object of study, but during the second half of the 20th century, processing power became more easily accessible and computer-based research became more commonplace. The term "machine learning", used as early as 1959 by IBM researcher Arthur Samuel,[1] currently encompasses a broad variety of statistical learning, data science and neural network approaches to computational problems (often falling under the aegis of artificial intelligence). The first neural network, the perceptron, was introduced in 1957 by Frank Rosenblatt.[2] This machine attempted to recognize pictures and classify them into categories. It consisted of a network of "input neurons" and "output neurons"; each input neuron was connected to every single output neuron, with "weights" (set with potentiometers) determining the strength of each connection's influence on output.[3] The architecture of Rosenblatt's perceptron is what would now be referred to as a fully-connected single-layer feed-forward neural network (FFNN). Since then, many different innovations have occurred, the most significant being the development of deep learning models in which one or more "layers" of neurons exists between the input and output.[4][5]

Neural networks are typically initialized with random weights, and "trained" to give consistently correct output for a known dataset (the "training set") using backpropagation to perform gradient descent, in which a system of equations is used to determine the optimal adjustment of all weights in the entire network for a given input/output example.[5][4] In traditional feed-forward neural networks (like Rosenblatt's perceptron), each layer processes output from the previous layer only. Information does not flow backwards, which means that its structure contains no "cycles".[4] In contrast, a recurrent neural network (RNN) has at least one "cycle" of activation flow, where neurons can be activated by neurons in subsequent layers.[4]

RNNs, unlike FFNNs, are suited to processing sequential data, since they are capable of encoding different weights (and producing different output) for the same input based on previous activation states. That is to say, a text-prediction model using recurrence could process the string "The dog ran out of the house, down the street, loudly" and produce "barking", while producing "meowing" for the same input sequence featuring "cat" in the place of "dog". Achieving the same output from a purely feed-forward neural network, on the other hand, would require separate activation pathways to be trained for both sentences in their entirety.[6][7]

However, RNNs and FFNNs are both vulnerable to the "vanishing gradient problem"; since gradients (stored as numbers of finite precision) must be backpropagated over every layer of a model to train it, a model with a large number of layers tends to see gradients "vanish" to zero or "explode" to infinity before getting all the way across. To resolve this problem, long short-term memory (LSTM) models were introduced by Sepp Hochreiter and Jürgen Schmidhuber in 1995—1997, featuring a novel architecture of multiple distinct "cells" with "input", "output" and "forget" gates.[8][9][10]. LSTMs would find use in a variety of tasks that RNNs performed poorly at, like learning fine distinctions between rhythmic pattern sequences.[11]

While LSTMs proved useful for a variety of applications, like handwriting recognition,[12] they remained limited in their ability to process context; a unidirectional RNN or LSTM's output can only be influenced by previous sequence items.[6] Similar to how the history of the Roman Empire is contextualized by its decline, earlier items in a sequence of images or words tend to take on different meanings based on later items. One example is the following sentence:

He loved his bird more than anything, and cared for it well, and was very distraught to find it had a broken propeller.

Here, the "bird" is being used as a slang term for an airplane, but this only becomes apparent upon parsing the last word ("propeller"). While a human reading this sentence can update their interpretation of the first part after reading the second, a unidirectional neural network (whether feedforward, recurrent, or LSTM) cannot.[6] To provide this capability, bidirectional LSTMs were created. Bidirectional RNNs were first described in 1997 by Schuster and Paliwal as an extension of RNNs.[13]

Bidirectional algorithms have long been used in domains outside of deep learning; in 2011, the state of the art in part-of-speech (POS) tagging classifiers consisted of classifiers trained on windows of text which then fed into bidirectional decoding algorithms during inference; Collobert et al. cited examples of high-performance POS tagging systems whose decoding systems' bidirectionality was instantiated in dependency networks and Viterbi decoders.[14]

In a 2005 paper, Graves et al. used bidirectional LSTMs for improved phoneme classification and recognition.[19]

In a 2013 paper, Graves et al. used deep bidirectional LSTM for hybrid speech recognition.[20]

In a 2007 paper, Liwicki et al. used bidirectional LSTM for on-line handwriting recognition.

