# Transformer (machine learning model)

The Transformer is a deep machine learning model introduced in 2017, used primarily in the field of natural language processing (NLP).[1] Like recurrent neural networks (RNNs), Transformers are designed to handle ordered sequences of data, such as natural language, for various tasks such as machine translation and text summarization. However, unlike RNNs, Transformers do not require that the sequence be processed in order. So, if the data in question is natural language, the Transformer does not need to process the beginning of a sentence before it processes the end. Due to this feature, the Transformer allows for much more parallelization than RNNs during training.[1]

Since their introduction, Transformers have become the basic building block of most state-of-the-art architectures in NLP, replacing gated recurrent neural network models such as the long short-term memory (LSTM) in many cases. Since the Transformer architecture facilitates more parallelization during training computations, it has enabled training on much more data than was possible before it was introduced. This led to the development of pretrained systems such as BERT (Bidirectional Encoder Representations from Transformers) and GPT-2, which have been trained with huge amounts of general language data prior to being released, and can then be fine-tune trained to specific language tasks.[2][3]

## Background

The primary motivations for the development of the Transformer were to reduce path lengths, thereby making long-range dependencies in sequences easier to learn, and to allow for more parallelization than previous models such as recurrent neural networks and convolutional neural networks (CNNs).[1][4] While path lengths in RNNs and CNNs are linear and logarithmic respectively, it is constant in a Transformer, greatly facilitating training. Additionally, the Transformer does not rely as heavily on the prohibitive sequential nature of input data as do CNNs and especially RNNs.

### Recurrent Neural Networks

Various architectures of recurrent neural networks have been successful in performing tasks relating to sequences, due to their ability to process inputs of arbitrary length, rather than being restricted to fixed length inputs like those of multilayer perceptrons.[5] RNNs operate by processing inputs sequentially and retaining a hidden vector between iterations which is constantly used and modified throughout the sequence. An RNN can be interpreted as a program which uses a hidden vector as its memory.[5] Theoretically, RNNs are able to model arbitrarily complicated programs due to their Turing-completeness, just as perceptrons are able to model arbitrarily complicated functions, but in practice, RNNs suffer from a variant of the vanishing gradient problem, making it extremely difficult for them to learn long-range dependencies in sequences.[5][6][7]

Long short term memory (LSTM) models, a variant of RNNs, attempt to alleviate this issue by choosing very specific operations to perform in recurrent portion of the network.[6][8] LSTMs make use of a cell state which only passes through linear operations in the recurrent portion, allowing information to pass through relatively unchanged with each iteration.[6]

Although RNNs are able to handle long-range dependencies in arbitrary length sequences, they still struggle with this issue, and are not easy to parallelize due to the sequential nature of their computation.[4]

### Convolutional Neural Networks

Autoregressive convolutional neural networks can handle long-range dependencies through dilated convolutions such that the path length is logarithmic.[4][9] CNNs also allow for parallelization within layers.

## Architecture

The Transformer consists of two main components: a set of encoders chained together and a set of decoders chained together. The function of each encoder is to process its input vectors to generate what are known as encodings, which contain information about the parts of the inputs which are relevant to each other. It passes its set of generated encodings to the next encoder as inputs. Each decoder does the opposite, taking all the encodings and processing them, using their incorporated contextual information to generate an output sequence.[10] To achieve this, each encoder and decoder makes use of an attention mechanism, which for each input, weighs the relevance of every input and draws information from them accordingly when producing the output.[11] Each decoder also has an additional attention mechanism which draws information from the outputs of previous decoders, before the decoder draws information from the encodings. Both the encoders and decoders have a final feed-forward neural network for additional processing of the outputs, and also contain residual connections and layer normalization steps.[11]

### Encoder

Each encoder consists of two major components: a self-attention mechanism and a feed-forward neural network. The self-attention mechanism takes in a set of input encodings from the previous encoder and weighs their relevance to each other to generate a set of output encodings. The feed-forward neural network then further processes each output encoding individually. These output encodings are finally passed to the next encoder as its input, as well as the decoders.

The first encoder takes positional information and embeddings of the input sequence as its input, rather than encodings. The positional information is necessary for the Transformer to make use of the order of the sequence, because no other part of the Transformer makes use of this.[1]

### Decoder

Each decoder consists of three major components: a self-attention mechanism, an attention mechanism over the encodings, and a feed-forward neural network. The decoder functions in a similar fashion to the encoder, but an additional attention mechanism is inserted which instead draws relevant information from the encodings generated by the encoders.[1][11]

Like the first encoder, the first decoder takes positional information and embeddings of the output sequence as its input, rather than embeddings. Since the transformer should not use the current or future output to predict an output though, the output sequence must be partially masked to prevent this reverse information flow.[1] The last decoder is followed by a final linear transformation and softmax layer, to produce the desired output probabilities.

### Scaled Dot-Product Attention

The purpose of an attention mechanism is to use a set of encodings to incorporate context into a sequence.[11] An attention mechanism takes a set of queries and a set of key-value pairs in order to generate an output, where the queries are generated by linear transformations of the vectors in the sequence, and the keys and values are generated by linear transformations of the encodings.[1][12] Self-attention refers to the situation where the queries, keys, and values are all created using encodings of the sequence.[11] Scaled dot-product attention multiples the queries by the keys, divides by a scaling factor and takes the softmax, and finally uses the result to compute a weighted sum of the values corresponding to the keys, generating an encoding of the original sequence.[1] If ${\displaystyle \mathbf {Q} }$, ${\displaystyle \mathbf {K} }$, and ${\displaystyle \mathbf {V} }$ are the matrices containing the queries, keys, and values respectively and ${\displaystyle n}$ is the dimension of the queries and keys, then the output of this attention mechanism is:[1][12]

{\displaystyle {\begin{aligned}{\text{Attention}}(\mathbf {Q} ,\mathbf {K} ,\mathbf {V} )={\text{softmax}}({\frac {\mathbf {Q} \mathbf {K} ^{T}}{\sqrt {n}}})\mathbf {V} \end{aligned}}}

In practice, it is useful to implement several attention mechanisms in parallel and then combine the resulting encodings together in a process called multi-head attention.

## Training

Transformers typically undergo semi-supervised learning involving unsupervised pretraining followed by supervised fine-tuning. Pretraining is typically done on a much larger dataset than fine-tuning, due to the restricted availability of labeled training data. Tasks for pretraining and fine-tuning commonly include:

## Applications

The Transformer finds most of its applications in the field of natural language processing (NLP), for example the tasks of machine translation and time series prediction.[14] Many pretrained models such as GPT-2, BERT, XLNet, and RoBERTa demonstrate the ability of Transformers to perform a wide variety of such NLP-related tasks, and have the potential to find real-world applications.[2][3][15] These may include:

## References

1. Polosukhin, Illia; Kaiser, Lukasz; Gomez, Aidan N.; Jones, Llion; Uszkoreit, Jakob; Parmar, Niki; Shazeer, Noam; Vaswani, Ashish (2017-06-12). "Attention Is All You Need". arXiv:1706.03762 [cs.CL].
2. ^ a b c "Open Sourcing BERT: State-of-the-Art Pre-training for Natural Language Processing". Google AI Blog. Retrieved 2019-08-25.
3. ^ a b c "Better Language Models and Their Implications". OpenAI. 2019-02-14. Retrieved 2019-08-25.
4. ^ a b c Chromiak, Michał (2017-09-12). "The Transformer – Attention is all you need". Michał Chromiak's blog. Retrieved 2019-10-20.
5. ^ a b c "The Unreasonable Effectiveness of Recurrent Neural Networks". karpathy.github.io. Retrieved 2019-10-20.
6. ^ a b c "Understanding LSTM Networks -- colah's blog". colah.github.io. Retrieved 2019-10-20.
7. ^ Bengio, Y.; Simard, P.; Frasconi, P. (March 1994). "Learning long-term dependencies with gradient descent is difficult". IEEE Transactions on Neural Networks. 5 (2): 157–166. doi:10.1109/72.279181. ISSN 1045-9227. PMID 18267787.
8. ^ Hochreiter, Sepp; Schmidhuber, Jürgen (November 1997). "Long Short-Term Memory". Neural Computation. 9 (8): 1735–1780. doi:10.1162/neco.1997.9.8.1735. ISSN 0899-7667.
9. ^ Oord, Aaron van den; Dieleman, Sander; Zen, Heiga; Simonyan, Karen; Vinyals, Oriol; Graves, Alex; Kalchbrenner, Nal; Senior, Andrew; Kavukcuoglu, Koray (2016-09-12). "WaveNet: A Generative Model for Raw Audio". arXiv:1609.03499 [cs.SD].
10. ^ "Sequence Modeling with Neural Networks (Part 2): Attention Models". Indico. 2016-04-18. Retrieved 2019-10-15.
11. Alammar, Jay. "The Illustrated Transformer". jalammar.github.io. Retrieved 2019-10-15.
12. ^ a b "Attention? Attention!". Lil'Log. 2018-06-24. Retrieved 2019-10-15.
13. ^ a b Wang, Alex; Singh, Amanpreet; Michael, Julian; Hill, Felix; Levy, Omer; Bowman, Samuel (2018). "GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding". Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. Stroudsburg, PA, USA: Association for Computational Linguistics: 353–355. arXiv:1804.07461. Bibcode:2018arXiv180407461W. doi:10.18653/v1/w18-5446.
14. ^ Allard, Maxime (2019-07-01). "What is a Transformer?". Medium. Retrieved 2019-10-21.
15. ^ Yang, Zhilin Dai, Zihang Yang, Yiming Carbonell, Jaime Salakhutdinov, Ruslan Le, Quoc V. (2019-06-19). XLNet: Generalized Autoregressive Pretraining for Language Understanding. OCLC 1106350082.CS1 maint: multiple names: authors list (link)
16. ^ a b Monsters, Data (2017-09-26). "10 Applications of Artificial Neural Networks in Natural Language Processing". Medium. Retrieved 2019-10-21.