# Gated recurrent unit

Gated recurrent units (GRUs) are a gating mechanism in recurrent neural networks, introduced in 2014 by Kyunghyun Cho et al.[1] The GRU is like a long short-term memory (LSTM) with a forget gate,[2] but has fewer parameters than LSTM, as it lacks an output gate.[3] GRU's performance on certain tasks of polyphonic music modeling, speech signal modeling and natural language processing was found to be similar to that of LSTM.[4][5] GRUs have been shown to exhibit better performance on certain smaller and less frequent datasets.[6][7]

## Architecture

There are several variations on the full gated unit, with gating done using the previous hidden state and the bias in various combinations, and a simplified form called minimal gated unit.[8]

The operator ${\displaystyle \odot }$ denotes the Hadamard product in the following.

### Fully gated unit

Gated Recurrent Unit, fully gated version

Initially, for ${\displaystyle t=0}$, the output vector is ${\displaystyle h_{0}=0}$.

{\displaystyle {\begin{aligned}z_{t}&=\sigma _{g}(W_{z}x_{t}+U_{z}h_{t-1}+b_{z})\\r_{t}&=\sigma _{g}(W_{r}x_{t}+U_{r}h_{t-1}+b_{r})\\{\hat {h}}_{t}&=\phi _{h}(W_{h}x_{t}+U_{h}(r_{t}\odot h_{t-1})+b_{h})\\h_{t}&=z_{t}\odot {\hat {h}}_{t}+(1-z_{t})\odot h_{t-1}\end{aligned}}}

Variables

• ${\displaystyle x_{t}}$: input vector
• ${\displaystyle h_{t}}$: output vector
• ${\displaystyle {\hat {h}}_{t}}$: candidate activation vector
• ${\displaystyle z_{t}}$: update gate vector
• ${\displaystyle r_{t}}$: reset gate vector
• ${\displaystyle W}$, ${\displaystyle U}$ and ${\displaystyle b}$: parameter matrices and vector
• ${\displaystyle \sigma _{g}}$: The original is a sigmoid function.
• ${\displaystyle \phi _{h}}$: The original is a hyperbolic tangent.

Alternative activation functions are possible, provided that ${\displaystyle \sigma _{g}(x)\in [0,1]}$.

Type 1
Type 2
Type 3

Alternate forms can be created by changing ${\displaystyle z_{t}}$ and ${\displaystyle r_{t}}$[9]

• Type 1, each gate depends only on the previous hidden state and the bias.
{\displaystyle {\begin{aligned}z_{t}&=\sigma _{g}(U_{z}h_{t-1}+b_{z})\\r_{t}&=\sigma _{g}(U_{r}h_{t-1}+b_{r})\\\end{aligned}}}
• Type 2, each gate depends only on the previous hidden state.
{\displaystyle {\begin{aligned}z_{t}&=\sigma _{g}(U_{z}h_{t-1})\\r_{t}&=\sigma _{g}(U_{r}h_{t-1})\\\end{aligned}}}
• Type 3, each gate is computed using only the bias.
{\displaystyle {\begin{aligned}z_{t}&=\sigma _{g}(b_{z})\\r_{t}&=\sigma _{g}(b_{r})\\\end{aligned}}}

### Minimal gated unit

The minimal gated unit is similar to the fully gated unit, except the update and reset gate vector is merged into a forget gate. This also implies that the equation for the output vector must be changed:[10]

{\displaystyle {\begin{aligned}f_{t}&=\sigma _{g}(W_{f}x_{t}+U_{f}h_{t-1}+b_{f})\\{\hat {h}}_{t}&=\phi _{h}(W_{h}x_{t}+U_{h}(f_{t}\odot h_{t-1})+b_{h})\\h_{t}&=(1-f_{t})\odot h_{t-1}+f_{t}\odot {\hat {h}}_{t}\end{aligned}}}

Variables

• ${\displaystyle x_{t}}$: input vector
• ${\displaystyle h_{t}}$: output vector
• ${\displaystyle {\hat {h}}_{t}}$: candidate activation vector
• ${\displaystyle f_{t}}$: forget vector
• ${\displaystyle W}$, ${\displaystyle U}$ and ${\displaystyle b}$: parameter matrices and vector

The complete CARU architecture. The direction of data flow is indicated by arrows, involved functions are indicated by yellow rectangles, and various gates (operations) are indicated by blue circles.

Content Adaptive Recurrent Unit (CARU) is a variant of GRU, introduced in 2020 by Ka-Hou Chan et al.[11] The CARU contain the update gate like GRU, but introduce a content-adaptive gate instead of the reset gate. CARU is design to alleviate the long-term dependence problem of RNN models. It was found to have a slight performance improvement on NLP task and also has fewer parameters than the GRU.[12]

In the following equations, the lowercase variables represent vectors and ${\displaystyle \left[W;B\right]}$ denote the training parameters, which are linear layers consisting of weights and biases, respectively. Initially, for ${\displaystyle t=0}$, CARU directly returns ${\displaystyle h^{(1)}\gets W_{vn}v^{(0)}+B_{vn}}$; Next, for ${\displaystyle t>0}$, a complete architecture is given by:

{\displaystyle {\begin{aligned}x^{(t)}&={W_{vn}}v^{(t)}+{B_{vn}}\\n^{(t)}&=\phi (({W_{hn}}h^{(t)}+{B_{hn}})+x^{(t)})\\z^{(t)}&=\sigma ({W_{hz}}h^{(t)}+{B_{hz}}+{W_{vz}}v^{(t)}+{B_{vz}})\\l^{(t)}&=\sigma (x^{(t)})\odot z^{(t)}\\h^{(t+1)}&=(1-l^{(t)})\odot h^{(t)}+l^{(t)}\odot n^{(t)}\end{aligned}}}

Note that it has ${\displaystyle t\gets t+1}$ at the end of each recurrent loop. The operator ${\displaystyle \odot }$ denotes the Hadamard product, ${\displaystyle \sigma }$ and ${\displaystyle \phi }$ denotes the activation function of sigmoid and hyperbolic tangent, respectively.

### Variables

• ${\displaystyle x^{(t)}}$: It first projects the current word ${\displaystyle v^{(t)}}$ into ${\displaystyle x^{(t)}}$ as the input feature. This result would be used in the next hidden state and passed to the proposed content-adaptive gate.
• ${\displaystyle n^{(t)}}$: Compare to GRU, the reset gate has been taken out. It just combines the parameters related to ${\displaystyle h^{(t)}}$ and ${\displaystyle x^{(t)}}$ to produce a new hidden state ${\displaystyle n^{(t)}}$.
• ${\displaystyle z^{(t)}}$: It is the same as the update gate in GRU and is used to the transition of the hidden state.
• ${\displaystyle l^{(t)}}$: There is a Hadamard operator to combine the update gate with the weight of current feature. This gate had been named as content-adaptive gate, which will influence the amount of gradual transition, rather than diluting the current hidden state.
• ${\displaystyle h^{(t+1)}}$: The next hidden state is combined with ${\displaystyle h^{(t)}}$ and ${\displaystyle n^{(t)}}$.

### Data Flow

Another feature of CARU is the weighting of hidden states according to the current word and the introduction of content-adaptive gates, instead of using reset gates to alleviate the dependence on long-term content. There are three trunks of data flow to be processed by CARU:

• content-state: It produces a new hidden state ${\displaystyle n^{(t)}}$ achieved by a linear layer, this part is equivalent to simple RNN networks.
• word-weight: It produces the weight ${\displaystyle \sigma (x^{(t)})}$ of the current word, it has the capability like a GRU reset gate but is only based on the current word instead of the entire content. More specifically, it can be considered as the tagging task that connects the relation between the weight and parts-of-speech.
• content-weight: It produces the weight ${\displaystyle z^{(t)}}$ of the current content, the form is the same as a GRU update gate but with the purpose to overcome the long-term dependence.

In contrast to GRU, CARU does not intend to process those data flow, instead dispatch the word-weight to the content-adaptive gate and multiplies it with the content-weight. In this way, the content-adaptive gate considers of both the word and the content.

## References

1. ^ Cho, Kyunghyun; van Merrienboer, Bart; Bahdanau, DZmitry; Bengio, Yoshua (2014). "On the Properties of Neural Machine Translation: Encoder-Decoder Approaches". arXiv:1409.1259. {{cite journal}}: Cite journal requires |journal= (help)
2. ^ Felix Gers; Jürgen Schmidhuber; Fred Cummins (1999). "Learning to Forget: Continual Prediction with LSTM". Proc. ICANN'99, IEE, London. 1999: 850–855. doi:10.1049/cp:19991218. ISBN 0-85296-721-7.
3. ^ "Recurrent Neural Network Tutorial, Part 4 – Implementing a GRU/LSTM RNN with Python and Theano – WildML". Wildml.com. 2015-10-27. Archived from the original on 2021-11-10. Retrieved May 18, 2016.
4. ^ Ravanelli, Mirco; Brakel, Philemon; Omologo, Maurizio; Bengio, Yoshua (2018). "Light Gated Recurrent Units for Speech Recognition". IEEE Transactions on Emerging Topics in Computational Intelligence. 2 (2): 92–102. arXiv:1803.10225. doi:10.1109/TETCI.2017.2762739. S2CID 4402991.
5. ^ Su, Yuahang; Kuo, Jay (2019). "On extended long short-term memory and dependent bidirectional recurrent neural network". Neurocomputing. 356: 151–161. arXiv:1803.01686. doi:10.1016/j.neucom.2019.04.044. S2CID 3675055.
6. ^ Chung, Junyoung; Gulcehre, Caglar; Cho, KyungHyun; Bengio, Yoshua (2014). "Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling". arXiv:1412.3555 [cs.NE].
7. ^ Gruber, N.; Jockisch, A. (2020), "Are GRU cells more specific and LSTM cells more sensitive in motive classification of text?", Frontiers in Artificial Intelligence, 3: 40, doi:10.3389/frai.2020.00040, PMC 7861254, PMID 33733157, S2CID 220252321
8. ^ Chung, Junyoung; Gulcehre, Caglar; Cho, KyungHyun; Bengio, Yoshua (2014). "Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling". arXiv:1412.3555 [cs.NE].
9. ^ Dey, Rahul; Salem, Fathi M. (2017-01-20). "Gate-Variants of Gated Recurrent Unit (GRU) Neural Networks". arXiv:1701.05923 [cs.NE].
10. ^ Heck, Joel; Salem, Fathi M. (2017-01-12). "Simplified Minimal Gated Unit Variations for Recurrent Neural Networks". arXiv:1701.03452 [cs.NE].
11. ^ Chan, Ka-Hou; Ke, Wei; Im, Sio-Kei (2020), Yang, Haiqin; Pasupa, Kitsuchart; Leung, Andrew Chi-Sing; Kwok, James T. (eds.), "CARU: A Content-Adaptive Recurrent Unit for the Transition of Hidden State in NLP", Neural Information Processing, Cham: Springer International Publishing, vol. 12532, pp. 693–703, doi:10.1007/978-3-030-63830-6_58, ISBN 978-3-030-63829-0, S2CID 227075832, retrieved 2022-02-18
12. ^ Ke, Wei; Chan, Ka-Hou (2021-11-30). "A Multilayer CARU Framework to Obtain Probability Distribution for Paragraph-Based Sentiment Analysis". Applied Sciences. 11 (23): 11344. doi:10.3390/app112311344. ISSN 2076-3417.