# Gated recurrent unit

Gated recurrent units (GRUs) are a gating mechanism in recurrent neural networks, introduced in 2014 by Kyunghyun Cho et al. The GRU is like a long short-term memory (LSTM) with a forget gate, but has fewer parameters than LSTM, as it lacks an output gate. GRU's performance on certain tasks of polyphonic music modeling, speech signal modeling and natural language processing was found to be similar to that of LSTM. GRUs have been shown to exhibit better performance on certain smaller and less frequent datasets.

## Architecture

There are several variations on the full gated unit, with gating done using the previous hidden state and the bias in various combinations, and a simplified form called minimal gated unit.

The operator $\odot$ denotes the Hadamard product in the following.

### Fully gated unit

Initially, for $t=0$ , the output vector is $h_{0}=0$ .

{\begin{aligned}z_{t}&=\sigma _{g}(W_{z}x_{t}+U_{z}h_{t-1}+b_{z})\\r_{t}&=\sigma _{g}(W_{r}x_{t}+U_{r}h_{t-1}+b_{r})\\{\hat {h}}_{t}&=\phi _{h}(W_{h}x_{t}+U_{h}(r_{t}\odot h_{t-1})+b_{h})\\h_{t}&=z_{t}\odot {\hat {h}}_{t}+(1-z_{t})\odot h_{t-1}\end{aligned}} Variables

• $x_{t}$ : input vector
• $h_{t}$ : output vector
• ${\hat {h}}_{t}$ : candidate activation vector
• $z_{t}$ : update gate vector
• $r_{t}$ : reset gate vector
• $W$ , $U$ and $b$ : parameter matrices and vector
• $\sigma _{g}$ : The original is a sigmoid function.
• $\phi _{h}$ : The original is a hyperbolic tangent.

Alternative activation functions are possible, provided that $\sigma _{g}(x)\in [0,1]$ .

Alternate forms can be created by changing $z_{t}$ and $r_{t}$ • Type 1, each gate depends only on the previous hidden state and the bias.
{\begin{aligned}z_{t}&=\sigma _{g}(U_{z}h_{t-1}+b_{z})\\r_{t}&=\sigma _{g}(U_{r}h_{t-1}+b_{r})\\\end{aligned}} • Type 2, each gate depends only on the previous hidden state.
{\begin{aligned}z_{t}&=\sigma _{g}(U_{z}h_{t-1})\\r_{t}&=\sigma _{g}(U_{r}h_{t-1})\\\end{aligned}} • Type 3, each gate is computed using only the bias.
{\begin{aligned}z_{t}&=\sigma _{g}(b_{z})\\r_{t}&=\sigma _{g}(b_{r})\\\end{aligned}} ### Minimal gated unit

The minimal gated unit is similar to the fully gated unit, except the update and reset gate vector is merged into a forget gate. This also implies that the equation for the output vector must be changed:

{\begin{aligned}f_{t}&=\sigma _{g}(W_{f}x_{t}+U_{f}h_{t-1}+b_{f})\\{\hat {h}}_{t}&=\phi _{h}(W_{h}x_{t}+U_{h}(f_{t}\odot h_{t-1})+b_{h})\\h_{t}&=(1-f_{t})\odot h_{t-1}+f_{t}\odot {\hat {h}}_{t}\end{aligned}} Variables

• $x_{t}$ : input vector
• $h_{t}$ : output vector
• ${\hat {h}}_{t}$ : candidate activation vector
• $f_{t}$ : forget vector
• $W$ , $U$ and $b$ : parameter matrices and vector The complete CARU architecture. The direction of data flow is indicated by arrows, involved functions are indicated by yellow rectangles, and various gates (operations) are indicated by blue circles.

Content Adaptive Recurrent Unit (CARU) is a variant of GRU, introduced in 2020 by Ka-Hou Chan et al. The CARU contain the update gate like GRU, but introduce a content-adaptive gate instead of the reset gate. CARU is design to alleviate the long-term dependence problem of RNN models. It was found to have a slight performance improvement on NLP task and also has fewer parameters than the GRU.

In the following equations, the lowercase variables represent vectors and $\left[W;B\right]$ denote the training parameters, which are linear layers consisting of weights and biases, respectively. Initially, for $t=0$ , CARU directly returns $h^{(1)}\gets W_{vn}v^{(0)}+B_{vn}$ ; Next, for $t>0$ , a complete architecture is given by:

{\begin{aligned}x^{(t)}&={W_{vn}}v^{(t)}+{B_{vn}}\\n^{(t)}&=\phi (({W_{hn}}h^{(t)}+{B_{hn}})+x^{(t)})\\z^{(t)}&=\sigma ({W_{hz}}h^{(t)}+{B_{hz}}+{W_{vz}}v^{(t)}+{B_{vz}})\\l^{(t)}&=\sigma (x^{(t)})\odot z^{(t)}\\h^{(t+1)}&=(1-l^{(t)})\odot h^{(t)}+l^{(t)}\odot n^{(t)}\end{aligned}} Note that it has $t\gets t+1$ at the end of each recurrent loop. The operator $\odot$ denotes the Hadamard product, $\sigma$ and $\phi$ denotes the activation function of sigmoid and hyperbolic tangent, respectively.

### Variables

• $x^{(t)}$ : It first projects the current word $v^{(t)}$ into $x^{(t)}$ as the input feature. This result would be used in the next hidden state and passed to the proposed content-adaptive gate.
• $n^{(t)}$ : Compare to GRU, the reset gate has been taken out. It just combines the parameters related to $h^{(t)}$ and $x^{(t)}$ to produce a new hidden state $n^{(t)}$ .
• $z^{(t)}$ : It is the same as the update gate in GRU and is used to the transition of the hidden state.
• $l^{(t)}$ : There is a Hadamard operator to combine the update gate with the weight of current feature. This gate had been named as content-adaptive gate, which will influence the amount of gradual transition, rather than diluting the current hidden state.
• $h^{(t+1)}$ : The next hidden state is combined with $h^{(t)}$ and $n^{(t)}$ .

### Data Flow

Another feature of CARU is the weighting of hidden states according to the current word and the introduction of content-adaptive gates, instead of using reset gates to alleviate the dependence on long-term content. There are three trunks of data flow to be processed by CARU:

• content-state: It produces a new hidden state $n^{(t)}$ achieved by a linear layer, this part is equivalent to simple RNN networks.
• word-weight: It produces the weight $\sigma (x^{(t)})$ of the current word, it has the capability like a GRU reset gate but is only based on the current word instead of the entire content. More specifically, it can be considered as the tagging task that connects the relation between the weight and parts-of-speech.
• content-weight: It produces the weight $z^{(t)}$ of the current content, the form is the same as a GRU update gate but with the purpose to overcome the long-term dependence.

In contrast to GRU, CARU does not intend to process those data flow, instead dispatch the word-weight to the content-adaptive gate and multiplies it with the content-weight. In this way, the content-adaptive gate considers of both the word and the content.