Neural Machine Translation by Jointly Learning to Align and Translate (Sep 2014)
These are my notes from the paper Neural Machine Translation by Jointly Learning to Align and Translate (2014) by Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio.
Overview
This paper proposed an improvement to the RNN EncoderDecoder ^{1} network architecture, introducing an "attention mechanism" to the decoder, which significantly improved performance over longer sentences. The concept of attention went on to become extremely influential in Machine Learning.
At the time, neural networks had emerged as a promising approach to machine translation, where researchers were aiming for an endtoend translation model, in contrast to the stateoftheart statistical phrasebased translation methods, which involved many individually trained components. The RNN EncoderDecoder approach would encode an input sentence into a fixedlength context vector; a decoder would then output a translation using the context vector. The encoder and decoder are jointly trained on a dataset of text pairs, where the goal is to maximise the probability of the target given the input.
However, this approach struggles with longer sentences, as the encoder has to drop information to compress it into a fixedlength context vector.
The authors proposed modifying the encoder to output a sequence with one hidden representation per input word, then adding a search mechanism to the decoder, allowing it to find the most relevant information in the input sequence to predict each word in the output sequence.
They likened the modification to the human notion of "attention", calling it an Attention Mechanism. Though not the first Machine Learning paper to propose applying humanlike attention to model architectures ^{2}, this approach was very influential in NLP, leading to a lot of research eventually converging on an entirely attentionbased architecture called the Transformer.
Architecture Details
The authors propose an RNNSearch model: an Encoder / Decoder model with an attention mechanism. For comparison, they train RNNencdec, which follows the standard RNN Encoder / Decoder architecture ^{2} with the encoder returning a fixedlength context vector.
To demonstrate the ability to handle longer sequences, they train each model twice:
 First, with sentences of length up to 30 words:
RNNencdec30
,RNNsearch30
 Next, with sentences of size up to 50 words:
RNNencdec50
,RNNsearch50
Encoder
For the RNN, they use a Bidirectional RNN: a Gated Recurrent Unit (GRU).
Each input token is fed into an embedding layer, $x_i$, and then a GRU encodes into a forward and backward "annotation" per token, concatenated to make a single representation, $h_i$.
The idea is to allow each annotation to summarise the preceding and the following words, providing the most possible representation for the attention mechanism.
Figure 1: The graphical illustration of the proposed model trying to generate the tth target word $y_t$ given a source sentence ( $x_1, x_2, \ldots, x_T$ )
Decoder
For the decoder, they use a unidirectional Gated Recurrent Unit.
The initial hidden state $s_0$ is computed as an initialisation layer, which comprises a linear layer followed by a tanh
activation function.
$s_0 = \tanh \left( W_s \overleftarrow{h}_1 \right)$ where $W_s \in \mathbb{R}^{n \times n}$.
For each prediction step, they calculate the word probability as:
$p(y_iy_1, \ldots, y_{i−1}, x) = g(y_{i−1}, s_i, c_i)$
Where
 $y_{i1}$ is the embedding of the token from the previous step.
 $s_i$ is the hidden state output from the previous layer.
 $c_i$ is the context vector.
The context vector, $c_i$ is calculated at each step as follows:
$c_i = \sum\limits_{j=1}^{T_x}\alpha_{ij}h_j$
The weights, $\alpha_{ij}$, are calculated by the alignment (Attention) model.
Alignment Model (Attention)
The alignment scores are calculated by combining a projection of the decoder's previous state and a projection of the encoder output, then applying tanh
activation followed by a linear combination with another weight vector.
$e_{ij} = v_a^{T} \tanh(W_as_{i1} + U_{a}h_{j})$
This function gives us an output score for each token in the input sequence. Finally, we can perform a Softmax calculation to convert the weights to probability distribution:
$\sigma_{ij} = \frac{\exp(e_{ij})}{\sum_{k=1}^{T_x}\exp(e_{ij})}$
Maxout
The final layer, which returns the probabilities for each word, uses a Maxout layer to generate the final probabilities. A Maxout layer acts as a form of regularisation. It projects the input vector onto multiple "buckets" and selects the maximum value from each bucket. This process introduces nonlinearity and helps prevent overfitting, akin to dropout.
Training Params
 Algorithm: Stochasic Gradient Descent (SGD)
 Optimiser: Adadelta (Zeiler, 2012)
 Batch Size: 80 sentences
 Training Duration: Approximately 5 days per model
 Inference method: Beam search
 Task: EnglishtoFrench translation
 Dataset: bilingual, parallel corpora provided by ACL WMT 14.
 Word count: 850 (reduced to 348M)
 Components:
 Europarl (61M words)
 News Commentary (5.5M)
 UN (421M)
 two crawled corpora of 90M and 272.5M words, respectively
 Metric: BLEU Score.
 Tokeniser: from opensource machine translation package Moses. They shortlist the most frequent 30k words and map everything else to
[UNK]
.  Comparisons: They compare RNNsearch with a standard RNN EncoderDecoder, RNNenc and Moses, the stateoftheart translation package.
 Test Set: For the test set, they evaluate
newstest2014
from WMT'14, which contains 3003 sentences not in training  Valid Set: They concat
newstest2012
andnewstest2013
.  Initialisation: Orthogonal for recurrent weights, Gaussian ($0, 0.01^2$) for feedforward weights, zeros for biases.
Results
They record results on all test data with only examples that don't contain unknown tokens.
The RNNsearch50 model achieved a BLEU score of 34.16 on sentences with unknown tokens excluded, significantly outperforming the RNNencdec50 model, which scored 26.71 and training RNNsearch50 to convergence beat the stateoftheart Moses. However, when unknown tokens are included, the model performs considerably worse.
RNNsearch was much better at longer sentences than RNNenc.
Figure 2: The BLEU scores of the generated translations on the test set with respect to the lengths of the sentences.
Model  All  No UNK 

RNNencdec30  13.93  24.19 
RNNsearch30  21.50  31.44 
RNNencdec50  17.82  26.71 
RNNsearch50  26.75  34.16 
RNNsearch50*  28.45  36.15 
Moses  33.30  35.63 
*Note: RNNsearch50* was trained much longer until the performance on the development set stopped improving.
Interpreting Attention
One benefit of calculating attention weights for each output word is that they are interpretable, allowing us to visualise word alignments.
Figure 3. Four sample alignments that were found by RNNsearch50.
As we can see, typically, words are aligned to similarly positioned words in a sentence, but not always.

Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning phrase representations using RNN encoderdecoder for statistical machine translation. arXiv. https://arxiv.org/abs/1406.1078 ↩

Brauwers, G., & Frasincar, F. (2023). A general survey on attention mechanisms in deep learning. IEEE Transactions on Knowledge and Data Engineering, 35(4), 3279–3298. https://doi.org/10.1109/tkde.2021.3126456 ↩↩