Attention is All You Need
Recurrent models typically factor computation along the symbol positions of the input and output sequences. Aligning the positions to steps in computation time, they generate a sequence of hidden states ht, as a function of the previous hidden state ht−1 and the input for position t.
The sequential nature of RNNs prohibits parallelization of input. Transformers are proposed to overcome the issue of input serialization of RNNs. Transformers rely entirely on the mechanism of attention rather than recurrence making it possible for parallelizing input. The attention mechanism allows for the "modeling of dependencies without regard to their distance in the input and output sequences". The "Attention is all you need" paper proposes the transformer architecture.
I will try to provide a high-level explanation of the transformer model. For a more detailed study, please refer to the paper here.
Flow of input and output
The left sub part is the encoder and the right subpart is the decoder. Suppose we are translating a sentence from English to French, unlike a RNN, a transformer sees the whole sentence at the same time. Each word in the sentence is first embedded in a vector representation into an embedding space where similar words are grouped together. Each word has its own vector representation. However, the same word used in different position in a sentence has a different meaning. To take this into account, a positional encoding, the process of generating vectors that contain information about the distance between words, is used. The vectorized input, from input embedding, together with its context, from positional encoding, is passed as input to the encoder block.
The encoder block consists of two sub blocks: multi-head attention and feed forward network. The attention block determines how what part of the sentence needs to be focussed on. For a detailed understanding of the attention mechanism, please visit this blog. The attentional vectors generated by the attention block are normalized and sent as input to the feed forward network. The purpose of the feed forward network is to transform the attentional vectors into an input format acceptable to pass to another encoder or decoder block.
The transformed attentional vectors from attentional block are sent as input to the decoder. The decoder, during training, also gets as input the vectorized form of the output sentence (in our case the French) and its context. These vectors also pass through the attentional block.
The attentional vectors of the output sentence as well as the attentional vectors of the input sentence are then passed through another attentional block. The resulting vectors are normalized passed through a feedforward network and then linearly transformed, followed by a softmax function. The output of softmax function are the probabilities of the translation. For an explanation of transformers with a specific example, refer to this video. This is another blog which illustratively explains transformers.
The paper presents a novel architecture that allows for parallelization of input and is also much faster than RNNs. Currently, transformers are predominantly being used as language models. It would be interesting to see them being used for other modalities. Like mentioned in this paper, while parallelization helps make the model computationally efficient, is it able to fully exploit the sequential nature of data? Also, is parallelism really being achieved? Do transformers just have a larger window sizes than LSTMs can support?