Transformers Explained—Part I

transformers computer component

Transformers have revolutionized the world of Deep Learning, well beyond NLP application. They have become foundational models, and deserve an in-depth explanation, so there it is, Transformers Explained, part I!

We talked about transformers a few times here and there in other deep dives or feature articles. However, they represent such a pivotal innovation in the world of Natural Language Processing and Deep Learning, that they deserve an in-depth explanation. In this first part of our series, Transformers explained, we  are going to see at transformer at a high level, and try to understand why they are so popular. So go ahead, get comfortable, and let’s dive in!

Transformers – Explaining Why

In the realm of deep learning, The Transformer is a relatively new concept (Vaswani et al. 2017). Transformers is presently focused on NLP, but new research has looked at advanced image processing using similar networks (Qi et al. 2020). The more natural approach that transformers take to sentences is the foundational principle for the study of transformers in natural language processing (NLP).
 
The sequence-to-sequence encoder-decoder architecture serves as the basis for sequence transduction operations. In essence, it suggests encoding the entire sequence at once and using that encoding as the foundation for developing the target sequence or the decoded sequence.
 

Before Transformers

Before Transformers entered the scene, RNNs in all their flavors, such as LSTMs and GRUs, were the standard architecture for all NLP applications.

Separate RNNs are utilized for the encoder and decoder in the conventional sequence-to-sequence paradigm. The encoded sequence is the RNN’s hidden state at the encoder network. Using this encoded sequence and (typically) word-level generative modeling, conventional sequence-to-sequence models generate the target sequence. Since word-level encoding makes it difficult for the encoder to maintain context for longer sequences, the well-known Attention mechanism was added as a final layer. Its role is to “pay attention” to crucial words in the sequence that significantly contribute to the generation of the target sequence. According to how it influences the development of the target sequence, each word in the input sequence is given a different attention score.

RNNs have Their Own Problems

These architectures presented two main issues:

  1. Longer sentences were hard to encode without losing meaning. Because of the nature of the RNN, encoding dependencies between words that were far apart in the sentence was challenging. This is due to the vanishing gradient problem.
  2. The model processes the input sequence one word at a time. This means that until all the computation for time step t1 is not finished, we cannot start computing for time step t2. In other words, the computation is time-consuming, which is problematic for both training and inference.
RNNs vs Transformers Encoding Approach

Transformers Explained – Why

The Transformer architecture addresses both of these issues. It completely abandoned RNNs in favor of relying only on the advantages of Attention.
Transformers perform a parallel processing operation on each word in the sequence, considerably accelerating computing.
It doesn’t matter how close or far apart the words are in the input sequence. It performs dependencies between words that are close together and those that are far away equally well.

Explaining Transformers to my Mum

Ok, so now we have seen why transformers are so popular. It is now time to look at how they work. However, transformers are complex yet elegantly simple models, and it’s hard to do them justice in a few words without forgetting any detail. So for the moment, I’ll explain transformers at a very high level, pretty much as I would explain them to my mum. So here we go.

For the moment, let’s just imagine transformers as a big box. This big box takes in a some information, like a phrase or a picture, and gives you back something in return, depending on what type of box you have. Some of these boxes take a picture and give you back a description of said picture; others take a paragraph and give you back a summary. Some answer questions, and some paraphrase sentence. Others yet even take a phrase and give you back a picture

Now, this is not a magic box. There is a really simple yet powerful mechanism behind all this: the Attention mechanism. This mechanism allows the Transformer to take in all the information at once (all the words in a sentence, or all the pixels in a picture, for example), and decide which are the most important one, to correctly decode the information. Exactly how humans do when they are talking, translating, studying, etc!

In the next episode of this saga, we are going to learn how Attention is used in a Transformer. We are going to get technical, but don’t worry, it’s going to be fun!