Transformers Explained

August 22, 2022

Photo by Victor on Unsplash

Transformers Explained

Since their introduction in 2017, transformers have revolutionized the world of natural language processing. Prior to Transformers, LSTMs and RNNs were the state of the art. The reason Transformers consistently outperform LSTMs and RNNs is that the latter can only interpret sentences from left to right. For example, suppose we had the following sentences:

  • On the river bank
  • On the bank of the river

An LSTM or RNN wouldn’t realize that in the context of the second sentence, the word bank is referring to a location by a stream of water and not a financial institution. In contrast, a transformer is able to handle this scenario because it doesn’t read the words one after the other. Rather, it accepts the entire sentence at once.

The architecture described in the paper Attention Is All You Need consists of an encoder and decoder.

Transformer model for language understanding

Input Embeddings

Transformers do not accept raw text as input. Thus, like we do for other models, we generate the word embeddings for the input sequence.


Positional Encoding

Embeddings represent a token in a d-dimensional space where tokens with similar meaning are closer to one another. However, the embeddings do not encode the relative position of the tokens in a sentence.

As the name implies, positional encoding encodes the position of the words in the sequence.


The formula for calculating the positional encoding is:

Transformer model for language understanding

Positional encoding works because absolute position is less important than relative position. For instance, we don’t need to know that the word “good” is at index 6 and the word “looks” is at index 5. It’s sufficient to remember that the word “good” tends to follows the word “looks”.

Here’s a plot generated using a sequence length of 100 and embedding space of 512 dimensions:


For the first dimension, if the value is 1, it’s an odd word, if the value is 0, it’s an even word. For the d/2th dimension, if the value is 1, we know the word is in the second half of the sentence and if the value is 0, then it’s in the first half of the sentence. The model can use this information to determine the relative position of the tokens.

Encoder Input

After adding the positional encoding to the embedding vector, tokens will be closer to each other based on the similarity of their meaning and their position in the sentence.



The Encoder’s job is to map all input sequences into an abstract continuous representation that holds the learned information (i.e. how words relate to one another).


Scaled Dot-Product Attention

Transformer model for language understanding

After feeding the query, key, and value vectors through a linear layer, we calculate the dot product of the query and key vectors. The values in the resulting matrix determine how much attention should be payed to the other words in the sequence given the current word. In other words, each word (row) will have an attention score for every other word (column) in the sequence.


The dot product is scaled by a factor of square root of the depth. This is done because for large values of depth, the dot product grows large in magnitude pushing the softmax function where it has small gradients which make it difficult to learn.

Transformer model for language understanding

Once the values have been scaled, we apply a softmax function to obtain values between 0 and 1.


Finally, we multiply the resulting matrix by the value vector.


Multi-Headed Attention

Instead of one single attention head, Q, K, and V are split into multiple heads because it allows the model to jointly attend to information from different representation subspaces at different positions.

For example, given the word “the”, the first head will give more attention to the word “bank” whereas the second head will give more attention to the word “river”.


It’s important to note that after the split each head has a reduced dimensionality. Thus, the total computation cost is the same as a single head attention with full dimensionality.

The attention output for each head is concatenated and put through a Dense layer.

Transformer model for language understanding

The Residual Connections, Layer Normalization, and Feed Forward Network

The original positional input embedding is added to the multi-headed attention output vector. This is known as a residual connection. Each hidden layer has a residual connection around it followed by a layer normalization. Residual connections help in avoiding the vanishing gradient problem in deep networks.


The output finishes by passing through a point wise feed forward network.



The decoder’s job is to generate text. The decoder has similar hidden layers to the encoder. However, unlike the encoder, the decoder’s output is sent to a softmax layer in order to compute the probability of the next word in the sequence.

Transformer model for language understanding

Decoder Input Embeddings & Positional Encoding

The decoder is autoregressive meaning that it predicts future values based on previous values. To be exact, the decoder predicts the next token in the sequence by looking at the encoder’s output and self-attending to its own previous output. Just like we did with the encoder, we add the positional encodings to the word embedding to capture the position of the tokens in the sentence.



Since the decoder is trying to generate the sequence word by word, a look-ahead mask is used to indicate which entries should not be used. For example, when predicting the third token in the sentence, only the previous tokens, that is, the first and second tokens, should be used.



Like we mentioned previously, the output of the hidden layers goes through a final softmax layer. If we have a vocabulary of 10,000 words, then the output of the classifier will be a vector of length 10,000 where the value at each index is the probability that the word associated with that index is the next word in the sequence.


We take the word with the highest probability and append it to the sequence used in the next training iteration.


The Google Transformer model for language understanding tutorial already does an excellent job of demonstrating how to code a Transformer from scratch using TensorFlow Keras. Thus, we will instead see how we can download and make use of one of the pre-trained models.

To begin, we install and import the required libraries.

! pip install -q -U "tensorflow-text==2.8.*" tf-models-official==2.7.0
import tensorflow as tf  
import tensorflow_hub as hub  
import tensorflow_text as text  
from official.nlp import optimization  # to create AdamW optimizer
import matplotlib.pyplot as plt
import os  
import shutil

We download the IMDB dataset using the Keras utility function.

dataset = tf.keras.utils.get_file('aclImdb_v1.tar.gz', '['](,  
                                  untar=True, cache_dir='.',  
dataset_dir = os.path.join(os.path.dirname(dataset), 'aclImdb')
train_dir = os.path.join(dataset_dir, 'train')
remove_dir = os.path.join(train_dir, 'unsup')  

We create training, validation and testing datasets from the input data.

batch_size = 32  
seed = 42
raw_train_ds = tf.keras.utils.text_dataset_from_directory(  
class_names = raw_train_ds.class_names
train_ds = raw_train_ds.cache().prefetch(buffer_size=AUTOTUNE)
val_ds = tf.keras.utils.text_dataset_from_directory(  
val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)
test_ds = tf.keras.utils.text_dataset_from_directory(  
test_ds = test_ds.cache().prefetch(buffer_size=AUTOTUNE)

We print a few records to get a better sense of what we’re working with.

for text_batch, label_batch in train_ds.take(1):  
  for i in range(3):  
    print(f'Review: {text_batch.numpy()[i]}')  
    label = label_batch.numpy()[i]  
    print(f'Label : {label} ({class_names[label]})')
Review: b'"Pandemonium" is a horror movie spoof that comes off more stupid than funny. Believe me when I tell you, I love comedies. Especially comedy spoofs....'  
Label : 0 (neg)  
Review: b"David Mamet is a very interesting and a very un-equal director. His first movie 'House of Games' was the one I liked best, and it set a series of films with characters whose perspective of life changes as they get into complicated situations, and so does the perspective of the viewer...."  
Label : 0 (neg)  
Review: b'Great documentary about the lives of NY firefighters during the worst terrorist attack of all time....'  
Label : 1 (pos)

We will download and use the pre-trained BERT models from TensorFlow Hub.

tfhub_handle_encoder = '['](  
tfhub_handle_preprocess = '['](

The pre-processing model takes a sentence and tokenizes it. Notice how it also adds padding to ensure the sequence is of length 128 (required by the BERT model).

bert_preprocess_model = hub.KerasLayer(tfhub_handle_preprocess)  
text_test = ['what a great movie!']  
text_preprocessed = bert_preprocess_model(text_test)
print(f'Keys       : {list(text_preprocessed.keys())}')  
print(f'Shape      : {text_preprocessed["input_word_ids"].shape}')  
print(f'Word Ids   : {text_preprocessed["input_word_ids"][0, :12]}')  
print(f'Input Mask : {text_preprocessed["input_mask"][0, :12]}')  
print(f'Type Ids   : {text_preprocessed["input_type_ids"][0, :12]}')
Keys       : ['input_mask', 'input_type_ids', 'input_word_ids']  
Shape      : (1, 128)  
Word Ids   : [ 101 2054 1037 2307 3185  999  102    0    0    0    0    0]  
Input Mask : [1 1 1 1 1 1 1 0 0 0 0 0]  
Type Ids   : [0 0 0 0 0 0 0 0 0 0 0 0]

We define a function to build our classifier model. We add a dense layer to the in order to return a value ranging from 0 to 1 where a value of 1 implies that the review is positive and a value of 0 implies that the review is negative.

def build_classifier_model():  
  text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text')  
  preprocessing_layer = hub.KerasLayer(tfhub_handle_preprocess, name='preprocessing')  
  encoder_inputs = preprocessing_layer(text_input)  
  encoder = hub.KerasLayer(tfhub_handle_encoder, trainable=True, name='BERT_encoder')  
  outputs = encoder(encoder_inputs)  
  net = outputs['pooled_output']  
  net = tf.keras.layers.Dropout(0.1)(net)  
  net = tf.keras.layers.Dense(1, activation=None, name='classifier')(net)  
  return tf.keras.Model(text_input, net)

We call the function and examine the model layers in closer detail. We ensure the parameters are trainable since we want to fine tune the model.

classifier_model = build_classifier_model()  

We define a number of hyperparameters such as the number of epochs, steps and the learning rate.

epochs = 5  
steps_per_epoch =  
num_train_steps = steps_per_epoch * epochs  
num_warmup_steps = int(0.1*num_train_steps)
init_lr = 3e-5
optimizer = optimization.create_optimizer(init_lr=init_lr, num_train_steps=num_train_steps, num_warmup_steps=num_warmup_steps, optimizer_type='adamw')

We compile the model using binary crossentropy for the loss function and AdamW for the optimizer.


We train the model.

history =,  
Epoch 1/5  
625/625 [==============================] - 137s 203ms/step - loss: 0.5083 - binary_accuracy: 0.7452 - val_loss: 0.3831 - val_binary_accuracy: 0.8364  
Epoch 2/5  
625/625 [==============================] - 122s 195ms/step - loss: 0.3284 - binary_accuracy: 0.8520 - val_loss: 0.3700 - val_binary_accuracy: 0.8450  
Epoch 3/5  
625/625 [==============================] - 121s 194ms/step - loss: 0.2530 - binary_accuracy: 0.8949 - val_loss: 0.3833 - val_binary_accuracy: 0.8522  
Epoch 4/5  
625/625 [==============================] - 121s 194ms/step - loss: 0.1967 - binary_accuracy: 0.9232 - val_loss: 0.4424 - val_binary_accuracy: 0.8534  
Epoch 5/5  
625/625 [==============================] - 121s 193ms/step - loss: 0.1612 - binary_accuracy: 0.9385 - val_loss: 0.4716 - val_binary_accuracy: 0.8504

We evaluate the accuracy of our model on the testing dataset.

loss, accuracy = classifier_model.evaluate(test_ds)
print(f'Loss: {loss}')  
print(f'Accuracy: {accuracy}')
Loss: 0.4483765959739685  
Accuracy: 0.8543199896812439

We plot the loss and accuracy of our model over time.

history_dict = history.history
acc = history_dict['binary_accuracy']  
val_acc = history_dict['val_binary_accuracy']  
loss = history_dict['loss']  
val_loss = history_dict['val_loss']
epochs = range(1, len(acc) + 1)  
fig = plt.figure(figsize=(10, 6))  
plt.subplot(2, 1, 1)  
plt.plot(epochs, loss, 'r', label='Training loss')  
plt.plot(epochs, val_loss, 'b', label='Validation loss')  
plt.title('Training and validation loss')  
plt.subplot(2, 1, 2)  
plt.plot(epochs, acc, 'r', label='Training acc')  
plt.plot(epochs, val_acc, 'b', label='Validation acc')  
plt.title('Training and validation accuracy')  
plt.legend(loc='lower right')

For the sake of understanding, we perform inference on a few examples.

examples = [  
    'this is such an amazing movie!',  
    'The movie was great!',  
    'The movie was meh.',  
    'The movie was okish.',  
    'The movie was terrible...'  
results = tf.sigmoid(classifier_model(tf.constant(examples)))
result_for_printing = \  
    [f'input: {examples[i]:<30} : score: {results[i][0]:.6f}'  
                         for i in range(len(examples))]  
print(*result_for_printing, sep='\n')
input: this is such an amazing movie! : score: 0.999392  
input: The movie was great!           : score: 0.991764  
input: The movie was meh.             : score: 0.515988  
input: The movie was okish.           : score: 0.009715  
input: The movie was terrible...      : score: 0.001295

As we can see, the model does a pretty good job of classifying the sentences as either positive or negative. That being said, “okish” probably should have been closer to “meh” than “terrible”.

Profile picture

Written by Cory Maklin Genius is making complex ideas simple, not making simple ideas complex - Albert Einstein You should follow them on Twitter