In this post, I will present a detailed example of the math used inside the transformer model so that you get a good idea of how the model works. To make the post clear, I’ll simplify a lot of things. We’ll be doing quite a bit of manual calculations, so we’ll reduce the dimensionality of the model. For example, instead of embeddings of 512 values, we use embeddings of 4 values. This will make the calculations easier to understand. We use arbitrary vectors and matrices, but you can choose your own values if you want.
As you’ll see, the math of the model isn’t that complicated. The complexity arises from the number of steps and parameters. Before reading this article, I recommend reading the post Illustrated Transformer (or read them sideways) [перевод на Хабре]. This is a great post that explains the Transformer model in an intuitive (and visual!) way, so I won’t explain what has already been explained in it. My goal is to explain how the transformer model works, not what it is. If you want to go into more detail, then check out the wellknown article Attention is all you need [перевод на Хабре: первая и вторая части].
Prerequisites
Basic knowledge of linear algebra is required to understand the article; Basically, we’ll be doing simple matrix multiplications, so you don’t have to be an expert. In addition, knowledge of the basics of machine learning and deep learning will be useful.
What is covered in the article?

A complete example of the mathematical calculations that take place in the Transformer model during the inference process

Explanation of the Mechanisms of Attention

Explanation of Residual Links and Layer Normalization

Code to scale the model
Our goal will be to use the transformer model as a translation tool, so that we can pass input data to the model and expect it to generate a translation. For example, we can transmit “Hello World” in English and expect to receive “Hola Mundo” in Spanish at the exit.
Let’s take a look at the scary Transformer diagram (fear not, you’ll soon figure it out!):
The original transformer model consists of two parts: an encoder and a decoder. The encoder handles the “comprehension” or “comprehension” of the input text, while the decoder renders the output text. Let’s take a look at the encoder.
Coder
The purpose of the encoder is to generate an embeddingrich description of the input text. This embedding reflects the semantic information about the input text and is passed to the decoder to generate the output text. The encoder consists of a stack of N layers. Before we move on to layers, we need to understand how to pass words (or tokens) to the model.
Note
The term “embedding” is used too often. First, we’ll create an embedding that will be the input to the encoder. The encoder also creates embedding (sometimes called hidden states) as an output. The decoder also gets embedding! The whole point is that embedding describes a token as a vector.
0. Tokenization
Machine learning models can process numbers rather than text, so we need to turn input text into numbers. And that’s exactly what it does Tokenization! For example, we can split the text “Hello World” into two tokens: “Hello” and “World”. We can also break it down into symbols: “H”, “e”, “l”, “l”, “o”, “o”, “w”, “o”, “r”, “l”, “d”. We can choose the principle of tokenization, it depends on the data we work with.
Word tokenization (breaking text into words) requires a very large amount of dictionary (all possible tokens). In it, words like “dog” and “dogs” or “run” and “running” will be different tokens. A vocabulary of characters would require less space but would have less meaning (it can be useful in languages like Chinese, where each character contains more information).
Progress has moved towards tokenization by prompts. It’s a cross between tokenization by words and by symbols. We divide words into subwords. For example, the word “tokenization” can be broken down into “token” and “ization”. How do you decide where to split words? This is part of training the tokenizer through a statistical process that aims to identify the subwords that are best chosen for a particular dataset. It’s deterministic(as opposed to training a machine learning model).
In this article, I’ll use tokenization by words for simplicity. Our goal will be to translate “Hello World” from English to Spanish. We will break down the “Hello World” example into “Hello” and “World” tokens. For example, “Hello” can be token 1 and “World” can be token 2.
1. Text Embedding
While we can pass token IDs (i.e. 1 and 2) to the model, these numbers don’t make any sense. You need to turn them into vectors (a list of numbers). This is exactly what the process does Embedding! Token embeddings map the token ID to a fixedlength vector that has semantic meaning Tokens. This creates interesting properties: similar tokens will have similar embedding (in other words, calculating the cosine coefficient between the two embeddings will give us a good understanding of the degree of similarity of the tokens).
It is worth noting that the display of the token in embedding is studied by the model. Although we can use already trained embedding like word2vec or GloVe, transformer models learn these embeddings as they train. This is a big advantage because the model can learn the best token description for the task at hand. For example, a model might learn that “dog” and “dogs” should have similar embeddings.
All embeddings in the same model are the same size. In the transformer from the scientific paper, the size 512 was used, but in order for us to be able to perform the calculations, we will reduce its size to 4. I’ll assign random values to each token (as mentioned above, this mapping is usually studied by the model).
Hello > [1,2,3,4]
World > [2,3,4,5]
Note
After the publication of the article, many readers asked questions about the embeddings presented above. I was lazy and just wrote down numbers that would be convenient to do calculations with. In practice, these numbers will be studied by the model. To make it clearer, I’ve supplemented the post.
We can calculate the similarity of these vectors using the cosine coefficient, which would be too high for the vectors above. In practice, the vector will most likely look something like this: [0.071, 0.344, 0.12, 0.026, …, 0.008].
We can represent our inputs as a single matrix
Note
Although we can treat the two embeddings as two separate vectors, it’s easier to work with them as a single matrix because we’ll be multiplying the matrices later on.
2. Positional Coding
The individual embeddings in the matrix don’t contain information about the position of the words in the sentence, so we need information about the position. It can be created by adding positional coding to embedding.
You can get it in a variety of ways; We can use learned embedding or a fixed vector. In the original scientific article, a fixed vector is used because the authors saw almost no difference between the two methods (see section 3.5 of the article). We’ll also use a fixed vector. The functions of sine and cosine have a wavelike pattern and are repeated. By using these features Each item in the offer receives a unique, but consistent positional coding. Their repetition will help the model more easily learn patterns such as proximity and distance between elements. The following functions are used in the article:
The idea is to interpolate between the sine and cosine for each value in the embedding (even indices use the sine, odd indices use the cosine). Let’s calculate them for our example!
For “Hello”

i = 0 (even): PE(0,0) = sin(0 / 10000^(0 / 4)) = sin(0) = 0

i = 1 (odd): PE(0,1) = cos(0 / 10000^(2*1 / 4)) = cos(0) = 1

i = 2 (even): PE(0,2) = sin(0 / 10000^(2*2 / 4)) = sin(0) = 0

i = 3 (odd): PE(0,3) = cos(0 / 10000^(2*3 / 4)) = cos(0) = 1
For “World”

i = 0 (even): PE(1,0) = sin(1 / 10000^(0 / 4)) = sin(1 / 10000^0) = sin(1) ≈ 0.84

i = 1 (odd): PE(1,1) = cos(1 / 10000^(2*1 / 4)) = cos(1 / 10000^0.5) ≈ cos(0.01) ≈ 0.99

i = 2 (even): PE(1,2) = sin(1 / 10000^(2*2 / 4)) = sin(1 / 10000^1) ≈ 0

i = 3 (odd): PE(1,3) = cos(1 / 10000^(2*3 / 4)) = cos(1 / 10000^1.5) ≈ 1
As a result, we get the following:

«Hello» > [0, 1, 0, 1]

«World» > [0.84, 0.99, 0, 1]
Note that these encodings have the same dimension as the original embedding.
Note
We use sine and cosine as in the original scientific paper, but there are other ways to implement it. In the very popular transformer BERT uses trainable positional embeddings.
3. Adding Positional Coding and Embedding
Now we’ll add positional coding to embedding. This is done by adding the two vectors.
«Hello» = [1,2,3,4] + [0, 1, 0, 1] = [1, 3, 3, 5]
«World» = [2,3,4,5] + [0.84, 0.99, 0, 1] = [2.84, 3.99, 4, 6]
That is, our new matrix, which will be the input to the encoder, looks like this:
If you look at the image from the research paper, you can see that we have just done the lower left part of the image (embedding + positional coding).
4. SelfAttention
4.1 Defining matrices
Now we’re going to introduce the concept of multihead attention. Attention is a mechanism that allows the model to focus on specific parts of the input data. Multiheaded attention allows the model to jointly pay attention to information from different subspaces of descriptions. For this purpose, multiple heads of attention are used. Each head of attention has its own matrices K, V, and Q.
Let’s use two heads of attention in our example. For these matrices, we’ll use random values. Each matrix will be 4×3 in size. In this way, each matrix will convert fourdimensional embeddings into threedimensional keys (K), values (V), and queries (Q). This reduces the dimensionality of the attention mechanism, which helps manage computational complexity. It’s worth noting that using too little attention size will reduce the accuracy of the model. Let’s use the following values (chosen arbitrarily):
For the first head
For the second head
4.2 Calculating Keys, Queries, and Values
To get keys, queries, and values, you need to multiply the input embeddings by the weighting matrices.
Key Calculation
In fact, we don’t have to calculate all this manually, it would be too monotonous. Let’s cheat and use NumPy.
First, let’s define the matrices
import numpy as np
WK1 = np.array([[1, 0, 1], [0, 1, 0], [1, 0, 1], [0, 1, 0]])
WV1 = np.array([[0, 1, 1], [1, 0, 0], [1, 0, 1], [0, 1, 0]])
WQ1 = np.array([[0, 0, 0], [1, 1, 0], [0, 0, 1], [1, 0, 0]])
WK2 = np.array([[0, 1, 1], [1, 0, 1], [1, 1, 0], [0, 1, 0]])
WV2 = np.array([[1, 0, 0], [0, 1, 1], [0, 0, 1], [1, 0, 0]])
WQ2 = np.array([[1, 0, 1], [0, 1, 0], [1, 0, 0], [0, 1, 1]])
And let’s make sure that there are no errors in the above calculations.
embedding = np.array([[1, 3, 3, 5], [2.84, 3.99, 4, 6]])
K1 = embedding @ WK1
K1
array([[4. , 8. , 4. ],
[6.84, 9.99, 6.84]])
It’s cool! Now let’s get the values and queries
Calculating Values
V1 = embedding @ WV1
V1
array([[6. , 6. , 4. ],
[7.99, 8.84, 6.84]])
Evaluating Queries
Q1 = embedding @ WQ1
Q1
array([[8. , 3. , 3. ],
[9.99, 3.99, 4. ]])
Let’s skip the second head for now and focus on the final result of the first head. We’ll come back to the second head later.
4.3 Calculating Attention
It takes a couple of steps to calculate the attention score:

Compute the dot product of the query and each key

Dividing the result by the square root of the key vector dimension

Using the Softmax Function to Obtain Attention Weights

Multiplying each value vector by attention weights
4.3.1 Dot product of query and each key
To compute the result for “Hello”, you need to compute the dot product of q1 and each key vector (k1 and k2)
In the world of matrices, this would be Q1 multiplied by the permutation of K1
I could make a mistake, so let’s check it again with Python
scores1 = Q1 @ K1.T
scores1
array([[ 68. , 105.21 ],
[ 87.88 , 135.5517]])
4.3.2 Dividing by the square root of the key vector dimension
Next, we divide the exponents by the square root of the dimension (d) of the keys (in this case, it’s 3, but in the paper it was 64). Why? For large values of d, the dot product grows too fast (after all, we add the multiplication of a bunch of numbers, which leads to large values). And high values are bad! We’ll talk more about this in a moment.
scores1 = scores1 / np.sqrt(3)
scores1
array([[39.2598183 , 60.74302182],
[50.73754166, 78.26081048]])
4.3.3 Using the softmax function
Next, we use softmax to normalize them so that all of them are positive and add up to 1.
What is softmax?
Softmax is a function that retrieves a vector of values and returns a vector of values between 0 and 1 in which the sum of the values is 1. This is a convenient way to get probabilities. The function is defined as follows:
Don’t be intimidated by this formula, it’s actually quite simple. Let’s say we have the following vector:
The softmax of this vector would be:
As you can see, all values are positive and add up to 1.
def softmax(x):
return np.exp(x) / np.sum(np.exp(x), axis=1, keepdims=True)
scores1 = softmax(scores1)
scores1
array([[4.67695573e10, 1.00000000e+00],
[1.11377182e12, 1.00000000e+00]])
4.3.4 Multiplying the Value Matrix by the Attention Weights
Next, we multiply by the value matrix
attention1 = scores1 @ V1
attention1
array([[7.99, 8.84, 6.84],
[7.99, 8.84, 6.84]])
Let’s combine 4.3.1, 4.3.2, 4.3.3, and 4.3.4 into one formula using matrices (this is from section 3.2.1 of the scientific paper):
Yes, that’s it! All the calculations we have done can easily be reduced to the attention formula shown above! Let’s put that into code.
def attention(x, WQ, WK, WV):
K = x @ WK
V = x @ WV
Q = x @ WQ
scores = Q @ K.T
scores = scores / np.sqrt(3)
scores = softmax(scores)
scores = scores @ V
return scores
attention(embedding, WQ1, WK1, WV1)
array([[7.99, 8.84, 6.84],
[7.99, 8.84, 6.84]])
We haveWe were sure that the values were the same as those obtained above. Let’s use this code to get the attention metrics of the second head of attention:
attention2 = attention(embedding, WQ2, WK2, WV2)
attention2
array([[8.84, 3.99, 7.99],
[8.84, 3.99, 7.99]])
If you’re wondering how attention became the same for the two embeddings, it’s because softmax converts the values to 0 and 1. See:
softmax(((embedding @ WQ2) @ (embedding @ WK2).T) / np.sqrt(3))
array([[1.10613872e14, 1.00000000e+00],
[4.95934510e20, 1.00000000e+00]])
This is caused by poor matrix initialization and small vector sizes. Large differences in presoftmax values will only be amplified, causing one value to be close to 1 and others to 0. In practice, our original embedding matrix values were too large, resulting in high values for keys, values, and queries that only got higher as we multiplicated.
Remember how we did the division by the square root of the dimension of the keys? That’s why we did it, otherwise the dot product values would be too large, resulting in larger values after softmax. However, in this case, it doesn’t seem to have been enough, given our small values! As a quick hack, we can scale down the values by a smaller value than the square root of three. Let’s redefine the attention function by zooming out by 30. In the long run, this is a bad decision, but it will help us get different values for attention scores. We’ll come back to a better solution later.
def attention(x, WQ, WK, WV):
K = x @ WK
V = x @ WV
Q = x @ WQ
scores = Q @ K.T
scores = scores / 30 # we just changed this
scores = softmax(scores)
scores = scores @ V
return scores
attention1 = attention(embedding, WQ1, WK1, WV1)
attention1
array([[7.54348784, 8.20276657, 6.20276657],
[7.65266185, 8.35857269, 6.35857269]])
attention2 = attention(embedding, WQ2, WK2, WV2)
attention2
array([[8.45589591, 3.85610456, 7.72085664],
[8.63740591, 3.91937741, 7.84804146]])
4.3.5 Goal Attention Output Values
The next encoder layer expects one matrix as input, not two. The first step will be to concatenate the output values of the two heads (see section 3.2.2 of the scientific paper)
attentions = np.concatenate([attention1, attention2], axis=1)
attentions
array([[7.54348784, 8.20276657, 6.20276657, 8.45589591, 3.85610456,
7.72085664],
[7.65266185, 8.35857269, 6.35857269, 8.63740591, 3.91937741,
7.84804146]])
We then multiply this concatenated matrix by the weight matrix to get the final result of the attention layer. The model also learns this weight matrix! The dimensionality of the matrix ensures that we return to the same dimension as embedding (in this case, 4).
# Просто произвольные значения
W = np.array(
[
[0.79445237, 0.1081456, 0.27411536, 0.78394531],
[0.29081936, 0.36187258, 0.32312791, 0.48530339],
[0.36702934, 0.76471963, 0.88058366, 1.73713022],
[0.02305587, 0.64315981, 0.68306653, 1.25393866],
[0.29077448, 0.04121674, 0.01509932, 0.13149906],
[0.57451867, 0.08895355, 0.02190485, 0.24535932],
]
)
Z = attentions @ W
Z
array([[ 11.46394285, 13.18016471, 11.59340253, 17.04387829],
[ 11.62608573, 13.47454936, 11.87126395, 17.4926367 ]])
All of this can be combined into an image from The Ilustrated Transformer
5. Feedforward layer
5.1 Simple FeedForward Layer
After the selfattention layer, the encoder has a feedforward neural network (FFN). It is a simple network with two linear transformations and ReLU activation in between. The Illustrated Transformer post doesn’t go into more detail about this, so I’ll briefly explain this layer. The purpose of FFN is to process and transform the description created by the attentional mechanism. The flow usually looks like this (see Section 3.3 of the scientific paper):

First Line Layer: Typically, it extends the dimensionality of the input data. For example, if the dimension of the input data is 512, then the dimension of the output can be 2048. This is done to allow the model to learn more complex functions. In our simple example with dimension 4, we’ll expand to 8.

ReLU Activation: Nonlinear activation function. This is a simple function that returns 0 if the input is negative and the input if it is positive. This allows the model to learn nonlinear functions. The calculations look like this:

Second Line Layer: It is the opposite of the first linear layer. This layer reverts the dimension back to its original size. In our example, it will descend from 8 to 4.
All this can be described as follows
As a reminder, the input to this layer is Z, which we computed in selfattention. Here are the values we got there
Now let’s set arbitrary values for the weights and displacement vector matrices. I’ll do this in code, but if you have the patience, you can set them manually!
W1 = np.random.randn(4, 8)
W2 = np.random.randn(8, 4)
b1 = np.random.randn(8)
b2 = np.random.randn(4)
Now, let’s write down the forward pass function
def relu(x):
return np.maximum(0, x)
def feed_forward(Z, W1, b1, W2, b2):
return relu(Z.dot(W1) + b1).dot(W2) + b2
output_encoder = feed_forward(Z, W1, b1, W2, b2)
output_encoder
array([[ 3.24115016, 9.7901049 , 29.42555675, 19.93135286],
[ 3.40199463, 9.87245924, 30.05715408, 20.05271018]])
5.2 Merging Everything: Arbitrary Encoder
Now let’s write code to combine multiheaded attention and direct communication in the encoder block.
Note
The code is optimized for understanding and educational purposes, not for performance, don’t judge too harshly!
d_embedding = 4
d_key = d_value = d_query = 3
d_feed_forward = 8
n_attention_heads = 2
def attention(x, WQ, WK, WV):
K = x @ WK
V = x @ WV
Q = x @ WQ
scores = Q @ K.T
scores = scores / np.sqrt(d_key)
scores = softmax(scores)
scores = scores @ V
return scores
def multi_head_attention(x, WQs, WKs, WVs):
attentions = np.concatenate(
[attention(x, WQ, WK, WV) for WQ, WK, WV in zip(WQs, WKs, WVs)], axis=1
)
W = np.random.randn(n_attention_heads * d_value, d_embedding)
return attentions @ W
def feed_forward(Z, W1, b1, W2, b2):
return relu(Z.dot(W1) + b1).dot(W2) + b2
def encoder_block(x, WQs, WKs, WVs, W1, b1, W2, b2):
Z = multi_head_attention(x, WQs, WKs, WVs)
Z = feed_forward(Z, W1, b1, W2, b2)
return Z
def random_encoder_block(x):
WQs = [
np.random.randn(d_embedding, d_query) for _ in range(n_attention_heads)
]
WKs = [
np.random.randn(d_embedding, d_key) for _ in range(n_attention_heads)
]
WVs = [
np.random.randn(d_embedding, d_value) for _ in range(n_attention_heads)
]
W1 = np.random.randn(d_embedding, d_feed_forward)
b1 = np.random.randn(d_feed_forward)
W2 = np.random.randn(d_feed_forward, d_embedding)
b2 = np.random.randn(d_embedding)
return encoder_block(x, WQs, WKs, WVs, W1, b1, W2, b2)
Recall our inputs in matrix E, which have positional coding and embedding.
embedding
array([[1. , 3. , 3. , 5. ],
[2.84, 3.99, 4. , 6. ]])
Now let’s pass this on to our function random_encoder_block
random_encoder_block(embedding)
array([[ 71.76537515, 131.43316885, 13.2938131 , 4.26831998],
[ 72.04253781, 131.84091347, 13.3385937 , 4.32872015]])
It’s cool! It was just one encoder block. Six encoders are used in the scientific article. The output of one encoder is passed to the next, and so on:
def encoder(x, n=6):
for _ in range(n):
x = random_encoder_block(x)
return x
encoder(embedding)
/tmp/ipykernel_11906/1045810361.py:2: RuntimeWarning: overflow encountered in exp
return np.exp(x)/np.sum(np.exp(x),axis=1, keepdims=True)
/tmp/ipykernel_11906/1045810361.py:2: RuntimeWarning: invalid value encountered in divide
return np.exp(x)/np.sum(np.exp(x),axis=1, keepdims=True)
array([[nan, nan, nan, nan],
[nan, nan, nan, nan]])
5.3 Residual Coupling and Layer Normalization
Oops! We’ve started to get NaN! It looks like our values are too large and when passed to the next encoder, they turn out to be too large and “explode”! This problem of too large values often occurs when training models. For example, when we perform backpropagation (the technique by which models are trained)i), gradients may be too large and “explode” as a result; It’s called Gradient explosion. Without normalization, small changes in input data on the first layers are amplified in subsequent layers. This is a common problem in deep neural networks. There are two ways to deal with this problem: residual couplings and layer normalization (these are briefly mentioned in Section 3.1 of the scientific paper).

Residual Links simply add the layer’s input to its output. For example, we add the initial embedding to the attention output. Residual links eliminate the problem of disappearing gradients. The logic is that if the gradient is too small, then we can just add the input to the output, and the gradient will become larger. The calculations are very simple:
That’s it! We’ll do this for the attention output and the feed layer output.

Normalize a layer is a technique for normalizing the input data of a layer. It normalizes the embedding dimension. The logic is that we want to normalize the input values of the layer so that they have an average value of 0 and a standard deviation of 1. This helps with gradient flow. At first glance, the calculations don’t look so simple.
Let’s explain each parameter:

μ is the average embedding value

σ is the standard deviation of embedding

ε is a small number to avoid dividing by zero. In the case of zero quadratic deviation, this little epsilon saves us.

γ and β are the parameters that are studied that control the scaling and panning phases.
Unlike batch normalization (don’t worry if you don’t know what it is), layer normalization normalizes by embedding dimension; This means that each embedding will not be affected by the other samples in the pack. The idea is that we want to normalize the layer’s inputs so that they have an average of 0 and a standard deviation of 1.
Why do we add the parameters we are studying? γ and β? The reason is that we don’t want to lose the power of the layer view. If you just normalize the input data, some information may get lost. By adding the parameters we learn, we can learn how to scale and shift the normalized values.
By putting these equations together, we get the equation for the entire encoder:
Let’s check it out with our example! Let’s take the previous E and Z values:
Now let’s compute the normalization of the layer; This process can be divided into three stages:

Calculate the mean and variance for each embedding.

Normalization by subtracting the mean in its string and dividing by the square root of the variance of the string (plus a small number to avoid dividing by zero).

Scaling and shifting by multiplying by gamma and adding beta.
5.3.1 Mean and Variance
For the first embedding
The same can be done for the second embedding. Let’s skip the calculations themselves and show only the result.
Let’s check with Python
(embedding + Z).mean(axis=1, keepdims=True)
array([[4.58837567],
[3.59559107]])
(embedding + Z).std(axis=1, keepdims=True)
array([[ 9.92061529],
[10.50653019]])
It’s cool! Now let’s normalize.
5.3.2 Normalization
When normalizing, we subtract the mean from each value in embedding and divide it by the standard deviation. Epsilon is a very small value, such as 0.00001. To simplify, let’s assume that γ = 1, and β = 0.
For the second embedding, we won’t do the calculations manually. Let’s check them with the help of code. Let’s override the function encoder_block
by making the following change:
def layer_norm(x, epsilon=1e6):
mean = x.mean(axis=1, keepdims=True)
std = x.std(axis=1, keepdims=True)
return (x  mean) / (std + epsilon)
def encoder_block(x, WQs, WKs, WVs, W1, b1, W2, b2):
Z = multi_head_attention(x, WQs, WKs, WVs)
Z = layer_norm(Z + x)
output = feed_forward(Z, W1, b1, W2, b2)
return layer_norm(output + Z)
layer_norm(Z + embedding)
array([[ 1.71887693, 0.56365339, 0.40370747, 0.75151608],
[ 1.71909039, 0.56050453, 0.40695381, 0.75163205]])
Worked! Let’s try again to run embedding through six encoders.
def encoder(x, n=6):
for _ in range(n):
x = random_encoder_block(x)
return x
encoder(embedding)
array([[0.335849 , 1.44504571, 1.21698183, 0.56391289],
[0.33583947, 1.44504861, 1.21698606, 0.56390202]])
It’s cool! The values are acceptable and there is no NaN! The idea behind the encoder stack is that they output a continuous z description that conveys the meaning of the input sequence. This description is then passed to the decoder, which generates an output sequence of characters, one element at a time.
Before we get into the decoder, let’s take a look at the image from Jay’s awesome post:
Each of the elements on the left side should be clear to you by now! Impressive, right? Now let’s move on to the decoder.
Decoder
Most of the knowledge gained from learning encoders will be used in the decoder as well! The decoder has two layers of selfattention, one for the encoder and the other for the decoder. Also, the decoder has a layer with a straight line Communication. Let’s take a look at each of the parts in order.
The decoder block receives two input elements: the output of the encoder and the generated output sequence. The output of the encoder is a description of the input sequence. In the inference process, the generated output sequence starts with a special startofsequence token (SOS). During training, the target output sequence is the actual output sequence shifted by one position. You’ll get better soon!
With an encodergenerated embedding and an SOS token, the decoder generates the next sequence token, i.e., “hola”. The decoder is autoregressive, meaning it takes the previously generated tokens and generates a second token again.

Iteration 1: Input – SOS, Output – “hola”

Iteration 2: Input – SOS + “hola”, Output – “Mundo”

Iteration 3: Input – SOS + “hola” + “mundo”, Output – EOS
Here, SOS is the sequence start token, while EOS is the sequence end token. After the EOS token is generated, the decoder stops working. It generates one token at a time. Note that all iterations use the embedding generated by the encoder.
Note
This autoregressive structure slows down the decoder. The encoder is capable of generating its embedding in a single straight pass, while the decoder needs to perform many straight passes. This is one of the reasons why architectures that use a single encoder (such as BERT or sentence similarity models) are much faster than architectures that use decoders alone (such as GPT2 or BART).
Let’s take a look at each of the stages! Like an encoder, a decoder consists of a stack of decoder blocks. The decoder block is a bit more complex than the encoder block. Its general structure is as follows:

SelfAttention Layer (Masked

Residual Coupling and Normalization of a Layer

EncoderDecoder Attention Layer

Residual Coupling and Normalization of a Layer

Feedforward layer

Residual Coupling and Normalization of a Layer
We are already familiar with all the mathematics of points 1, 2, 3, 5 and 6. Looking at the right side of the image below, you’ll see that all of these blocks are already known to you:
1. Text Embedding
The first decoder text is needed to embedding the input tokens. The input token is SOS
, so we’re embedding it. Uses the same embedding dimension as the encoder. Suppose the embedding vector for SOS
It looks like this:
2. Positional Coding
Now we’ll add positional coding to embedding, as we did with the encoder. Given that this is the same position as “Hello”, we’ll have the same positional coding as before:

i = 0 (even): PE(0,0) = sin(0 / 10000^(0 / 4)) = sin(0) = 0

i = 1 (odd): PE(0,1) = cos(0 / 10000^(2*1 / 4)) = cos(0) = 1

i = 2 (even): PE(0,2) = sin(0 / 10000^(2*2 / 4)) = sin(0) = 0

i = 3 (odd): PE(0,3) = cos(0 / 10000^(2*3 / 4)) = cos(0) = 1
3. Combining Positional Coding and Embedding
The addition of positional coding with embedding is done by adding two vectors:
4. SelfAttention
The first stage in the decoder unit is the selfattention mechanism. Luckily, we have the code for this and we can just use it!
d_embedding = 4
n_attention_heads = 2
E = np.array([[1, 1, 0, 1]])
WQs = [np.random.randn(d_embedding, d_query) for _ in range(n_attention_heads)]
WKs = [np.random.randn(d_embedding, d_key) for _ in range(n_attention_heads)]
WVs = [np.random.randn(d_embedding, d_value) for _ in range(n_attention_heads)]
Z_self_attention = multi_head_attention(E, WQs, WKs, WVs)
Z_self_attention
array([[ 2.19334924, 10.61851198, 4.50089666, 2.76366551]])
Note
From the point of view of inference, everything is quite simple, but from the point of view of learning, there are difficulties. When training, we use unlabeled data: just a bunch of text data that we collect through frequent scraping on the web. The purpose of the encoder is to convey all the information in the input, and the task of the decoder is to predict the next most likely token. This means that the decoder can only use previously generated tokens (it can’t cheat and look at the next tokens).
Because of this, we use masked selfattention: we mask tokens that have not yet been generated. This is done by assigning inf values to attention measures. This is done in a scientific article (section 3.2.3.1). We’ll skip that for now, but it’s important to remember that the decoder is a bit more complicated when it comes to training.
5. Residual Coupling and Layer Normalization
There’s nothing mysterious here, we just add the input to the selfattention output and apply layer normalization. The code used is the same as above.
Z_self_attention = layer_norm(Z_self_attention + E)
Z_self_attention
array([[ 0.17236212, 1.54684892, 1.0828824 , 0.63632864]])
6. EncoderDecoder Attention
This part is new! If you’ve been wondering where encodergenerated embeddings go, now is the time for them!
Let’s assume that the output of the encoder is a matrix like this:
In the selfattention engine, we compute queries, keys, and values for input embedding.
In the attention of the encoderdecoder, we compute the queries from the previous decoder layer and the keys and values from the encoder output! All calculations remain the same as before; The only difference is which embedding to use for queries. Let’s take a look at the code
def encoder_decoder_attention(encoder_output, attention_input, WQ, WK, WV):
# В следующих трёх строках и состоит основное различие!
K = encoder_output @ WK # Обратите внимание, что теперь мы передаём предыдущие выходные данные кодировщика!
V = encoder_output @ WV # Обратите внимание, что теперь мы передаём предыдущие выходные данные кодировщика!
Q = attention_input @ WQ # То же, что и для самовнимания
# Остаётся таким же
scores = Q @ K.T
scores = scores / np.sqrt(d_key)
scores = softmax(scores)
scores = scores @ V
return scores
def multi_head_encoder_decoder_attention(
encoder_output, attention_input, WQs, WKs, WVs
):
# Обратите внимание, что теперь мы передаём предыдущие выходные данные кодировщика!
attentions = np.concatenate(
[
encoder_decoder_attention(
encoder_output, attention_input, WQ, WK, WV
)
for WQ, WK, WV in zip(WQs, WKs, WVs)
],
axis=1,
)
W = np.random.randn(n_attention_heads * d_value, d_embedding)
return attentions @ W
WQs = [np.random.randn(d_embedding, d_query) for _ in range(n_attention_heads)]
WKs = [np.random.randn(d_embedding, d_key) for _ in range(n_attention_heads)]
WVs = [np.random.randn(d_embedding, d_value) for _ in range(n_attention_heads)]
encoder_output = np.array([[1.5, 1.0, 0.8, 1.5], [1.0, 1.0, 0.5, 1.0]])
Z_encoder_decoder = multi_head_encoder_decoder_attention(
encoder_output, Z_self_attention, WQs, WKs, WVs
)
Z_encoder_decoder
array([[ 1.57651431, 4.92489307, 0.08644448, 0.46776051]])
Worked! You might be wondering, “why are we doing this?” The point is that we want the decoder to focus on the relevant parts of the input text (i.e. “hello world”). The attention of the encoderdecoder allows each position in the decoder to visit all positions in the input sequence. This is very useful for tasks such as translation, where the decoder needs to focus on the relevant parts of the input sequence. The decoder will learn to focus on the relevant parts of the input sequence by learning how to generate the correct output tokens. It’s a very powerful mechanism!
7. Residual Links and Layer Normalization
Everything is the same as before!
Z_encoder_decoder = layer_norm(Z_encoder_decoder + Z_self_attention)
Z_encoder_decoder
array([[0.44406723, 1.6552893 , 0.19984632, 1.01137575]])
8. Feedforward layer
And it’s the same here! After that, I’ll also do a residual link and normalization of the layer.
W1 = np.random.randn(4, 8)
W2 = np.random.randn(8, 4)
b1 = np.random.randn(8)
b2 = np.random.randn(4)
output = layer_norm(feed_forward(Z_encoder_decoder, W1, b1, W2, b2) + Z_encoder_decoder)
output
array([[0.97650182, 0.81470137, 2.79122044, 3.39192873]])
9. Combining Everything: Arbitrary Decoder
Let’s write the code for a single decoder block. The most important change is that we now have an additional mechanism of attention.
d_embedding = 4
d_key = d_value = d_query = 3
d_feed_forward = 8
n_attention_heads = 2
encoder_output = np.array([[1.5, 1.0, 0.8, 1.5], [1.0, 1.0, 0.5, 1.0]])
def decoder_block(
x,
encoder_output,
WQs_self_attention, WKs_self_attention, WVs_self_attention,
WQs_ed_attention, WKs_ed_attention, WVs_ed_attention,
W1, b1, W2, b2,
):
# То же, что и раньше
Z = multi_head_attention(
x, WQs_self_attention, WKs_self_attention, WVs_self_attention
)
Z = layer_norm(Z + x)
# Основное различие заключается в следующих трёх строках!
Z_encoder_decoder = multi_head_encoder_decoder_attention(
encoder_output, Z, WQs_ed_attention, WKs_ed_attention, WVs_ed_attention
)
Z_encoder_decoder = layer_norm(Z_encoder_decoder + Z)
# То же, что и раньше
output = feed_forward(Z_encoder_decoder, W1, b1, W2, b2)
return layer_norm(output + Z_encoder_decoder)
def random_decoder_block(x, encoder_output):
# Просто несколько произвольных инициализаций
WQs_self_attention = [
np.random.randn(d_embedding, d_query) for _ in range(n_attention_heads)
]
WKs_self_attention = [
np.random.randn(d_embedding, d_key) for _ in range(n_attention_heads)
]
WVs_self_attention = [
np.random.randn(d_embedding, d_value) for _ in range(n_attention_heads)
]
WQs_ed_attention = [
np.random.randn(d_embedding, d_query) for _ in range(n_attention_heads)
]
WKs_ed_attention = [
np.random.randn(d_embedding, d_key) for _ in range(n_attention_heads)
]
WVs_ed_attention = [
np.random.randn(d_embedding, d_value) for _ in range(n_attention_heads)
]
W1 = np.random.randn(d_embedding, d_feed_forward)
b1 = np.random.randn(d_feed_forward)
W2 = np.random.randn(d_feed_forward, d_embedding)
b2 = np.random.randn(d_embedding)
return decoder_block(
x, encoder_output,
WQs_self_attention, WKs_self_attention, WVs_self_attention,
WQs_ed_attention, WKs_ed_attention, WVs_ed_attention,
W1, b1, W2, b2,
)
def decoder(x, decoder_embedding, n=6):
for _ in range(n):
x = random_decoder_block(x, decoder_embedding)
return x
decoder(E, encoder_output)
array([[ 0.25919176, 1.49913566, 1.14331487, 0.61501256],
[ 0.25956188, 1.49896896, 1.14336934, 0.61516151]])
Output Sequence Generation
We already have all the parts we need! Let’s generate an output sequence.

We have coder, which receives the input sequence and generates its enriched description. It consists of a stack of encoder blocks.

We have decoder, which receives the encoder output and the generated tokens and generates the output sequence. It consists of a stack of decoder blocks.
How do you go from decoder output to word? You’ll need to add the last line layer and the softmax layer on top of the decoder. The whole algorithm looks like this:

Encoder Processing: The encoder receives the input sequence and generates a contextualized description of the entire sentence using a stack of encoder blocks.

Decoder initialization: The decoding process begins with the embedding of the SOS (Start of Sequence) token connected to the output of the encoder.

Decoder Operation: The decoder uses the output of the encoder and embeddings of all previously generated tokens to create a new list of embeddings.

Line layer for logites: A linear layer is applied to the last output embedding of the decoder to generate logites representing raw predictions of the next token.

Softmax for probabilities: These logits are then passed through the Softmax layer, converting them into a probability distribution for the potential next tokens.

Iterative Token Generation: This process is repeated, and at each stage, the decoder generates the next token based on the cumulative embeddings of the previously generated tokens and the original output of the encoder.

Formation of the offer: these generation stages continue until the EOS (End of Sequence) token is created or a preset maximum supply length is reached.
This is stated in section 3.4 of the scientific article.
1. Line Layer
A line layer is a simple linear transformation. It receives the output of the decoder and converts it into a size vector vocab_size
. This is the size of the dictionary. For example, if we had a dictionary of 10,000 words, a line layer would convert the output of the decoder to a vector of size 10,000. This vector would contain the probability that each word will be the next word in the sequence. For simplicity’s sake, we can start with a 10word dictionary and assume that the first output of the decoder is a very simple vector: [1, 0, 1, 0]. We use arbitrary weights and size skew matrices vocab_size
x decoder_output_size
.
def linear(x, W, b):
return np.dot(x, W) + b
x = linear([1, 0, 1, 0], np.random.randn(4, 10), np.random.randn(10))
x
array([ 0.06900542, 1.81351091, 1.3122958 , 0.33197364, 2.54767851,
1.55188231, 0.82907169, 0.85910931, 0.32982856, 1.26792439])
Note
What is used as input to a line layer? The decoder will output one embedding for each token in the sequence. The input to the line layer will be the last embedding generated. The last embedding includes information for the entire sequenceIn other words, it contains all the information needed to generate the next token. This means that each output embedding of the decoder contains information about the entire sequence up to that stage.
2. Softmax
They are called logites, but they are not so easy to interpret. To get probabilities, you can use the softmax function.
softmax(x)
array([[0.01602618, 0.06261303, 0.38162024, 0.03087794, 0.0102383 ,
0.00446011, 0.01777314, 0.00068275, 0.46780959, 0.00789871]])
And that’s how we got the probabilities! Suppose the dictionary looks like this:
We can see that the probabilities are as follows:

hello: 0.01602618

mundo: 0.06261303

world: 0.38162024

how: 0.03087794

?: 0.0102383

EOS: 0.00446011

SOS: 0.01777314

a: 0.00068275

hola: 0.46780959

c: 0.00789871
From this, it can be seen that the most likely next token is “hola”. If the most likely token is always chosen, this is called greedy decoding. This is not always the best approach because it can lead to suboptimal results, but we won’t delve into generation techniques for now. If you want to know more about them, then read the awesome post.
3. Random Transformer from Encoder and Decoder
Let’s write the entire code. Let’s define a dictionary that matches words with their original embeddings. It should be noted that this is also studied in training, but for now we use random values.
vocabulary = [
"hello",
"mundo",
"world",
"how",
"?",
"EOS",
"SOS",
"a",
"hola",
"c",
]
embedding_reps = np.random.randn(10, 4)
vocabulary_embeddings = {
word: embedding_reps[i] for i, word in enumerate(vocabulary)
}
vocabulary_embeddings
{'hello': array([0.32106406, 2.09332588, 0.77994069, 0.92639774]),
'mundo': array([0.59563791, 0.63389256, 1.70663692, 0.99495115]),
'world': array([ 1.35581862, 0.0323546 , 2.76696887, 0.83069982]),
'how': array([0.52975474, 0.94439644, 0.80073818, 1.50135518]),
'?': array([0.88116833, 0.13995055, 2.01827674, 0.52554391]),
'EOS': array([1.12207024, 1.40905796, 1.22231714, 0.02267638]),
'SOS': array([0.60624082, 0.67560165, 0.77152125, 0.63472247]),
'a': array([ 1.67622229, 0.20319309, 0.18324905, 0.24258774]),
'hola': array([ 1.07809402, 0.83846408, 0.33448976, 0.28995976]),
'c': array([ 0.65643157, 0.24935726, 0.80839751, 1.87156293])}
Now let’s write an arbitrary method generate
, which autoregressively generates tokens.
def generate(input_sequence, max_iters=3):
# Сначала мы кодируем входные данные в эмбеддинги
# Для простоты мы пропустим этап позиционного кодирования
embedded_inputs = [
vocabulary_embeddings[token] for token in input_sequence
]
print("Embedding representation (encoder input)", embedded_inputs)
# Затем генерируем описание эмбеддинга
encoder_output = encoder(embedded_inputs)
print("Embedding generated by encoder (encoder output)", encoder_output)
# Инициализируем выходные данные декодера эмбеддингом начального токена
sequence_embeddings = [vocabulary_embeddings["SOS"]]
output = "SOS"
# Случайные матрицы для линейного слоя
W_linear = np.random.randn(d_embedding, len(vocabulary))
b_linear = np.random.randn(len(vocabulary))
# Мы ограничиваем количество этапов декодинга, чтобы избежать слишком последовательностей без EOS
for i in range(max_iters):
# Этап декодера
decoder_output = decoder(sequence_embeddings, encoder_output)
# Используем для предсказания только последние выходные данные
logits = linear(decoder_output[1], W_linear, b_linear)
# Обёртываем логиты в список, потому что softmax нужны пакеты/2Dмассив
probs = softmax([logits])
# Получаем наиболее вероятный следующий токен
next_token = vocabulary[np.argmax(probs)]
sequence_embeddings.append(vocabulary_embeddings[next_token])
output += " " + next_token
print(
"Iteration", i,
"next token", next_token,
"with probability of", np.max(probs),
)
# Если следующий токен последний, то возвращаем последовательность
if next_token == "EOS":
return output
return output, sequence_embeddings
Let’s run the code!
generate(["hello", "world"])
Embedding representation (encoder input) [array([0.32106406, 2.09332588, 0.77994069, 0.92639774]), array([ 1.35581862, 0.0323546 , 2.76696887, 0.83069982])]
Embedding generated by encoder (encoder output) [[ 1.14747807 1.5941759 0.36847675 0.07822107]
[ 1.14747705 1.59417696 0.36847441 0.07822551]]
Iteration 0 next token hola with probability of 0.4327111653266739
Iteration 1 next token mundo with probability of 0.4411354383451089
Iteration 2 next token world with probability of 0.4746898792307499
('SOS hola mundo world',
[array([0.60624082, 0.67560165, 0.77152125, 0.63472247]),
array([ 1.07809402, 0.83846408, 0.33448976, 0.28995976]),
array([0.59563791, 0.63389256, 1.70663692, 0.99495115]),
array([ 1.35581862, 0.0323546 , 2.76696887, 0.83069982])])
Great, we now have the “how”, “a” and “c” tokens. This is a mistranslation, but it was to be expected, because we used random weights.
I advise you to take a closer look at the entire encoderdecoder architecture from the scientific article:
Conclusion
I hope you found the post interesting and informative! We’ve covered a lot of aspects. But is that all? In fact, almost yes! A lot of tricks are added to the architectures of new Transformers, but the foundation of the Transformer is exactly as we described it. Depending on the task, you can also use only the encoder or decoder. For example, in tasks that require understanding, such as classification, you can use an encoder stack with a linear layer on top of it. For tasks that require generation, such as translation, you can use encoder and decoder stacks. And for free generation, for example, as in ChatGPT or Mistral, only a stack of decoders can be used.
Of course, we’ve simplified a lot of things. Let’s take a quick look at what the numbers were in the Transformer research paper:

Embeddings dimension: 512 (4 in our example)

Number of encoders: 6 (6 in our example)

Number of decoders: 6 (6 in our example)

Direct coupling dimension: 2048 (8 in our example)

Number of Attention Heads: 8 (2 in our example)

Dimension of attention: 64 (in our example, 3)
We’ve covered a lot of topics, but it’s quite interesting that we can achieve impressive results by scaling up these calculations and doing smart learning. We didn’t cover training in this post because our goal was to understand computation using an offtheshelf model, but I hope this will be a solid foundation for the transition to learning!
You can also find a more formal document with calculations in this PDF.
Exercises
Here are some exercises to practice understanding a transformer.

What is the purpose of positional coding?

What is the difference between selfattention and encoderdecoder attention?

What happens if the dimension of attention is too small? And if it’s too big?

Briefly describe the structure of the feedforward layer.

Why is a decoder slower than an encoder?

What is the purpose of residual couplings and layer normalization?

How do you go from decoder output to probabilities?

Why Choosing the Most Likely Next Token Each Time Can Cause Problems?
Resources

The Illustrated Transformer [перевод на Хабре]

Attention is all you need [перевод на Хабре: первая и вторая части]
———
Acknowledgement and Usage Notice
The editorial team at TechBurst Magazine acknowledges the invaluable contribution of the author of the original article that forms the foundation of our publication. We sincerely appreciate the author’s work. All images in this publication are sourced directly from the original article, where a reference to the author’s profile is provided as well. This publication respects the author’s rights and enhances the visibility of their original work. If there are any concerns or the author wishes to discuss this matter further, we welcome an open dialogue to address potential issues and find an amicable resolution. Feel free to contact us through the ‘Contact Us’ section; the link is available in the website footer.