I was prompted to write this note by the almost complete lack of information in the Russian language regarding the effective implementation of the algorithm for optimal prefix encoding of the alphabet with minimal redundancy, known by the name of its creator as the Huffman algorithm. This algorithm is used in one form or another in many standards and programs for compressing a variety of data.
Huffman’s Canonical Algorithm
A good description of Huffman’s algorithm can be found in books [1,2]. Therefore, I will quote almost verbatim the information from the section devoted to the description of the Huffman algorithm in the book [1].
The algorithm belongs to the group of “greedy” algorithms and uses only the frequency of the same characters in the input data block. Maps the more frequent characters in the input stream to a string of shorter bits. On the contrary, if it is rare, it is a longer chain. To collect statistics, it requires two passes through the input block (there are also singlepass adaptive versions of the algorithm).
Characteristics of the Huffman algorithm [1]:

Compression Ratios: 8, 1.5, 1 (Best, Average, Worst)

Time symmetry: 2:1 (due to the fact that it requires two passes through the compressed data array)

Features: One of the few algorithms that does not increase the size of the source data in the worst case (except for the need to store the transcoding table with the file).
Basic Steps of the Huffman Code Compression Algorithm

Collection of statistical information for subsequent construction of variablelength code tables

Constructing VariableLength Codes Based on Collected Statistical Information

Data encoding (compression) using constructed codes
The compression algorithm described above requires that additional information be stored and transmitted, together with the encoded data, in order to uniquely reconstruct the correspondence table between the encoded characters and the encoding bit chains.
It should be noted that in some cases it is possible to use a persistent table (or set of tables) that are known in advance in both encoding and decoding. Alternatively, you can build the table adaptively during the compression and repair process. In these cases, there is no need to store and transmit additional information, and there is no need to collect statistical information in advance (step 1).
In what follows, we will consider only a twopass circuit with the explicit passing of additional information about the correspondence table of the encoded characters and the encoding bit chains.
First Data Pass: Gathering Statistical Information
Let’s assume that the input characters are bytes. In this case, the collection of statistical information will consist of counting the number of occurrences of identical bytes in the input data block.
Constructing VariableLength Codes Based on Collected Statistical Information
Huffman’s algorithm is based on the creation of a binary tree [3].
Initially, all nodes are considered leaves (leaf nodes), which represent a symbol (in this case, a byte) and the number of times it occurs. To build a binary tree, a priority queue is used, where the node with the lowest frequency is assigned the highest priority.
The algorithm will include the following steps:
Step 1. Place all leaves with a nonzero spawn count in a priority queue (order all leaves in descending order of spawn number)
Step 2. As long as there is more than one node in the queue, perform the following actions:
Step 2a. Remove the two nodes with the highest priority (the lowest number of occurrences) from the queue
Step 2b. Create a new node for which the nodes selected in step 2a are descendants. In this case, the number of occurrences of the new node (its weight) is assumed to be equal to the sum of the occurrences selected in step 2a
Step 2c. Add the node created in step 2b to the priority queue.
The only node that remains in the priority queue at the end of the algorithm will be the root node for the built binary tree.
To reconstruct the coding words, we need to follow the edges from the root node of the resulting tree to each leaf. In this case, each edge is assigned a bit “1” if the edge is on the left, and “0” if it is on the right (or vice versa).
An example of how the algorithm described above works is shown in Figure 1.
Figure 1. Example of how the algorithm works Huffman
Table 1. Constructed Huffman codes for the example shown in Figure 1.
Input Symbol 
Number of appearances 
Bitcode Length 
Bit code 
“A” 
20 
1 
0 
“B” 
17 
2 
10 
“C” 
6 
4 
1111 
“D” 
3 
4 
1100 
“E” 
2 
5 
11100 
“F” 
2 
5 
11011 
“G” 
2 
5 
11010 
“H” 
1 
6 
111010 
“I” 
1 
7 
1110111 
“J” 
1 
7 
1110110 
The above description of the algorithm gives the impression that the implementation of the codebuilding algorithm should be pointerbased and require additional memory to store the addresses of the internal nodes of the binary tree. An example of such an implementation can be found, for example, in the article [3]. For learning purposes, this implementation is good, but that’s all. And it becomes very sad when in all seriousness they try to use such an implementation in real software.
Build code with minimal redundancy that makes efficient use of RAM
Basic ideas of the algorithm
Before we get into this efficient algorithm for building code with minimal redundancy, we’ll have to dive a little deeper into coding theory [4].
Up to this point, we’ve assumed that the result of code building is to assign a bitcode to each input character. The mapping of bitcodes to each input character defines the encoding scheme.
In order for the encoding to be onetoone, the encoding scheme must have a prefix property – i.e., no bitcode used to encode a particular character must be a prefix of a bitcode intended to encode any other input character [1].
Let is the length of the bitcode to be encoded th character from the alphabet of input characters. In this case, the following inequality is a prerequisite for an encoding scheme to have a prefix property:
This inequality is called the MacmillanKraft inequality in the literature [4,5].
At the same time, in order for the encoding scheme to have the property of minimal redundancy, it is necessary to construct bitcodes such that the mathematical expectation of the length of the bitcodes was minimal (– the probability of the appearance of the characters of the input alphabet).
It should be noted again that an encoding scheme that specifies a set of bitcodes with lengths for which the MacmillanKraft inequality is satisfied does not guarantee a prefix property. For example, for a twocharacter alphabet encoding scheme consisting of bit codes “0” and “00”, the MacmillanKraft inequality will hold: . But bitcodes obviously won’t have a prefix property. However, if we have information about the lengths of the codes for which the MacmillanKraft inequality holds, we can always construct bitcodes corresponding to these lengths, which will have the property of [4,5].
We will return to the algorithm for constructing bit codes by their lengths later, but for now we will record the fact that the problem of constructing codes with minimal redundancy is reduced to finding the lengths of such codes for the characters of the input alphabet on the basis of the collected statistical information.
Following the article [6], we will describe an algorithm for building codes with minimal redundancy that does not require additional memory (“inplace”) for its operation.
First of all, let’s look at the main ideas of the algorithm. It is based on the following two observations.
The first observation is that we can store the leaves of the binary tree separately from the internal nodes that are formed during the merge process (steps 2a and 2b) of the Huffman algorithm described above. Thus, we have two queues with priorities. Moreover, because the inner nodes are sorted and the leaves are also sorted in step 1, priority queues become regular lists.
In this case, the weight of any node in the tree is needed only until the node is processed. For any givenAt least the right thing to do is to store the Weights. At the end of the mergers, we will have only one weight left.
The second observation is that we do not need to store pointers to parent nodes (indexes of parent nodes) in two lists. If the depth of each inner node is known, then we can reconstruct the depth for each of the leaves (since we can assume that the lengths of the codes will not increase). For example, for the tree shown in Figure 2, with the depths of the inner nodes (indicated in orange) [3, 3, 2, 1, 0] The depths of the leaves (indicated in green) will be equal to [4, 4, 4, 4, 2, 1].
Figure 2. Example of a tree with depths
At the beginning of the merge process, there are no pointers to the parent nodes (indexes of the parent nodes) in any of the lists. And at the end of the merger process, we will have Pointers to the parent nodes (indexes of the parent nodes). But all of these pointers (indexes) will belong to the internal nodes.
The total number of weights (stored for nodes that have not yet been processed) and pointers to parent nodes (indexes of parent nodes) (stored for internal nodes that have already been processed) will never exceed . That is, pointers to the parent nodes (indexes of the parent nodes) and weights can be stored in the input array .
Thus, the input array At the same time, it is working for the algorithm. At the beginning of the algorithm, the array contains the numbers (frequencies) of the appearance of the characters of the input alphabet, which consists of symbols in the input data stream. Then, at each stage, the contents of the array change: the elements in the array At different stages of the algorithm, they are used to store the number (frequency of occurrences) of characters, the weights of the inner nodes, the indices of the parent nodes, the depths of the inner nodes, and finally the depth of the leaves (the length of the codes for each character).
Main steps of the algorithm
Let’s move on to the direct description of the “inplace” algorithm for building codes with minimal redundancy.
This algorithm includes the following three main steps:

Stage 1. Formation of internal nodes.
Input: Array containing the numbers (frequencies) of the appearance of the characters of the input alphabet from Characters.
Output: Array containing the following data:
– Pointers to parent nodes (indexes of parent nodes)
– Weight of the root node
– Not used.
1 s ← 0
2 r ← 0
3 for t ← 0 to n2
4 do ⇨ выбираем первый узелпотомок
5 if (s > n1) or (r < t and A[r] < A[s])
6 then ⇨ выбираем внутренний узел
7 A
8 A[r] ← t+1
9 r ← r+1
10 else ⇨ выбираем лист
11 A
12 s ← s+1
13 ⇨ выбираем второй узелпотомок
14 if (s > n1) or (r < t and A[r] < A[s])
15 then ⇨ выбираем внутренний узел
16 A
17 A[r] ← t+1
18 r ← r+1
19 else ⇨ выбираем лист
20 A
21 s ← s+1
Important notes on the pseudocode shown above: – The “or” and “and” operations are evaluated conditionally (as it happens, for example, in the C/C++ programming language).
The essence of the first stage is as follows: at each iteration of the loop, compare the next inner node with the lowest weight (if it exists) with the next sheet with the lowest weight (if it exists) and choose the least weight of the two (lines 5 – 12). The lowest weight found is assigned to the newly created tree node of the depths of the inner nodes are converted to leaf depths (code lengths for each character). To do this, the array is scanned from right to left using the pointer for internal nodes and pointer on the element in which the value of the code length of the next leaf in the tree will be stored.
An illustration of Stage 3 for our example is shown in the figure below:
Figure 4. Example of the third stage of the algorithm
Here, the resulting code lengths are highlighted in blue.
To sum up, the algorithm described above consists of three stages, each of which is traversed through the array (once in the forward direction, and twice in the opposite direction).
The algorithm is executed in time and uses additional memory in the event that the input array The numbers (frequencies) of the appearance of the characters of the input alphabet are presorted.
Once again, I would like to emphasize that the algorithm does not require explicit tree construction using pointers (additional dynamic allocation of memory for the generated nodes of the tree), which in practice leads to an increase in the time required to build variablelength codes.
Constructing VariableLength Codes
Now that we have the code lengths for each character, let’s construct bitcodes for them.
So – the length of the bitcode to be encoded th character from the alphabet of input characters. Let’s denote it through Number of characters in the input alphabet for which the code length is equal to :
In here – the maximum length of the code obtained after applying the algorithm described above.
Next, the code with the length for A character out of a set of characters for which the length of the code is the same and equal to can be obtained using the following formula:
Where is
Input: Array containing the code lengths for each character of the input alphabet from Characters.
Output: Array , where is an integer contains a bitmap representation of the code in the Low bits.
1 ⇨ подсчитываем число символов с одинаковыми длинами кодов
2 for i ← 0 to n1
3 do m[i] ← 0
4 for i ← 0 to n1
5 do m[A[i]] ← m[A[i]] + 1
6 ⇨ вычисляем значения base для каждой длины кода
7 s ← 0
7 for k ← L downto 1
8 do Base[k] ← s >> (L  k)
9 s ← s + (m[k] << (L  k))
10 ⇨ вычисляем коды для каждого символа входного алфавита
11 p ← 0
12 for i ← 0 to n1
13 do if p ≠ A[i]
14 then j ← 0
15 p ← A[i]
16 B[i] ← j + Base[A[i]]
17 j ← j + 1
Let’s construct variablelength codes for our example.
Table 2. An example of building variablelength codes.
Input Symbol 
Bitcode Length 
base 
j 
Bit code 
“H” 
6 
0 
0 
000000 
“I” 
6 

1 
000001 
“J” 
5 
1 
0 
00001 
“E” 
5 

1 
00010 
“F” 
5 

2 
00011 
“G” 
5 

3 
00100 
“D” 
5 

4 
00101 
“C” 
4 
3 
0 
0011 
“B” 
2 
1 
0 
01 
“A” 
1 
1 
0 
1 
Pay attention to the appearance of the resulting codes. Codes of equal length are an ascending sequence of integers. Such codes are called canonical codes [5,7] and allow for fast encoding and decoding.
In addition, knowing only the lengths of the codes, you can quickly build canonical codes when decoding. This makes it possible to compactly present the additional information that needs to be passed to the decoder in order to unambiguously reconstruct the correspondence table between the encoded characters and the variablelength codes used for this purpose (recoding table).
For example, additional information might have the following structure:
In the Englishlanguage literature, additional information is called “prelude,” and the compact presentation of such additional information is an interesting task in its own right, which I will not go into here. For those who are interested, I will leave a link to the article [8], which discusses this issue in detail.
An attentive reader may notice that the variablelength codes presented in Table 2 and the Huffman codes in Table 1 do not coincide either in the lengths of the codes or in their bit representations (see Table 3).
Table 3. Compare the results of building code with minimal redundancy.
Input Symbol 
Huffman Codes 

Canonical Codes 


Bitcode Length 
Bitcode 
Bitcode Length 
Bit code 
“H” 
6 
111010 
6 
000000 
“I” 
7 
1110111 
6 
000001 
“J” 
7 
1110110 
5 
00001 
“E” 
5 
11100 
5 
00010 
“F” 
5 
11011 
5 
00011 
“G” 
5 
11010 
5 
00100 
“D” 
4 
1100 
5 
00101 
“C” 
4 
1111 
4 
0011 
“B” 
2 
10 
2 
01 
“A” 
1 
0 
1 
1 
Applying the Huffman algorithm will always produce code with minimal redundancy, but not every possible code with minimal redundancy can be obtained by applying the Huffman algorithm. Therefore, the title of this article is, strictly speaking, incorrect.
For the canonical and Huffman codes shown in Table 3, it is easy to check that the MacmillanKraft inequality holds () and the average length of the code is the same and equal .
Data Encoding/Decoding with Variable Length Codes
Now let’s focus on the issue of data encoding (compression) using constructed canonical codes.
To do this, let’s go back to our example. Let the bitcode lengths for each character be stored in the CodeLen array[]. Let’s write the bit representation of each of the codes to the CodeWord array[]. Each element of a CodeWord array[] – An unsigned integer where the code bits are rightaligned (Table 4).
Table 4. Representation of variablelength codes for encoding.
Input Symbol 
CodeLen[] 
Bit code 
CodeWord[] 
“H” 
6 
000000 
0 
“I” 
6 
000001 
1 
“J” 
5 
00001 
1 
“E” 
5 
00010 
2 
“F” 
5 
00011 
3 
“G” 
5 
00100 
4 
“D” 
5 
00101 
5 
“C” 
4 
0011 
3 
“B” 
2 
01 
1 
“A” 
1 
1 
1 
Then the data encoding procedure is D long With the help of the constructed variablelength codes, it will look like this:
1 for i ← 0 to n1
2 do ⇨ берём очередной символ из массива входных данных D
3 s ← D[i]
4 l ← CodeLen[s]
5 c ← CodeWord[s]
6 PutBits(c, l)
In the above encoding procedure, the , which writes to the output data stream the least significant bits of an unsigned integer .
To perform decoding, transform the code table as follows. First, let’s form an array of L_code[], where each element is unsigned bit integer, which is formed by aligning the bitcode to the left bit edge. In here – Maximum code length.
For our example, the array L_code[] is presented in Table 5.
Table 5. Representation of variablelength codes during decoding.
Input Symbol 
CodeLen[] 
Bit code 
L_code[] 
“H” 
6 
000000 
0 
“I” 
6 
000001 
1 
“J” 
5 
00001 
2 
“E” 
5 
00010 
4 
“F” 
5 
00011 
6 
“G” 
5 
00100 
8 
“D” 
5 
00101 
10 
“C” 
4 
0011 
12 
“B” 
2 
01 
16 
“A” 
1 
1 
32 
Next, from the array of L_code[] select the first elements in order that correspond to different code lengths and form a FirstCode array[] (Table 6). If there are no codes with some lengths (in our example, codes with a length of 3), the following value is taken for them.
Table 6. An auxiliary array for decoding.
Bitcode Length 
FirstCode[] 
6 
0 
5 
2 
4 
12 
3 
16 
2 
16 
1 
32 
0 
64 
The Value of FirstCode[0] It is necessary for the correct operation of the code length search algorithm when decoding the next encoded character. We’ll come back to this algorithm a little later, but here we’ll just give you a formula for calculating FirstCode[0]:
FirstCode[0] = FirstCode[] + (1 << ()),
Where is – minimum code length, “<<” is the bit shift operation to the left.
Next, we’ll form the DecTable data structure[][], which for our example will look like this:
Here, offset is the ordinal indices of characters that have the same code length.
The decoding procedure will look like this:
1 for i ← 0 to n1
2 do ⇨ получаем из потока закодированных данных L бит
3 Buffer ← ShowBits(L)
4 ⇨ определяем длину кода для очередного символа s
5 for l ← L to 1
6 do if FirstCode[l] ≤ Buffer < FirstCode[l1]
7 then goto line 8
8 ⇨ на основании найденной длины кода декодируем символ s
9 offset ← (Buffer  FirstCode[l]) >> (L  l)
10 s ← DecTable[l][offset]
11 ⇨ удаляем l битов из потока закодированных данных
12 RemoveBits(l)
The above decoding procedure can be sped up by replacing the linear code length search (lines 5 – 7) with a binary search algorithm:
1 for i ← 0 to n1
2 do ⇨ получаем из потока закодированных данных L бит
3 Buffer ← ShowBits(L)
4 ⇨ определяем длину кода для очередного символа s
5 left ← 0, right ← L + 1
6 while (right  left) > 1
7 do
8 middle = (right + left) / 2
9 if FirstCode[middle] ≤ Buffer
10 then
11 right ← middle
12 else
13 left ← middle
14 ⇨ на основании найденной длины кода декодируем символ s
15 offset ← (Buffer  FirstCode[right]) >> (L  right)
16 s ← DecTable[right][offset]
17 ⇨ удаляем right битов из потока закодированных данных
18 RemoveBits(right)
If we go back to the description of the encoding procedure, we can see that the total number of bits for the encoded data can be a multiple of 8. This means that when writing encoded data to memory or a file on disk, we need to augment the encoded data with as many bits as possible to obtain an integer number of bytes (e.g., add zero bits if necessary).
However, these extra bits must be ignored during decoding. In the decoding procedure described above, it is assumed that we pass the original data size (the length of the unencoded data ) together with the encoded information.
It should be noted that this is not the only way necessary to decode data correctly. Another option is to include another special character in the alphabet of input characters (let’s denote it – “EndOfData”), which is supposed to have a spawn rate of 1. Then, for the A prefix code will be built. We place to the end of the encoded data, and when decoding, the appearance of a code for will stop the decoding procedure. There is one drawback to this method of stopping decoding, which we will discuss later.
Implementation issues
Now I propose to move on to the practical plane and consider the problems of implementing the previously described algorithms for real computing systems.
When describing the procedure for constructing bitcodes from the found lengths, we assumed that integers , which are used to store bitmaps of codes, have a sufficient bit size (bit size) for this.
Since we initially agreed that the input characters are bytes, the input alphabet will consist of 256 characters. Characters of such an alphabet may have probabilities , which, when constructing prefix codes, will result in two codes with a length of 255 bits each. In practice, other coding methods, such as arithmetic coding, are used for such distributions [1].
In the worstcase scenario, however, we need to be able to store long bit codes.
Further, the bit width of integers is given by the bit width of the computing system (for pedants, by the bit width of the computer word of the computing system). For the laptop on which I wrote this text, the bit width is 64 bits. And to work with bit codes that are longer than 64 bits, it is necessary to implement special algorithms that will carry out basic operations for such codes, which in most cases leads to a catastrophic slowdown in the work of encoding algorithms.
To solve this problem, when implementing an encoding algorithm, we can sacrifice the compression ratio to some extent in exchange for providing a high encoding/decoding speed. Let’s look at two main approaches to solving this problem.
Apply an algorithm to limit the maximum length of code.
The input to this algorithm is an array containing the code lengths for each character of the input alphabet from Characters.
1 ⇨ подсчитываем число символов с одинаковыми длинами кодов
2 for i ← 0 to n1
3 do m[i] ← 0
4 for i ← 0 to n1
5 do m[A[i]] ← m[A[i]] + 1
6 ⇨ пересчитаем количество кодов с длинами, меньшими требуемой L_restrict
7 for i ← L downto L_restrict+1
8 do while m[i] > 0
9 do
10 j ← i  1
11 do
12 j ← j  1
13 while m[j] ≤ 0
14 m[i] ← m[i]  2
15 m[i1] ← m[i1] + 1
16 m[j+1] ← m[j+1] + 2
17 m[j] ← m[j]  1
18 ⇨ переназначим длины кодов символам входного алфавита
19 n ← 0
20 for i ← L_restrict downto 1
21 do
22 k ← m[i]
23 while k > 0
24 do
25 A[n] ← i
26 n ← n + 1
27 k ← k  1
The above algorithm for limiting the maximum length of the code consists of 3 steps [9]. In the first step (lines 1 – 5), the number of codes for each of the lengths is counted Where is – Maximum code length. The result of this calculation is written to the (just like at the beginning of the procedure for generating bitcodes by their lengths).
The second step (lines 6 to 17) is to recalculate the number of codes with lengths less than the required value so that the code lengths for all characters of the input alphabet are less than or equal to .
Since the two characters of the input alphabet (two sheets of the binary tree) form a single parent vertex, at each iteration of the inner loop (lines 8 through 17) we take a pair of characters with the length of the code . For these two characters, the vertex parent will have the length of the code . Next, we look for a character with a code length less than , and replace it with the selected parent vertex. In this case, the replaced symbol is placed in the place of the parent vertex. Sounds very confusing, so I suggest we look at the work of this stage for our example, assuming .
The algorithm operates on the number of characters for which the length of the code is the same. Therefore, it is not important for us to bind each sheet to a specific symbol of the input alphabet. Such rebinding is done in the third step (lines 18 to 27).
The correct operation of the above algorithm is guaranteed under the following conditions:
Where is – The size of the input alphabet.
In essence, the above constraint is the lowest possible height of a binary tree with the number of tiles equal to the number of characters in the input alphabet. For our example, such a tree would look like this:
It is not possible to apply the described algorithm to such a tree, since at a certain step there will be no leaves that could be replaced by a parent vertex for two tiles with a code length of 4.
Use escape encoding.
Since the rare characters of the input alphabet are assigned longer codes than the frequently occurring characters, one way to limit the maximum length of the resulting codes is to encode a group of rare characters in a pair .
In here – the prefix code of a pseudocharacter that has a frequency of occurrence equal to the sum of the frequency of appearance of characters from the group of rare characters, – A fixedlength code that allows you to uniquely decode a character from this group. In the literature, this coding scheme is called escape coding (coding with is)using an escape pseudocharacter).
It should be noted that the escape encoding option is used if the size of the input alphabet is large enough, and the construction of variablelength codes for each character is guaranteed to result in codes with longer lengths.
For example, the JPEG standard requires the encoding of integers from the range , and each character is encoded by a pair Where is – a prefix code that encodes information about the length of the string of null characters preceding the next nonzero encoded character, as well as the number of bits required to uniquely decode this nonzero character. has a length equal to the number of bits that is encoded in the . A more detailed description of such an encoding scheme can be found in [10]. The English language there is quite simple, and the author describes the essence of the algorithms used in the JPEG standard in simple words.
Let’s go back to our problem of limiting the maximum length of code. Let’s replace the characters with code lengths that are larger than the required size , with a couple of characters . – Variable length code , – Fixedlength code. Since we’ve agreed that the input characters are bytes, the size of the code can be equal to 8 bits.
Let’s take a look at this procedure for limiting the maximum length of code for our example. Let’s say, as in the previous case, . Then the encoding table would look like this:
Table 7. An encoding scheme using an escape character.
Entrance symbol 
CodeLen[] 
CodeWord[] 
FixedCodeWord[] 
“H” 
4 
0011 
111 
“I” 
4 
0011 
110 
“J” 
4 
0011 
101 
“E” 
4 
0011 
100 
“F” 
4 
0011 
011 
“G” 
4 
0011 
010 
“D” 
4 
0011 
001 
“C” 
4 
0011 
000 
“B” 
2 
01 
– 
“A” 
1 
1 
– 
Based on the number of characters that will be encoded with the escape character, the length of the code will be equal to 3 bits, which allows you to uniquely decode the characters of the input alphabet.
Data Encoding Procedure D Length will look like this:
1 for i ← 0 to n1
2 do ⇨ берём очередной символ из массива входных данных D
3 s ← D[i]
4 if s ∈ группе символов, кодируемых с помощью escapeсимвола
5 then ⇨ помещаем в выходной битовый поток код, соответствующий escapeсимволу
6 l ← CodeLen[s]
7 c ← CodeWord[s]
8 PutBits(c, l)
9 ⇨ помещаем в выходной битовый поток код фиксированной длины
10 l ← FixedCodeLen
11 c ← FixedCodeWord[s]
12 PutBits(c, l)
13 else ⇨ кодируем символ напрямую с помощью кода переменной длины
14 l ← CodeLen[s]
15 c ← CodeWord[s]
16 PutBits(c, l)
Here, FixedCodeLen is the predefined length of the fixedlength code.
The decoding procedure is obviously a consequence of the encoding procedure. As soon as a code corresponding to an escape character emerges from the input encoded stream, the FixedCodeLen bits are read to help uniquely decode the character of the input alphabet.
The above encoding scheme reduces the amount of data compression in exchange for limiting the maximum code length in order to speed up encoding/decoding. To increase the compression ratio without sacrificing encoding/decoding speed, you can use more than one escape character, as well as preprocess the input data before encoding it. But here I would not like to delve into this topic, since it requires, in my opinion, a separate note, or even a whole series of articles.
In conclusion, it should be noted that limiting the maximum length of the code can help make decoding fast enough.
Fast decoding
As a reminder, – Maximum code length. Then, when decoding, we can build a onedimensional correspondence table, which can speed up the decoding procedure.
For our example, such a onedimensional table would look like this:
Table 8. An example of a onedimensional mapping table for fast decoding.
Lbit code 
Actual Code Length 
Input Symbol 
000000 
6 
“H” 
000001 
6 
“I” 
000010 
5 
“J” 
000011 
5 
“J” 
000100 
5 
“E” 
000101 
5 
“E” 
000110 
5 
“F” 
000111 
5 
“F” 
001000 
5 
“G” 
001001 
5 
“G” 
001010 
5 
“D” 
001011 
5 
“D” 
001100 
4 
“C” 
001101 
4 
“C” 
001110 
4 
“C” 
001111 
4 
“C” 
010000 
2 
“B” 
010001 
2 
“B” 
… 
… 
… 
011110 
2 
“B” 
011111 
2 
“B” 
100000 
1 
“A” 
… 
… 
… 
111111 
1 
“A” 
Bit bit code specifies the address in the onedimensional correspondence table, which is used to determine the actual length of the code and the symbol during the decoding process.
In this case, the decoding procedure is as follows:
1 for i ← 0 to n1
2 do ⇨ получаем из потока закодированных данных L бит
3 Buffer ← ShowBits(L)
4 ⇨ определяем длину кода и декодируем очередной символ
5 s ← FastDecSymbolTable[Buffer]
6 l ← FastDecLenTable[Buffer]
7 ⇨ удаляем l битов из потока закодированных данных
8 RemoveBits(l)
In here is the length of unencoded data, FastDecSymbolTable and FastDecLenTable are onedimensional arrays containing the decoded character and its corresponding code length.
In this way, the computational complexity of decoding will be reduced with (for binary search decoding) to .
The size of the FastDecSymbolTable and FastDecLenTable arrays is . If large enough, these arrays will be huge, and in the worst case, they will not fit in the RAM of the computing system. And even if they do, the actual decoding speed will be slow due to constant accesses to RAM. Therefore, the above fast decoding procedure makes sense if the FastDecSymbolTable and FastDecLenTable arrays are placed in the cache of the computing system.
In conclusion, I would like to note that such important issues as the applicability of data coding using variablelength codes, probability modeling in data coding, as well as general issues of the theory of entropy coding have been left out of the scope of this article. I consciously tried not to touch on these topics, but to focus on the algorithm for building variablelength codes and its practical implementation.
I would like to thank Ivan Solomatin for his help and valuable advice in the process of preparing this publication.
Alexei Fartukov
Head of Samsung Lab
Literature

D. Vatolin, A. Ratushnyak, M. Smirnov, V. Yukin. Data compression methods. Device of archivers, compression of images and videos. DialogMEPhI – 2003.

Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, Clifford Stein. Algorithms: Construction and Analysis, 3rd Edition. Williams – 2013.

Huffman compression algorithm. Translated by MaxRokatansky. OTUS Blog. https://habr.com/ru/company/otus/blog/497566/ – 2020.

Yablonskii S. V. Introduction to Discrete Mathematics. Moscow, Nauka Publ., 1986.

A. Moffat, T.C. Bell and I. H. Witten. Lossless Compression for Text and Images. International Journal of High Speed Electronics and Systems. Vol. 08, No. 01, pp. 179231 (https://doi.org/10.1142/S0129156497000068) – 1997.

A. Moffat and J. Katajainen. InPlace Calculation of MinimumRedundancy Codes. WADS ’95: Proceedings of the 4th International Workshop on Algorithms and Data Structures, pp. 393–402 – 1995.

A. Moffat. Huffman Coding. ACM Computing Surveys. Vol. 52 Issue 4, pp. 1–35 ( https://doi.org/10.1145/3342555) – 2020.

A. Turpin and A. Moffat. Housekeeping for prefix coding. IEEE Transactions on Communications, vol. 48, no. 4, pp. 622628, April 2000, doi: 10.1109/26.843129.

T.81 : Information technology – Digital compression and coding of continuoustone still images – Requirements and guidelines (09/92) https://www.w3.org/Graphics/JPEG/itut81.pdf

Cristi Cuturicu. A note about the JPEG decoding algorithm. http://www.opennet.ru/docs/formats/jpeg.txt – 1999.
———
Acknowledgement and Usage Notice
The editorial team at TechBurst Magazine acknowledges the invaluable contribution of the author of the original article that forms the foundation of our publication. We sincerely appreciate the author’s work. All images in this publication are sourced directly from the original article, where a reference to the author’s profile is provided as well. This publication respects the author’s rights and enhances the visibility of their original work. If there are any concerns or the author wishes to discuss this matter further, we welcome an open dialogue to address potential issues and find an amicable resolution. Feel free to contact us through the ‘Contact Us’ section; the link is available in the website footer.