You can't assume that everyone is using UTF-8

How to calculate encoding using statistics

People speak countless different languages. These languages are not only incompatible with each other, but also present a huge difficulty in transpiling in the runtime environment. Unfortunately, everything attempts standardization failed.

At least there is someone to blame for this state of affairs: God. After all, it was he who forced humanity to speak different languages because ancient dispute over the construction of a property.

However, humanity can blame itself for having difficulty communicating computers.

And one of the biggest problems is also the simplest: computers have not agreed on how to write letters in binary code.

How letters are written in binary code

Let’s take the Latin character “A” as an example. In the American Standard Code for Information Interchange, or ASCIIhe was assigned the number 65. This numbering was inherited Unicode, only in Unicode the number 65 is written in hexadecimal (U+0041). This entry is called a “code space element” (codepoint).

Everything is quite simple here; at least there is general consensus on the issue of the number representing “A”. But computers can’t just store decimal numbers, they only store binary numbers.

In the most popular character encoding, UTF-8, character number 65 (“A”) is written like this:

01000001

Equal to one, or only the second and last bits are “on”. The second bit represents 64, and the last bit represents 1. They add up to 65. It’s very simple.

Another popular encoding is UTF-16, mainly used in the world of Windows, Java and
JavaScript. In UTF-16, the number 65 is written as follows:

01000001 00000000

Almost the same, only UTF-16 uses two full bytes for each character (at least), but does not require additional bits to describe 65, so the second byte remains empty.

What about other encodings? Here are just a few of the most popular:

Win-1252 – non-Unicode encoding, used where European languages are spoken
KOI8 – non-Unicode encoding, used where Cyrillic is used
GB18030 – Unicode, but mainly used in mainland China
Big5 – not related to Unicode, widely used where traditional Chinese characters are used
Shift_JIS – not Unicode, used in Japan

All these encodings inherit from ASCII letters, so in all of them A is written like this:

01000001

Exactly the same as in UTF-8.

Very comfortably. This is why the basic Western European alphabet is readable even when the rest of the document turns into twisted chaos. Many popular encodings (with the exception of UTF-16) correspond to ASCII, at least for the Latin alphabet.

So far so good. But let’s now look at a more complex symbol: the euro sign, €. The Unicode Consortium designated it with the number 8364 (U+20AC).

In UTF-8, the number 8364 is represented as follows:

11100010 10000010 10101100

Note that in UTF-8 it takes up three bytes. UTF-8 is a “variable length” character encoding: the higher the Unicode number, the more bytes required. (This is actually true for UTF-16 as well, but is less common.)

However, in UTF-16 the number 8364 is encoded completely differently:

10101100 00100000

Win-1252 does not follow the Unicode standard. In this encoding, the euro sign has the number 128. And the encoding writes 128 like this:

10000000

That is, only one included bit is equal to 128.

And this is where the problems begin. As soon as we leave the calm streets of the English alphabet, encodings quickly become chaotic.

€ impossible to imagine in any way in KOI8.

In GB18030 the € character is encoded as follows:

10100010 11100011

In Big5 the € symbol looks like this:

10100011 11100001

In Shift JIS it is

10000000 00111111

Completely different and completely incompatible. If we automatically assume that UTF-8 is used, then we will get complete nonsense.

How can I determine which encoding is being used?

Some formats themselves specify the encoding, for example, JSON requires the use of UTF-8. This makes life much easier – if you know that the data is written in JSON, then it should be encoded in UTF-8.

In other cases, you can transfer the encoding separately. HTTP allows you to put the encoding in the header Content-Type:

Content-type: text/html; charset=ISO-8859-1

And some formats have internal ways to specify the encoding. For example, some text files have a header:

# -*- encoding: utf-16be -*-

However, this is a little confusing because in order to find this header, we first need to somehow read the file in advance.

Well, what if the data doesn’t have any labels?

Or if the label is wrong? As will be shown below, this happens quite often.

In particular, CSV files have no internal way of communicating what encoding they are using. You can’t put a comment in them because there is no space for it, and csv readers most often will not be able to parse your csv. And many popular tools that work with CSV files (MS Excel) do not use UTF-8.

What will happen then?

The solution will be statistics.

Determining encoding using statistics

There are two basic strategies for determining the encoding of an unmarked string of text.

Byte level
At the character level

Most implementations use the byte level first, and optionally the character level.

▍ Byte-level heuristics

At the byte level, everything is quite simple. Just look at the bytes to see if they look similar to a particular character encoding.

For example, as I said above, UTF-16 uses two bytes per character (most often). In the case of text in Latin (for example, English), a lot of “empty” second bytes usually appear. Fortunately, many markup languages actively use the Latin alphabet (for example, <, >, [, ] and so on), even if the document itself is not written in Latin. If a string of text contains many empty second bytes, then chances are it is UTF-16.

There are other signs as well. Imagine that you are a web browser and a link takes the user to a file whose first two bytes look like this:

00111100 00100001

If this were UTF-16, then these two bytes would be the character ℼ, called double struck small pi and number 8508 (U+213C) in Unicode. Is this character often the first one to appear in an HTML file?

Or more likely it’s a two character sequence <! in UTF-8 encoding? Perhaps the next bytes are doctype> or the standard boilerplate of any HTML document?

Another clue is the specific bytes. As terrible as it may seem, UTF-16 has two versions: one in which the bits are written in the usual way, in the other – in the opposite direction. To help people differentiate between the two versions, the UTF-16 standard has byte sequence marker (byte order mark) which can be placed before the text stream to indicate which version is being used. This pair of bytes is rarely found in other encodings, and almost never at the beginning, so they become a good clue to what comes after them.

So bytes can give us quite a lot of information about the encoding. If you can use them to unambiguously determine UTF-8 or UTF-16, then our task is completed.

▍ Character-level heuristics

Difficulties arise with single-byte encodings that are not Unicode. For example, it is difficult to distinguish Win-1252 from KOI8, because both usually use the empty first bit of ASCII to encode different things.

How can you tell them apart? Thanks to frequency analysis. We look at the letters that might be present in a document, for example if it is KOI8, and ask the question: “Is this really a typical distribution of letters for a Cyrillic document?”

Here’s the basic algorithm:

We exclude all encodings cut off by previous byte-level heuristics
For every remaining possible encoding X:
- Parsing input data as if it were encoded X
- We compare the frequency of characters in a string with the values in a known frequency table
- Optionally we also compare couples letters (for example, qu) with a known frequency table
- If they match well enough, then we return X
Otherwise we return an error

This can often also tell you what language the document is written in, which is how web browsers open the “Translate this page?” dialog box.

Does it really work?

People are generally not very fond of heuristics, but the answer is yes. It works, and surprisingly well. And much better than simply assuming that the text is UTF-8 encoded (which is what the benchmark is, after all).

We probably shouldn’t be surprised that the statistics work. She often works well with languages ranging from the first effective spam filters before many other things.

Heuristics are also important because people misunderstand encodings.

It might seem logical that if you export an Excel sheet to a csv file in the latest version of MS Excel, you will end up with UTF-8. Well, or perhaps UTF-16. But you would be wrong. By default, in most configurations Excel saves CSV in Win-1252 encoding.

Win-1252 is a single-byte encoding that is not Unicode. This is an extension of ASCII that stuffs a fairly large number of characters into the unused eighth bit for almost every European language. The average Excel user has never heard of it, if they’ve even heard of character encodings at all. In much wisdom there is much sorrow.

Additional sources

Most of the encoding detection code probably works on principles laid down by Netscape in the early 2000s. There is an article describing this approach in the Mozilla archive.

I have the clear impression that automatic text encoding detection is a special case Postel’s law: “be conservative in what you do, be liberal in what you accept from others.” I always perceived Postel’s law as something true, but now I have more and more doubts arise. Perhaps the mechanism for automatically detecting the encoding in my csvbase database should be made part of the user interface, rather than a pre-selected drop-down list item.

Telegram channel with discounts, prize draws and IT news 💻

Acknowledgement and Usage Notice

The editorial team at TechBurst Magazine acknowledges the invaluable contribution of the author of the original article that forms the foundation of our publication. We sincerely appreciate the author’s work. All images in this publication are sourced directly from the original article, where a reference to the author’s profile is provided as well. This publication respects the author’s rights and enhances the visibility of their original work. If there are any concerns or the author wishes to discuss this matter further, we welcome an open dialogue to address potential issues and find an amicable resolution. Feel free to contact us through the ‘Contact Us’ section; the link is available in the website footer.