Exploring Unicode

Recently, I worked on building an .env parser for Zig. I thought one approach to take could be to iterate through each character to find whether it was a delimiter or not. I would later learn that does not quite work to properly parse an .env file.

Thinking that I needed to iterate through each character in order to parse the files, I looked at Unicode since I wanted to make sure I captured all values no matter the language or emoji used in the file, leading to the idea of iterating not through each byte (u8) but each perceived character (grapheme).

This is not an exhaustive list of what Unicode does and how it works. These are notes on a few high-level pieces I learned about, probably enough for most day-to-day cases and to dig deeper if needed.

What is Unicode?

Unicode is a standard that maps characters of writing systems into numerical values plus character properties and text algorithms, and is set by the Unicode Consortium.

The reason this is useful is because it gives us a method to organise characters, combinations, symbols and emojis, being able to handle text of any language.

Fundamentally computers only deal with numbers. Before Unicode, there were many different and conflicting standards for different writing systems. One of the most widely known is ASCII, which maps 128 values to characters and in practice has no way of supporting text like this:

用户名

Unicode was presented as a solution to consolidate text representation into a single standard that essentially can represent all writing systems with space for additional symbols such as emoji and perhaps something else in the future.

What is a character?

Typically we think of a character as A, 4 or 字 or even 👍. A multi-symbol sequence like 14 is two characters, as you’d expect.

In Unicode the character unit we see is called a Grapheme.

To understand how this works, let’s check the length of the 👍 and 👍🏼 emoji:

irb(main):001> "👍".length
=> 1
irb(main):002> "👍🏼".length
=> 2

Note
The example is in Ruby because it counts code points for string length.

As you might have expected it’s one, you see one “character” and you get 1 as the length of the string. However, the emoji with a skin tone variant has a length of 2. So what happened? We see a single 👍🏼 and we get 2 as a result.

The emoji we see is one user perceived character, a Grapheme. However, the program counted 2 code points.

To summarise, Unicode text is a group of graphemes that are composed of one or many code points. Depending on the language when we run the length / count methods on a string it might count bytes, code points or the grapheme itself because implementations vary.

Encodings

The Unicode Standard essentially maps numbers called code points to characters but they do not directly represent the character in bytes like in ASCII, the letter A maps to 65 which is 01000001.

The standard defines a table: each code point is a number assigned to a character. Alongside the table, each code point also carries a set of properties, such as whether it’s a letter, a combining mark, or an emoji modifier. These properties describe how it behaves and combines with others.

In Unicode there are encodings that are able to convert the numerical representation into bytes, because that’s what computers use to handle information.

The most widely used and arguably the one worth knowing about for now is UTF-8 (Unicode Transformation Format, 8-bit) but there’s two others which I’ll briefly mention later on.

UTF-8 is how we arrange the character data into bytes. We know a grapheme can have multiple code points. These code points are represented in bytes and each code point can be up to 4 bytes (1–4 bytes; in UTF-8, each byte is a code unit).

In theory to compose a grapheme, there can be a code point with 2 bytes, another with 4, another with 2 and so on. So graphemes can occupy many bytes.

However, how do we know when a code point begins and ends? In UTF-8 the first byte of a code point is prefixed with the length. So it would be 0 for 1 byte, 110 for 2 bytes, 1110 for 3 bytes and 11110 for 4 bytes and any continuation bytes are prefixed with 10.

But how do we know to combine the thumbs up symbol with the skin tone one into a single grapheme? Unicode strings are not self-describing.

For that, Unicode publishes the Unicode Character Database (UCD), which assigns properties to every code point, such as whether it’s an emoji modifier, and a segmentation algorithm that consults those properties to determine where each grapheme begins and ends.

Here’s the diagram with the previous emoji example to explain further:

A diagram that shows different parts of unicode, a grapheme, code points, hexadecimal as well as binary representation

Like before, we have what we perceive as a single character, a grapheme which in this case has two code points. The code points are numbers often represented in hexadecimal, and here are encoded in UTF-8.

Another benefit is that UTF-8 is compatible with ASCII which is really useful especially when reading legacy documents.

In short, UTF-8’s encoding tells us where code points begin and end by reading its bytes. However, to understand what constitutes a grapheme the UCD dataset and a segmentation algorithm are required.

Other Encodings

Aside from UTF-8 there’s UTF-16 and UTF-32 but due to some of their downsides and trade-offs they are not as widely adopted.

UTF-16 is less common but far from unpopular: Windows, Java, JavaScript, and .NET all use it internally.

It works in units of 16 bits. A 16-bit unit can only hold numbers up to 65,535, but code points go all the way up to 1,114,111. The 👍 emoji alone is code point 128,077. Characters with numbers that don’t fit in one unit get split across two, which is called a surrogate pair.

Because nearly everything fits in one unit, UTF-16 looks fixed-width when it isn’t, and code written assuming one unit per character passes testing and then breaks the first time an emoji shows up. In JavaScript, "👍".length is 2 because it counts 16-bit units. Separately, each 16-bit unit spans two bytes and hardware disagrees on which byte to store first, so UTF-16 files need a byte order mark (BOM) at the start to declare the order, a problem UTF-8 avoids entirely since its units are single bytes.

UTF-32 comes with the guarantee that every code point will always be 4 bytes even if it means storing more data than necessary. The simplicity it buys stops at the code point layer like the other encodings.

From what I understand, UTF-8 is the preferred option because every ASCII file ever written was already valid UTF-8, and every byte-oriented tool (C strings, Unix pipes, network protocols, existing parsers) kept working unchanged. Adopting UTF-16 meant converting the world; adopting UTF-8 meant converting nothing.

Conclusion

Unicode is a standard to map characters into numerical values. The UTF-8 encoding organises those values into bytes. With an additional database and segmentation algorithms we can extract graphemes.

This allows us to use a universal set of data to represent pretty much any character we like. Unicode’s code space is exactly 1,114,112 code points, of which over 150,000 are currently assigned with plenty of room for more.

This ended up being a side quest that stemmed from the parser I was building. It turned out iterating through graphemes was simply not useful for this parsing case: all the parser cares about is finding delimiters. The =, #, newlines are ASCII characters, which UTF-8 stores as single, unchanged bytes. Everything between delimiters can be copied through as-is, and since no byte of a multi-byte character ever looks like an ASCII byte, keys and values always come out complete despite the symbols used (emoji, Chinese, Latin, etc).

Either way, it is an interesting topic and I might attempt to implement the segmentation algorithm at some point.

References & further reading

Davis, M., & Chapman, C. (Eds.). (2025). Unicode Standard Annex #29: Unicode text segmentation. The Unicode Consortium. https://www.unicode.org/reports/tr29/

Microsoft. (2024). Using byte order marks. Microsoft Learn. https://learn.microsoft.com/en-us/windows/win32/intl/using-byte-order-marks

The Unicode Consortium. (n.d.). UTF-8, UTF-16, UTF-32 & BOM [FAQ]. https://www.unicode.org/faq/utf_bom.html

The Unicode Consortium. (n.d.). Unicode Character Database. https://www.unicode.org/ucd/

The Unicode Consortium. (2025). Chapter 2: General structure. In The Unicode Standard, Version 17.0.0. https://www.unicode.org/versions/Unicode17.0.0/core-spec/chapter-2/

The Unicode Consortium. (n.d.). What is Unicode? https://www.unicode.org/standard/WhatIsUnicode.html

Yergeau, F. (2003). UTF-8, a transformation format of ISO 10646 (RFC 3629). Internet Engineering Task Force. https://www.rfc-editor.org/rfc/rfc3629

UTF-8 vs UTF-16: Comparing Unicode encodings. (n.d.). Character.Codes. https://www.character.codes/learn/utf8-vs-utf16