`bytes`: The Lesser-Known Python Built-In Sequence • And Understanding UTF-8 Encoding
The `bytes` data type looks a bit like a string, but it isn't a string. Let's explore it and also look at the main Unicode encoding, UTF-8
We all have topics we feel we ought to know but never bothered learning properly. Usually, whenever I need to enter the name of a character encoding, I just type in "UTF-8" since I know it's the most common one, and then move on.
But what's UTF-8? Sounds boring…but recently, I came across this topic in my meanderings and found it fascinating. This subject is a bit more computer-science-y than the normal higher-level stuff I write about. But don't worry, I'll take good care of you as you make your way through this article.
The Lesser-Known Built-In Sequence: bytes
First, let's look at a built-in type you may not have heard of. If I ask you to name a built-in immutable sequence, you'd probably suggest a tuple or a string. But there's another one: bytes
.
We'll get to the general meaning of "byte" in computing in a second. But first, let's focus on the data type called bytes
. And the best way to explain bytes
is to compare it with a string.
Let's create a string and a bytes
object. I'm adding the suffix _bytes
to the variable name for the bytes
object to make it clear it's different from the string:
The bytes
literal looks very similar to the string one. You add b in front of the quotes. You're familiar with adding f in front of single or double (or triple) quotes to create an f-string. But this is different. An f-string is still a string. But the notation b" "
creates an object with a different data type. It creates a bytes
object:
And when you display these string and bytes
object, they also look similar:
It's just the b before the quotes that's different. However, this won't always be the case, as you'll see in examples later in this article.
They're both sequences. So you can find their length using len()
:
They're the same length...in this case. You'll see later that this isn't always so.
Are you bored? So far, str
and bytes
look very similar. "What's the point?" you may be thinking.
So, let's look at the differences. They're both sequences, so you can use the square brackets notation to access the first element in each object:
The first element in the string text
is the single-letter string "H"
. But the first element in the bytes
object text_bytes
is the integer 72. Why 72? We'll get to this in the next section.
You can view all elements individually by converting the objects into lists:
A bytes
object is an immutable sequence of integers. But not just any integers. The elements in a bytes
object are integers in the range 0 to 255. These are the numbers that can be represented by eight bits, where each bit is either 0 or 1. The largest 8-bit number is the number that has eight 1's in binary:
The number 0b11111111
is a binary number, as shown by the 0b
at the beginning. It has eight bits that are all 1, and it's equivalent to 255 in decimal. Let's display these eight bits in an image to introduce a graphic I'll use later in this article:
And eight bits make one byte. Therefore, each element in a bytes
object represents an integer between 0 and 255, which is an 8-bit number, which is one byte. Each element represents one byte of data.
Let's Start with ASCII
So, what are those numbers you get when you convert a string to a bytes
object? They're the ASCII character codes.
Everything in a computer is 0's and 1's.
How many times have you heard this phrase and then ignored it? It's not really relevant when coding in a high-level language like Python…most of the time.
Therefore, text characters must also be represented by 0's and 1's, and ASCII is one of the encodings used to translate characters into numbers. And numbers can be represented in binary.
Let's look at the integers in text_bytes
again:
The ASCII code for uppercase H is 72. The code for lowercase e is 101. Each character in the original string is converted to its ASCII code. You can find the ASCII codes for all the characters here. ASCII is a 7-bit encoding. Therefore, it contains 128 characters, but not all of them are printable characters. Several ASCII codes are now obsolete.
And since there are only 128 ASCII characters, and their codes range from 0 to 127, they all fit within one byte of data (which has eight bits).
But ASCII is Not Enough
The printable ASCII characters include:
Basic punctuation marks
The digits from 0 to 9
The uppercase and lowercase Latin letters in the English alphabet
That's quite limited. We need many other characters for modern and international texts. I offer a phrase from my native Maltese as evidence, which includes several characters not present in ASCII:
Għandna bżonn ħafna iżjed ittri.
Unicode is the standard used to represent the wider set of characters—currently, it includes nearly 150,000 characters! The first 128 Unicode characters match the ASCII characters. But how can we represent the remaining ones? There are only 128 more numbers left in a single byte (128-255), so we need more bytes. And we need to find a way of representing the Unicode codes using one or more bytes.
The most common encoding for Unicode characters is UTF-8.
Let's convert some strings with non-ASCII characters to bytes
using the UTF-8 encoding. You can't use the bytes
literal—putting b in front of quotes—with non-ASCII characters. But there are other options:
You can use either the bytes()
constructor or the string method .encode()
. Both return the same bytes
object. The calls in this example both have the encoding "utf-8"
included as an argument. However, the .encode()
method doesn't need an encoding argument if you want to use UTF-8, which is the default. The string method gives more flexibility, so it's the preferred option.
Let's look at the text_bytes
object:
The first three characters within the quotes are Caf
. These are ASCII characters, and therefore, they're displayed directly. However, the lowercase é, which is e with an acute accent, isn't an ASCII character.
The rest of the bytes
object shows \xc3\xa9
. This part of the output shows two bytes. The first byte is \xc3
, which represents the hexadecimal number c3. This is 195 in decimal. The second byte, \xa9
, represents the hexadecimal number a9, which is 169 in decimal.
If you need a short refresher on hexadecimal and binary numbers, go to the appendix at the end of the article, just before Stop Stack.
Therefore, the four-letter word Café is converted to a bytes
object of length five:
You can also display all the integers in text_bytes
by converting to a list:
The first three bytes represent the first three characters C, a, and f. These are single bytes containing the characters' ASCII codes, which each fit within a single byte. The last two bytes combined represent the last character, é.
But what do the integers 195 and 169 represent?
The UTF-8 Encoding Rules
So, back to 1's and 0's. The accented letter é is represented by two bytes. The value of the first of these bytes is c3 in hexadecimal, which is 195 in decimal. Let's convert this to binary:
Here's how to understand a byte in a UTF-8 encoded number. Look for the first 0 value, starting from the highest bit:
The bits up to and including this first 0 contain information about the role of this byte in the set of bytes making up a character. So, we'll treat these three bits differently from the rest:
There are two 1's ahead of the first 0. This 110 pattern indicates that this byte is the first in a group of two bytes that combine to represent a single Unicode character. To summarise, the starting 110 pattern in this 8-bit binary number tells us the following:
This is the first of a group of bytes that represent a single character.
There are two bytes in this group of bytes.
Therefore, we already know that the next byte also belongs to the same character. We'll get back to this byte later.
Let's look at the second byte, which is a9 in hexadecimal or 169 in decimal. Here's the same number in binary:
As with the previous byte, you can look for the first 0 and group the bits up to and including the first zero together:
The leading 10 pattern indicates that this byte is not the start of a new character but a continuation. It's part of a group of bytes but it's not the first byte in the group. And we already know that this group of bytes started with the previous byte, which has a value of 195. This leading 10 pattern confirms this.
Right, so some of the bits in each of the two bytes give us information about how to group bytes together. So far, we know these two bytes belong together to represent a single character.
But how do we figure out what character this is?
Let's look at the remaining bits in each byte, which I'm showing using orange in the diagram below. These are the bits that carry data about the character itself:
Let's create a single binary number by combining the data bits from the first byte with the data bits from the second byte. The data bits in the first byte—shown in orange—are 00011, and the data bits from the second byte are 101001. Combining them gives 00011101001. And what's this number in decimal?
Here's a list of some of the Unicode characters. If you look up the character with a decimal value of 233, you'll find it's the character é. You can also find the character corresponding to a Unicode value directly in Python:
Are you considering joining my community at The Python Coding Place? Just email me if you have any questions about The Place
你好 • Another Example
In 2020, when we all spent plenty of time at home, I started learning Mandarin, so I now have the Pinyin keyboard installed on my computer–I need to find time to pick up my Mandarin studies again. But since I can easily write 你好, I'll use this as an example. I hope you're all just as fluent in Mandarin!
You can copy and paste these characters if you can't write them easily:
你
好
Let's convert this two-character string into bytes
using UTF-8:
There are six bytes in greeting_bytes
, even though there are only two characters in the string. You can distinguish each byte as they start with \x
to indicate they're escape sequences representing hexadecimal numbers.
Let's start with the first byte, which has the value e4. You can convert these to decimal and binary:
The leading pattern up to the first 0 is 1110. This tells us this byte is the start of a group of three bytes since there are three 1's
The first character is represented by the first three bytes out of the six in greeting_bytes
. Let's look at all of these first three bytes, which have values e4, bd, and a0 in hexadecimal:
The second and third bytes start with 10, confirming they're not the beginning of a new character but part of the group of three bytes representing a single character. When you combine the data bits–shown in orange–in all three bytes, you get the following:
And 20320 is the Unicode value for the character 你.
You can also have characters that are represented by four bytes. Here's the snake (is it a python?) emoji:
You can copy and paste this emoji if you can't easily find it on your computer (I rarely use emojis, so I need to look up where to find them on my computer every time!).
🐍
The pattern of bits up to the first 0 is 11110, which shows that this is the first byte in a group of four since there are four 1's.
We've seen bytes with leading patterns of 10, 110, 1110, and 11110. Those starting with 10 show they're continuation bytes, and the rest show they're the start of a group of bytes. There's one more type of byte in the UTF-8 encoding: when the first 0 is the highest bit (the leftmost one):
Bytes that start with 0 are not part of a group of bytes. They are single bytes that represent a character without the need of further bytes. They represent the ASCII characters.
If you enjoy these posts and want to contribute towards the time and effort it takes to create them, you can do so through this link. You can contribute any amount you want.
Final Words
For most applications you'll ever work on, you won't need to worry about how UTF-8 or any other encoding works behind the scenes. But I like to know stuff, and I found this interesting. Hopefully, you have, too.
ASCII and UTF-8 aren't the only encodings available. There are others. But Wikipedia tells me that over 98% of all web pages use UTF-8. And if Wikipedia says so, it must be right, no?
Code in this article uses Python 3.12
Appendix: A Refresher on Hexadecimal and Binary Numbers
We use the decimal numeral system for most things in our everyday lives. We also refer to this as base 10. There are ten digits, 0 to 9, in this system. When we run out of digits, which happens when we reach 9, we start again from 1 but write it in front of the right-most digit: 10. Since the 1 is in the second position from the right, it represents ten rather than one. And it represents a hundred if it's in the third position from the right: 100.
You know all this very well, of course. But it's less obvious when we deal with other bases.
Hexadecimal is base 16, which means there are 16 digits instead of 10. We can use the digits 0 to 9 to represent the first ten digits in hexadecimal. But there's a problem: how do we represent the eleventh, twelfth, thirteenth, fourteenth, fifteenth, and sixteenth digits? We normally write these in decimal, but that won't make sense in hexadecimal. So, we use the letters A to F (uppercase or lowercase, it doesn't matter):
A is ten
B is eleven
C is twelve
D is thirteen
E is fourteen
F is fifteen
I'm writing numbers as words to avoid confusion between decimal and hexadecimal.
Therefore, we can represent sixteen numbers with a single digit in hexadecimal (zero to fifteen, or 0 to F)
And what do we do if we need the next number but we've run out of digits? Same as in decimal, we start again from 1 but place it in the second position from the right: 10. In hexadecimal, this number is not ten but sixteen:
You can start writing a number with 0x
in Python to show it's hexadecimal. Therefore, seventeen is 11, eighteen is 12, and so on until 1F, which is thirty-one. The next hexadecimal number is 20, which is thirty-two (two times sixteen).
The largest two-digit hexadecimal number is FF. The rightmost F represents fifteen. The second-from-right F (the leftmost one) is fifteen times sixteen, which is two hundred and forty (240 in decimal). Therefore, and I'll use decimal numbers now, we have 240 + 15, which is 255. Two digits in hexadecimal represent 256 numbers (from 00 to FF, which is 0 to 255 in decimal).
These are the numbers we can represent with a single byte, which is eight bits.
And that leads to binary.
Binary follows the same pattern as decimal and hexadecimal but is in base 2. There are only two digits: 0 and 1. When we run out of digits, which happens after just two numbers, 0 and 1, we start again but shift the digit to the left. So, 10 in binary is two, and 11 is three. But we've run out of digits again, so the next one is 100, which is four, and then 101, which is five, and so on.
Therefore, the largest three-digit binary number is 111. The rightmost 1 represents one. The second-from-right 1 (the middle one) represents one times two, which is two. The leftmost digit represents one time four, which is four. Therefore, 111 in binary is four plus two plus one, which is seven.
One byte contains eight bits. Therefore, the largest 8-bit number is 11111111, which is:
(1 x 128) + (1 x 64) + (1 x 32) + (1 x 16) + (1 x 8) + (1 x 4) + (1 x 2) + (1 x 1)
Add all those values, and you get 255.
The numbers you multiply each digit by are powers of two since binary is base 2. This is the same as when you multiply digits in a decimal number by values that increase in multiples of ten: 316 is (3 x 100) + (1 x 10) + (6 x 1).
You can represent the numbers between 0 and 255 using eight bits in binary. This is the same range as 00 to FF in hexadecimal. This is why it's common to see bytes represented as hexadecimal numbers.
Stop Stack
#64
Thank you to all those who supported me with a one-off donation recently. This means a lot and helps me focus on writing more articles and keeping more of these articles free for everyone.
Here's the link again for anyone who wants to make a one-off donation to support The Python Coding Stack
The Python Coding Book is available (Ebook and paperback). This is the First Edition, which follows from the "Zeroth" Edition that has been available online for a while—Just ask Google for "python book"!
And if you read the book already, I'd appreciate a review on Amazon. These things matter so much for individual authors!
I'm also releasing The NumPy Mindset at the moment. Currently, this is available as an Early Release—I'm publishing chapters as and when they're ready. Members of The Python Coding Place already have access to this Early Release. Everyone else can get it here—You'll get the final ebook version too once it's ready if you get the Early Release version, of course!
And for those who want to join The Python Coding Place to access all of my video courses—past and future—join regular live sessions, and interact with me and other learners on the members-only forum, here's the link:
Any questions? Just ask…
Appendix: Code Blocks
Code Block #1
text = "Hello Python!"
text_bytes = b"Hello Python!"
Code Block #2
# ...
type(text)
# <class 'str'>
type(text_bytes)
# <class 'bytes'>
Code Block #3
# ...
text
# 'Hello Python!'
text_bytes
# b'Hello Python!'
Code Block #4
# ...
len(text)
# 13
len(text_bytes)
# 13
Code Block #5
# ...
text[0]
# 'H'
text_bytes[0]
# 72
Code Block #6
# ...
list(text)
# ['H', 'e', 'l', 'l', 'o', ' ', 'P', 'y', 't', 'h', 'o', 'n', '!']
list(text_bytes)
# [72, 101, 108, 108, 111, 32, 80, 121, 116, 104, 111, 110, 33]
Code Block #7
0b11111111
# 255
Code Block #8
# ...
list(text_bytes)
# [72, 101, 108, 108, 111, 32, 80, 121, 116, 104, 111, 110, 33]
Code Block #9
# ...
text_bytes = bytes("Café", "utf-8")
text_bytes
# b'Caf\xc3\xa9'
text_bytes = "Café".encode("utf-8")
text_bytes
# b'Caf\xc3\xa9'
Code Block #10
# ...
text_bytes
# b'Caf\xc3\xa9'
Code Block #11
# ...
len("Café")
# 4
len(text_bytes)
# 5
Code Block #12
list(text_bytes)
# [67, 97, 102, 195, 169]
Code Block #13
bin(195)
# '0b11000011'
Code Block #14
bin(169)
# '0b10101001'
Code Block #15
0b00011101001
# 233
Code Block #16
chr(233)
# 'é'
Code Block #17
greeting = "你好"
len(greeting)
# 2
greeting_bytes = greeting.encode()
greeting_bytes
# b'\xe4\xbd\xa0\xe5\xa5\xbd'
len(greeting_bytes)
# 6
Code Block #18
0xe4
# 228
bin(0xe4)
# '0b11100100'
Code Block #19
0b0100111101100000
# 20320
chr(20320)
# '你'
Code Block #20
"🐍".encode()
# b'\xf0\x9f\x90\x8d'
0xf0
# 240
bin(0xf0)
# '0b11110000'
Code Block #21
0b01000001
# 65
chr(65)
# 'A'
Code Block #22
0x10
# 16
Well done, and I'm impressed by how you handle showing source code on Substack. I'm fairly new around here and have been wondering whether anyone is writing about Python. Going by this post, you may be doing the same thing I've been doing on my other WordPress blog, writing about Python for non-experts. (Though I also get into advanced and non-Python topics.)
I enjoyed this post. I'm a retired software designer who worked for an international company, and Unicode made my life so much simpler. Now if someone would only do that for dates...
FWIW, I wrote something similar on my main WP blog a while back. If you'd like to check it out, it's here: https://logosconcarne.com/2021/11/15/the-blessing-of-unicode/
Understanding bytes and character encoding is crucial for handling diverse text data in programming. Thanks for the detailed breakdown!