"It works!" is The Most Dangerous Phrase in Programming • Understanding `itertools.groupby()`

"I'll skip reading the docs. I don't need them" is another one! Follow this real-life saga (!) while I was using `itertools.groupby()`

Apr 24, 2024

Be honest, you've also done this. You need a function you never used and you think to yourself:

I can figure this out. I don't need to read boring documentation.

You start tapping furiously on your keyboard, waking up the neighbours if you're using a mechanical one. And hey presto, your lips curl up into a faint smile:

It works!

The most dangerous two words in programming.

My Sorry Tale With `itertools.groupby()`

I was preparing for a course I was recording and I wanted to use itertools.groupby() as an example. I'm familiar with other functions* with the same name in other modules, such as the one in Pandas. Since the functions have the same name, surely they must be similar, right? (rookie mistake #1) So, I felt I could skip reading the docs (rookie mistake #2) and just figure things out in the REPL.

[Note for the pedants: itertools.groupby() is not a function but a class. However, we use it as a function. What matters is that it's 'callable'. This is Python–duck typing rules!]

I was using this list as an example:

All code blocks are available in text format at the end of this article • #1

As a simple example to show how groupby() works, I thought of grouping these names based on their length. As I always do in these cases, I started by calling the function and assigning whatever it returns to a variable name:

But what arguments does this function need? No, no, I'm not going to the documentation page. In my IDE, I can press ⌘P when the cursor is within the parentheses to see the function's signature:

groupby(iterable, key=None)

The first parameter name is iterable. So, I can pass my list as the first argument since that’s an iterable. The second parameter is key, which is a parameter name that pops up in other places from time to time. I'll try to pass the function name len as the second argument:

The groupby() function takes each element in names and passes it to len(). And groupby() uses the value that len() returns to create the group. I can't see the data in the REPL output, but I guess the output is an iterator. This makes sense, since functions in the itertools module return iterators rather than data structures containing the data. Let's cast this into a list:

The list contains five tuples and each tuple starts with a number. I look back at the list of names I used and I notice that the first name (mine) has seven letters, the second has three, and so on. It seems reasonable to assume that the first value in each tuple is the length of the names. This value is what I'm using to group the names.

So far, so good, I'm thinking.

But the second value in each tuple doesn't show the data. It seems that another iterator is returned within each tuple.

Therefore, itertools.groupby() returns an iterator, and each item yielded by that iterator is a tuple, which contains yet another iterator. Let's show the values for each group:

A reminder that once iterators are used up by fetching all their items, they're exhausted. Therefore, you can't use them again. So, you call itertools.groupby() again to recreate the iterator output.

And since each item yielded by the iterator output is a tuple containing two values, you can unpack the values in the for loop by assigning them to two variable names: word_length and groups. We've seen earlier that groups is another iterator. So, you cast it into a list to print it out.

Seven is linked to "Stephen", which has seven letters. Three is linked to "Bob". Four has two names, "Jane" and "Mary". Five is linked to "James" and six to "Ishaan".

This is the correct result. It's the result I was looking for–creating groups based on the number of letters in each name.

It Works!

(rookie mistake #3 – the biggest one. This is probably also rookie mistake #4, #5, and #6, and maybe more.)

"It Works!" • Famous Last Words

Right, does this bring me to the end of this article then? I experimented with a function I felt I understood. I couldn't be bothered with the docs. And I still got the result I expected.

Goodbye.

But wait a minute…

My extensive experience, knowledge, and professionalism (also known as "sheer luck") led me to read the documentation anyway. (I happen to have been talking about using the documentation to explore functions from itertools in the course I was recording.)

The first line in the documentation for itertools.groupby() is the following:

Make an iterator that returns consecutive keys and groups from the iterable.
from itertools.groupby()

The word "consecutive" is conspicuous in that first sentence. Why is it there?

A bit further down, there's this sentence:

Generally, the iterable needs to already be sorted on the same key function.
from itertools.groupby()

Why? As it happens, my memory had been jogged, as they say, by this point. But let's read on. In the next paragraph in the docs there's this:

It generates a break or new group every time the value of the key function changes (which is why it is usually necessary to have sorted the data using the same key function).
from itertools.groupby()

What does this mean? Let's break it down:

groupby() starts with the first item in the iterable–of course it does. In this example, this is the string "Stephen". This string is passed to len(), which is the function passed to the key parameter. The value that len("Stephen") returns is used to create a group. Therefore, after dealing with the first element in the iterable, you have one group that has one element. The group is identified by the number 7, which is the number of letters in "Stephen".
- This group will remain open, accepting new members, until another group is created.
groupby() moves on to the second item in names, which is "Bob". When this string is passed to len(), you get a different number, 3. Therefore, this item doesn't fit in the previous group, which is for names of length 7. The group of seven-letter names is closed and a new group is started, identified using the number 3. It starts with one element, "Bob".
Next, the function moves to the third item, "Jane". The group of names with three letters is closed since "Jane" doesn't contain three letters. A new group is started again. This one is for names with length 4 and includes "Jane".
The next item in the list is "Mary". The len() function returns 4 for this string. This matches the group that's currently open and accepting new members! Therefore, "Mary" is added to the group already containing "Jane".
groupby() then moves on to "James" and creates a new group with names of length 5.
And finally, "Ishaan" creates yet another group.

However, there are only two names with the same number of letters and they are, coincidentally, next to each other in the list. But what would happen if we had names of the same length that aren't consecutive? Can the groups that have been closed be "reopened"?

Let's add another name to the list of names. "Max" is now the final name in the list:

But this isn't right. There are now two groups of three-letter names. That's not what I intended. I want all three-letter names to be grouped together.

The documentation tried to warn us about this with the sentence: "It generates a break or new group every time the value of the key function changes". Once a group is closed, it's closed!

Luckily, sorting the list using the same key function before you call groupby() is all you need:

And "Bob" and "Max" are now in the same group!

Don't…

Don't assume functions with the same name in different modules are identical. (Try randint() from the random module and randint() from numpy.random.)
Don't shout out "It works!" from the rooftops too early.
Don't skip the "read the docs" step.
…and don't forget the stuff you learnt in the past. I made the same mistake last year with groupby() and I actually wrote an article about the `key` parameter that included itertools.groupby(), but I forgot what I wrote! I must be getting old…

[In case you’re wondering, the course I was recording is called Pythonic Loops. If you’re already a member of The Python Coding Place, you may have already started going through it. If you’re not yet a member, what are you waiting for?! Joking aside, just ask if you have questions about membership. I’ll always reply…]

Code in this article uses Python 3.12

You can also read my other article published today on Real Python about lazy evaluation in Python, which is relevant to this post as it deals with iterators among other things: What's Lazy Evaluation in Python?

Stop Stack

#59

The Python Coding Book is available (Ebook and paperback). This is the First Edition, which follows from the "Zeroth" Edition that has been available online for a while—Just ask Google for "python book"!
And if you read the book already, I'd appreciate a review on Amazon. These things matter so much for individual authors!

Join The Python Coding Place

If you read my articles often, and perhaps my posts on social media, too, you've heard me talk about The Python Coding Place several times. But you haven't heard me talk a lot about is Codetoday Unlimited, a platform for teenagers to learn to code in Python. The beginner levels are free so everyone can start their Python journey. If you have teenage daughters or sons, or a bit younger, too, or nephews and nieces, or neighbours' children, or any teenager you know, really, send them to Codetoday Unlimited so they can start learning Python or take their Python to the next level if they've already covered some of the basics.
Each article is the result of years of experience and many hours of work. Hope you enjoy each one and find them useful. If you're in a position to do so, you can support this Substack further with a paid subscription. In addition to supporting this work, you'll get access to the full archive of articles. Alternatively, if you become a member of The Python Coding Place, you'll get access to all articles on The Stack as part of that membership. Of course, there's plenty more at The Place, too.

Appendix: Code Blocks

Code Block #1

names = ["Stephen", "Bob", "Jane", "Mary", "James", "Ishaan"]

Code Block #2

import itertools
# # Partial line - I haven't finished writing this
output = itertools.groupby()

Code Block #3

output = itertools.groupby(names, key=len)
output
# <itertools.groupby object at 0x103e220c0>

Code Block #4

list(output)
# [(7, <itertools._grouper object at 0x103e02800>), 
#  (3, <itertools._grouper object at 0x103e02830>), 
#  (4, <itertools._grouper object at 0x103e031f0>), 
#  (5, <itertools._grouper object at 0x103e03040>), 
#  (6, <itertools._grouper object at 0x103e024d0>)]

Code Block #5

# We need to recreate the iterator returned by 'groupby()'
output = itertools.groupby(names, key=len)
for word_length, groups in output:
    print(word_length, list(groups))
   
# 7 ['Stephen']
# 3 ['Bob']
# 4 ['Jane', 'Mary']
# 5 ['James']
# 6 ['Ishaan']

Code Block #6

names = ["Stephen", "Bob", "Jane", "Mary", "James", "Ishaan", "Max"]
output = itertools.groupby(names, key=len)
for word_length, groups in output:
    print(word_length, list(groups))
   
# 7 ['Stephen']
# 3 ['Bob']
# 4 ['Jane', 'Mary']
# 5 ['James']
# 6 ['Ishaan']
# 3 ['Max']

Code Block #7

sorted_names = sorted(names, key=len)
output = itertools.groupby(sorted_names, key=len)
for word_length, groups in output:
    print(word_length, list(groups))
    
# 3 ['Bob', 'Max']
# 4 ['Jane', 'Mary']
# 5 ['James']
# 6 ['Ishaan']
# 7 ['Stephen']