Why Can't I Just Use A List? • Understanding NumPy's `ndarray` (A NumPy for Numpties article)
From Python's native lists to NumPy's `ndarray` data type, with a glimpse at the built-in `array`. Why do we need all these similar data structures?
To understand NumPy, you need to understand the ndarray
type. This is required. But it's not sufficient. Still, it's the place to start.
When I introduced NumPy for Numpties, I called it a loose series. My intention is not to write a mini-course or a well-structured series of articles that follow nicely from each other. I have other places for that type of content, including courses coming soon on The Python Coding Place.
Instead, I'll write about topics in no particular order. NumPy for Numpties will meander through NumPy topics, not quite aimlessly, but almost.
But I need a starting point. And it would feel rather strange if I didn't start with the most important building block in the NumPy library. So, say hello to the ndarray
data type!
Not So Fast. Let's Talk About Lists First
If you're already familiar with NumPy, did you get the joke in the subheading? If you're not, you'll get it by the end of this article.
The best place to start to understand NumPy's primary data structure is with another data structure you already know: Python's built-in list.
Let's look at some of the main characteristics of the list. A list is a data structure that contains other items. A Python list can contain any number of items, which can be instances of any data type. You can even mix and match data types within a single list:
The first two lists contain objects of the same data types within each list. But the third list contains all sorts of objects. In this example, the list contains an integer, a float, a Boolean, a string, a tuple, and another list, in that order.
When we get to NumPy's ndarray
, we'll see there's a difference between lists and NumPy ndarray
objects when dealing with elements of different data types. But let's not speed ahead.
Another important aspect of lists is that they're iterable. You can read more about iterables in this article from a few months ago, but all you need to know here is that you can use them in a for
loop to go through each item one at a time. You can also do this with NumPy ndarray
objects. However, as you'll see later, you often won't want to use an ndarray
in a for
loop. But I'm sprinting ahead again.
Lists are mutable. You can add, remove or replace items in a list without creating a new list. NumPy arrays are also mutable types.
Finally, at least for this short summary, lists are sequences. This means you can index them using the square bracket notation and use an integer index within the square brackets:
This is sort of true for NumPy arrays, too.
So, in summary, we'll focus on the following list characteristics:
They are containers that can hold any type of object, including a mixture of data types within the same list
They are iterable
They are mutable
They are sequences
A Note on Code Snippets
By the way, before I move on. I asked you about your preference for including code in my articles in my last post. 87% of those who voted in the poll chose the image snippets with syntax highlighting and proper formatting.
I agree. Being able to read the code clearly and easily is important for following an article about coding. Substack's native option fails miserably on this count. The drawback of the image snippets is that you can't copy-paste. But frankly, you shouldn't be copy-pasting anyway but typing in the code yourselves. And I link to a Github gist when the code is longer than a few lines, anyway.
So, I'll stick with this option. It's a lot more work for me this way. I thought I should let you know so you can appreciate the code snippets even more! I'll go the extra mile for my readers…
A Brief Look At Arrays, But Not Those Arrays, The Other Ones…
Before we move on to NumPy's ndarray
, let's look at a lesser-known data structure that's part of the standard library. This means you don't need to install any additional modules.
Let's briefly look at the array
data type in the array
module. These arrays look like lists, but one important difference is that they cannot contain items with different data types. All items in an array
must have the same data type.
Let's create an array
and explore it:
You need to state the type of data stored in the array using the first argument in the constructor array()
. This typecode
represents the C data type, but let's not get too distracted by this.
You use methods similar to ones you normally use on lists, like .append()
, .pop()
, and .reverse()
. Note that these are not the list methods. They're array
methods with the same names. You won't find all the list methods replicated in array
, but I'll let you explore what's available and what isn't.
So, what's the point of array
structures if they're similar to lists but more restrictive and hassle to create?
Here's the answer:
Using an array
data type will save memory if you have large amounts of data of the same type that you want to store in a list-like (or array-like) structure.
Incidentally, those coming to Python from other languages often refer to lists as arrays. Don't do that. Array has a different meaning to list in Python.
But when you say array in Python, you often don't mean array
from the array
module. Most people's go-to array data structure is NumPy's ndarray
.
The Python Coding Place launched in January 2024. Members have access to a large and growing catalogue of video courses, weekly videos, a support forum, live cohort courses and workshops, and more.
The Pièce De Résistance: NumPy's ndarray
NumPy is a third-party library you'll need to install before using it. Use pip install numpy
in your terminal. Or you can use whatever package manager you prefer.
By the way, the nd stands for n-dimensional, since ndarray
objects can have one or more dimensions. More on this soon.
Let's look at the four list characteristics we discussed earlier.
ndarray
objects can contain many items. This shouldn't be too surprising:
A word of caution: when NumPy displays the ndarray
object, it uses the term array([ ... ])
even though the data type is called ndarray
. Don't confuse this with the array
class in the array
module we discussed earlier.
Are ndarray
objects only for numbers?
No, they're not. Here, you converted a list of strings into a NumPy ndarray
. Note that when you display the ndarray
, the dtype
attribute is shown too. This shows the data type of the array's elements. In this case, the dtype
represents a Unicode string of up to four characters (since the longest names are four letters long.)
How about mixing data types?
No, but this is a weird list indeed. It includes other data structures within it. Let's try a simpler "weird" list:
This works. But it's not quite what you were expecting, perhaps. Note that all the elements are converted to strings. You can see this from the single quotes around all the values, even those that were integers, floats, and Booleans in the original list. And the dtype
contains a U
, which means all items are Unicode strings.
Let's get rid of the string from the original list and keep the integer, float, and Boolean:
The first item in the list is the integer 1
, but note how it's displayed as 1.
in the ndarray
. The point after the number indicates it's a float. Therefore, the integer is converted to a float. This didn't happen in the first example in this section since all the numbers were integers in that first example. However, there's also a float in this latest example, and NumPy converts every numeric object into a float.
But how about the Boolean True
? This is also converted to a float. Note how it's displayed as 1.
as well. That's because the Boolean data type is a subclass of the integer class, and True
is the same as 1
:
You may have spotted that NumPy sometimes shows the dtype
attribute when displaying the ndarray
, and sometimes it doesn't. NumPy will only show the dtype
when the array contains non-default data types, such as strings. Recall that NumPy stands for Numerical Python. NumPy arrays are primarily meant for numerical data types.
You can have Boolean data in a NumPy ndarray
as long as the array only contains Booleans:
Let's move to the next list characteristic and explore what happens in NumPy.
ndarray
objects are sequence-like. Lists are sequences. This means you can use integers within square brackets to fetch an item from the list based on its position in the list. The integer is the index showing the item's position, and we often call this process indexing.
You can do the same with NumPy ndarray
objects:
You can also slice a sequence. The last example in the code above confirms you can also slice a NumPy ndarray
.
So, a NumPy is a sequence, right? Well, not technically, no. Even though you can use NumPy ndarray
objects like sequences, as you saw above, you can also put other objects in the square brackets, not just integers (for indexing) and slices (for slicing).
Let's look at a couple of quick examples. First, let's create a two-dimensional array. All the NumPy arrays so far have been one-dimensional:
You can use the multiline formatting option to create this array if it's easier to visualise—in Python's REPL/Console environment, press enter when you have an unmatched open parenthesis or open square bracket to move to the next line:
If you use a single index on this array, you'll fetch an entire row. But you can use two integers to fetch a single item directly:
Therefore, you can use multiple integers to choose the row and column. You may not realise this, but you're using a tuple in this case since 2, 1
creates a tuple:
Let's look at one more example:
Let's break down the last expression. There's an open square bracket after the array name and a matching close square bracket at the end of the line. Inside these brackets is a list. This list contains Booleans. There's True
in the first, fourth, and last positions in the list, and False
in the other positions. The result is an array containing the first, fourth, and last items in the original array.
Therefore, you can also use lists of Booleans within the square brackets. You could also use another ndarray
of Booleans instead of the list.
Sequences can only accept integers or slices in the square brackets. Integers are used for indexing and slices for slicing. But ndarray
objects can be used with tuples, lists or other arrays, in addition to integers and slices within the square brackets. Therefore, NumPy arrays are not technically sequences:
But you can use them like you use sequences…
ndarray
objects are mutable. I won't dwell on this:
You also check the object id to confirm it's the same object before and after replacing one of its elements. Therefore, you can mutate a NumPy array.
ndarray
objects are iterable. Yes, but there's more to this point.
You can use NumPy ndarray
objects within a for
loop like you'd use a list. I won't show you this. You can try it out yourself if you prefer. But, the main benefits of having an iterable can often be replaced with another feature unique to NumPy arrays. We'll explore this in the next section.
If You're Using for
Loops, You're Doing NumPy Wrong!
This will be a topic for a future Numpy for Numpties article, but I'll write a short preview here.
Let's start with a demo example. You have a list of numbers, and you want to multiply each one by 3
:
This won't work. Multiplying the whole list by 3
repeats the list three times. It doesn't multiply each element of the list by 3
. For this, you'd need to use a for
loop (or a list comprehension, but I won't show list comps in this article):
That's better. But let's see how you'd do this with NumPy:
Ah! Now, we can multiply the whole array by 3
. NumPy treats this as an element-by-element multiplication. The same occurs for other operations. This is called vectorisation.
Here's another example:
Try this with a list, and you'll see it won't work. But the greater than operator on a NumPy array checks each item in the array to see if it's greater than 5
. It returns a new array with Boolean values.
And do you remember how we can use arrays of Booleans within the square brackets when fetching items from an array?
The expression within the square brackets is what you used in the previous example. It returns a NumPy ndarray
of Booleans. And this, in turn, is used to filter the original array. The result is a new array containing only the elements greater than 5
.
I'll explore all of these further in future articles.
But I'll finish this one with an answer to a question you may have already asked. Why go through all this trouble? Sure, it's more convenient to write our code as it saves a few lines of code each time. But that's not the only advantage.
I'll use an example from the chapter on NumPy in The Python Coding Book to demonstrate the main advantage. In this example, I'll use a script instead of the REPL/Console. You'll create a million random temperatures in ºC and convert them to ºF:
You create a list of temperatures and its NumPy ndarray
equivalent. You also define two functions. The first converts the list into a new list with the converted temperatures. This function uses a for
loop.
The second function accepts a NumPy ndarray
, which I'm showing using type hints in this example, and it uses NumPy's vectorised approach.
Let's time these two functions using the timeit
module:
You run each function 100 times within the timeit.timeit()
calls. And here are the results:
Using the "classic" for loop method with a list:
4.818278833998193
Using NumPy:
0.06725112500134856
The NumPy version is fast. Much faster than the version using a list and a for
loop. The actual times you'll get will vary depending on your computer setup and the Python version you're using, but you'll always find the NumPy version to be significantly faster. In my version, it's over 70 times faster.
You can try using list comprehensions. It will be faster than the classic for
loop option, but not by much, especially in the newer Python versions. NumPy speeds things up by performing many of its operations in C rather than Python.
Final Words
This is a good place to end this article. I'll refer back to this NumPy ndarray
primer in future NumPy for Numpties articles as a reference point.
In summary, NumPy's ndarray
looks and feels like a list at first glance but turns out to be a fair amount different as you dive further. It's a container that can only hold items of the same data type, unlike lists, which can hold a mixture of data types in a single list. NumPy's arrays are also mutable. They can be used like sequences even though they're technically not sequences. And they're iterable, but we rarely use them in for
loops. Instead, we use their vectorised operations, which perform operations on each element in the array. This process is much quicker than using native Python operations on lists.
See you in the next NumPy for Numpties articles, or before that for other non-NumPy articles.
PS: I’ll add a couple of extra points I couldn’t fit in here as comments to the article once I publish it.
Code in this article uses Python 3.12
Stop Stack
#46
The Python Coding Place launches in a few days' time, on the 15 January 2024. Join before that date to make the most of the pre-launch offer at thepythoncodingplace.com. The Place is the hub for all my resources: video courses, members' forum, live cohort courses, weekly videos, and more.
If you read my articles often, and perhaps my posts on social media, too, you've heard me talk about The Python Coding Place several times. But you haven't heard me talk a lot about is Codetoday Unlimited, a platform for teenagers to learn to code in Python. The beginner levels are free so everyone can start their Python journey. If you have teenage daughters or sons, or a bit younger, too, or nephews and nieces, or neighbours' children, or any teenager you know, really, send them to Codetoday Unlimited so they can start learning Python or take their Python to the next level if they've already covered some of the basics.
Recently published articles on The Python Coding Stack:
next(years) An end-of-year post • Some reflections • And there's some Python stuff in this post, too—a spinning globe animation
Do Not Try This At Home A bit of silliness for the holiday season • But please, don't code like this. Please • Plus some out-of-the-norm commentary • There's nothing ordinary about today's article
The Key To The 'key' Parameter in Python A parameter named
key
is present in several Python functions, such assorted()
. Let's explore what it is and how to use it.What's All the Fuss About 'lambda' Functions in Python? Python's
lambda
functions are seemingly obscure, until they aren't. They're almost mystical, until unveiled. Let's shed some light to dispel the obscurity and lift the mystique.In Conversation: Pawel and Stephen Discuss Matplotlib's New-ish subplot_mosaic() In recent years, Matplotlib introduced a new function for plotting several plots in one figure. We had a chat about
subplot_mosaic()
Recently published articles on Breaking the Rules, my other substack about narrative technical writing:
The South Park Technical Writing Manual (Ep. 14) What can we learn from South Park? Yes, the satirical TV show
I Haven't Been Abducted by Aliens (Ep. --) Why this long lull since the last Breaking The Rules post?
The Selfish Reason (Ep. 13) Another reason for authors to innovate • Enjoying the writing process
The Consequential Detail (Ep. 12). Can a single letter or one blank line make a difference? (Spoiler Alert: Yes)
The Unexpected Audience (Ep. 11). What I'm learning from listening to Feynman's physics lectures
Stats on the Stack
Age: 8 months, 3 weeks, and 6 days old
Number of articles: 46
Total subscribers: 1,753
On the Paid tier: 78
Each article is the result of years of experience and many hours of work. Hope you enjoy each one and find them useful. If you're in a position to do so, you can support this Substack further with a paid subscription. In addition to supporting this work, you'll get access to the full archive of articles and some paid-only articles. Alternatively, if you become a member of The Python Coding Place, you'll get access to all articles on The Stack as part of that membership.
A couple of extra points I couldn't fit in in the main article without making it unnecessarily longer.
I've used `np.array()` to create a `ndarray` object in the article. You may also see a similar function called `np.asarray()`. The first one of these, `np.array()` will always create a new `ndarray`, whereas `np.asarray()` will only create a new array if the argument isn't already a `ndarray` object.
Secondly, you recall when I introduced the `array` data type from the `array` module? The `array` object took up significantly less memory than its corresponding list. The list was 8448728 bytes while the `array` was 4000080 bytes.
How about the NumPy `ndarray`? If you convert the list you used when comparing lists and `array.array` objects into a NumPy array:
`numbers_np = np.array(numbers_list)`
and then find its size in the same way, using `sys.getsizeof()`, you'll find that the `ndarray` you created is 8000112 bytes, not too far off from the size of the list.
However, if you check the `dtype` of this `ndarray`, you'll see it's:
dtype('int64')
But if you don't need 64-bit integers, you can opt for integer types that that up less memory, such as int32 or even int16 in this case:
```
>>> numbers_np = np.array(numbers_list, dtype="int32")
>>> sys.getsizeof(numbers_np)
4000112
>>> numbers_np = np.array(numbers_list, dtype="int16")
>>> sys.getsizeof(numbers_np)
2000112
```
"You can use the multiline formatting option to create this array if it's easier to visualise—in Python's REPL/Console environment"
I'd been wondering what all those ellipses on your code snippets were about!