I know a lot of people that have this dread for mathematics. As someone that studied something health science related at university, I know a lot of people that chose the health sciences just because they wanted to avoid maths. Anyway, for my final research project at school, I had to study a lot of statistics to be able to make sense of all the data I gathered from the study, and make some kind of inference from it. It seems you can’t escape maths at all!
Why Numpy is awesome compared to normal Python lists
Numpy, short for Numerical Python, is a python library that helps users to perform numerical operations on large arrays and matrices. You can perform these operations on each of the items in the array or matrix, without using a loop. This makes using numpy faster than using a normal python list.
There are a couple of things that set numpy apart from lists. First of all, you can store items of different data types in a list, but in numpy, every item would be of the same data type. That way, there would be speed and efficiency to operate on each of the elements there. Imagine trying to multiply all the elements of a list by 2. You would have to check whether there were any strings in the list if you wanted only numeric data as your output. This would definitely slow you down (Yes, there are heterogenous numpy arrays, but they are not widely used). With numpy, you wouldn't even need a loop. You just multiply by 2!
import numpy as np
a = np.array([1,2,3])
double_array = a * 2
But then again, you might be asking about the difference if you just have a list with only numeric data. Another big advantage numpy would have over regular lists is that it stores data in contiguous memory. That means the data comes one after the other in computer memory. In lists, however, data is stored with pointers. A pointer is just a special variable that stores the address of another variable. That means one item would store both the value of an element in the list, and the address of the next element in that list, meaning you would need to find the first element, to get the next one, which would lead you to the next one, and so on. This could mean elements in a list could be scattered all over the computer memory (there is also the chance that they could be back to back [contiguous]. You’d never know!). While this has its advantages, it doesn’t help as much when you are talking about speed. Although not a perfect analogy, think of it as some kind of treasure hunt. When you find one element, you get the address of the next one so you can follow the address to the next element and so on. Would you be able to complete the hunt faster with this method, or if you found all the elements the moment you found the first one?
Illustrating numpy with a practical example
Let’s solve a simple problem with numpy that would help sell you on the power of this Python library.
Imagine we were to create a function that would return a dictionary of the mean, standard deviation, variance, minimum, maximum and the sum of the first and second axes of a 3x3 matrix, as well as the flattened matrix.
To do this, we would start by importing numpy as usual.
import numpy as np
(The np is the normal way to do this. You can still import numpy anyhow you want though)
Note that we would use a try/except block to ensure that only a list with 9 elements is passed as an argument to the function. Then we would create three arrays using the array method, splitting the elements of the 9 item list into three lists with 3 elements in each. A 3x3 matrix is just 3 arrays in one, so combining these three arrays into a single one with the array method would create a 3x3 matrix.
import numpy as np
def calculate(list_):
a = np.array([list_[0:3])
b = np.array([list_[3:6])
c = np.array([list_[6:9])
matrix = np.array(a, b, c)
Next, let’s create an empty dictionary that would be later populated with the different statistical calculations.
dict_ = {}
Numpy has built-in methods for statistical calculations. The methods we would need for this project are the mean(), std(), var(), min(), max() and sum() methods. As you can probably guess, these methods would calculate all we would need. The axis can be passed as arguments to these methods, and the flatten() method can be used to get the flattened matrix.
You get a flattened matrix when you turn a matrix into a regular list. For example;
Matrix = [[1, 2, 3],
[4, 5, 6],
[7, 8, 9]]
Flattened matrix = [1, 2, 3, 4, 5, 6, 7, 8, 9]
All these would be assigned as values to the various dictionary keys. Finally, we can use the .tolist() method to change all these to a regular python list, so we can return a dictionary of lists.
dict_["mean"] = [np.mean(matrix, axis=0).tolist(), np.mean(matrix.flatten()).tolist()]
dict_["standard deviation"] = [np.std(matrix, axis=0).tolist(), np.std(matrix.flatten()).tolist()]
dict_["variance"] = [np.var(matrix, axis=0).tolist(), np.var(matrix.flatten()).tolist()]
dict_["min"] = [np.min(matrix, axis=0).tolist(), np.min(matrix.flatten()).tolist()]
dict_["max"] = [np.max(matrix, axis=0).tolist(), np.max(matrix.flatten()).tolist()]
dict_["sum"] = [np.sum(matrix, axis=0).tolist(), np.sum(matrix.flatten()).tolist()]
I hope this small project showed you the power of numpy in performing calculations on matrices and arrays. The same way this was done on a small data set, numpy is equally effective on larger arrays.