Numpy, the foundation of data science

Numpy is the cornerstone of the data science ecosystem in Python. All data manipulation, modeling, and visualization libraries rely, directly or indirectly, on Numpy. It is therefore essential to review some concepts of this package before moving forward.

Tutorial
Manipulation
Author

Lino Galiana

Published

2024-07-10

If you want to try the examples in this tutorial:

1 Introduction

This chapter serves as an introduction to Numpy to ensure that the basics of vector calculations with Python are mastered. The first part of the chapter presents small exercises to practice some basic functions of Numpy. The end of the chapter presents more in-depth practical exercises using Numpy.

It is recommended to regularly refer to the numpy cheatsheet and the official documentation if you have any doubts about a function.

In this chapter, we will adhere to the convention of importing Numpy as follows:

import numpy as np

We will also set the seed of the random number generator to obtain reproducible results:

np.random.seed(12345)
Note

The authors of numpy now recommend using generators via the default_rng() function rather than simply using numpy.random.

To stay consistent with the codes found everywhere on the internet, we still use np.random.seed, but this may change in the future.

2 Concept of array

In the world of data science, as will be discussed in more depth in the upcoming chapters, the central object is the two-dimensional data table. The first dimension corresponds to rows and the second to columns. If we only consider one dimension, we refer to a variable (a column) of our data table. It is therefore natural to link data tables to the mathematical objects of matrices and vectors.

NumPy (Numerical Python) is the foundational brick for processing numerical lists or strings of text as matrices. NumPy comes into play to offer this type of object and the associated standardized operations that do not exist in the basic Python language.

The central object of NumPy is the array, which is a multidimensional data table. A Numpy array can be one-dimensional and considered as a vector (1d-array), two-dimensional and considered as a matrix (2d-array), or, more generally, take the form of a multidimensional object (Nd-array), a sort of nested table.

Simple arrays (one or two-dimensional) are easy to represent and cover most of the use-case related to Numpy. We will discover in the next chapter on Pandas that, in practice, we usually don’t directly use Numpy since it is a low-level library. A Pandas DataFrame is constructed from a collection of one-dimensional arrays (the variables of the table), which allows performing coherent (and optimized) operations with the variable type. Having some Numpy knowledge is useful for understanding the logic of vector manipulation, making data processing more readable, efficient, and reliable.

Compared to a list,

  • an array can only contain one type of data (integer, string, etc.), unlike a list.
  • operations implemented by Numpy will be more efficient and require less memory.

Geographical data will constitute a slightly more complex construction than a traditional DataFrame. The geographical dimension takes the form of a deeper table, at least two-dimensional (coordinates of a point). However, geographical data manipulation libraries will handle this increased complexity.

2.1 Creating an array

We can create an array in several ways. To create an array from a list, simply use the array method:

np.array([1, 2, 5])
array([1, 2, 5])

It is possible to add a dtype argument to constrain the array type:

np.array([["a", "z", "e"], ["r", "t"], ["y"]], dtype="object")
array([list(['a', 'z', 'e']), list(['r', 't']), list(['y'])], dtype=object)

There are also practical methods for creating arrays:

  • Logical sequences: np.arange (sequence) or np.linspace (linear interpolation between two bounds)
  • Ordered sequences: array filled with zeros, ones, or a desired number: np.zeros, np.ones, or np.full
  • Random sequences: random number generation functions: np.rand.uniform, np.rand.normal, etc.
  • Matrix in the form of an identity matrix: np.eye

This gives, for logical sequences:

np.arange(0, 10)
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
np.arange(0, 10, 3)
array([0, 3, 6, 9])
np.linspace(0, 1, 5)
array([0.  , 0.25, 0.5 , 0.75, 1.  ])

For an array initialized to 0:

np.zeros(10, dtype=int)
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

or initialized to 1:

np.ones((3, 5), dtype=float)
array([[1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.]])

or even initialized to 3.14:

np.full((3, 5), 3.14)
array([[3.14, 3.14, 3.14, 3.14, 3.14],
       [3.14, 3.14, 3.14, 3.14, 3.14],
       [3.14, 3.14, 3.14, 3.14, 3.14]])

Finally, to create the matrix \(I_3\):

np.eye(3)
array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])
Exercise 1

Generate:

  • \(X\) a random variable, 1000 repetitions of a \(U(0,1)\) distribution
  • \(Y\) a random variable, 1000 repetitions of a normal distribution with zero mean and variance equal to 2
  • Verify the variance of \(Y\) with np.var

3 Indexing and slicing

3.1 Logic illustrated with a one-dimensional array

The simplest structure is the one-dimensional array:

x = np.arange(10)
print(x)
[0 1 2 3 4 5 6 7 8 9]

Indexing in this case is similar to that of a list:

  • The first element is 0
  • The nth element is accessible at position \(n-1\)

The logic for accessing elements is as follows:

x[start:stop:step]

With a one-dimensional array, the slicing operation (keeping a slice of the array) is very simple. For example, to keep the first K elements of an array, you would do:

x[: (K - 1)]

In this case, you select the K\(^{th}\) element using:

x[K - 1]

To select only one element, you would do:

x = np.arange(10)
x[2]
2

The syntax for selecting particular indices from a list also works with arrays.

Exercise 2

Take x = np.arange(10) and…

  • Select elements 0, 3, 5 from x
  • Select even elements
  • Select all elements except the first
  • Select the first 5 elements

3.2 Regarding performance

A key element in the performance of Numpy compared to lists, when it comes to slicing, is that an array does not return a copy of the element in question (a copy that costs memory and time) but simply a view of it.

When it is necessary to make a copy, for example to avoid altering the underlying array, you can use the copy method:

x_sub_copy = x[:2, :2].copy()

3.3 Logical filters

It is also possible, and more practical, to select data based on logical conditions (an operation called a boolean mask). This functionality will mainly be used to perform data filtering operations.

For simple comparison operations, logical comparators may be sufficient. These comparisons also work on multidimensional arrays thanks to broadcasting, which we will discuss later:

x = np.arange(10)
x2 = np.array([[-1, 1, -2], [-3, 2, 0]])
print(x)
print(x2)
[0 1 2 3 4 5 6 7 8 9]
[[-1  1 -2]
 [-3  2  0]]
x == 2
x2 < 0
array([[ True, False,  True],
       [ True, False, False]])

To select the observations related to the logical condition, just use the numpy slicing logic that works with logical conditions.

Exercise 3

Given

x = np.random.normal(size=10000)
  1. Keep only the values whose absolute value is greater than 1.96
  2. Count the number of values greater than 1.96 in absolute value and their proportion in the whole set
  3. Sum the absolute values of all observations greater (in absolute value) than 1.96 and relate them to the sum of the values of x (in absolute value)

Whenever possible, it is recommended to use numpy’s logical functions (optimized and well-handling dimensions). Among them are:

  • count_nonzero ;
  • isnan ;
  • any or all especially with the axis argument ;
  • np.array_equal to check element-by-element equality.

Given

x = np.random.normal(0, size=(3, 4))

a multidimensional array and

y = np.array([np.nan, 0, 1])

a one-dimensional array with a missing value.

Exercise 4
  1. Use count_nonzero on y
  2. Use isnan on y and count the number of non-NaN values
  3. Check if x has at least one positive value in its entirety, by rows and then by columns.
Hint

Take a look at the axis parameter by researching online. For example, here.

4 Manipulating an array

4.1 Manipulation functions

Numpy provides standardized methods or functions for modifying here’s a table showing some of them:

Here are some functions to modify an array:

Operation Implementation
Flatten an array x.flatten() (method)
Transpose an array x.T (method) or np.transpose(x) (function)
Append elements to the end np.append(x, [1,2])
Insert elements at a given position (at positions 1 and 2) np.insert(x, [1,2], 3)
Delete elements (at positions 0 and 3) np.delete(x, [0,3])

To combine arrays, you can use, depending on the case, the functions np.concatenate, np.vstack or the method .r_ (row-wise concatenation). np.hstack or the method .column_stack or .c_ (column-wise concatenation).

x = np.random.normal(size=10)

To sort an array, use np.sort

x = np.array([7, 2, 3, 1, 6, 5, 4])

np.sort(x)
array([1, 2, 3, 4, 5, 6, 7])

If you want to perform a partial reordering to find the k smallest values in an array without sorting them, use partition:

np.partition(x, 3)
array([2, 1, 3, 4, 6, 5, 7])

4.2 Statistics on an array

For classical descriptive statistics, Numpy offers a number of already implemented functions, which can be combined with the axis argument.

x = np.random.normal(0, size=(3, 4))
Exercice 5
  1. Faire la somme de tous les éléments d’un array, des éléments en ligne et des éléments en colonne. Vérifier la cohérence.
  2. Ecrire une fonction statdesc pour renvoyer les valeurs suivantes : moyenne, médiane, écart-type, minimum et maximum. L’appliquer sur x en jouant avec l’argument axis

5 Broadcasting

Broadcasting refers to a set of rules for applying operations to arrays of different dimensions. In practice, it generally consists of applying a single operation to all members of a numpy array.

The difference can be understood from the following example. Broadcasting allows the scalar 5 to be transformed into a 3-dimensional array:

a = np.array([0, 1, 2])
b = np.array([5, 5, 5])

a + b
a + 5
array([5, 6, 7])

Broadcasting can be very practical for efficiently performing operations on data with a complex structure. For more details, visit here or here.

5.1 Application: programming your own k-nearest neighbors

Exercise (a bit more challenging)
  1. Create X, a two-dimensional array (i.e., a matrix) with 10 rows and 2 columns. The numbers in the array are random.
  2. Import the matplotlib.pyplot module as plt. Use plt.scatter to plot the data as a scatter plot.
  3. Construct a 10x10 matrix storing, at element \((i,j)\), the Euclidean distance between points \(X[i,]\) and \(X[j,]\). To do this, you will need to work with dimensions by creating nested arrays using np.newaxis:
  • First, use X1 = X[:, np.newaxis, :] to transform the matrix into a nested array. Check the dimensions.
  • Create X2 of dimension (1, 10, 2) using the same logic.
  • Deduce, for each point, the distance with other points for each coordinate. Square this distance.
  • At this stage, you should have an array of dimension (10, 10, 2). The reduction to a matrix is obtained by summing over the last axis. Check the help of np.sum on how to sum over the last axis.
  • Finally, apply the square root to obtain a proper Euclidean distance.
  1. Verify that the diagonal elements are zero (distance of a point to itself…).
  2. Now, sort for each point the points with the most similar values. Use np.argsort to get the ranking of the closest points for each row.
  3. We are interested in the k-nearest neighbors. For now, set k=2. Use argpartition to reorder each row so that the 2 closest neighbors of each point come first, followed by the rest of the row.
  4. Use the code snippet below to graphically represent the nearest neighbors.
A hint for graphically representing the nearest neighbors
plt.scatter(X[:, 0], X[:, 1], s=100)

# draw lines from each point to its two nearest neighbors
K = 2

for i in range(X.shape[0]):
    for j in nearest_partition[i, : K + 1]:
        # plot a line from X[i] to X[j]
        # use some zip magic to make it happen:
        plt.plot(*zip(X[j], X[i]), color="black")

For question 2, you should get a graph that looks like this:

Question 7 result is :

Did I invent this challenging exercise? Not at all, it comes from the book Python Data Science Handbook. But if I had told you this immediately, would you have tried to answer the questions?

Moreover, it would not be a good idea to generalize this algorithm to large datasets. The complexity of our approach is \(O(N^2)\). The algorithm implemented by Scikit-Learn is \(O[NlogN]\).

Additionally, computing matrix distances using the power of GPU (graphics cards) would be faster. In this regard, the library faiss, or the dedicated frameworks for computing distance between high-dimensional vectors like ChromaDB offer much more satisfactory performance than Numpy for this specific problem.

6 Additional Exercises

Google became famous thanks to its PageRank algorithm. This algorithm allows, from links between websites, to give an importance score to a website which will be used to evaluate its centrality in a network. The objective of this exercise is to use Numpy to implement such an algorithm from an adjacency matrix that links the sites together.

Comprendre le principe de l’algorithme PageRank

Google est devenu célèbre grâce à son algorithme PageRank. Celui-ci permet, à partir de liens entre sites web, de donner un score d’importance à un site web qui va être utilisé pour évaluer sa centralité dans un réseau. L’objectif de cet exercice est d’utiliser Numpy pour mettre en oeuvre un tel algorithme à partir d’une matrice d’adjacence qui relie les sites entre eux.

  1. Créer la matrice suivante avec numpy. L’appeler M:

\[ \begin{bmatrix} 0 & 0 & 0 & 0 & 1 \\ 0.5 & 0 & 0 & 0 & 0 \\ 0.5 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0.5 & 0 & 0 \\ 0 & 0 & 0.5 & 1 & 0 \end{bmatrix} \]

  1. Pour représenter visuellement ce web minimaliste, convertir en objet networkx (une librairie spécialisée dans l’analyse de réseau) et utiliser la fonction draw de ce package.

Il s’agit de la transposée de la matrice d’adjacence qui permet de relier les sites entre eux. Par exemple, le site 1 (première colonne) est référencé par les sites 2 et 3. Celui-ci ne référence que le site 5.

  1. A partir de la page wikipedia anglaise de PageRank, tester sur votre matrice.

Site 1 is quite central because it is referenced twice. Site 5 is also central since it is referenced by site 1.

array([[0.25419178],
       [0.13803151],
       [0.13803151],
       [0.20599017],
       [0.26375504]])

Informations additionnelles

environment files have been tested on.

Latest built version: 2024-07-10

Python version used:

'3.11.6 | packaged by conda-forge | (main, Oct  3 2023, 10:40:35) [GCC 12.3.0]'
Package Version
affine 2.4.0
aiobotocore 2.12.2
aiohttp 3.9.3
aioitertools 0.11.0
aiosignal 1.3.1
alembic 1.13.1
aniso8601 9.0.1
annotated-types 0.7.0
appdirs 1.4.4
archspec 0.2.3
astroid 3.1.0
asttokens 2.4.1
attrs 23.2.0
Babel 2.15.0
bcrypt 4.1.2
beautifulsoup4 4.12.3
black 24.4.2
blinker 1.7.0
blis 0.7.11
bokeh 3.4.0
boltons 23.1.1
boto3 1.34.51
botocore 1.34.51
branca 0.7.1
Brotli 1.1.0
cachetools 5.3.3
cartiflette 0.0.2
Cartopy 0.23.0
catalogue 2.0.10
cattrs 23.2.3
certifi 2024.2.2
cffi 1.16.0
charset-normalizer 3.3.2
click 8.1.7
click-plugins 1.1.1
cligj 0.7.2
cloudpathlib 0.18.1
cloudpickle 3.0.0
colorama 0.4.6
comm 0.2.2
commonmark 0.9.1
conda 24.3.0
conda-libmamba-solver 24.1.0
conda-package-handling 2.2.0
conda_package_streaming 0.9.0
confection 0.1.5
contextily 1.6.0
contourpy 1.2.1
cryptography 42.0.5
cycler 0.12.1
cymem 2.0.8
cytoolz 0.12.3
dask 2024.4.1
dask-expr 1.0.10
debugpy 1.8.1
decorator 5.1.1
dill 0.3.8
distributed 2024.4.1
distro 1.9.0
docker 7.0.0
duckdb 0.10.1
en-core-web-sm 3.7.1
entrypoints 0.4
et-xmlfile 1.1.0
exceptiongroup 1.2.0
executing 2.0.1
fastjsonschema 2.19.1
fiona 1.9.6
flake8 7.0.0
Flask 3.0.2
folium 0.16.0
fontawesomefree 6.5.1
fonttools 4.51.0
frozenlist 1.4.1
fsspec 2023.12.2
GDAL 3.8.4
gensim 4.3.2
geographiclib 2.0
geopandas 0.12.2
geoplot 0.5.1
geopy 2.4.1
gitdb 4.0.11
GitPython 3.1.43
google-auth 2.29.0
graphene 3.3
graphql-core 3.2.3
graphql-relay 3.2.0
graphviz 0.20.3
great-tables 0.10.0
greenlet 3.0.3
gunicorn 21.2.0
htmltools 0.5.2
hvac 2.1.0
idna 3.6
imageio 2.34.2
importlib_metadata 7.1.0
importlib_resources 6.4.0
inflate64 1.0.0
ipykernel 6.29.3
ipython 8.22.2
ipywidgets 8.1.2
isort 5.13.2
itsdangerous 2.1.2
jedi 0.19.1
Jinja2 3.1.3
jmespath 1.0.1
joblib 1.3.2
jsonpatch 1.33
jsonpointer 2.4
jsonschema 4.21.1
jsonschema-specifications 2023.12.1
jupyter-cache 1.0.0
jupyter_client 8.6.1
jupyter_core 5.7.2
jupyterlab_widgets 3.0.10
kaleido 0.2.1
kiwisolver 1.4.5
kubernetes 29.0.0
langcodes 3.4.0
language_data 1.2.0
lazy_loader 0.4
libmambapy 1.5.7
llvmlite 0.42.0
locket 1.0.0
lxml 5.2.2
lz4 4.3.3
Mako 1.3.2
mamba 1.5.7
mapclassify 2.6.1
marisa-trie 1.2.0
Markdown 3.6
markdown-it-py 3.0.0
MarkupSafe 2.1.5
matplotlib 3.8.3
matplotlib-inline 0.1.6
mccabe 0.7.0
mdurl 0.1.2
menuinst 2.0.2
mercantile 1.2.1
mizani 0.11.4
mlflow 2.11.3
mlflow-skinny 2.11.3
msgpack 1.0.7
multidict 6.0.5
multivolumefile 0.2.3
munkres 1.1.4
murmurhash 1.0.10
mypy 1.9.0
mypy-extensions 1.0.0
nbclient 0.10.0
nbformat 5.10.4
nest_asyncio 1.6.0
networkx 3.3
nltk 3.8.1
numba 0.59.1
numpy 1.26.4
oauthlib 3.2.2
opencv-python-headless 4.9.0.80
openpyxl 3.1.5
OWSLib 0.28.1
packaging 23.2
pandas 2.2.1
paramiko 3.4.0
parso 0.8.4
partd 1.4.1
pathspec 0.12.1
patsy 0.5.6
Pebble 5.0.7
pexpect 4.9.0
pickleshare 0.7.5
pillow 10.3.0
pip 24.0
pkgutil_resolve_name 1.3.10
platformdirs 4.2.0
plotly 5.19.0
plotnine 0.13.6
pluggy 1.4.0
polars 0.20.31
preshed 3.0.9
prometheus_client 0.20.0
prometheus-flask-exporter 0.23.0
prompt-toolkit 3.0.42
protobuf 4.25.3
psutil 5.9.8
ptyprocess 0.7.0
pure-eval 0.2.2
py7zr 0.20.8
pyarrow 15.0.0
pyarrow-hotfix 0.6
pyasn1 0.5.1
pyasn1-modules 0.3.0
pybcj 1.0.2
pycodestyle 2.11.1
pycosat 0.6.6
pycparser 2.21
pycryptodomex 3.20.0
pydantic 2.8.2
pydantic_core 2.20.1
pyflakes 3.2.0
Pygments 2.17.2
PyJWT 2.8.0
pylint 3.1.0
PyNaCl 1.5.0
pynsee 0.1.7
pyOpenSSL 24.0.0
pyparsing 3.1.2
pyppmd 1.1.0
pyproj 3.6.1
pyshp 2.3.1
PySocks 1.7.1
python-dateutil 2.9.0
python-dotenv 1.0.1
python-magic 0.4.27
pytz 2024.1
pyu2f 0.1.5
pywaffle 1.1.1
PyYAML 6.0.1
pyzmq 25.1.2
pyzstd 0.16.0
QtPy 2.4.1
querystring-parser 1.2.4
rasterio 1.3.10
referencing 0.34.0
regex 2023.12.25
requests 2.31.0
requests-cache 1.2.1
requests-oauthlib 2.0.0
rich 13.7.1
rpds-py 0.18.0
rsa 4.9
Rtree 1.2.0
ruamel.yaml 0.18.6
ruamel.yaml.clib 0.2.8
s3fs 2023.12.2
s3transfer 0.10.1
scikit-image 0.24.0
scikit-learn 1.4.1.post1
scipy 1.13.0
seaborn 0.13.2
setuptools 69.2.0
shapely 2.0.3
shellingham 1.5.4
six 1.16.0
smart_open 7.0.4
smmap 5.0.0
snuggs 1.4.7
sortedcontainers 2.4.0
soupsieve 2.5
spacy 3.7.5
spacy-legacy 3.0.12
spacy-loggers 1.0.5
SQLAlchemy 2.0.29
sqlparse 0.4.4
srsly 2.4.8
stack-data 0.6.2
statsmodels 0.14.1
tabulate 0.9.0
tblib 3.0.0
tenacity 8.2.3
texttable 1.7.0
thinc 8.2.5
threadpoolctl 3.4.0
tifffile 2024.7.2
tomli 2.0.1
tomlkit 0.12.4
toolz 0.12.1
topojson 1.9
tornado 6.4
tqdm 4.66.2
traitlets 5.14.2
truststore 0.8.0
typer 0.12.3
typing_extensions 4.11.0
tzdata 2024.1
Unidecode 1.3.8
url-normalize 1.4.3
urllib3 1.26.18
wasabi 1.1.3
wcwidth 0.2.13
weasel 0.4.1
webdriver-manager 4.0.1
websocket-client 1.7.0
Werkzeug 3.0.2
wheel 0.43.0
widgetsnbextension 4.0.10
wordcloud 1.9.3
wrapt 1.16.0
xgboost 2.0.3
xlrd 2.0.1
xyzservices 2024.4.0
yarl 1.9.4
yellowbrick 1.5
zict 3.0.0
zipp 3.17.0
zstandard 0.22.0

View file history

SHA Date Author Description
6bf883d 2024-07-08 15:09:21 Lino Galiana Rename files (#518)
56b6442 2024-07-08 15:05:57 Lino Galiana Version anglaise du chapitre numpy (#516)
Back to top

Citation

BibTeX citation:
@book{galiana2023,
  author = {Galiana, Lino},
  title = {Python Pour La Data Science},
  date = {2023},
  url = {https://pythonds.linogaliana.fr/},
  doi = {10.5281/zenodo.8229676},
  langid = {en}
}
For attribution, please cite this work as:
Galiana, Lino. 2023. Python Pour La Data Science. https://doi.org/10.5281/zenodo.8229676.