Python pour la data science

Lino Galiana

doi:10.5281/zenodo.8229676

If you want to try the examples in this tutorial:

1 Introduction

This chapter serves as an introduction to Numpy to ensure that the basics of vector calculations with Python are mastered. The first part of the chapter presents small exercises to practice some basic functions of Numpy. The end of the chapter presents more in-depth practical exercises using Numpy.

It is recommended to regularly refer to the numpy cheatsheet and the official documentation if you have any doubts about a function.

In this chapter, we will adhere to the convention of importing Numpy as follows:

import numpy as np

We will also set the seed of the random number generator to obtain reproducible results:

import numpy as np
rng = np.random.default_rng(seed=12345)

Caution

Historically, random numbers were generated using the numpy.random package. However, the authors of Numpy now recommend using generators instead. The examples in this tutorial adopt this practice.

2 Concept of array

In the world of data science, as will be discussed in more depth in the upcoming chapters, the central object is the two-dimensional data table. The first dimension corresponds to rows and the second to columns. If we only consider one dimension, we refer to a variable (a column) of our data table. It is therefore natural to link data tables to the mathematical objects of matrices and vectors.

NumPy (Numerical Python) is the foundational brick for processing numerical lists or strings of text as matrices. NumPy comes into play to offer this type of object and the associated standardized operations that do not exist in the basic Python language.

The central object of NumPy is the array, which is a multidimensional data table. A Numpy array can be one-dimensional and considered as a vector (1d-array), two-dimensional and considered as a matrix (2d-array), or, more generally, take the form of a multidimensional object (Nd-array), a sort of nested table.

Simple arrays (one or two-dimensional) are easy to represent and cover most of the use-case related to Numpy. We will discover in the next chapter on Pandas that, in practice, we usually don’t directly use Numpy since it is a low-level library. A Pandas DataFrame is constructed from a collection of one-dimensional arrays (the variables of the table), which allows performing coherent (and optimized) operations with the variable type. Having some Numpy knowledge is useful for understanding the logic of vector manipulation, making data processing more readable, efficient, and reliable.

Compared to a list,

an array can only contain one type of data (integer, string, etc.), unlike a list.
operations implemented by Numpy will be more efficient and require less memory.

Geographical data will constitute a slightly more complex construction than a traditional DataFrame. The geographical dimension takes the form of a deeper table, at least two-dimensional (coordinates of a point). However, geographical data manipulation libraries will handle this increased complexity.

2.1 Creating an array

We can create an array in several ways. To create an array from a list, simply use the array method:

np.array([1,2,5])

array([1, 2, 5])

It is possible to add a dtype argument to constrain the array type:

np.array([["a","z","e"],["r","t"],["y"]], dtype="object")

array([list(['a', 'z', 'e']), list(['r', 't']), list(['y'])], dtype=object)

There are also practical methods for creating arrays:

Logical sequences: np.arange (sequence) or np.linspace (linear interpolation between two bounds);
Ordered sequences: array filled with zeros, ones, or a desired number: np.zeros, np.ones, or np.full;
Random sequences: random number generation functions: rng.uniform, rng.normal, etc. where rng is a random number generator;
Matrix in the form of an identity matrix: np.eye.

This gives, for logical sequences:

np.arange(0,10)

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

np.arange(0,10,3)

array([0, 3, 6, 9])

np.linspace(0, 1, 5)

array([0.  , 0.25, 0.5 , 0.75, 1.  ])

For an array initialized to 0:

np.zeros(10, dtype=int)

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

or initialized to 1:

np.ones((3, 5), dtype=float)

array([[1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.]])

or even initialized to 3.14:

np.full((3, 5), 3.14)

array([[3.14, 3.14, 3.14, 3.14, 3.14],
       [3.14, 3.14, 3.14, 3.14, 3.14],
       [3.14, 3.14, 3.14, 3.14, 3.14]])

Finally, to create the matrix \(I_3\):

np.eye(3)

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

Exercise 1

Generate:

\(X\) a random variable, 1000 repetitions of a \(U(0,1)\) distribution
\(Y\) a random variable, 1000 repetitions of a normal distribution with zero mean and variance equal to 2
Verify the variance of \(Y\) with np.var

3 Indexing and slicing

3.1 Logic illustrated with a one-dimensional array

The simplest structure is the one-dimensional array:

x = np.arange(10)
print(x)

[0 1 2 3 4 5 6 7 8 9]

Indexing in this case is similar to that of a list:

The first element is 0
The nth element is accessible at position \(n-1\)

The logic for accessing elements is as follows:

x[start:stop:step]

With a one-dimensional array, the slicing operation (keeping a slice of the array) is very simple. For example, to keep the first K elements of an array, you would do:

x[:(K-1)]

In this case, you select the K\(^{th}\) element using:

x[K-1]

To select only one element, you would do:

x = np.arange(10)
x[2]

np.int64(2)

The syntax for selecting particular indices from a list also works with arrays.

Exercise 2

Take x = np.arange(10) and…

Select elements 0, 3, 5 from x
Select even elements
Select all elements except the first
Select the first 5 elements

np.arange(10)

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

The same logic applies to multidimensional arrays. Indexing then takes place at several levels. Take, for example, a 2-dimensional array (a matrix of sorts):

If we want to select the 2nd row, 3rd column (the element with value 6), we do

np.int32(6)

Now, to select a complete column (e.g. the 2nd), we can use the 2nd index to specify it (index 1 in Python since indexing starts from 0) and then : on the first dimension (shortened version of 0:N) to avoid discriminating according to this dimension:

``{python} x[:,1]


The principle is generalized, but becomes more complex, for nested _arrays_. Fortunately, these are objects we rarely manipulate directly, as most of our numerical data are flat arrays (a value - the observation - is the intersection of a row - the individual - and a column - the variable). 


:::


::: {.content-visible when-profile="fr"}

## Sur la performance

Un élément déterminant dans la performance de `Numpy` par rapport aux listes,
lorsqu'il est question de 
*slicing* est qu'un array ne renvoie pas une
copie de l'élément en question (copie qui coûte de la mémoire et du temps)
mais simplement une vue de celui-ci.

Lorsqu'il est nécessaire d'effectuer une copie,
par exemple pour ne pas altérer l'_array_ sous-jacent, on peut 
utiliser la méthode `copy`:

:::

::: {.content-visible when-profile="en"}
## Regarding performance

A key element in the performance of `Numpy` compared to lists, when it comes to slicing, is that an array does not return a copy of the element in question (a copy that costs memory and time) but simply a view of it.

When it is necessary to make a copy, for example to avoid altering the underlying array, you can use the `copy` method:

:::

```python
x_sub_copy = x[:2, :2].copy()

It is also possible, and more practical, to select data based on logical conditions (an operation called a boolean mask). This functionality will mainly be used to perform data filtering operations.

For simple comparison operations, logical comparators may be sufficient. These comparisons also work on multidimensional arrays thanks to broadcasting, which we will discuss later:

x = np.arange(10)
x2 = np.array([[-1,1,-2],[-3,2,0]])
print(x)
print(x2)

[0 1 2 3 4 5 6 7 8 9]
[[-1  1 -2]
 [-3  2  0]]

x==2
x2<0

array([[ True, False,  True],
       [ True, False, False]])

To select the observations related to the logical condition, just use the numpy slicing logic that works with logical conditions.

Exercise 3

Given

x = np.random.normal(size=10000)

Keep only the values whose absolute value is greater than 1.96
Count the number of values greater than 1.96 in absolute value and their proportion in the whole set
Sum the absolute values of all observations greater (in absolute value) than 1.96 and relate them to the sum of the values of x (in absolute value)

Whenever possible, it is recommended to use numpy’s logical functions (optimized and well-handling dimensions). Among them are:

count_nonzero ;
isnan ;
any or all especially with the axis argument ;
np.array_equal to check element-by-element equality.

Let’s create x a multidimensional array and y a one-dimensional array with a missing value.

# Assuming rng has been created beforehand
x = rng.normal(0, size=(3, 4))
y = np.array([np.nan, 0, 1])

4 Manipulating an array

4.1 Manipulation functions

Numpy provides standardized methods or functions for modifying here’s a table showing some of them:

Here are some functions to modify an array:

Operation	Implementation
Flatten an array	`x.flatten()` (method)
Transpose an array	`x.T` (method) or `np.transpose(x)` (function)
Append elements to the end	`np.append(x, [1,2])`
Insert elements at a given position (at positions 1 and 2)	`np.insert(x, [1,2], 3)`
Delete elements (at positions 0 and 3)	`np.delete(x, [0,3])`

To combine arrays, you can use, depending on the case, the functions np.concatenate, np.vstack or the method .r_ (row-wise concatenation). np.hstack or the method .column_stack or .c_ (column-wise concatenation).

x = rng.normal(size = 10)

To sort an array, use np.sort

x = np.array([7, 2, 3, 1, 6, 5, 4])

np.sort(x)

array([1, 2, 3, 4, 5, 6, 7])

If you want to perform a partial reordering to find the k smallest values in an array without sorting them, use partition:

np.partition(x, 3)

array([1, 2, 3, 4, 5, 6, 7])

For classical descriptive statistics, Numpy offers a number of already implemented functions, which can be combined with the axis argument.

x = rng.normal(0, size=(3, 4))

Exercise 5

Sum all the elements of an array, the elements by row, and the elements by column. Verify the consistency.
Write a function statdesc to return the following values: mean, median, standard deviation, minimum, and maximum. Apply it to x using the axis argument.

5 Broadcasting

Broadcasting refers to a set of rules for applying operations to arrays of different dimensions. In practice, it generally consists of applying a single operation to all members of a numpy array.

The difference can be understood from the following example. Broadcasting allows the scalar 5 to be transformed into a 3-dimensional array:

a = np.array([0, 1, 2])
b = np.array([5, 5, 5])

a + b
a + 5

array([5, 6, 7])

Broadcasting can be very practical for efficiently performing operations on data with a complex structure. For more details, visit here or here.

5.1 Application: programming your own k-nearest neighbors

Exercise 6 (a bit more challenging)

Create X, a two-dimensional array (i.e., a matrix) with 10 rows and 2 columns. The numbers in the array are random.
Import the matplotlib.pyplot module as plt. Use plt.scatter to plot the data as a scatter plot.
Construct a 10x10 matrix storing, at element \((i,j)\), the Euclidean distance between points \(X[i,]\) and \(X[j,]\). To do this, you will need to work with dimensions by creating nested arrays using np.newaxis:
1. First, use X1 = X[:, np.newaxis, :] to transform the matrix into a nested array. Check the dimensions.
2. Create X2 of dimension (1, 10, 2) using the same logic.
3. Deduce, for each point, the distance with other points for each coordinate. Square this distance.
4. At this stage, you should have an array of dimension (10, 10, 2). The reduction to a matrix is obtained by summing over the last axis. Check the help of np.sum on how to sum over the last axis.
5. Finally, apply the square root to obtain a proper Euclidean distance.
Verify that the diagonal elements are zero (distance of a point to itself…).
Now, sort for each point the points with the most similar values. Use np.argsort to get the ranking of the closest points for each row.
We are interested in the k-nearest neighbors. For now, set k=2. Use argpartition to reorder each row so that the 2 closest neighbors of each point come first, followed by the rest of the row.
Use the code snippet below to graphically represent the nearest neighbors.

A hint for graphically representing the nearest neighbors

plt.scatter(X[:, 0], X[:, 1], s=100)

# draw lines from each point to its two nearest neighbors
K = 2

for i in range(X.shape[0]):
    for j in nearest_partition[i, :K+1]:
        # plot a line from X[i] to X[j]
        # use some zip magic to make it happen:
        plt.plot(*zip(X[j], X[i]), color='black')

Question 7 result is :

Did I invent this challenging exercise? Not at all, it comes from the book Python Data Science Handbook. But if I had told you this immediately, would you have tried to answer the questions?

Moreover, it would not be a good idea to generalize this algorithm to large datasets. The complexity of our approach is \(O(N^2)\). The algorithm implemented by Scikit-Learn is \(O[NlogN]\).

Additionally, computing matrix distances using the power of GPU (graphics cards) would be faster. In this regard, the library faiss, or the dedicated frameworks for computing distance between high-dimensional vectors like ChromaDB offer much more satisfactory performance than Numpy for this specific problem.

6 Additional Exercises

Google became famous thanks to its PageRank algorithm. This algorithm allows, from links between websites, to give an importance score to a website which will be used to evaluate its centrality in a network. The objective of this exercise is to use Numpy to implement such an algorithm from an adjacency matrix that links the sites together.

Comprendre le principe de l’algorithme PageRank

Google est devenu célèbre grâce à son algorithme PageRank. Celui-ci permet, à partir de liens entre sites web, de donner un score d’importance à un site web qui va être utilisé pour évaluer sa centralité dans un réseau. L’objectif de cet exercice est d’utiliser Numpy pour mettre en oeuvre un tel algorithme à partir d’une matrice d’adjacence qui relie les sites entre eux.

Créer la matrice suivante avec Numpy. L’appeler M:

\[ \begin{bmatrix} 0 & 0 & 0 & 0 & 1 \\ 0.5 & 0 & 0 & 0 & 0 \\ 0.5 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0.5 & 0 & 0 \\ 0 & 0 & 0.5 & 1 & 0 \end{bmatrix} \]

Pour représenter visuellement ce web minimaliste, convertir en objet networkx (une librairie spécialisée dans l’analyse de réseau) et utiliser la fonction draw de ce package.

Il s’agit de la transposée de la matrice d’adjacence qui permet de relier les sites entre eux. Par exemple, le site 1 (première colonne) est référencé par les sites 2 et 3. Celui-ci ne référence que le site 5.

A partir de la page wikipedia anglaise de PageRank, tester sur votre matrice.

Site 1 is quite central because it is referenced twice. Site 5 is also central since it is referenced by site 1.

array([[0.25419178],
       [0.13803151],
       [0.13803151],
       [0.20599017],
       [0.26375504]])

Informations additionnelles

environment files have been tested on.

Latest built version: 2025-07-29

Python version used:

'3.12.3 (main, Jun 18 2025, 17:59:45) [GCC 13.3.0]'

Package	Version
affine	2.4.0
aiobotocore	2.22.0
aiohappyeyeballs	2.6.1
aiohttp	3.11.18
aioitertools	0.12.0
aiosignal	1.3.2
altair	5.4.1
annotated-types	0.7.0
anyio	4.9.0
appdirs	1.4.4
argon2-cffi	25.1.0
argon2-cffi-bindings	21.2.0
arrow	1.3.0
asttokens	3.0.0
async-lru	2.0.5
attrs	25.3.0
babel	2.17.0
beautifulsoup4	4.13.4
black	24.8.0
bleach	6.2.0
blis	1.3.0
boto3	1.37.3
botocore	1.37.3
branca	0.8.1
Brotli	1.1.0
bs4	0.0.2
cartiflette	0.0.3
Cartopy	0.24.1
catalogue	2.0.10
cattrs	24.1.3
certifi	2025.4.26
cffi	1.17.1
charset-normalizer	3.4.2
chromedriver-autoinstaller	0.6.4
click	8.2.1
click-plugins	1.1.1
cligj	0.7.2
cloudpathlib	0.21.1
comm	0.2.2
commonmark	0.9.1
confection	0.1.5
contextily	1.6.2
contourpy	1.3.2
cycler	0.12.1
cymem	2.0.11
dataclasses-json	0.6.7
debugpy	1.8.14
decorator	5.2.1
defusedxml	0.7.1
diskcache	5.6.3
duckdb	1.3.0
en_core_web_sm	3.8.0
et_xmlfile	2.0.0
executing	2.2.0
fastexcel	0.14.0
fastjsonschema	2.21.1
fiona	1.10.1
folium	0.19.6
fontawesomefree	6.6.0
fonttools	4.58.0
fqdn	1.5.1
fr_core_news_sm	3.8.0
frozenlist	1.6.0
fsspec	2025.5.0
geographiclib	2.0
geopandas	1.0.1
geoplot	0.5.1
geopy	2.4.1
graphviz	0.20.3
great-tables	0.12.0
greenlet	3.2.2
h11	0.16.0
htmltools	0.6.0
httpcore	1.0.9
httpx	0.28.1
httpx-sse	0.4.0
idna	3.10
imageio	2.37.0
importlib_metadata	8.7.0
importlib_resources	6.5.2
inflate64	1.0.1
ipykernel	6.29.5
ipython	9.3.0
ipython_pygments_lexers	1.1.1
ipywidgets	8.1.7
isoduration	20.11.0
jedi	0.19.2
Jinja2	3.1.6
jmespath	1.0.1
joblib	1.5.1
json5	0.12.0
jsonpatch	1.33
jsonpointer	3.0.0
jsonschema	4.23.0
jsonschema-specifications	2025.4.1
jupyter	1.1.1
jupyter-cache	1.0.0
jupyter_client	8.6.3
jupyter-console	6.6.3
jupyter_core	5.7.2
jupyter-events	0.12.0
jupyter-lsp	2.2.5
jupyter_server	2.16.0
jupyter_server_terminals	0.5.3
jupyterlab	4.4.3
jupyterlab_pygments	0.3.0
jupyterlab_server	2.27.3
jupyterlab_widgets	3.0.15
kaleido	0.2.1
kiwisolver	1.4.8
langchain	0.3.25
langchain-community	0.3.9
langchain-core	0.3.61
langchain-text-splitters	0.3.8
langcodes	3.5.0
langsmith	0.1.147
language_data	1.3.0
lazy_loader	0.4
loguru	0.7.3
lxml	5.4.0
mapclassify	2.8.1
marisa-trie	1.2.1
Markdown	3.8
markdown-it-py	3.0.0
MarkupSafe	3.0.2
marshmallow	3.26.1
matplotlib	3.10.3
matplotlib-inline	0.1.7
mdurl	0.1.2
mercantile	1.2.1
mistune	3.1.3
mizani	0.11.4
multidict	6.4.4
multivolumefile	0.2.3
murmurhash	1.0.13
mypy_extensions	1.1.0
narwhals	1.40.0
nbclient	0.10.0
nbconvert	7.16.6
nbformat	5.10.4
nest-asyncio	1.6.0
networkx	3.4.2
nltk	3.9.1
notebook	7.4.3
notebook_shim	0.2.4
numpy	2.2.6
openpyxl	3.1.5
orjson	3.10.18
outcome	1.3.0.post0
overrides	7.7.0
OWSLib	0.33.0
packaging	24.2
pandas	2.2.3
pandocfilters	1.5.1
parso	0.8.4
pathspec	0.12.1
patsy	1.0.1
Pebble	5.1.1
pexpect	4.9.0
pillow	11.2.1
pip	25.1.1
platformdirs	4.3.8
plotly	6.1.2
plotnine	0.13.6
polars	1.8.2
preshed	3.0.9
prometheus_client	0.22.1
prompt_toolkit	3.0.51
propcache	0.3.1
psutil	7.0.0
ptyprocess	0.7.0
pure_eval	0.2.3
py7zr	0.22.0
pyarrow	17.0.0
pybcj	1.0.6
pycparser	2.22
pycryptodomex	3.23.0
pydantic	2.11.5
pydantic_core	2.33.2
pydantic-settings	2.9.1
Pygments	2.19.1
pynsee	0.1.8
pyogrio	0.11.0
pyparsing	3.2.3
pyppmd	1.1.1
pyproj	3.7.1
pyshp	2.3.1
PySocks	1.7.1
python-dateutil	2.9.0.post0
python-dotenv	1.0.1
python-json-logger	3.3.0
python-magic	0.4.27
pytz	2025.2
pywaffle	1.1.1
PyYAML	6.0.2
pyzmq	26.4.0
pyzstd	0.17.0
rasterio	1.4.3
referencing	0.36.2
regex	2024.11.6
requests	2.32.3
requests-cache	1.2.1
requests-toolbelt	1.0.0
retrying	1.3.4
rfc3339-validator	0.1.4
rfc3986-validator	0.1.1
rich	14.0.0
rpds-py	0.25.1
rtree	1.4.0
s3fs	2025.5.0
s3transfer	0.11.3
scikit-image	0.24.0
scikit-learn	1.6.1
scipy	1.13.0
seaborn	0.13.2
selenium	4.34.2
Send2Trash	1.8.3
setuptools	80.8.0
shapely	2.1.1
shellingham	1.5.4
six	1.17.0
smart-open	7.1.0
sniffio	1.3.1
sortedcontainers	2.4.0
soupsieve	2.7
spacy	3.8.4
spacy-legacy	3.0.12
spacy-loggers	1.0.5
SQLAlchemy	2.0.41
srsly	2.5.1
stack-data	0.6.3
statsmodels	0.14.4
tabulate	0.9.0
tenacity	9.1.2
terminado	0.18.1
texttable	1.7.0
thinc	8.3.6
threadpoolctl	3.6.0
tifffile	2025.5.24
tinycss2	1.4.0
topojson	1.9
tornado	6.5.1
tqdm	4.67.1
traitlets	5.14.3
trio	0.30.0
trio-websocket	0.12.2
typer	0.15.3
types-python-dateutil	2.9.0.20250516
typing_extensions	4.13.2
typing-inspect	0.9.0
typing-inspection	0.4.1
tzdata	2025.2
Unidecode	1.4.0
uri-template	1.3.0
url-normalize	2.2.1
urllib3	2.4.0
wasabi	1.1.3
wcwidth	0.2.13
weasel	0.4.1
webcolors	24.11.1
webdriver-manager	4.0.2
webencodings	0.5.1
websocket-client	1.8.0
widgetsnbextension	4.0.14
wordcloud	1.9.3
wrapt	1.17.2
wsproto	1.2.0
xlrd	2.0.1
xyzservices	2025.4.0
yarl	1.20.0
yellowbrick	1.5
zipp	3.21.0

View file history

md`Ce fichier a été modifié __${table_commit.length}__ fois depuis sa création le ${creation_string} (dernière modification le ${last_modification_string})`

creation = d3.min(
  table_commit.map(d => new Date(d.Date))
)

last_modification = d3.max(
  table_commit.map(d => new Date(d.Date))
)

creation_string = creation.toLocaleString("fr", {
  "day": "numeric",
  "month": "long",
  "year": "numeric"
})

last_modification_string = last_modification.toLocaleString("fr", {
  "day": "numeric",
  "month": "long",
  "year": "numeric"
})

html`<div>${git_history_table}</div>`

html`<div>${git_history_plot}</div>`

SHA	Date	Author	Description
1884aef7	2025-07-28 13:47:11	lgaliana	Teste une extension différente
7006f605	2025-07-28 14:20:47	Lino Galiana	Une première PR qui gère plein de bugs détectés par Nicolas (#630)
94648290	2025-07-22 18:57:48	Lino Galiana	Fix boxes now that it is better supported by jupyter (#628)
91431fa2	2025-06-09 17:08:00	Lino Galiana	Improve homepage hero banner (#612)
488780a4	2024-09-25 14:32:16	Lino Galiana	Change badge (#556)
4640e6da	2024-09-18 11:53:05	linogaliana	corrections
88b030e8	2024-08-08 17:45:56	Lino Galiana	Replace by English metadata when relevant (#535)
580cba77	2024-08-07 18:59:35	Lino Galiana	Multilingual version as quarto profile (#533)
72f42bb7	2024-07-25 19:06:38	Lino Galiana	Language message on notebooks (#529)
195dc9e9	2024-07-25 11:59:19	linogaliana	Switch language button
6bf883d9	2024-07-08 15:09:21	Lino Galiana	Rename files (#518)
56b6442d	2024-07-08 15:05:57	Lino Galiana	Version anglaise du chapitre numpy (#516)
065b0abd	2024-07-08 11:19:43	Lino Galiana	Nouveaux callout dans la partie manipulation (#513)
d75641d7	2024-04-22 18:59:01	Lino Galiana	Editorialisation des chapitres de manipulation de données (#491)
005d89b8	2023-12-20 17:23:04	Lino Galiana	Finalise l’affichage des statistiques Git (#478)
16842200	2023-12-02 12:06:40	Antoine Palazzolo	Première partie de relecture de fin du cours (#467)
1f23de28	2023-12-01 17:25:36	Lino Galiana	Stockage des images sur S3 (#466)
a06a2689	2023-11-23 18:23:28	Antoine Palazzolo	2ème relectures chapitres ML (#457)
889a71ba	2023-11-10 11:40:51	Antoine Palazzolo	Modification TP 3 (#443)
a7711832	2023-10-09 11:27:45	Antoine Palazzolo	Relecture TD2 par Antoine (#418)
a63319ad	2023-10-04 15:29:04	Lino Galiana	Correction du TP numpy (#419)
e8d0062d	2023-09-26 15:54:49	Kim A	Relecture KA 25/09/2023 (#412)
154f09e4	2023-09-26 14:59:11	Antoine Palazzolo	Des typos corrigées par Antoine (#411)
a8f90c2f	2023-08-28 09:26:12	Lino Galiana	Update featured paths (#396)
80823022	2023-08-25 17:48:36	Lino Galiana	Mise à jour des scripts de construction des notebooks (#395)
3bdf3b06	2023-08-25 11:23:02	Lino Galiana	Simplification de la structure 🤓 (#393)
9e1e6e41	2023-07-20 02:27:22	Lino Galiana	Change launch script (#379)
130ed717	2023-07-18 19:37:11	Lino Galiana	Restructure les titres (#374)
ef28fefd	2023-07-07 08:14:42	Lino Galiana	Listing pour la première partie (#369)
f21a24d3	2023-07-02 10:58:15	Lino Galiana	Pipeline Quarto & Pages 🚀 (#365)
7e15843a	2023-02-13 18:57:28	Lino Galiana	from_numpy_array no longer in networkx 3.0 (#353)
a408cc96	2023-02-01 09:07:27	Lino Galiana	Ajoute bouton suggérer modification (#347)
3c880d59	2022-12-27 17:34:59	Lino Galiana	Chapitre regex + Change les boites dans plusieurs chapitres (#339)
e2b53ac9	2022-09-28 17:09:31	Lino Galiana	Retouche les chapitres pandas (#287)
d068cb6d	2022-09-24 14:58:07	Lino Galiana	Corrections avec echo true (#279)
b2d48237	2022-09-21 17:36:29	Lino Galiana	Relec KA 21/09 (#273)
a56dd451	2022-09-20 15:27:56	Lino Galiana	Fix SSPCloud links (#270)
f10815b5	2022-08-25 16:00:03	Lino Galiana	Notebooks should now look more beautiful (#260)
494a85ae	2022-08-05 14:49:56	Lino Galiana	Images featured ✨ (#252)
d201e3cd	2022-08-03 15:50:34	Lino Galiana	Pimp la homepage ✨ (#249)
1ca1a8a7	2022-05-31 11:44:23	Lino Galiana	Retour du chapitre API (#228)
4fc58e52	2022-05-25 18:29:25	Lino Galiana	Change deployment on SSP Cloud with new filesystem organization (#227)
12965bac	2022-05-25 15:53:27	Lino Galiana	:launch: Bascule vers quarto (#226)
9c71d6e7	2022-03-08 10:34:26	Lino Galiana	Plus d’éléments sur S3 (#218)
6777f038	2021-10-29 09:38:09	Lino Galiana	Notebooks corrections (#171)
2a8809fb	2021-10-27 12:05:34	Lino Galiana	Simplification des hooks pour gagner en flexibilité et clarté (#166)
26ea709d	2021-09-27 19:11:00	Lino Galiana	Règle quelques problèmes np (#154)
2fa78c9f	2021-09-27 11:24:19	Lino Galiana	Relecture de la partie numpy/pandas (#152)
85ba1194	2021-09-16 11:27:56	Lino Galiana	Relectures des TP KA avant 1er cours (#142)
2e4d5862	2021-09-02 12:03:39	Lino Galiana	Simplify badges generation (#130)
2f7b52d9	2021-07-20 17:37:03	Lino Galiana	Improve notebooks automatic creation (#120)
80877d20	2021-06-28 11:34:24	Lino Galiana	Ajout d’un exercice de NLP à partir openfood database (#98)
6729a724	2021-06-22 18:07:05	Lino Galiana	Mise à jour badge onyxia (#115)
4cdb759c	2021-05-12 10:37:23	Lino Galiana	:sparkles: :star2: Nouveau thème hugo :snake: :fire: (#105)
7f9f97bc	2021-04-30 21:44:04	Lino Galiana	🐳 + 🐍 New workflow (docker 🐳) and new dataset for modelization (2020 🇺🇸 elections) (#99)
0a0d0348	2021-03-26 20:16:22	Lino Galiana	Ajout d’une section sur S3 (#97)
6d010fa2	2020-09-29 18:45:34	Lino Galiana	Simplifie l’arborescence du site, partie 1 (#57)
66f9f87a	2020-09-24 19:23:04	Lino Galiana	Introduction des figures générées par python dans le site (#52)
edca3916	2020-09-21 19:31:02	Lino Galiana	Change np.is_nan to np.isnan
f9f00cc0	2020-09-15 21:05:54	Lino Galiana	enlève quelques TO DO
4677769b	2020-09-15 18:19:24	Lino Galiana	Nettoyage des coquilles pour premiers TP (#37)
d48e68fa	2020-09-08 18:35:07	Lino Galiana	Continuer la partie pandas (#13)
913047d3	2020-09-08 14:44:41	Lino Galiana	Harmonisation des niveaux de titre (#17)
c452b832	2020-07-28 17:32:06	Lino Galiana	TP Numpy (#9)
200b6c1f	2020-07-27 12:50:33	Lino Galiana	Encore une coquille
5041b280	2020-07-27 12:44:10	Lino Galiana	Une coquille à cause d’un bloc jupyter
e8db4cf0	2020-07-24 12:56:38	Lino Galiana	modif des markdown
b24a1fe7	2020-07-23 18:20:09	Lino Galiana	Add notebook
4f8f1caa	2020-07-23 18:19:28	Lino Galiana	fix typo
434d20e8	2020-07-23 18:18:46	Lino Galiana	Essai de yaml header
5ac02efd	2020-07-23 18:05:12	Lino Galiana	Essai de md généré avec jupytext

git_history_table = Inputs.table(
  table_commit,
  {
    format: {
      SHA: x => md`[${x}](${github_repo}/commit/${x})`,
      Description: x => md`${replacePullRequestPattern(x, github_repo)}`,
      /*Date: x => x.toLocaleString("fr", {
        "month": "numeric",
        "day": "numeric",
        "year": "numeric"
        })
      */
    }
  }
)

git_history_plot = Plot.plot({
  marks: [
    Plot.ruleY([0], {stroke: "royalblue"}),
    Plot.dot(
          table_commit,
          Plot.pointerX({x: (d) => new Date(d.date), y: 0, stroke: "red"})),
    Plot.dot(table_commit, {x: (d) => new Date(d.Date), y: 0, fill: "royalblue"})
  ]
})

function replacePullRequestPattern(inputString, githubRepo) {
    // Use a regular expression to match the pattern #digit
    var pattern = /#(\d+)/g;

    // Replace the pattern with ${github_repo}/pull/#digit
    var replacedString = inputString.replace(pattern, '[#$1](' + githubRepo + '/pull/$1)');

    return replacedString;
}

github_repo = "https://github.com/linogaliana/python-datascientist"

table_commit = {

// Get the HTML table by its class name
var table = document.querySelector('.commit-table');

// Check if the table exists
if (table) {
    // Initialize an array to store the table data
    var dataArray = [];

    // Extract headers from the first row
    var headers = [];
    for (var i = 0; i < table.rows[0].cells.length; i++) {
        headers.push(table.rows[0].cells[i].textContent.trim());
    }

    // Iterate through the rows, starting from the second row
    for (var i = 1; i < table.rows.length; i++) {
        var row = table.rows[i];
        var rowData = {};

        // Iterate through the cells in the row
        for (var j = 0; j < row.cells.length; j++) {
            // Use headers as keys and cell content as values
            rowData[headers[j]] = row.cells[j].textContent.trim();
        }

        // Push the rowData object to the dataArray
        dataArray.push(rowData);
    }
  }

  return dataArray

}

// Get the element with class 'git-details'
{
  var gitDetails = document.querySelector('.commit-table');

  // Check if the element exists
  if (gitDetails) {
      // Hide the element
      gitDetails.style.display = 'none';
  }
}

Plot = require('@observablehq/plot@0.6.12/dist/plot.umd.min.js')

Back to top

Citation

BibTeX citation:

@book{galiana2023,
  author = {Galiana, Lino},
  title = {Python Pour La Data Science},
  date = {2023},
  url = {https://pythonds.linogaliana.fr/},
  doi = {10.5281/zenodo.8229676},
  langid = {en}
}

For attribution, please cite this work as:

Galiana, Lino. 2023. Python Pour La Data Science. https://doi.org/10.5281/zenodo.8229676.