Python pour la data science

Lino Galiana

doi:10.5281/zenodo.8229676

If you want to try the examples in this tutorial:

What you will learn in this chapter

Get to know the matplotlib and seaborn ecosystems, and learn how to build charts by progressively layering elements.
Explore the modern plotnine library — a Python implementation of R’s ggplot2 — which provides a powerful grammar of graphics for constructing visualizations.
Understand how to create interactive, web-based visualizations using the plotly and altair packages.
Learn about the key principles of effective data visualization, including the trade-offs involved in delivering a clear message and the limitations of some traditional chart types.

This chapter focuses on data visualization and presents a classic task for data scientists and data engineers: constructing figures that populate an analytical dashboard, providing a retrospective view of a phenomenon of interest.

To illustrate this, we are going to reproduce several charts available on the City of Paris’ open data portal . Since these charts do not always adhere to best practices in data visualization, we will at times modify them in order to make the represented information more accessible.

The ability to construct effective and engaging data visualizations is an essential skill for any data scientist or researcher. To improve the quality of such visualizations, it is advisable to follow the recommendations offered by specialists in dataviz and graphic semiology.

High-quality visualizations, such as those produced by the New York Times, depend not only on the use of appropriate tools (for example, JavaScript libraries), but also on adherence to established principles of representation that allow the message of a visualization to be grasped within seconds.

Because it is not easy to convey complex information to an audience in a clear and synthetic manner, it is important to consider both the reception of a visualization and the principal messages it is intended to communicate. This presentation by Eric Mauvière illustrates, through numerous examples, how visualization choices influence the effectiveness of the message delivered.

Among other resources that I have found particularly useful, this blog post by datawrapper , a reference in the field of visualization, is especially insightful. This blog post by Albert Rapp also demonstrates how to progressively construct an effective visualization and is well worth revisiting periodically. Finally, among the sites that merit frequent consultation, the resources available on Andrew Heiss’s blog are of considerable value.

Several major families of graphical representations can be distinguished: visualizations of distributions for a single variable, visualizations of relationships between multiple variables, and maps that allow one or more variables to be represented in space.

Each of these families encompasses a variety of specific figure types. For example, depending on the nature of the phenomenon, visualizations of relationships may take the form of a time series (the evolution of a variable over time), a scatter plot (the correlation between two variables), or a bar chart (illustrating the relative relationship between the values of one variable in relation to another), among others.

Rather than attempting to provide an exhaustive catalogue of possible visualizations, this chapter and the next will present a selection designed to encourage further analysis prior to the implementation of any modeling. This chapter focuses on traditional visualizations, while the following chapter is devoted to cartography. Together, these chapters aim to provide an initial framework for synthesizing the information contained in a dataset.

The subsequent step is to advance the work of communication and synthesis through outputs that may take diverse forms, such as reports, scientific publications, articles, presentations, interactive applications, websites, or notebooks such as those used in this course. The general principle remains the same regardless of the medium, and is of particular interest to data scientists when the task involves intensive use of data and requires a reproducible output. A chapter dedicated to this topic may be added to the course in the future¹.

Use an interactive interface to visualize graphics

For visualization chapters, it is highly recommended to use Python via an interactive interface such as a notebook Jupyter (via VSCode or Jupyter for example, see the notebook presentation chapter ).

This makes it possible to view the graphics immediately below each code cell, to adjust them easily, and to test modifications in real time. Conversely, if scripts are run from a conventional console (e.g., by writing to a .py file and executing line by line with MAJ+,ENTREE in VSCode), the graphics will not be displayed in a popup window_ requiring additional commands to save them, before opening the exports manually and being able to correct the code if necessary. This makes for a more laborious learning experience.

Data

This chapter is based on the bicycle passage count data from Parisian measurement points, published on the open data website of the City of Paris.

The analysis of recent historical data has been made easier by the availability of datasets in the Parquet format, a modern alternative that is more efficient and convenient than CSV. Further information on this format can be found in the resources cited in the paragraph dedicated to it in the final chapter of the section on data manipulation .

Code pour importer les données à partir du format Parquet

import os
import requests
from tqdm import tqdm
import pandas as pd
import duckdb

url = "https://minio.lab.sspcloud.fr/lgaliana/data/python-ENSAE/comptage-velo-donnees-compteurs.parquet"
# problem with https://opendata.paris.fr/api/explore/v2.1/catalog/datasets/comptage-velo-donnees-compteurs/exports/parquet?lang=fr&timezone=Europe%2FParis

filename = 'comptage_velo_donnees_compteurs.parquet'


# DOWNLOAD FILE --------------------------------

# Perform the HTTP request and stream the download
response = requests.get(url, stream=True)

if not os.path.exists(filename):
    # Perform the HTTP request and stream the download
    response = requests.get(url, stream=True)

    # Check if the request was successful
    if response.status_code == 200:
        # Get the total size of the file from the headers
        total_size = int(response.headers.get('content-length', 0))

        # Open the file in write-binary mode and use tqdm to show progress
        with open(filename, 'wb') as file, tqdm(
                desc=filename,
                total=total_size,
                unit='B',
                unit_scale=True,
                unit_divisor=1024,
        ) as bar:
            # Write the file in chunks
            for chunk in response.iter_content(chunk_size=1024):
                if chunk:  # filter out keep-alive chunks
                    file.write(chunk)
                    bar.update(len(chunk))
    else:
        print(f"Failed to download the file. Status code: {response.status_code}")
else:
    print(f"The file '{filename}' already exists.")

# READ FILE AND CONVERT TO PANDAS --------------------------

query = """
SELECT id_compteur, nom_compteur, id, sum_counts, date
FROM read_parquet('comptage_velo_donnees_compteurs.parquet')
"""

# READ WITH DUCKDB AND CONVERT TO PANDAS
df = duckdb.sql(query).df()

	id_compteur	nom_compteur	id	sum_counts	date
0	100003098-101003098	106 avenue Denfert Rochereau NE-SO	100003098	36	2024-01-01 03:00:00+00:00
1	100003098-101003098	106 avenue Denfert Rochereau NE-SO	100003098	27	2024-01-01 04:00:00+00:00
2	100003098-101003098	106 avenue Denfert Rochereau NE-SO	100003098	10	2024-01-01 06:00:00+00:00

To import the graphical libraries we will use in this chapter, execute

import matplotlib.pyplot as plt
import seaborn as sns
from plotnine import *

Warning

Importing libraries in the form from package import * is not a very good practice.

However, for a package like plotnine, many of whose functions we’ll be using, it would be a bit tedious to import functions on a case-by-case basis. What’s more, it allows us to reuse the ggplot R library code examples, which are plentiful on the Internet with visual demonstrations, almost as they are. from package import * is the Python equivalent of the library(package) practice in R.

Since we will regularly recreate variations of the same figure, we will create variables for the axis labels and the title:

title="The 10 bikemeters with the highest hourly average"
xaxis="Meter name"
yaxis="Hourly average"

1 A first figure with `Pandas`’ `Matplotlib` API

Trying to produce a perfect visualization on the first attempt is unrealistic. It is much more practical to gradually improve a graphical representation to progressively highlight structural effects in a dataset.

We will begin by visualizing the distribution of bicycle counts at the main measurement stations. To do this, we will quickly create a barplot and then improve it step by step.

In this section, we will reproduce the first two charts from the data analysis page : The 10 counters with the highest hourly average and The 10 counters that recorded the most bicycles. The numerical values of the charts may differ from those on the webpage, which is expected, as we are not necessarily working with data as up-to-date as that online.

1.1 Understanding the Basics of `matplotlib`

matplotlib dates back to the early 2000s and emerged as a Python alternative for creating charts, similar to Matlab, a proprietary numerical computation software. Thus, matplotlib is quite an old library, predating the rise of Python in the data processing ecosystem. This is reflected in its design, which may not always feel intuitive to those familiar with the modern data science ecosystem. Fortunately, many libraries build upon matplotlib to provide syntax more familiar to data scientists.

matplotlib primarily offers two levels of abstraction: the figure and the axes. The figure is essentially the “canvas” that contains one or more axes, where the charts are placed. Depending on the situation, you might need to modify figure or axis parameters, which makes chart creation highly flexible but also potentially confusing, as it’s not always clear which abstraction level to modify². As shown in Figure 1.1, every element of a figure is customizable.

Figure 1.1: Understanding the Anatomy of a `matplotlib` Figure (Source: Official Documentation)

In practice, there are two ways to create and update your figure, depending on your preference:

The explicit approach, inheriting an object-oriented programming logic, where Figure and Axes objects are created and updated directly.
The implicit approach, based on the pyplot interface, which uses a series of functions to update implicitly created objects.

import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(0, 2, 100)  # Sample data.

# Note that even in the OO-style, we use `.pyplot.figure` to create the Figure.
fig, ax = plt.subplots(figsize=(5, 2.7), layout='constrained')
ax.plot(x, x, label='linear')  # Plot some data on the Axes.
ax.plot(x, x**2, label='quadratic')  # Plot more data on the Axes...
ax.plot(x, x**3, label='cubic')  # ... and some more.
ax.set_xlabel('x label')  # Add an x-label to the Axes.
ax.set_ylabel('y label')  # Add a y-label to the Axes.
ax.set_title("Simple Plot")  # Add a title to the Axes.
ax.legend()  # Add a legend.

Source: Official matplotlib Documentation

import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(0, 2, 100)  # Sample data.

plt.figure(figsize=(5, 2.7), layout='constrained')
plt.plot(x, x, label='linear')  # Plot some data on the (implicit) Axes.
plt.plot(x, x**2, label='quadratic')  # etc.
plt.plot(x, x**3, label='cubic')
plt.xlabel('x label')
plt.ylabel('y label')
plt.title("Simple Plot")
plt.legend()

Source: Official matplotlib Documentation

These elements are the minimum required to understand the logic of matplotlib. To become more comfortable with these concepts, repeated practice is essential.

1.2 Discovering `matplotlib` through `Pandas`

It’s often handy to produce a graph quickly, without necessarily worrying too much about style, but to get a quick idea of the statistical distribution of your data. For this, the integration of basic graphical functions in Pandas is handy: you can directly apply a few instructions to a DataFrame and it will produce a matplotlib figure.

The aim of Exercise 1 is to discover these instructions and how the result can quickly be reworked for visual descriptive statistics.

Exercise 1: Create an initial plot

The data includes several dimensions that can be analyzed statistically. We’ll start by focusing on the volume of passage at various counting stations.

Since our goal is to summarize the information in our dataset, we first need to perform some ad hoc aggregations to create a readable plot.

Retain the ten stations with the highest average. To get an ordered plot from largest to smallest using Pandas plot methods, the data must be sorted from smallest to largest (yes, it’s odd but that’s how it works…). Sort the data accordingly.
Initially, without worrying about styling or aesthetics, create the structure of a barplot (bar chart) as seen on the data analysis page.
To prepare for the second figure, retain only the 10 stations that recorded the highest total number of bicycles.
As in question 2, create a barplot to replicate figure 2 from the Paris open data portal.

The top 10 stations from question 1 are those with the highest average bicycle traffic. These reordered data allow for creating a clear visualization highlighting the busiest stations.

	sum_counts
nom_compteur
72 boulevard Voltaire NO-SE	159.539148
27 quai de la Tournelle SE-NO	166.927660
Quai d'Orsay E-O	178.842743
35 boulevard de Ménilmontant NO-SE	180.364565
Totem 64 Rue de Rivoli Totem 64 Rue de Rivoli Vélos E-O	190.852164

Figure 1.2, displays the data in a basic barplot. While it conveys the essential information, it lacks aesthetic layout, harmonious colors, and clear annotations, which are necessary to improve readability and visual impact.

Figure 1 (click here to mask)

Figure 1.2: First draft for ‘The 10 meters with the highest hourly average’

Figure 2 without styling (click here to mask):

Figure 1.3: First draft of the figure ‘The 10 counters that recorded the most bicycles’

Our visualization starts to communicate a concise message about the nature of the data. In this case, the intended message is the relative hierarchy of station usage.

Nevertheless, several issues remain. Some elements are problematic (for example, labels), others are inconsistent (such as axis titles), and still others are missing altogether (including the title of the graph). This figure remains somewhat unfinished.

Since the graphs produced by Pandas are based on the highly flexible logic of matplotlib, they can be customized extensively. However, this often requires considerable effort, as the matplotlib grammar is neither as standardized nor as intuitive as that of ggplot in R. For those wishing to remain within the matplotlib ecosystem, it is generally preferable to use seaborn directly, as it provides several ready-to-use options. Alternatively, one can turn, as we shall do here, to the plotnine ecosystem, which offers the standardized ggplot syntax for modifying the various elements of a figure.

2 Using `seaborn` directly

2.1 Understanding `seaborn` in a Few Lines

seaborn is a high-level interface built on top of matplotlib. This package provides a set of features to create matplotlib figures or axes directly from a function with numerous arguments. If further customization is needed, matplotlib functionalities can be used to update the figure, whether through the implicit or explicit approaches described earlier.

As with matplotlib, the same figure can be created in multiple ways in seaborn. seaborn inherits the figure-axes duality from matplotlib, requiring frequent adjustments at either level. The main characteristic of seaborn is its standardized entry points, such as seaborn.relplot or seaborn.catplot, and its input logic based on DataFrame, whereas matplotlib is structured around Numpy arrays. However, it is important to be aware that seaborn suffers from the same limitations as matplotlib, particularly the unintuitive nature of the customisation elements, which, if not found in the arguments, can be a headache to implement.

The figure now conveys a message, but it is still not very readable. There are several ways to create a barplot in seaborn. The two main ones are:

sns.catplot
sns.barplot

For this exercise, we suggest using sns.catplot. It is a common entry point for plotting graphs of a discretized variable.

2.2 Reproduction of the previous example with `seaborn`

We will simply reproduce Figure 1.2 with seaborn. To do this, here is the code needed to have a ready-to-use DataFrame:

Code

df1 = (
    df
    .groupby('nom_compteur')
    .agg({'sum_counts': "mean"})
    .sort_values('sum_counts', ascending = False)
    .head(10)
    .sort_values('sum_counts')
)


df1 = df1.reset_index().sort_values("sum_counts", ascending = False)

df1.head()

	nom_compteur	sum_counts
9	Totem 73 boulevard de Sébastopol S-N	327.091037
8	Totem 73 boulevard de Sébastopol N-S	232.560270
7	Totem 64 Rue de Rivoli Totem 64 Rue de Rivoli ...	230.304195
6	102 boulevard de Magenta SE-NO	219.405306
5	89 boulevard de Magenta NO-SE	217.406990

Exercise 2: reproduce the first figure with seaborn

Redraw the previous graph using the catplot function from seaborn. To control the size of the graph, you can use the height and aspect arguments.
Add axis titles and a title to the graph.
Even if it does not add any information, try colouring the x axis red, as in the figure on the open data portal. You can predefine a style with sns.set_style('ticks', {'xtick.color': 'red'}).

At the end of question 1, i.e. using seaborn to reproduce a minimal barplot, we obtain Figure 2.1. This is already a little cleaner than the previous version (Figure 1.2) and may already be sufficient for exploratory work.

At the end of the exercise, we obtain a figure close to the one we are trying to reproduce. The main difference is that ours does not include numerical values.

This shows that Boulevard de Sébastopol is the most traveled, which won’t surprise you if you cycle in Paris. However, if you’re not familiar with Parisian geography, this will provide little information for you. You’ll need an additional graphical representation: a map! We will cover this in a future chapter.

3 And here enters `Plotnine`, a pythonic grammar of graphics

plotnine is the newcomer to the Python visualization ecosystem. This library is developed by Posit, the company behind the RStudio editor and the tidyverse ecosystem, which is central to the R language. This library aims to bring the logic of ggplot to Python, meaning a standardized, readable, and flexible grammar of graphics inspired by Wilkinson (2011).

In this approach, a chart is viewed as a succession of layers that, when combined, create the final figure. This principle is not inherently different from that of matplotlib. However, the grammar used by plotnine is far more intuitive and standardized, offering much more autonomy for modifying a chart.

With plotnine, there is no longer a dual figure-axis entry point. As illustrated in the slides below:

A figure is initialized
Layers are updated, a very general abstraction level that applies to the data represented, axis scales, colors, etc.
Finally, aesthetics can be adjusted by modifying axis labels, legend labels, titles, etc.

We will need hierarchical data to have bars ordered in a consistent manner:

df1["nom_compteur"] = pd.Categorical(
    df1["nom_compteur"],
    categories = df1["nom_compteur"][::-1],
    ordered=True
)

Exercise 4: Reproduce the First Figure with plotnine

This is the same exercise as Exercise 2. The objective is to create this figure with plotnine.

For this exercise, we offer a step-by-step guided correction to illustrate the logic behind the grammar of graphs.

3.1 The plot grid: `ggplot()`

The first step in any figure is to define the object of the graph, i.e. the data that will be visually represented. This is done using the ggplot statement with the following parameters:

The DataFrame, the first parameter of any call to ggplot.
The main variable aesthetic parameters - which are inserted into aes (aesthetics) - which will be common to the different layers. In this case, we only have the axes to declare, but depending on the nature of the graph, we could have other aesthetics whose behaviour would be controlled by a variable in our dataset: colour, point size, curve width, transparency, etc.

ggplot(df1, aes(x="nom_compteur", y="sum_counts"))

This gives us the structure of the graph into which all subsequent elements will be inserted. Regarding the chosen \(x\) and \(y\), this declaration will define a vertical bar plot. We will then see that we are going to reverse the axes to make it more readable, but that will come later.

3.2 Add geometries: `geom_*`

Graphical layers are defined by the geom_ family of functions according to an additive logic (hence the +). These are controlled on two levels:

In the parameters defined by aes, either at the global level (ggplot) or at the level specific to the geometry in question (in the call to geom_)
In the constant parameters that apply uniformly to the layer, defined as constant parameters

(
    ggplot(df1, aes(x="nom_compteur", y="sum_counts")) +
    geom_bar(stat="identity", fill="red")
)

You can add several successive layers. For example, the numerical values displayed to provide context can be created using geom_text, whose positioning on the figure is managed by the same parameters as the other layers:

df1["text"] = df1["sum_counts"].round().astype(int).astype(str)

(
    ggplot(df1, aes(x="nom_compteur", y="sum_counts")) +
    geom_bar(stat="identity", fill="red") +
1    geom_text(aes(label = "text"), position=position_nudge(y=30))
)

1: This position parameter is unnecessary, even annoying, right now. But we will use it later to shift the label (see plotnine documentation ) when we have reversed the axes.

Figure 3.3: La seconde couche de géométrie (texte)

The harmonisation of visual element declarations enabled by the graphics grammar is achieved using geom_* geometries. It is therefore logical that their behaviour should also be controlled in a standardised manner, using another family of functions: scale_ (scale_x_discrete, scale_x_continuous, scale_color_discrete, etc.).

Thus, each aesthetic (x, y, colour, fill, size, etc.) can be finely tuned in a systematic way via its own scale (scale_*). This offers almost total control over the visual translation of the data.

The functions of the coord_* family, which modify the coordinate system, can also be included in this category. In this case, we will use coord_flip to obtain a vertical bar chart.

(
    ggplot(df1, aes(x="nom_compteur", y="sum_counts")) +
    geom_bar(stat="identity", fill="red") +
    geom_text(aes(label = "text"), position=position_nudge(y=30)) +
    scale_y_continuous(expand=(0, 40)) +
    coord_flip()
)

Here, there are few parameters to modify since our scales already suit us well (we don’t have to use log to compress the scale, apply a colour palette, etc.). We’ll just enlarge the \(x\) axis a little so we can enter our numerical values. As before, when swapping coordinates with coord_flip, the axis in question is \(y\), so we’ll play around with scale_y_continuous.

3.3 Labels and themes

The final declaration of our figure is done using the formal elements that are labels (axes, titles, reading notes, etc.) and the theme (preconfigured through the theme_ family or customised with the parameters of the theme function). Before that, let’s reduce the size of our labels by \(y\)

import textwrap

def wrap_label(s, width=30):
    return '\n'.join(textwrap.wrap(s, width=width))

df1["nom_compteur"] = df1["nom_compteur"].apply(wrap_label)

We can now customise our figure:

p = ( 
    ggplot(df1, aes(x="nom_compteur", y="sum_counts")) +
    geom_bar(stat="identity", fill="red") +
    geom_text(aes(label = "text"), position=position_nudge(y=30)) +
    scale_y_continuous(expand=(0, 40)) +
    coord_flip() +
    labs(
        title=title,
        x=xaxis,
        y=yaxis
    ) +
    theme(
        panel_background=element_rect(fill="white"),
        line=element_line(color="white"),
        axis_text_x=element_text(angle=45, hjust=1, color="red"),
        axis_title_x=element_text(color="red"),
    )
)

p

Figure 3.5

Although brief, this introduction to the world of ggplot graphics grammar shows just how intuitive - once you understand its logic - and powerful it is.

Caution

To effectively contextualize time-based data, it’s standard practice to use dates along the x-axis. To maintain readability, avoid overloading the axis with too much detail such as showing every single day when months would suffice.

Rotating text vertically to squeeze more labels onto the axis isn’t a great solution - it mostly just gives your reader a sore neck. It’s often better to reduce the number of labels and, if needed, add annotations for particularly important dates.

4 Visualisations alternatives

So far, we have conscientiously reproduced the visualisations offered on the Paris open data dashboard. But we may want to convey the same information using different visualisations:

Lollipop charts are very similar to bar charts, but the visual information is a little more efficient: instead of a thick bar to represent the values, there is a thinner line, which can help to really perceive the scales of magnitude in the data.
Since we need to contextualise the figure with the exact values – while waiting to discover the world of interactivity – why not use a table and insert graphs into it? Tables are not a bad communication medium; on the contrary, if they offer hierarchical visual information, they can be very useful!

Bar charts (barplot) are extremely common, likely due to the legacy of Excel, where these charts can be created with just a couple of clicks. However, in terms of conveying a message, they are far from perfect. For example, the bars take up a lot of visual space, which can obscure the intended message about relationships between observations.

From a semiological perspective, that is, in terms of the effectiveness of conveying a message, lollipop charts are preferable: they convey the same information but with fewer visual elements that might clutter understanding.

Lollipop charts are not perfect either but are slightly more effective at conveying the message. To learn more about alternatives to bar charts, Eric Mauvière’s talk for the public statistics data scientists network, whose main message is “Unstack your figures”, is worth exploring (available on ssphub.netlify.app/ ).

With plotnine, it is not too complicated to create a lollipop chart. All you need are two geometries:

The stick of the lollipop is created with a geom_segment;
The tip of the lollipop is created with a geom_point.

p = (
    ggplot(df1, aes(x="nom_compteur", y="sum_counts")) +
    geom_segment(aes(x="nom_compteur", xend="nom_compteur", y=0, yend="sum_counts"), size=1) +
    geom_point(color="white", fill="red", size=6, stroke=1, shape="o") +
    coord_flip() +
    labs(
        title=title,
        x=xaxis,
        y=yaxis
    ) +
    theme_minimal()
)

p

Figure 4.1

p = (
    ggplot(df1, aes(x="nom_compteur", y="sum_counts")) +
    geom_segment(
        aes(x="nom_compteur", xend="nom_compteur", y=0, yend="sum_counts"), size=1, color = "white"
    ) +
    geom_point(color="white", fill="red", size=6, stroke=1, shape="o") +
    coord_flip() +
    labs(
        title=title,
        x=xaxis,
        y=yaxis
    ) +
    theme_minimal() +
    theme(
        plot_background=element_rect(fill="black"),
        panel_background=element_rect(fill="black"),
        line=element_line(color="black"),
        axis_text_x=element_text(color="white"),
        axis_title_x=element_text(color="white"),
        text=element_text(color="white"),
        plot_title=element_text(ha="left")
    )
)

p

Figure 4.2

This alternative representation provides a clearer picture of the difference between the most frequently used counter and the others.

The lollipop chart is a fairly standard representation in biostatistics and economics for representing odds ratios derived from logistic modelling. In this case, the lines are generally used to represent the size of the confidence interval in this literature.

A variant of lollipop charts to represent odds ratios (Galiana et al. 2022)

A variant of the lollipop chart, popularised in particular by datawrapper , also allows intervals to be represented: the range plot. It allows both the hierarchy between observations and the amplitude of a phenomenon to be represented.

![Example of a range plot by Eric Mauvière (ssphub.netlify.app/ ).] (https://ssphub.netlify.app/talk/2024-02-29-mauviere/mauviere.png)

4.1 A stylised table

Tables are a good medium for communicating precise values. But without the addition of contextual elements, such as colour intensities or figures, they are of little help in visually perceiving discrepancies or orders of magnitude.

Thanks to the richness of the HTML format, which allows lightweight graphics to be inserted directly into cells, it is possible to combine numerical precision with visual readability. This gives us the best of both worlds.

We have previously used the great_tables package to represent aggregated statistics. Here, we will use it to integrate a lollipop chart into a table, allowing immediate reading of values while maintaining their accuracy.

We will take this opportunity to clean up the text to be displayed by removing duplicate labels and isolating the direction.

Code

df1["direction"] = df1["nom_compteur"].str.extract(
    r"([A-Z]{1,3}-[A-Z]{1,3})$"
)
df1["nom_compteur"] = df1["nom_compteur"].str.replace(
    r"([A-Z]{1,3}-[A-Z]{1,3})$", "", regex=True
)

def deduplicate_label(label):
    parts = label.split()
    mid = len(parts) // 2
    for i in range(1, mid + 1):
        if parts[:i] == parts[i:2*i]:
            return ' '.join(parts[i:])
    return label

df1["nom_compteur"] = df1["nom_compteur"].apply(deduplicate_label)
df1["nom_compteur"] = df1['nom_compteur'].str.replace("(Vélos|Totem)", "", regex=True)

df1.head()

	nom_compteur	sum_counts	text	direction
9	73 boulevard de\nSébastopol	327.091037	327	S-N
8	73 boulevard de\nSébastopol	232.560270	233	N-S
7	64 Rue de Rivoli	230.304195	230	O-E
6	102 boulevard de Magenta	219.405306	219	SE-NO
5	89 boulevard de Magenta	217.406990	217	NO-SE

We will also create an intermediate column to create a colourful summary visualisation allowing us to see the counters on several lines.

Code

import matplotlib.pyplot as plt

df1["nom_compteur_temp"] = df1["nom_compteur"]

# Discrete colormap
categories = df1["nom_compteur_temp"].unique()
cmap = plt.get_cmap("Dark2")

# Create mapping from label to color hex
colors = {cat: cmap(i / max(len(categories) - 1, 1)) for i, cat in enumerate(categories)}
colors = {k: plt.matplotlib.colors.to_hex(v) for k, v in colors.items()}

# Function to return colored cell
def create_color_cell(label: str) -> str:
    color = colors.get(label, "#ccc")
    return f"""
    <div style="
        width: 20px;
        height: 20px;
        background-color: {color};
        border-radius: 3px;
        margin: auto;
    "></div>
    """

We will create a dictionary to rename our columns in a more intelligible form than the variable names, as well as a variable for referencing the source.

columns_mapping = {
    "nom_compteur": "Location",
    "direction": "",
    "text": "",
    "sum_counts": "",
    "nom_compteur_temp": ""
}
source_note = "**Source**: Vélib counters on the [Paris open data page](https://opendata.paris.fr/explore/dataset/comptage-velo-donnees-compteurs/dataviz/?disjunctive.id_compteur&disjunctive.nom_compteur&disjunctive.id&disjunctive.name)"

Code

from great_tables import *

df_table = df1.loc[:, ["nom_compteur_temp","nom_compteur", "direction", "text", "sum_counts"]]

gt = (
    GT(
        df_table
    )
    .fmt(fns=create_color_cell, columns="nom_compteur_temp")
    .fmt_nanoplot(columns="sum_counts")
    .tab_spanner(
        label="Compteur",
        columns=["nom_compteur_temp", "nom_compteur", "direction"]
    )
    .tab_spanner(
        label="Moyenne horaire",
        columns=["text", "sum_counts"]
    )
    .cols_label(
        **columns_mapping
    )
    .cols_width(
        cases={
            "nom_compteur_temp": "10%",
            "nom_compteur": "40%",
            "direction": "10%",
            "text": "10%"
        }
    )
    .tab_style(
        style=[
            style.text(size = "small")
        ],
        locations = loc.body(df_table.columns.tolist())
    )
    .tab_style(
        style=[
            style.text(weight="bold")
        ],
        locations=loc.body(columns="text")
    )
    .tab_source_note(
        md(source_note)
    )
)

To make it look better on a black background, you can add a few specific settings for this purpose.

gt_dark = gt.tab_options(
    table_background_color="black",
    heading_background_color="black"
)
gt_dark

Table 4.1: Visualisation alternative sous forme de table

Compteur			Moyenne horaire
	Location
	73 boulevard de Sébastopol	S-N	327
	73 boulevard de Sébastopol	N-S	233
	64 Rue de Rivoli	O-E	230
	102 boulevard de Magenta	SE-NO	219
	89 boulevard de Magenta	NO-SE	217
	64 Rue de Rivoli	E-O	191
	35 boulevard de Ménilmontant	NO-SE	180
	Quai d'Orsay	E-O	179
	27 quai de la Tournelle	SE-NO	167
	72 boulevard Voltaire	NO-SE	160
Source: Vélib counters on the Paris open data page

5 Reactive charts with `Javascript` wrappers

Important

A tooltip is a text that appears when hovering over an element in a chart on a computer, or when tapping on it on a smartphone. It adds an extra layer of information through interactivity and can be a useful way to declutter the main message of a visualization.

That said, like any element of a chart, a tooltip requires thoughtful design to be effective. The default tooltips provided by visualization libraries are rarely sufficient. You need to consider what message the tooltip should convey as a textual complement to the visual data shown in the chart.

Again, we won’t go into detail here - this topic alone could fill an entire data visualization course - but it is important to keep in mind when designing interactive charts.

Another important topic we won’t cover here is responsiveness: the ability of a visualization (or a website more generally) to display clearly and function properly across different screen sizes. Designing for multiple devices is challenging but essential, especially given that a growing share of web traffic now comes from smartphones.

In addition, accessibility is another crucial consideration in interactive visualizations. For instance, around 8% of men have some form of color vision deficiency, most commonly difficulty perceiving green (about 6%) or red (about 2%).

In short, ye who enter to data visualization, abandon all hope. While the tools themselves may be easy to use, the needs they must meet are often complex.

5.1 Ecosystem available from `Python`

Static figures created with matplotlib or plotnine are fixed and thus have the disadvantage of not allowing interaction with the viewer. All the information must be contained in the figure, which can make it difficult to read. If the figure is well-made with multiple levels of information, it can still work well.

However, thanks to web technologies, it is simpler to offer visualizations with multiple levels. A first level of information, the quick glance, may be enough to grasp the main messages of the visualization. Then, a more deliberate behavior of seeking secondary information can provide further insights. Reactive visualizations, now the standard in the dataviz world, allow for this approach: the viewer can hover over the visualization to find additional information (e.g., exact values) or click to display complementary details.

These visualizations rely on the same triptych as the entire web ecosystem: HTML, CSS, and JavaScript. Python users will not directly manipulate these languages, which require a certain level of expertise. Instead, they use libraries that automatically generate all the necessary HTML, CSS, and JavaScript code to create the figure.

Several Javascript ecosystems are made available to developers through Python. The two main libraries are Plotly, associated with the Javascript ecosystem of the same name, and Altair, associated with the Vega and Altair ecosystems in Javascript³. To allow Python users to explore the emerging Javascript library Observable Plot, French research engineer Julien Barnier developed pyobsplot, a Python library enabling the use of this ecosystem from Python.

Interactivity should not just be a gimmick that adds no readability or even worsens it. It is rare to rely solely on the figure as produced without further work to make it effective.

5.2 The `Plotly` library

The Plotly package is a wrapper for the Javascript library Plotly.js, allowing for the creation and manipulation of graphical objects very flexibly to produce interactive objects without the need for Javascript.

The recommended entry point is the plotly.express module (documentation here), which provides an intuitive approach for creating charts that can be modified post hoc if needed (e.g., to customize axes).

Displaying Figures Created with Plotly

In a standard Jupyter notebook, the following lines of code allow the output of a Plotly command to be displayed under a code block:

For JupyterLab, the jupyterlab-plotly extension is required:

!jupyter labextension install jupyterlab-plotly

5.3 Replicating the Previous Example with `Plotly`

The following modules will be required to create charts with plotly:

import plotly
import plotly.express as px

Exercise 7: A Barplot with Plotly

The goal is to recreate the first red bar chart using Plotly.

Create the chart using the appropriate function from plotly.express and…
- Do not use the default theme but one with a white background to achieve a result similar to that on the open-data site.
- Use the color_discrete_sequence argument for the red color.
- Remember to label the axes.
Modify the hover text.
Choose a white or a dark theme and use appropriate options.

(a)

(b)

Figure 5.1

5.4 The `altair` library

For this example, we will recreate our previous figure.

Like ggplot/plotnine, Vega is a graphics ecosystem designed to implement the grammar of graphics from Wilkinson (2011). The syntax of Vega is therefore based on a declarative principle: a construction is declared through layers and progressive data transformations.

Originally, Vega was based on a JSON syntax, hence its strong connection to Javascript. However, there is a Python API that allows for creating these types of interactive figures natively in Python. To understand the logic of constructing an altair code, here is how to replicate the previous figure:

View the architecture of an Altair figure

import altair as alt

fig_altair = (
    alt.Chart(df1.reset_index())
    .mark_bar(color='steelblue') 
    .encode(
        x=alt.X('sum_counts:Q', title=xaxis),
        y=alt.Y('nom_compteur:N', sort='-x', title=yaxis),
        tooltip=[
            alt.Tooltip('nom_compteur:N', title=xaxis),
            alt.Tooltip('sum_counts:Q', title=yaxis)
        ]
    ).properties(
        title=title
    ).configure_view(
        strokeOpacity=0
    )
)

fig_altair.interactive()

References

Galiana, Lino, Olivier Meslin, Noémie Courtejoie, and Simon Delage. 2022. “Caractéristiques Socio-économiques Des Individus Aux Formes sévères de Covid-19 Au Fil Des Vagues épidémiques.”

Wilkinson, Leland. 2011. “The Grammar of Graphics.” In Handbook of Computational Statistics: Concepts and Methods, 375–414. Springer.

Informations additionnelles

Python environment

This site was built automatically through a Github action using the Quarto reproducible publishing software (version 1.8.25).

The environment used to obtain the results is reproducible via uv. The pyproject.toml file used to build this environment is available on the linogaliana/python-datascientist repository

pyproject.toml

[project]
name = "python-datascientist"
version = "0.1.0"
description = "Source code for Lino Galiana's Python for data science course"
readme = "README.md"
requires-python = ">=3.12,<3.13"
dependencies = [
    "altair==5.4.1",
    "black==24.8.0",
    "cartiflette",
    "contextily==1.6.2",
    "duckdb>=0.10.1",
    "folium>=0.19.6",
    "geoplot==0.5.1",
    "graphviz==0.20.3",
    "great-tables==0.12.0",
    "ipykernel>=6.29.5",
    "jupyter>=1.1.1",
    "jupyter-cache==1.0.0",
    "kaleido==0.2.1",
    "langchain-community==0.3.9",
    "loguru==0.7.3",
    "markdown>=3.8",
    "nbclient==0.10.0",
    "nbformat==5.10.4",
    "nltk>=3.9.1",
    "pip>=25.1.1",
    "plotly>=6.1.2",
    "plotnine>=0.15",
    "polars==1.8.2",
    "pyarrow==17.0.0",
    "pynsee==0.1.8",
    "python-dotenv==1.0.1",
    "pywaffle==1.1.1",
    "requests>=2.32.3",
    "scikit-image==0.24.0",
    "scipy==1.13.0",
    "spacy==3.8.4",
    "webdriver-manager==4.0.2",
    "wordcloud==1.9.3",
    "xlrd==2.0.1",
    "yellowbrick==1.5",
]

[tool.uv.sources]
cartiflette = { git = "https://github.com/inseefrlab/cartiflette" }

[dependency-groups]
dev = [
    "nb-clean>=4.0.1",
]

To use exactly the same environment (version of Python and packages), please refer to the documentation for uv.

File history

md`This file has been modified __${table_commit.length}__ times since its creation on ${creation_string} (last modified on ${last_modification_string})`

html`<div>${git_history_table}</div>`

html`<div>${git_history_plot}</div>`

SHA	Date	Author	Description
794ce14a	2025-09-15 16:21:42	lgaliana	retouche quelques abstracts
8dafcf38	2025-08-24 22:04:46	Lino Galiana	Deuxième vague de correctifs (#646)
9e03c9b2	2025-08-23 18:57:11	Lino Galiana	Premiers correctifs du chapitre de visualisation (#645)
131ccd4c	2025-08-22 19:42:29	Lino Galiana	Chapitre de visualisation: plus de place pour la grammaire des graphiques (#643)
73043ee7	2025-08-20 14:50:30	Lino Galiana	retire l’historique inutile des données velib (#638)
ff22c636	2025-08-12 09:33:59	lgaliana	Improving data visualisation chapter
2326ae94	2025-08-11 15:00:21	lgaliana	eval: true dans le chapitre des graphiques pour résoudre #626
7006f605	2025-07-28 14:20:47	Lino Galiana	Une première PR qui gère plein de bugs détectés par Nicolas (#630)
99ab48b0	2025-07-25 18:50:15	Lino Galiana	Utilisation des callout classiques pour les box notes and co (#629)
94648290	2025-07-22 18:57:48	Lino Galiana	Fix boxes now that it is better supported by jupyter (#628)
91431fa2	2025-06-09 17:08:00	Lino Galiana	Improve homepage hero banner (#612)
2f96f636	2025-01-29 19:49:36	Lino Galiana	Tweak callout for colab engine (#591)
1b184cba	2025-01-24 18:07:32	lgaliana	Traduction 🇬🇧 du chapitre 1 de la partie dataviz
e66fee04	2024-12-23 15:12:18	Lino Galiana	Fix errors in generated notebooks (#583)
cbe6459f	2024-11-12 07:24:15	lgaliana	Revoir quelques abstracts
9cf2bde5	2024-10-18 15:49:47	lgaliana	Reconstruction complète du chapitre de cartographie
c9a3f963	2024-09-24 15:18:59	Lino Galiana	Finir la reprise du chapitre matplotlib (#555)
46f038a4	2024-09-23 15:28:36	Lino Galiana	Mise à jour du premier chapitre sur les figures (#553)
59f5803d	2024-09-22 16:41:46	Lino Galiana	Update bike count source data for visualisation tutorial (#552)
06d003a1	2024-04-23 10:09:22	Lino Galiana	Continue la restructuration des sous-parties (#492)
005d89b8	2023-12-20 17:23:04	Lino Galiana	Finalise l’affichage des statistiques Git (#478)
3fba6124	2023-12-17 18:16:42	Lino Galiana	Remove some badges from python (#476)
cf91965e	2023-12-02 13:15:18	linogaliana	href in dataviz chapter
1f23de28	2023-12-01 17:25:36	Lino Galiana	Stockage des images sur S3 (#466)
09654c71	2023-11-14 15:16:44	Antoine Palazzolo	Suggestions Git & Visualisation (#449)
889a71ba	2023-11-10 11:40:51	Antoine Palazzolo	Modification TP 3 (#443)
df01f019	2023-10-10 15:55:04	Lino Galiana	Menus automatisés (#432)
a7711832	2023-10-09 11:27:45	Antoine Palazzolo	Relecture TD2 par Antoine (#418)
154f09e4	2023-09-26 14:59:11	Antoine Palazzolo	Des typos corrigées par Antoine (#411)
057dae1b	2023-09-20 16:28:46	Lino Galiana	Chapitre visualisation (#406)
1d0780ca	2023-09-18 14:49:59	Lino Galiana	Problème rendu chapitre matplotlib (#405)
a8f90c2f	2023-08-28 09:26:12	Lino Galiana	Update featured paths (#396)
3bdf3b06	2023-08-25 11:23:02	Lino Galiana	Simplification de la structure 🤓 (#393)
78ea2cbd	2023-07-20 20:27:31	Lino Galiana	Change titles levels (#381)
8df7cb22	2023-07-20 17:16:03	linogaliana	Change link
f0c583c0	2023-07-07 14:12:22	Lino Galiana	Images viz (#371)
f21a24d3	2023-07-02 10:58:15	Lino Galiana	Pipeline Quarto & Pages 🚀 (#365)
f2e89224	2023-06-12 14:54:20	Lino Galiana	Remove spoiler shortcode (#364)
2dc82e7b	2022-10-18 22:46:47	Lino Galiana	Relec Kim (visualisation + API) (#302)
03babc6c	2022-10-03 16:53:47	Lino Galiana	Parler des règles de la dataviz (#291)
89c10c32	2022-08-25 08:30:22	Lino Galiana	Adaptation du shortcode spoiler en notebook (#257)
494a85ae	2022-08-05 14:49:56	Lino Galiana	Images featured ✨ (#252)
d201e3cd	2022-08-03 15:50:34	Lino Galiana	Pimp la homepage ✨ (#249)
2812ef40	2022-07-07 15:58:58	Lino Galiana	Petite viz sympa des prenoms (#242)
a4e24263	2022-06-16 19:34:18	Lino Galiana	Improve style (#238)
02ed1e25	2022-06-09 19:06:05	Lino Galiana	Règle problème plotly (#235)
299cff3d	2022-06-08 13:19:03	Lino Galiana	Problème code JS suite (#233)
5698e303	2022-06-03 18:28:37	Lino Galiana	Finalise widget (#232)
7b9f27be	2022-06-03 17:05:15	Lino Galiana	Essaie régler les problèmes widgets JS (#231)
12965bac	2022-05-25 15:53:27	Lino Galiana	:launch: Bascule vers quarto (#226)
9c71d6e7	2022-03-08 10:34:26	Lino Galiana	Plus d’éléments sur S3 (#218)
4f675284	2021-12-12 08:37:21	Lino Galiana	Improve website appareance (#194)
66a52761	2021-11-23 16:13:20	Lino Galiana	Relecture partie visualisation (#181)
2a8809fb	2021-10-27 12:05:34	Lino Galiana	Simplification des hooks pour gagner en flexibilité et clarté (#166)
2f4d3905	2021-09-02 15:12:29	Lino Galiana	Utilise un shortcode github (#131)
2e4d5862	2021-09-02 12:03:39	Lino Galiana	Simplify badges generation (#130)
80877d20	2021-06-28 11:34:24	Lino Galiana	Ajout d’un exercice de NLP à partir openfood database (#98)
6729a724	2021-06-22 18:07:05	Lino Galiana	Mise à jour badge onyxia (#115)
4cdb759c	2021-05-12 10:37:23	Lino Galiana	:sparkles: :star2: Nouveau thème hugo :snake: :fire: (#105)
7f9f97bc	2021-04-30 21:44:04	Lino Galiana	🐳 + 🐍 New workflow (docker 🐳) and new dataset for modelization (2020 🇺🇸 elections) (#99)
0a0d0348	2021-03-26 20:16:22	Lino Galiana	Ajout d’une section sur S3 (#97)
a5b7c990	2020-10-05 15:07:09	Lino Galiana	Donne lien vers données compteurs
18be8f43	2020-10-01 17:08:53	Lino Galiana	Intégration de box inspirées du thème pydata sphinx (#58)
5ac3cbee	2020-09-28 18:59:24	Lino Galiana	Continue la partie graphiques (#54)
94f39ecc	2020-09-24 21:25:32	Lino Galiana	quelques mots sur vizu

creation = d3.min(
  table_commit.map(d => new Date(d.Date))
)

last_modification = d3.max(
  table_commit.map(d => new Date(d.Date))
)

creation_string = creation.toLocaleString("fr", {
  "day": "numeric",
  "month": "long",
  "year": "numeric"
})

last_modification_string = last_modification.toLocaleString("fr", {
  "day": "numeric",
  "month": "long",
  "year": "numeric"
})

git_history_table = Inputs.table(
  table_commit,
  {
    format: {
      SHA: x => md`[${x}](${github_repo}/commit/${x})`,
      Description: x => md`${replacePullRequestPattern(x, github_repo)}`,
      /*Date: x => x.toLocaleString("fr", {
        "month": "numeric",
        "day": "numeric",
        "year": "numeric"
        })
      */
    }
  }
)

git_history_plot = Plot.plot({
  marks: [
    Plot.ruleY([0], {stroke: "royalblue"}),
    Plot.dot(
          table_commit,
          Plot.pointerX({x: (d) => new Date(d.date), y: 0, stroke: "red"})),
    Plot.dot(table_commit, {x: (d) => new Date(d.Date), y: 0, fill: "royalblue"})
  ]
})

function replacePullRequestPattern(inputString, githubRepo) {
    // Use a regular expression to match the pattern #digit
    var pattern = /#(\d+)/g;

    // Replace the pattern with ${github_repo}/pull/#digit
    var replacedString = inputString.replace(pattern, '[#$1](' + githubRepo + '/pull/$1)');

    return replacedString;
}

github_repo = "https://github.com/linogaliana/python-datascientist"

table_commit = {

// Get the HTML table by its class name
var table = document.querySelector('.commit-table');

// Check if the table exists
if (table) {
    // Initialize an array to store the table data
    var dataArray = [];

    // Extract headers from the first row
    var headers = [];
    for (var i = 0; i < table.rows[0].cells.length; i++) {
        headers.push(table.rows[0].cells[i].textContent.trim());
    }

    // Iterate through the rows, starting from the second row
    for (var i = 1; i < table.rows.length; i++) {
        var row = table.rows[i];
        var rowData = {};

        // Iterate through the cells in the row
        for (var j = 0; j < row.cells.length; j++) {
            // Use headers as keys and cell content as values
            rowData[headers[j]] = row.cells[j].textContent.trim();
        }

        // Push the rowData object to the dataArray
        dataArray.push(rowData);
    }
  }

  return dataArray

}

// Get the element with class 'git-details'
{
  var gitDetails = document.querySelector('.commit-table');

  // Check if the element exists
  if (gitDetails) {
      // Hide the element
      gitDetails.style.display = 'none';
  }
}

Plot = require('@observablehq/plot@0.6.12/dist/plot.umd.min.js')

Back to top

Footnotes

This forthcoming chapter will be structured around the Quarto ecosystem. In the meantime, readers are encouraged to consult the exemplary documentation available for this ecosystem and to experiment with it directly, as this remains the most effective way to learn.↩︎
Thankfully, with a vast amount of online code using matplotlib, code assistants like ChatGPT or Github Copilot are invaluable for creating charts based on instructions.↩︎
The names of these libraries are inspired by the Summer Triangle constellation, of which Vega and Altair are two members.↩︎

Citation

BibTeX citation:

@book{galiana2023,
  author = {Galiana, Lino},
  title = {Python Pour La Data Science},
  date = {2023},
  url = {https://pythonds.linogaliana.fr/},
  doi = {10.5281/zenodo.8229676},
  langid = {en}
}

For attribution, please cite this work as:

Galiana, Lino. 2023. Python Pour La Data Science. https://doi.org/10.5281/zenodo.8229676.

Data

1 A first figure with Pandas’ Matplotlib API

1.1 Understanding the Basics of matplotlib

1.2 Discovering matplotlib through Pandas

2 Using seaborn directly

2.1 Understanding seaborn in a Few Lines

2.2 Reproduction of the previous example with seaborn

3 And here enters Plotnine, a pythonic grammar of graphics

3.1 The plot grid: ggplot()

3.2 Add geometries: geom_*

3.3 Labels and themes

4 Visualisations alternatives

4.1 A stylised table

5 Reactive charts with Javascript wrappers

5.1 Ecosystem available from Python

5.2 The Plotly library

5.3 Replicating the Previous Example with Plotly

5.4 The altair library

References

Informations additionnelles

Footnotes

Citation

1 A first figure with `Pandas`’ `Matplotlib` API

1.1 Understanding the Basics of `matplotlib`

1.2 Discovering `matplotlib` through `Pandas`

2 Using `seaborn` directly

2.1 Understanding `seaborn` in a Few Lines

2.2 Reproduction of the previous example with `seaborn`

3 And here enters `Plotnine`, a pythonic grammar of graphics

3.1 The plot grid: `ggplot()`

3.2 Add geometries: `geom_*`

5 Reactive charts with `Javascript` wrappers

5.1 Ecosystem available from `Python`

5.2 The `Plotly` library

5.3 Replicating the Previous Example with `Plotly`

5.4 The `altair` library