Building graphics with Python

An essential part of the data scientist’s job is to be able to synthesize information into powerful graphical representations. This chapter looks at the challenges of data representation with Python, the ecosystem for doing this. It also opens the door to interactive data representation with Plotly.

Visualisation
Exercice
Author

Lino Galiana

Published

2025-03-19

If you want to try the examples in this tutorial:
View on GitHub Onyxia Onyxia Open In Colab
Skills at the End of This Chapter
  • Discover the matplotlib and seaborn ecosystems for constructing charts through the successive enrichment of layers.
  • Explore the modern plotnine ecosystem, a Python implementation of the R package ggplot2 for this type of representation, which offers a powerful syntax for building data visualizations through its grammar of graphics.
  • Understand the concept of interactive HTML (web format) visualizations through the plotly and altair packages.
  • Learn the challenges of graphical representation, the trade-offs needed to convey a clear message, and the limitations of certain traditional representations.

The practice of data visualization in this course will involve replicating charts found on the open data page of the City of Paris here or proposing alternatives using the same data.

The goal of this chapter is not to provide a comprehensive inventory of charts that can be created with Python. That would be long, somewhat tedious, and unnecessary, as websites like python-graph-gallery.com/ already excel at showcasing a wide variety of examples. Instead, the objective is to illustrate, through practice, some key challenges and opportunities related to using the main graphical libraries in Python.

We can distinguish several major families of visualizations: representations of distributions specific to a single variable, representations of relationships between multiple variables, and maps that allow spatial representation of one or more variables.

These families themselves branch into various types of figures. For instance, depending on the nature of the phenomenon, relationship representations may take the form of a time series (evolution of a variable over time), a scatter plot (correlation between two variables), or a bar chart (highlighting the relative values of one variable in relation to another), among others.

Rather than an exhaustive inventory of possible visualizations, this chapter and the next will present some visualizations that may inspire further analysis before implementing a form of modeling. This chapter focuses on traditional visualizations, while the next chapter is dedicated to cartography. Together, these two chapters aim to provide the initial tools for synthesizing the information present in a dataset.

The next step is to deepen the work of communication and synthesis through various forms of output, such as reports, scientific publications or articles, presentations, interactive applications, websites, or notebooks like those provided in this course. The general principle is the same regardless of the medium and is particularly relevant for data scientists working with intensive data analysis. This will be the subject of a future chapter in this course1.

Important

Being able to create interesting data visualizations is a necessary skill for any data scientist or researcher. To improve the quality of these visualizations, it is recommended to follow certain advice from dataviz specialists on graphical semiology.

Good data visualizations, like those from the New York Times, rely not only on appropriate tools (such as JavaScript libraries) but also on certain design principles that allow the message of a visualization to be understood in just a few seconds.

This blog post is a resource worth consulting regularly. This blog post by Albert Rapp clearly demonstrates how to gradually build a good data visualization.

Note

If you are interested in R , a very similar version of this practical work is available in this introductory R course for ENS Ulm.

1 Data

This chapter is based on the bicycle passage count data from Parisian measurement points, published on the open data website of the City of Paris.

The use of recent historical data has been greatly facilitated by the availability of data in the Parquet format, a modern format more practical than CSV. For more information about this format, you can refer to the resources mentioned in the section dedicated to it in the advanced chapter.

Code pour importer les données à partir du format Parquet
import os
import requests
from tqdm import tqdm
import pandas as pd
import duckdb

url = "https://minio.lab.sspcloud.fr/lgaliana/data/python-ENSAE/comptage-velo-donnees-compteurs.parquet"
# problem with https://opendata.paris.fr/api/explore/v2.1/catalog/datasets/comptage-velo-donnees-compteurs/exports/parquet?lang=fr&timezone=Europe%2FParis

filename = 'comptage_velo_donnees_compteurs.parquet'


# DOWNLOAD FILE --------------------------------

# Perform the HTTP request and stream the download
response = requests.get(url, stream=True)

if not os.path.exists(filename):
    # Perform the HTTP request and stream the download
    response = requests.get(url, stream=True)

    # Check if the request was successful
    if response.status_code == 200:
        # Get the total size of the file from the headers
        total_size = int(response.headers.get('content-length', 0))

        # Open the file in write-binary mode and use tqdm to show progress
        with open(filename, 'wb') as file, tqdm(
                desc=filename,
                total=total_size,
                unit='B',
                unit_scale=True,
                unit_divisor=1024,
        ) as bar:
            # Write the file in chunks
            for chunk in response.iter_content(chunk_size=1024):
                if chunk:  # filter out keep-alive chunks
                    file.write(chunk)
                    bar.update(len(chunk))
    else:
        print(f"Failed to download the file. Status code: {response.status_code}")
else:
    print(f"The file '{filename}' already exists.")

# READ FILE AND CONVERT TO PANDAS --------------------------

query = """
SELECT id_compteur, nom_compteur, id, sum_counts, date
FROM read_parquet('comptage_velo_donnees_compteurs.parquet')
"""

# READ WITH DUCKDB AND CONVERT TO PANDAS
df = duckdb.sql(query).df()

df.head(3)

2 Initial Graphical Productions with PandasMatplotlib API

Trying to produce a perfect visualization on the first attempt is unrealistic. It is much more practical to gradually improve a graphical representation to progressively highlight structural effects in a dataset.

We will begin by visualizing the distribution of bicycle counts at the main measurement stations. To do this, we will quickly create a barplot and then improve it step by step.

In this section, we will reproduce the first two charts from the data analysis page: The 10 counters with the highest hourly average and The 10 counters that recorded the most bicycles. The numerical values of the charts may differ from those on the webpage, which is expected, as we are not necessarily working with data as up-to-date as that online.

To import the graphical libraries we will use in this chapter, execute

import matplotlib.pyplot as plt
import seaborn as sns
1from plotnine import *
1
Importing libraries in the form from package import * is not a very good practice. However, for a package like plotnine, many of whose functions we’ll be using, it would be a bit tedious to import functions on a case-by-case basis. What’s more, it allows us to reuse the ggplot R library code examples, which are plentiful on the Internet with visual demonstrations, almost as they are. from package import * is the Python equivalent of the library(package) practice in R.

2.1 Understanding the Basics of matplotlib

matplotlib dates back to the early 2000s and emerged as a Python alternative for creating charts, similar to Matlab, a proprietary numerical computation software. Thus, matplotlib is quite an old library, predating the rise of Python in the data processing ecosystem. This is reflected in its design, which may not always feel intuitive to those familiar with the modern data science ecosystem. Fortunately, many libraries build upon matplotlib to provide syntax more familiar to data scientists.

matplotlib primarily offers two levels of abstraction: the figure and the axes. The figure is essentially the “canvas” that contains one or more axes, where the charts are placed. Depending on the situation, you might need to modify figure or axis parameters, which makes chart creation highly flexible but also potentially confusing, as it’s not always clear which abstraction level to modify2. As shown in Figure 2.1, every element of a figure is customizable.

Figure 2.1: Understanding the Anatomy of a matplotlib Figure (Source: Official Documentation)

In practice, there are two ways to create and update your figure, depending on your preference:

  • The explicit approach, inheriting an object-oriented programming logic, where Figure and Axes objects are created and updated directly.
  • The implicit approach, based on the pyplot interface, which uses a series of functions to update implicitly created objects.
import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(0, 2, 100)  # Sample data.

# Note that even in the OO-style, we use `.pyplot.figure` to create the Figure.
fig, ax = plt.subplots(figsize=(5, 2.7), layout='constrained')
ax.plot(x, x, label='linear')  # Plot some data on the Axes.
ax.plot(x, x**2, label='quadratic')  # Plot more data on the Axes...
ax.plot(x, x**3, label='cubic')  # ... and some more.
ax.set_xlabel('x label')  # Add an x-label to the Axes.
ax.set_ylabel('y label')  # Add a y-label to the Axes.
ax.set_title("Simple Plot")  # Add a title to the Axes.
ax.legend()  # Add a legend.

Source: Official matplotlib Documentation

import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(0, 2, 100)  # Sample data.

plt.figure(figsize=(5, 2.7), layout='constrained')
plt.plot(x, x, label='linear')  # Plot some data on the (implicit) Axes.
plt.plot(x, x**2, label='quadratic')  # etc.
plt.plot(x, x**3, label='cubic')
plt.xlabel('x label')
plt.ylabel('y label')
plt.title("Simple Plot")
plt.legend()

Source: Official matplotlib Documentation

These elements are the minimum required to understand the logic of matplotlib. To become more comfortable with these concepts, repeated practice is essential.

2.2 Discovering matplotlib through Pandas

Exercise 1: Create an Initial Plot

The data includes several dimensions that can be analyzed statistically. We’ll start by focusing on the volume of passage at various counting stations.

Since our goal is to summarize the information in our dataset, we first need to perform some ad hoc aggregations to create a readable plot.

  1. Retain the ten stations with the highest average. To get an ordered plot from largest to smallest using Pandas plot methods, the data must be sorted from smallest to largest (yes, it’s odd but that’s how it works…). Sort the data accordingly.

  2. Initially, without worrying about styling or aesthetics, create the structure of a barplot (bar chart) as seen on the data analysis page.

  3. To prepare for the second figure, retain only the 10 stations that recorded the highest total number of bicycles.

  4. As in question 2, create a barplot to replicate figure 2 from the Paris open data portal.

The top 10 stations from question 1 are those with the highest average bicycle traffic. These reordered data allow for creating a clear visualization highlighting the busiest stations.

Figure 1, without any styling, displays the data in a basic barplot. While it conveys the essential information, it lacks aesthetic layout, harmonious colors, and clear annotations, which are necessary to improve readability and visual impact.

Figure 2 without styling:

We are starting to create something that conveys a synthetic message about the nature of the data. However, several issues remain (e.g., labels), as well as elements that are either incorrect (axis titles, etc.) or missing (graph title…).

Since the charts produced by Pandas follow the highly flexible logic of matplotlib, they can be customized. However, this often requires significant effort, and the matplotlib grammar is not as standardized as ggplot in R. If you wish to remain in the matplotlib ecosystem, it is better to use seaborn directly, which provides ready-to-use arguments. Alternatively, you can switch to the plotnine ecosystem, which offers a standardized syntax for modifying elements.

3 Using seaborn Directly

3.1 Understanding seaborn in a Few Lines

seaborn is a high-level interface built on top of matplotlib. This package provides a set of features to create matplotlib figures or axes directly from a function with numerous arguments. If further customization is needed, matplotlib functionalities can be used to update the figure, whether through the implicit or explicit approaches described earlier.

As with matplotlib, the same figure can be created in multiple ways in seaborn. seaborn inherits the figure-axes duality from matplotlib, requiring frequent adjustments at either level. The main characteristic of seaborn is its standardized entry points, such as seaborn.relplot or seaborn.catplot, and its input logic based on DataFrame, whereas matplotlib is structured around Numpy arrays.

The figure now conveys a message, but it is still not very readable. There are several ways to create a barplot in seaborn. The two main ones are:

  • sns.catplot
  • sns.barplot

For this exercise, we suggest using sns.catplot. It is a common entry point for plotting graphs of a discretized variable.

3.2 The bar chart (barplot)

Exercise 2: Reproduce the First Figure with Seaborn
  1. Reset the index of the dataframes df1 and df2 to have a column ‘Nom du compteur’. Reorder the data in descending order to obtain a correctly ordered graph with seaborn.

  2. Redo the previous graph using seaborn’s catplot function. To control the size of the graph, you can use the height and aspect arguments.

  3. Add axis titles and the graph title for the first graph.

  4. Try coloring the x axis in red. You can pre-define a style with sns.set_style("ticks", {"xtick.color": "red"}).

At the end of question 2, that is, by using seaborn to minimally reproduce a barplot, we get:

After some aesthetic adjustments, at the end of questions 3 and 4, we get a figure close to that of the Paris open data portal.

The additional parameters proposed in question 4 ultimately allow us to obtain the figure

This shows that Boulevard de Sébastopol is the most traveled, which won’t surprise you if you cycle in Paris. However, if you’re not familiar with Parisian geography, this will provide little information for you. You’ll need an additional graphical representation: a map! We will cover this in a future chapter.

Exercise 2b: Reproducing the Figure “The 10 Counters That Recorded the Most Bicycles”

Following the gradual approach of Exercise 2, recreate the chart The 10 Counters That Recorded the Most Bicycles using seaborn.

3.3 An Alternative to the Barplot: the Lollipop Chart

Bar charts (barplot) are extremely common, likely due to the legacy of Excel, where these charts can be created with just a couple of clicks. However, in terms of conveying a message, they are far from perfect. For example, the bars take up a lot of visual space, which can obscure the intended message about relationships between observations.

From a semiological perspective, that is, in terms of the effectiveness of conveying a message, lollipop charts are preferable: they convey the same information but with fewer visual elements that might clutter understanding.

Lollipop charts are not perfect either but are slightly more effective at conveying the message. To learn more about alternatives to bar charts, Eric Mauvière’s talk for the public statistics data scientists network, whose main message is “Unstack your figures”, is worth exploring (available on ssphub.netlify.app/).

Exercise 3 (optional): Reproduce Figure 2 with a lollipop chart

Following the gradual approach of Exercise 2, redo the graph The 10 counters that have recorded the most bikes.

💡 Don’t hesitate to consult python-graph-gallery.com/ or ask ChatGPT for help.

4 The Same Figure with Plotnine

plotnine is the newcomer to the Python visualization ecosystem. This library is developed by Posit, the company behind the RStudio editor and the tidyverse ecosystem, which is central to the R language. This library aims to bring the logic of ggplot to Python, meaning a standardized, readable, and flexible grammar of graphics inspired by Wilkinson (2012).

The mindset of ggplot2 users when they discover plotnine

The mindset of ggplot2 users when they discover plotnine

In this approach, a chart is viewed as a succession of layers that, when combined, create the final figure. This principle is not inherently different from that of matplotlib. However, the grammar used by plotnine is far more intuitive and standardized, offering much more autonomy for modifying a chart.

The logic of ggplot (and plotnine) by Lisa (2021), image itself borrowed from Field (2012)

The logic of ggplot (and plotnine) by Lisa (2021), image itself borrowed from Field (2012)

With plotnine, there is no longer a dual figure-axis entry point. As illustrated in the slides below:

  1. A figure is initialized
  2. Layers are updated, a very general abstraction level that applies to the data represented, axis scales, colors, etc.
  3. Finally, aesthetics can be adjusted by modifying axis labels, legend labels, titles, etc.

Dérouler les slides ci-dessous ou cliquer ici pour afficher les slides en plein écran.

Scroll slides below or click here to display slides full screen.

Exercise 4: Reproduce the First Figure with plotnine

This is the same exercise as Exercise 2. The objective is to create this figure with plotnine.

5 Initial Temporal Aggregations

We will now focus on the temporal dimension of our dataset using two approaches:

  • A bar chart summarizing the information in our dataset on a monthly basis;
  • Informative series on temporal dynamics, which will be the subject of the next section.

Before that, we will enhance this data to include a longer history, particularly encompassing the Covid period in our dataset. This is interesting due to the unique traffic dynamics during this time (sudden halt, strong recovery, etc.).

Voir le code pour bénéficier d’un historique plus long de données
import requests
import zipfile
import io
import os
from pathlib import Path
import pandas as pd
import geopandas as gpd

list_useful_columns = [
        "Identifiant du compteur", "Nom du compteur",
        "Identifiant du site de comptage",
        "Nom du site de comptage",
        "Comptage horaire",
        "Date et heure de comptage"
    ]


# GENERIC FUNCTION TO RETRIEVE DATA -------------------------


def download_unzip_and_read(url, extract_to='.', list_useful_columns=list_useful_columns):
    """
    Downloads a zip file from the specified URL, extracts its contents, and reads the CSV file based on the filename pattern in the URL.

    Parameters:
    - url (str): The URL of the zip file to download.
    - extract_to (str): The directory where the contents of the zip file should be extracted.

    Returns:
    - df (DataFrame): The loaded pandas DataFrame from the extracted CSV file.
    """
    try:
        # Extract the file pattern from the URL (filename without the extension)
        file_pattern = url.split('/')[-1].replace('_zip/', '')


        # Send a GET request to the specified URL to download the file
        response = requests.get(url)
        response.raise_for_status()  # Ensure we get a successful response

        # Create a ZipFile object from the downloaded bytes
        with zipfile.ZipFile(io.BytesIO(response.content)) as z:
            # Extract all the contents to the specified directory
            z.extractall(path=extract_to)
            print(f"Extracted all files to {os.path.abspath(extract_to)}")

        dir_extract_to = Path(extract_to)
        #dir_extract_to = Path(f"./{file_pattern}/")

        # Look for the file matching the pattern
        csv_filename = [
            f.name for f in dir_extract_to.iterdir() if f.suffix == '.csv'
        ]

        if not csv_filename:
            print(f"No file matching pattern '{file_pattern}' found.")
            return None

        # Read the first matching CSV file into a pandas DataFrame
        csv_path = os.path.join(dir_extract_to.name, csv_filename[0])
        print(f"Reading file: {csv_path}")
        df = pd.read_csv(csv_path, sep=";")

        # CONVERT TO GEOPANDAS
        df[['latitude', 'longitude']] = df['Coordonnées géographiques'].str.split(',', expand=True)
        df['latitude'] = pd.to_numeric(df['latitude'])
        df['longitude'] = pd.to_numeric(df['longitude'])
        gdf = gpd.GeoDataFrame(
            df, geometry=gpd.points_from_xy(df.longitude, df.latitude)
        )

        # CONVERT TO TIMESTAMP
        df["Date et heure de comptage"] = (
            df["Date et heure de comptage"]
            .astype(str)
            .str.replace(r'\+.*', '', regex=True)
        )
        df["Date et heure de comptage"] = pd.to_datetime(
            df["Date et heure de comptage"],
            format="%Y-%m-%dT%H:%M:%S",
            errors="coerce"
        )
        gdf = df.loc[
            :, list_useful_columns
        ]
        return gdf

    except requests.exceptions.RequestException as e:
        print(f"Error: The downloaded file has not been found: {e}")
        return None
    except zipfile.BadZipFile as e:
        print(f"Error: The downloaded file is not a valid zip file: {e}")
        return None
    except Exception as e:
        print(f"An error occurred: {e}")
        return None


def read_historical_bike_data(year):
    dataset = "comptage_velo_donnees_compteurs"
    url_comptage = f"https://opendata.paris.fr/api/datasets/1.0/comptage-velo-historique-donnees-compteurs/attachments/{year}_{dataset}_csv_zip/"
    df_comptage = download_unzip_and_read(
        url_comptage, extract_to=f'./extracted_files_{year}'
    )
    if (df_comptage is None):
        url_comptage_alternative = url_comptage.replace("_csv_zip", "_zip")
        df_comptage = download_unzip_and_read(url_comptage_alternative, extract_to=f'./extracted_files_{year}')
    return df_comptage


# IMPORT HISTORICAL DATA -----------------------------

historical_bike_data = pd.concat(
    [read_historical_bike_data(year) for year in range(2018, 2024)]
)

rename_columns_dict = {
    "Identifiant du compteur": "id_compteur",
    "Nom du compteur": "nom_compteur",
    "Identifiant du site de comptage": "id",
    "Nom du site de comptage": "nom_site",
    "Comptage horaire": "sum_counts",
    "Date et heure de comptage": "date"
}


historical_bike_data = historical_bike_data.rename(
    columns=rename_columns_dict
)


# IMPORT LATEST MONTHS ----------------

import os
import requests
from tqdm import tqdm
import pandas as pd
import duckdb

url = 'https://opendata.paris.fr/api/explore/v2.1/catalog/datasets/comptage-velo-donnees-compteurs/exports/parquet?lang=fr&timezone=Europe%2FParis'
filename = 'comptage_velo_donnees_compteurs.parquet'


# DOWNLOAD FILE --------------------------------

# Perform the HTTP request and stream the download
response = requests.get(url, stream=True)

if not os.path.exists(filename):
    # Perform the HTTP request and stream the download
    response = requests.get(url, stream=True)

    # Check if the request was successful
    if response.status_code == 200:
        # Get the total size of the file from the headers
        total_size = int(response.headers.get('content-length', 0))

        # Open the file in write-binary mode and use tqdm to show progress
        with open(filename, 'wb') as file, tqdm(
                desc=filename,
                total=total_size,
                unit='B',
                unit_scale=True,
                unit_divisor=1024,
        ) as bar:
            # Write the file in chunks
            for chunk in response.iter_content(chunk_size=1024):
                if chunk:  # filter out keep-alive chunks
                    file.write(chunk)
                    bar.update(len(chunk))
    else:
        print(f"Failed to download the file. Status code: {response.status_code}")
else:
    print(f"The file '{filename}' already exists.")


# READ FILE AND CONVERT TO PANDAS
query = """
SELECT id_compteur, nom_compteur, id, sum_counts, date
FROM read_parquet('comptage_velo_donnees_compteurs.parquet')
"""

# READ WITH DUCKDB AND CONVERT TO PANDAS
df = duckdb.sql(query).df()

df.head(3)


# PUT THEM TOGETHER ----------------------------

historical_bike_data['date'] = (
    historical_bike_data['date']
    .dt.tz_localize(None)
)

df["date"] = df["date"].dt.tz_localize(None)

historical_bike_data = (
    historical_bike_data
    .loc[historical_bike_data["date"] < df["date"].min()]
)

df = pd.concat(
    [historical_bike_data, df]
)

To begin, let us reproduce the third figure, which is, once again, a barplot. Here, from a semiological perspective, it is not justified to use a barplot; a simple time series would suffice to provide similar information.

The first question in the next exercise involves an initial encounter with temporal data through a fairly common time series operation: changing the format of a date to allow aggregation at a broader time step.

Exercise 5: Monthly Counts Barplot
  1. Create a month variable whose format follows, for example, the 2019-08 scheme using the correct option of the dt.to_period method.

  2. Apply the previous tips to gradually build and improve a graph to obtain a figure similar to the 3rd production on the Parisian open data page. Create this figure first from early 2022 and then over the entire period of our history.

  3. Optional Question: Represent the same information in the form of a lollipop.

The figure with data from early 2022 will look like this if it was created with plotnine:

With seaborn, it will look more like this:

If you prefer to represent this as a lollipop[^notecolor]:

Finally, over the entire period, the series will look more like this:

6 First Time Series

It is more common to represent data with a temporal dimension as a series rather than stacked bars.

Exercise 5: Barplot of Monthly Counts
  1. Create a day variable that converts the timestamp into a daily format like 2021-05-01 using dt.day.
  2. Reproduce the figure from the open data page.

7 Reactive Charts with Javascript Libraries

7.1 The Ecosystem Available from Python

Static figures created with matplotlib or plotnine are fixed and thus have the disadvantage of not allowing interaction with the viewer. All the information must be contained in the figure, which can make it difficult to read. If the figure is well-made with multiple levels of information, it can still work well.

However, thanks to web technologies, it is simpler to offer visualizations with multiple levels. A first level of information, the quick glance, may be enough to grasp the main messages of the visualization. Then, a more deliberate behavior of seeking secondary information can provide further insights. Reactive visualizations, now the standard in the dataviz world, allow for this approach: the viewer can hover over the visualization to find additional information (e.g., exact values) or click to display complementary details.

These visualizations rely on the same triptych as the entire web ecosystem: HTML, CSS, and JavaScript. Python users will not directly manipulate these languages, which require a certain level of expertise. Instead, they use libraries that automatically generate all the necessary HTML, CSS, and JavaScript code to create the figure.

Several Javascript ecosystems are made available to developers through Python. The two main libraries are Plotly, associated with the Javascript ecosystem of the same name, and Altair, associated with the Vega and Altair ecosystems in Javascript3. To allow Python users to explore the emerging Javascript library Observable Plot, French research engineer Julien Barnier developed pyobsplot, a Python library enabling the use of this ecosystem from Python.

Interactivity should not just be a gimmick that adds no readability or even worsens it. It is rare to rely solely on the figure as produced without further work to make it effective.

7.1.1 The Plotly Library

The Plotly package is a wrapper for the Javascript library Plotly.js, allowing for the creation and manipulation of graphical objects very flexibly to produce interactive objects without the need for Javascript.

The recommended entry point is the plotly.express module (documentation here), which provides an intuitive approach for creating charts that can be modified post hoc if needed (e.g., to customize axes).

Displaying Figures Created with Plotly

In a standard Jupyter notebook, the following lines of code allow the output of a Plotly command to be displayed under a code block:

For JupyterLab, the jupyterlab-plotly extension is required:

!jupyter labextension install jupyterlab-plotly

7.2 Replicating the Previous Example with Plotly

The following modules will be required to create charts with plotly:

import plotly
import plotly.express as px
Exercise 7: A Barplot with Plotly

The goal is to recreate the first red bar chart using Plotly.

  1. Create the chart using the appropriate function from plotly.express and…
    • Do not use the default theme but one with a white background to achieve a result similar to that on the open-data site.
    • Use the color_discrete_sequence argument for the red color.
    • Remember to label the axes.
    • Consider the text color for the lower axis.
  2. Try another theme with a dark background. For colors, group the three highest values together and separate the others.

The first question allows the creation of the following chart:

Whereas with the dark theme (question 2), we get:

7.3 The altair Library

For this example, we will recreate our previous figure.

Like ggplot/plotnine, Vega is a graphics ecosystem designed to implement the grammar of graphics from Wilkinson (2012). The syntax of Vega is therefore based on a declarative principle: a construction is declared through layers and progressive data transformations.

Originally, Vega was based on a JSON syntax, hence its strong connection to Javascript. However, there is a Python API that allows for creating these types of interactive figures natively in Python. To understand the logic of constructing an altair code, here is how to replicate the previous figure:

View the architecture of an Altair figure
import altair as alt

color_scale = alt.Scale(domain=[True, False], range=['green', 'red'])

fig2 = (
    alt.Chart(df1)
1    .mark_bar()
2    .encode(
3        x=alt.X('average(sum_counts):Q', title='Average count per hour over the selected period'),
        y=alt.Y('nom_compteur:N', sort='-x', title=''),
        color=alt.Color('top:N', scale=color_scale, legend=alt.Legend(title="Top")),
        tooltip=[
            alt.Tooltip('nom_compteur:N', title='Counter Name'),
            alt.Tooltip('sum_counts:Q', title='Hourly Average')
4            ]
5    ).properties(
        title='The 10 counters with the highest hourly average'
    ).configure_view(
        strokeOpacity=0
    )
)

fig2.interactive()
1
First, the dataframe to be used is declared, similar to ggplot(df) in plotnine. Then, the desired chart type is specified (in this case, a bar chart, mark_bar in Altair’s grammar).
2
The main layer is defined using encode. This can accept either simple column names or more complex constructors, as shown here.
3
A constructor is defined for the x-axis, both to manage value scaling and its parameters (e.g., labels). Here, the x-axis is defined as a continuous value (:Q), the average of sum_counts for each \(y\) value. This average is not strictly necessary; we could have used sum_counts:Q or even sum_counts, but this illustrates data transformations in altair.
4
The tooltip adds interactivity to the chart.
5
Properties are specified at the end of the declaration to finalize the chart.

Informations additionnelles

environment files have been tested on.

Python version used:

Package Version
affine 2.4.0
aiobotocore 2.21.1
aiohappyeyeballs 2.6.1
aiohttp 3.11.13
aioitertools 0.12.0
aiosignal 1.3.2
alembic 1.13.3
altair 5.4.1
aniso8601 9.0.1
annotated-types 0.7.0
anyio 4.8.0
appdirs 1.4.4
archspec 0.2.3
asttokens 2.4.1
attrs 25.3.0
babel 2.17.0
bcrypt 4.2.0
beautifulsoup4 4.12.3
black 24.8.0
blinker 1.8.2
blis 1.2.0
bokeh 3.5.2
boltons 24.0.0
boto3 1.37.1
botocore 1.37.1
branca 0.7.2
Brotli 1.1.0
bs4 0.0.2
cachetools 5.5.0
cartiflette 0.0.2
Cartopy 0.24.1
catalogue 2.0.10
cattrs 24.1.2
certifi 2025.1.31
cffi 1.17.1
charset-normalizer 3.4.1
chromedriver-autoinstaller 0.6.4
click 8.1.8
click-plugins 1.1.1
cligj 0.7.2
cloudpathlib 0.21.0
cloudpickle 3.0.0
colorama 0.4.6
comm 0.2.2
commonmark 0.9.1
conda 24.9.1
conda-libmamba-solver 24.7.0
conda-package-handling 2.3.0
conda_package_streaming 0.10.0
confection 0.1.5
contextily 1.6.2
contourpy 1.3.1
cryptography 43.0.1
cycler 0.12.1
cymem 2.0.11
cytoolz 1.0.0
dask 2024.9.1
dask-expr 1.1.15
databricks-sdk 0.33.0
dataclasses-json 0.6.7
debugpy 1.8.6
decorator 5.1.1
Deprecated 1.2.14
diskcache 5.6.3
distributed 2024.9.1
distro 1.9.0
docker 7.1.0
duckdb 1.2.1
en_core_web_sm 3.8.0
entrypoints 0.4
et_xmlfile 2.0.0
exceptiongroup 1.2.2
executing 2.1.0
fastexcel 0.11.6
fastjsonschema 2.21.1
fiona 1.10.1
Flask 3.0.3
folium 0.17.0
fontawesomefree 6.6.0
fonttools 4.56.0
fr_core_news_sm 3.8.0
frozendict 2.4.4
frozenlist 1.5.0
fsspec 2023.12.2
geographiclib 2.0
geopandas 1.0.1
geoplot 0.5.1
geopy 2.4.1
gitdb 4.0.11
GitPython 3.1.43
google-auth 2.35.0
graphene 3.3
graphql-core 3.2.4
graphql-relay 3.2.0
graphviz 0.20.3
great-tables 0.12.0
greenlet 3.1.1
gunicorn 22.0.0
h11 0.14.0
h2 4.1.0
hpack 4.0.0
htmltools 0.6.0
httpcore 1.0.7
httpx 0.28.1
httpx-sse 0.4.0
hyperframe 6.0.1
idna 3.10
imageio 2.37.0
importlib_metadata 8.6.1
importlib_resources 6.5.2
inflate64 1.0.1
ipykernel 6.29.5
ipython 8.28.0
itsdangerous 2.2.0
jedi 0.19.1
Jinja2 3.1.6
jmespath 1.0.1
joblib 1.4.2
jsonpatch 1.33
jsonpointer 3.0.0
jsonschema 4.23.0
jsonschema-specifications 2024.10.1
jupyter-cache 1.0.0
jupyter_client 8.6.3
jupyter_core 5.7.2
kaleido 0.2.1
kiwisolver 1.4.8
langchain 0.3.20
langchain-community 0.3.9
langchain-core 0.3.45
langchain-text-splitters 0.3.6
langcodes 3.5.0
langsmith 0.1.147
language_data 1.3.0
lazy_loader 0.4
libmambapy 1.5.9
locket 1.0.0
loguru 0.7.3
lxml 5.3.1
lz4 4.3.3
Mako 1.3.5
mamba 1.5.9
mapclassify 2.8.1
marisa-trie 1.2.1
Markdown 3.6
markdown-it-py 3.0.0
MarkupSafe 3.0.2
marshmallow 3.26.1
matplotlib 3.10.1
matplotlib-inline 0.1.7
mdurl 0.1.2
menuinst 2.1.2
mercantile 1.2.1
mizani 0.11.4
mlflow 2.16.2
mlflow-skinny 2.16.2
msgpack 1.1.0
multidict 6.1.0
multivolumefile 0.2.3
munkres 1.1.4
murmurhash 1.0.12
mypy-extensions 1.0.0
narwhals 1.30.0
nbclient 0.10.0
nbformat 5.10.4
nest_asyncio 1.6.0
networkx 3.4.2
nltk 3.9.1
numpy 2.2.3
opencv-python-headless 4.10.0.84
openpyxl 3.1.5
opentelemetry-api 1.16.0
opentelemetry-sdk 1.16.0
opentelemetry-semantic-conventions 0.37b0
orjson 3.10.15
outcome 1.3.0.post0
OWSLib 0.28.1
packaging 24.2
pandas 2.2.3
paramiko 3.5.0
parso 0.8.4
partd 1.4.2
pathspec 0.12.1
patsy 1.0.1
Pebble 5.1.0
pexpect 4.9.0
pickleshare 0.7.5
pillow 11.1.0
pip 24.2
platformdirs 4.3.6
plotly 5.24.1
plotnine 0.13.6
pluggy 1.5.0
polars 1.8.2
preshed 3.0.9
prometheus_client 0.21.0
prometheus_flask_exporter 0.23.1
prompt_toolkit 3.0.48
propcache 0.3.0
protobuf 4.25.3
psutil 7.0.0
ptyprocess 0.7.0
pure_eval 0.2.3
py7zr 0.20.8
pyarrow 17.0.0
pyarrow-hotfix 0.6
pyasn1 0.6.1
pyasn1_modules 0.4.1
pybcj 1.0.3
pycosat 0.6.6
pycparser 2.22
pycryptodomex 3.21.0
pydantic 2.10.6
pydantic_core 2.27.2
pydantic-settings 2.8.1
Pygments 2.19.1
PyNaCl 1.5.0
pynsee 0.1.8
pyogrio 0.10.0
pyOpenSSL 24.2.1
pyparsing 3.2.1
pyppmd 1.1.1
pyproj 3.7.1
pyshp 2.3.1
PySocks 1.7.1
python-dateutil 2.9.0.post0
python-dotenv 1.0.1
python-magic 0.4.27
pytz 2025.1
pyu2f 0.1.5
pywaffle 1.1.1
PyYAML 6.0.2
pyzmq 26.3.0
pyzstd 0.16.2
querystring_parser 1.2.4
rasterio 1.4.3
referencing 0.36.2
regex 2024.9.11
requests 2.32.3
requests-cache 1.2.1
requests-toolbelt 1.0.0
retrying 1.3.4
rich 13.9.4
rpds-py 0.23.1
rsa 4.9
rtree 1.4.0
ruamel.yaml 0.18.6
ruamel.yaml.clib 0.2.8
s3fs 2023.12.2
s3transfer 0.11.3
scikit-image 0.24.0
scikit-learn 1.6.1
scipy 1.13.0
seaborn 0.13.2
selenium 4.29.0
setuptools 76.0.0
shapely 2.0.7
shellingham 1.5.4
six 1.17.0
smart-open 7.1.0
smmap 5.0.0
sniffio 1.3.1
sortedcontainers 2.4.0
soupsieve 2.5
spacy 3.8.4
spacy-legacy 3.0.12
spacy-loggers 1.0.5
SQLAlchemy 2.0.39
sqlparse 0.5.1
srsly 2.5.1
stack-data 0.6.2
statsmodels 0.14.4
tabulate 0.9.0
tblib 3.0.0
tenacity 9.0.0
texttable 1.7.0
thinc 8.3.4
threadpoolctl 3.6.0
tifffile 2025.3.13
toolz 1.0.0
topojson 1.9
tornado 6.4.2
tqdm 4.67.1
traitlets 5.14.3
trio 0.29.0
trio-websocket 0.12.2
truststore 0.9.2
typer 0.15.2
typing_extensions 4.12.2
typing-inspect 0.9.0
tzdata 2025.1
Unidecode 1.3.8
url-normalize 1.4.3
urllib3 1.26.20
uv 0.6.8
wasabi 1.1.3
wcwidth 0.2.13
weasel 0.4.1
webdriver-manager 4.0.2
websocket-client 1.8.0
Werkzeug 3.0.4
wheel 0.44.0
wordcloud 1.9.3
wrapt 1.17.2
wsproto 1.2.0
xgboost 2.1.1
xlrd 2.0.1
xyzservices 2025.1.0
yarl 1.18.3
yellowbrick 1.5
zict 3.0.0
zipp 3.21.0
zstandard 0.23.0

View file history

SHA Date Author Description
2f96f636 2025-01-29 19:49:36 Lino Galiana Tweak callout for colab engine (#591)
1b184cba 2025-01-24 18:07:32 lgaliana Traduction 🇬🇧 du chapitre 1 de la partie dataviz
e66fee04 2024-12-23 15:12:18 Lino Galiana Fix errors in generated notebooks (#583)
cbe6459f 2024-11-12 07:24:15 lgaliana Revoir quelques abstracts
9cf2bde5 2024-10-18 15:49:47 lgaliana Reconstruction complète du chapitre de cartographie
c9a3f963 2024-09-24 15:18:59 Lino Galiana Finir la reprise du chapitre matplotlib (#555)
46f038a4 2024-09-23 15:28:36 Lino Galiana Mise à jour du premier chapitre sur les figures (#553)
59f5803d 2024-09-22 16:41:46 Lino Galiana Update bike count source data for visualisation tutorial (#552)
06d003a1 2024-04-23 10:09:22 Lino Galiana Continue la restructuration des sous-parties (#492)
005d89b8 2023-12-20 17:23:04 Lino Galiana Finalise l’affichage des statistiques Git (#478)
3fba6124 2023-12-17 18:16:42 Lino Galiana Remove some badges from python (#476)
cf91965e 2023-12-02 13:15:18 linogaliana href in dataviz chapter
1f23de28 2023-12-01 17:25:36 Lino Galiana Stockage des images sur S3 (#466)
09654c71 2023-11-14 15:16:44 Antoine Palazzolo Suggestions Git & Visualisation (#449)
889a71ba 2023-11-10 11:40:51 Antoine Palazzolo Modification TP 3 (#443)
df01f019 2023-10-10 15:55:04 Lino Galiana Menus automatisés (#432)
a7711832 2023-10-09 11:27:45 Antoine Palazzolo Relecture TD2 par Antoine (#418)
154f09e4 2023-09-26 14:59:11 Antoine Palazzolo Des typos corrigées par Antoine (#411)
057dae1b 2023-09-20 16:28:46 Lino Galiana Chapitre visualisation (#406)
1d0780ca 2023-09-18 14:49:59 Lino Galiana Problème rendu chapitre matplotlib (#405)
a8f90c2f 2023-08-28 09:26:12 Lino Galiana Update featured paths (#396)
3bdf3b06 2023-08-25 11:23:02 Lino Galiana Simplification de la structure 🤓 (#393)
78ea2cbd 2023-07-20 20:27:31 Lino Galiana Change titles levels (#381)
8df7cb22 2023-07-20 17:16:03 linogaliana Change link
f0c583c0 2023-07-07 14:12:22 Lino Galiana Images viz (#371)
f21a24d3 2023-07-02 10:58:15 Lino Galiana Pipeline Quarto & Pages 🚀 (#365)
f2e89224 2023-06-12 14:54:20 Lino Galiana Remove spoiler shortcode (#364)
2dc82e7b 2022-10-18 22:46:47 Lino Galiana Relec Kim (visualisation + API) (#302)
03babc6c 2022-10-03 16:53:47 Lino Galiana Parler des règles de la dataviz (#291)
89c10c32 2022-08-25 08:30:22 Lino Galiana Adaptation du shortcode spoiler en notebook (#257)
494a85ae 2022-08-05 14:49:56 Lino Galiana Images featured ✨ (#252)
d201e3cd 2022-08-03 15:50:34 Lino Galiana Pimp la homepage ✨ (#249)
2812ef40 2022-07-07 15:58:58 Lino Galiana Petite viz sympa des prenoms (#242)
a4e24263 2022-06-16 19:34:18 Lino Galiana Improve style (#238)
02ed1e25 2022-06-09 19:06:05 Lino Galiana Règle problème plotly (#235)
299cff3d 2022-06-08 13:19:03 Lino Galiana Problème code JS suite (#233)
5698e303 2022-06-03 18:28:37 Lino Galiana Finalise widget (#232)
7b9f27be 2022-06-03 17:05:15 Lino Galiana Essaie régler les problèmes widgets JS (#231)
12965bac 2022-05-25 15:53:27 Lino Galiana :launch: Bascule vers quarto (#226)
9c71d6e7 2022-03-08 10:34:26 Lino Galiana Plus d’éléments sur S3 (#218)
4f675284 2021-12-12 08:37:21 Lino Galiana Improve website appareance (#194)
66a52761 2021-11-23 16:13:20 Lino Galiana Relecture partie visualisation (#181)
2a8809fb 2021-10-27 12:05:34 Lino Galiana Simplification des hooks pour gagner en flexibilité et clarté (#166)
2f4d3905 2021-09-02 15:12:29 Lino Galiana Utilise un shortcode github (#131)
2e4d5862 2021-09-02 12:03:39 Lino Galiana Simplify badges generation (#130)
80877d20 2021-06-28 11:34:24 Lino Galiana Ajout d’un exercice de NLP à partir openfood database (#98)
6729a724 2021-06-22 18:07:05 Lino Galiana Mise à jour badge onyxia (#115)
4cdb759c 2021-05-12 10:37:23 Lino Galiana :sparkles: :star2: Nouveau thème hugo :snake: :fire: (#105)
7f9f97bc 2021-04-30 21:44:04 Lino Galiana 🐳 + 🐍 New workflow (docker 🐳) and new dataset for modelization (2020 🇺🇸 elections) (#99)
0a0d0348 2021-03-26 20:16:22 Lino Galiana Ajout d’une section sur S3 (#97)
a5b7c990 2020-10-05 15:07:09 Lino Galiana Donne lien vers données compteurs
18be8f43 2020-10-01 17:08:53 Lino Galiana Intégration de box inspirées du thème pydata sphinx (#58)
5ac3cbee 2020-09-28 18:59:24 Lino Galiana Continue la partie graphiques (#54)
94f39ecc 2020-09-24 21:25:32 Lino Galiana quelques mots sur vizu
Back to top

References

Field, A. 2012. “Discovering Statistics Using r.” Sage.
Lisa, DeBruine. 2021. psyTeachR Book Template.” https://github.com/psyteachr/template/.
Wilkinson, Leland. 2012. The Grammar of Graphics. Springer.

Footnotes

  1. This chapter will be built around the Quarto ecosystem. In the meantime, you can consult the excellent documentation of this ecosystem and practice, which is the best way to learn.↩︎

  2. Thankfully, with a vast amount of online code using matplotlib, code assistants like ChatGPT or Github Copilot are invaluable for creating charts based on instructions.↩︎

  3. The names of these libraries are inspired by the Summer Triangle constellation, of which Vega and Altair are two members.↩︎

Citation

BibTeX citation:
@book{galiana2023,
  author = {Galiana, Lino},
  title = {Python Pour La Data Science},
  date = {2023},
  url = {https://pythonds.linogaliana.fr/},
  doi = {10.5281/zenodo.8229676},
  langid = {en}
}
For attribution, please cite this work as:
Galiana, Lino. 2023. Python Pour La Data Science. https://doi.org/10.5281/zenodo.8229676.