An essential part of the data scientist’s job is to be able to synthesize information into powerful graphical representations. This chapter looks at the challenges of data representation with Python, the ecosystem for doing this. It also opens the door to interactive data representation with Plotly.
Discover the matplotlib and seaborn ecosystems for constructing charts through the successive enrichment of layers.
Explore the modern plotnine ecosystem,
a Python implementation of the R package ggplot2
for this type of representation, which offers a powerful syntax for building data visualizations through its grammar of graphics.
Understand the concept of interactive HTML (web format) visualizations through the plotly and altair packages.
Learn the challenges of graphical representation, the trade-offs needed to convey a clear message, and the limitations of certain traditional representations.
The practice of data visualization in this course will involve replicating charts found on the open data page of the City of Paris here or proposing alternatives using the same data.
The goal of this chapter is not to provide a comprehensive inventory of charts that can be created with Python. That would be long, somewhat tedious, and unnecessary, as websites like python-graph-gallery.com/ already excel at showcasing a wide variety of examples. Instead, the objective is to illustrate, through practice, some key challenges and opportunities related to using the main graphical libraries in Python.
We can distinguish several major families of visualizations: representations of distributions specific to a single variable, representations of relationships between multiple variables, and maps that allow spatial representation of one or more variables.
These families themselves branch into various types of figures. For instance, depending on the nature of the phenomenon, relationship representations may take the form of a time series (evolution of a variable over time), a scatter plot (correlation between two variables), or a bar chart (highlighting the relative values of one variable in relation to another), among others.
Rather than an exhaustive inventory of possible visualizations, this chapter and the next will present some visualizations that may inspire further analysis before implementing a form of modeling. This chapter focuses on traditional visualizations, while the next chapter is dedicated to cartography. Together, these two chapters aim to provide the initial tools for synthesizing the information present in a dataset.
The next step is to deepen the work of communication and synthesis through various forms of output, such as reports, scientific publications or articles, presentations, interactive applications, websites, or notebooks like those provided in this course. The general principle is the same regardless of the medium and is particularly relevant for data scientists working with intensive data analysis. This will be the subject of a future chapter in this course1.
Important
Being able to create interesting data visualizations is a necessary skill for any data scientist or researcher. To improve the quality of these visualizations, it is recommended to follow certain advice from dataviz specialists on graphical semiology.
Good data visualizations, like those from the New York Times, rely not only on appropriate tools (such as JavaScript libraries) but also on certain design principles that allow the message of a visualization to be understood in just a few seconds.
This blog post is a resource worth consulting regularly. This blog post by Albert Rapp clearly demonstrates how to gradually build a good data visualization.
This chapter is based on the bicycle passage count data from Parisian measurement points, published on the open data website of the City of Paris.
The use of recent historical data has been greatly facilitated by the availability of data in the Parquet format, a modern format more practical than CSV. For more information about this format, you can refer to the resources mentioned in the section dedicated to it in the advanced chapter.
Code pour importer les données à partir du format Parquet
import osimport requestsfrom tqdm import tqdmimport pandas as pdimport duckdburl ="https://minio.lab.sspcloud.fr/lgaliana/data/python-ENSAE/comptage-velo-donnees-compteurs.parquet"# problem with https://opendata.paris.fr/api/explore/v2.1/catalog/datasets/comptage-velo-donnees-compteurs/exports/parquet?lang=fr&timezone=Europe%2FParisfilename ='comptage_velo_donnees_compteurs.parquet'# DOWNLOAD FILE --------------------------------# Perform the HTTP request and stream the downloadresponse = requests.get(url, stream=True)ifnot os.path.exists(filename):# Perform the HTTP request and stream the download response = requests.get(url, stream=True)# Check if the request was successfulif response.status_code ==200:# Get the total size of the file from the headers total_size =int(response.headers.get('content-length', 0))# Open the file in write-binary mode and use tqdm to show progresswithopen(filename, 'wb') asfile, tqdm( desc=filename, total=total_size, unit='B', unit_scale=True, unit_divisor=1024, ) as bar:# Write the file in chunksfor chunk in response.iter_content(chunk_size=1024):if chunk: # filter out keep-alive chunksfile.write(chunk) bar.update(len(chunk))else:print(f"Failed to download the file. Status code: {response.status_code}")else:print(f"The file '{filename}' already exists.")# READ FILE AND CONVERT TO PANDAS --------------------------query ="""SELECT id_compteur, nom_compteur, id, sum_counts, dateFROM read_parquet('comptage_velo_donnees_compteurs.parquet')"""# READ WITH DUCKDB AND CONVERT TO PANDASdf = duckdb.sql(query).df()df.head(3)
2 Initial Graphical Productions with Pandas’ Matplotlib API
Trying to produce a perfect visualization on the first attempt is unrealistic. It is much more practical to gradually improve a graphical representation to progressively highlight structural effects in a dataset.
We will begin by visualizing the distribution of bicycle counts at the main measurement stations. To do this, we will quickly create a barplot and then improve it step by step.
In this section, we will reproduce the first two charts from the data analysis page: The 10 counters with the highest hourly average and The 10 counters that recorded the most bicycles. The numerical values of the charts may differ from those on the webpage, which is expected, as we are not necessarily working with data as up-to-date as that online.
To import the graphical libraries we will use in this chapter, execute
import matplotlib.pyplot as pltimport seaborn as sns1from plotnine import*
1
Importing libraries in the form from package import * is not a very good practice. However, for a package like plotnine, many of whose functions we’ll be using, it would be a bit tedious to import functions on a case-by-case basis. What’s more, it allows us to reuse the ggplotR library code examples, which are plentiful on the Internet with visual demonstrations, almost as they are. from package import * is the Python equivalent of the library(package) practice in R.
2.1 Understanding the Basics of matplotlib
matplotlib dates back to the early 2000s and emerged as a Python alternative for creating charts, similar to Matlab, a proprietary numerical computation software. Thus, matplotlib is quite an old library, predating the rise of Python in the data processing ecosystem. This is reflected in its design, which may not always feel intuitive to those familiar with the modern data science ecosystem. Fortunately, many libraries build upon matplotlib to provide syntax more familiar to data scientists.
matplotlib primarily offers two levels of abstraction: the figure and the axes. The figure is essentially the “canvas” that contains one or more axes, where the charts are placed. Depending on the situation, you might need to modify figure or axis parameters, which makes chart creation highly flexible but also potentially confusing, as it’s not always clear which abstraction level to modify2. As shown in Figure 2.1, every element of a figure is customizable.
Figure 2.1: Understanding the Anatomy of a matplotlib Figure (Source: Official Documentation)
In practice, there are two ways to create and update your figure, depending on your preference:
The explicit approach, inheriting an object-oriented programming logic, where Figure and Axes objects are created and updated directly.
The implicit approach, based on the pyplot interface, which uses a series of functions to update implicitly created objects.
import numpy as npimport matplotlib.pyplot as pltx = np.linspace(0, 2, 100) # Sample data.# Note that even in the OO-style, we use `.pyplot.figure` to create the Figure.fig, ax = plt.subplots(figsize=(5, 2.7), layout='constrained')ax.plot(x, x, label='linear') # Plot some data on the Axes.ax.plot(x, x**2, label='quadratic') # Plot more data on the Axes...ax.plot(x, x**3, label='cubic') # ... and some more.ax.set_xlabel('x label') # Add an x-label to the Axes.ax.set_ylabel('y label') # Add a y-label to the Axes.ax.set_title("Simple Plot") # Add a title to the Axes.ax.legend() # Add a legend.
These elements are the minimum required to understand the logic of matplotlib. To become more comfortable with these concepts, repeated practice is essential.
2.2 Discovering matplotlib through Pandas
Exercise 1: Create an Initial Plot
The data includes several dimensions that can be analyzed statistically. We’ll start by focusing on the volume of passage at various counting stations.
Since our goal is to summarize the information in our dataset, we first need to perform some ad hoc aggregations to create a readable plot.
Retain the ten stations with the highest average. To get an ordered plot from largest to smallest using Pandas plot methods, the data must be sorted from smallest to largest (yes, it’s odd but that’s how it works…). Sort the data accordingly.
Initially, without worrying about styling or aesthetics, create the structure of a barplot (bar chart) as seen on the
data analysis page.
To prepare for the second figure, retain only the 10 stations that recorded the highest total number of bicycles.
As in question 2, create a barplot to replicate figure 2 from the Paris open data portal.
The top 10 stations from question 1 are those with the highest average bicycle traffic. These reordered data allow for creating a clear visualization highlighting the busiest stations.
Figure 1, without any styling, displays the data in a basic barplot. While it conveys the essential information, it lacks aesthetic layout, harmonious colors, and clear annotations, which are necessary to improve readability and visual impact.
Figure 2 without styling:
We are starting to create something that conveys a synthetic message about the nature of the data. However, several issues remain (e.g., labels), as well as elements that are either incorrect (axis titles, etc.) or missing (graph title…).
Since the charts produced by Pandas follow the highly flexible logic of matplotlib, they can be customized. However, this often requires significant effort, and the matplotlib grammar is not as standardized as ggplot in R. If you wish to remain in the matplotlib ecosystem, it is better to use seaborn directly, which provides ready-to-use arguments. Alternatively, you can switch to the plotnine ecosystem, which offers a standardized syntax for modifying elements.
3 Using seaborn Directly
3.1 Understanding seaborn in a Few Lines
seaborn is a high-level interface built on top of matplotlib. This package provides a set of features to create matplotlib figures or axes directly from a function with numerous arguments. If further customization is needed, matplotlib functionalities can be used to update the figure, whether through the implicit or explicit approaches described earlier.
As with matplotlib, the same figure can be created in multiple ways in seaborn. seaborn inherits the figure-axes duality from matplotlib, requiring frequent adjustments at either level. The main characteristic of seaborn is its standardized entry points, such as seaborn.relplot or seaborn.catplot, and its input logic based on DataFrame, whereas matplotlib is structured around Numpy arrays.
The figure now conveys a message, but it is still not very readable. There are several ways to create a barplot in seaborn. The two main ones are:
sns.catplot
sns.barplot
For this exercise, we suggest using sns.catplot. It is a common entry point for plotting graphs of a discretized variable.
3.2 The bar chart (barplot)
Exercise 2: Reproduce the First Figure with Seaborn
Reset the index of the dataframes df1 and df2 to have a column ‘Nom du compteur’. Reorder the data in descending order to obtain a correctly ordered graph with seaborn.
Redo the previous graph using seaborn’s catplot function. To control the size of the graph, you can use the height and aspect arguments.
Add axis titles and the graph title for the first graph.
Try coloring the x axis in red. You can pre-define a style with sns.set_style("ticks", {"xtick.color": "red"}).
At the end of question 2, that is, by using seaborn to minimally reproduce a barplot, we get:
After some aesthetic adjustments, at the end of questions 3 and 4, we get a figure close to that of the Paris open data portal.
The additional parameters proposed in question 4 ultimately allow us to obtain the figure
This shows that Boulevard de Sébastopol is the most traveled, which won’t surprise you if you cycle in Paris. However, if you’re not familiar with Parisian geography, this will provide little information for you. You’ll need an additional graphical representation: a map! We will cover this in a future chapter.
Exercise 2b: Reproducing the Figure “The 10 Counters That Recorded the Most Bicycles”
Following the gradual approach of Exercise 2, recreate the chart The 10 Counters That Recorded the Most Bicycles using seaborn.
3.3 An Alternative to the Barplot: the Lollipop Chart
Bar charts (barplot) are extremely common, likely due to the legacy of Excel, where these charts can be created with just a couple of clicks. However, in terms of conveying a message, they are far from perfect. For example, the bars take up a lot of visual space, which can obscure the intended message about relationships between observations.
From a semiological perspective, that is, in terms of the effectiveness of conveying a message, lollipop charts are preferable: they convey the same information but with fewer visual elements that might clutter understanding.
Lollipop charts are not perfect either but are slightly more effective at conveying the message. To learn more about alternatives to bar charts, Eric Mauvière’s talk for the public statistics data scientists network, whose main message is “Unstack your figures”, is worth exploring (available on ssphub.netlify.app/).
Exercise 3 (optional): Reproduce Figure 2 with a lollipop chart
Following the gradual approach of Exercise 2,
redo the graph The 10 counters that have recorded the most bikes.
plotnine is the newcomer to the Python visualization ecosystem. This library is developed by Posit, the company behind the RStudio editor and the tidyverse ecosystem, which is central to the R language. This library aims to bring the logic of ggplot to Python, meaning a standardized, readable, and flexible grammar of graphics inspired by Wilkinson (2012).
The mindset of ggplot2 users when they discover plotnine
In this approach, a chart is viewed as a succession of layers that, when combined, create the final figure. This principle is not inherently different from that of matplotlib. However, the grammar used by plotnine is far more intuitive and standardized, offering much more autonomy for modifying a chart.
The logic of ggplot (and plotnine) by Lisa (2021), image itself borrowed from Field (2012)
With plotnine, there is no longer a dual figure-axis entry point. As illustrated in the slides below:
A figure is initialized
Layers are updated, a very general abstraction level that applies to the data represented, axis scales, colors, etc.
Finally, aesthetics can be adjusted by modifying axis labels, legend labels, titles, etc.
Dérouler les slides ci-dessous ou cliquer ici
pour afficher les slides en plein écran.
Scroll slides below or click here
to display slides full screen.
Exercise 4: Reproduce the First Figure with plotnine
This is the same exercise as Exercise 2. The objective is to create this figure with plotnine.
5 Initial Temporal Aggregations
We will now focus on the temporal dimension of our dataset using two approaches:
A bar chart summarizing the information in our dataset on a monthly basis;
Informative series on temporal dynamics, which will be the subject of the next section.
Before that, we will enhance this data to include a longer history, particularly encompassing the Covid period in our dataset. This is interesting due to the unique traffic dynamics during this time (sudden halt, strong recovery, etc.).
Voir le code pour bénéficier d’un historique plus long de données
import requestsimport zipfileimport ioimport osfrom pathlib import Pathimport pandas as pdimport geopandas as gpdlist_useful_columns = ["Identifiant du compteur", "Nom du compteur","Identifiant du site de comptage","Nom du site de comptage","Comptage horaire","Date et heure de comptage" ]# GENERIC FUNCTION TO RETRIEVE DATA -------------------------def download_unzip_and_read(url, extract_to='.', list_useful_columns=list_useful_columns):""" Downloads a zip file from the specified URL, extracts its contents, and reads the CSV file based on the filename pattern in the URL. Parameters: - url (str): The URL of the zip file to download. - extract_to (str): The directory where the contents of the zip file should be extracted. Returns: - df (DataFrame): The loaded pandas DataFrame from the extracted CSV file. """try:# Extract the file pattern from the URL (filename without the extension) file_pattern = url.split('/')[-1].replace('_zip/', '')# Send a GET request to the specified URL to download the file response = requests.get(url) response.raise_for_status() # Ensure we get a successful response# Create a ZipFile object from the downloaded byteswith zipfile.ZipFile(io.BytesIO(response.content)) as z:# Extract all the contents to the specified directory z.extractall(path=extract_to)print(f"Extracted all files to {os.path.abspath(extract_to)}") dir_extract_to = Path(extract_to)#dir_extract_to = Path(f"./{file_pattern}/")# Look for the file matching the pattern csv_filename = [ f.name for f in dir_extract_to.iterdir() if f.suffix =='.csv' ]ifnot csv_filename:print(f"No file matching pattern '{file_pattern}' found.")returnNone# Read the first matching CSV file into a pandas DataFrame csv_path = os.path.join(dir_extract_to.name, csv_filename[0])print(f"Reading file: {csv_path}") df = pd.read_csv(csv_path, sep=";")# CONVERT TO GEOPANDAS df[['latitude', 'longitude']] = df['Coordonnées géographiques'].str.split(',', expand=True) df['latitude'] = pd.to_numeric(df['latitude']) df['longitude'] = pd.to_numeric(df['longitude']) gdf = gpd.GeoDataFrame( df, geometry=gpd.points_from_xy(df.longitude, df.latitude) )# CONVERT TO TIMESTAMP df["Date et heure de comptage"] = ( df["Date et heure de comptage"] .astype(str) .str.replace(r'\+.*', '', regex=True) ) df["Date et heure de comptage"] = pd.to_datetime( df["Date et heure de comptage"],format="%Y-%m-%dT%H:%M:%S", errors="coerce" ) gdf = df.loc[ :, list_useful_columns ]return gdfexcept requests.exceptions.RequestException as e:print(f"Error: The downloaded file has not been found: {e}")returnNoneexcept zipfile.BadZipFile as e:print(f"Error: The downloaded file is not a valid zip file: {e}")returnNoneexceptExceptionas e:print(f"An error occurred: {e}")returnNonedef read_historical_bike_data(year): dataset ="comptage_velo_donnees_compteurs" url_comptage =f"https://opendata.paris.fr/api/datasets/1.0/comptage-velo-historique-donnees-compteurs/attachments/{year}_{dataset}_csv_zip/" df_comptage = download_unzip_and_read( url_comptage, extract_to=f'./extracted_files_{year}' )if (df_comptage isNone): url_comptage_alternative = url_comptage.replace("_csv_zip", "_zip") df_comptage = download_unzip_and_read(url_comptage_alternative, extract_to=f'./extracted_files_{year}')return df_comptage# IMPORT HISTORICAL DATA -----------------------------historical_bike_data = pd.concat( [read_historical_bike_data(year) for year inrange(2018, 2024)])rename_columns_dict = {"Identifiant du compteur": "id_compteur","Nom du compteur": "nom_compteur","Identifiant du site de comptage": "id","Nom du site de comptage": "nom_site","Comptage horaire": "sum_counts","Date et heure de comptage": "date"}historical_bike_data = historical_bike_data.rename( columns=rename_columns_dict)# IMPORT LATEST MONTHS ----------------import osimport requestsfrom tqdm import tqdmimport pandas as pdimport duckdburl ='https://opendata.paris.fr/api/explore/v2.1/catalog/datasets/comptage-velo-donnees-compteurs/exports/parquet?lang=fr&timezone=Europe%2FParis'filename ='comptage_velo_donnees_compteurs.parquet'# DOWNLOAD FILE --------------------------------# Perform the HTTP request and stream the downloadresponse = requests.get(url, stream=True)ifnot os.path.exists(filename):# Perform the HTTP request and stream the download response = requests.get(url, stream=True)# Check if the request was successfulif response.status_code ==200:# Get the total size of the file from the headers total_size =int(response.headers.get('content-length', 0))# Open the file in write-binary mode and use tqdm to show progresswithopen(filename, 'wb') asfile, tqdm( desc=filename, total=total_size, unit='B', unit_scale=True, unit_divisor=1024, ) as bar:# Write the file in chunksfor chunk in response.iter_content(chunk_size=1024):if chunk: # filter out keep-alive chunksfile.write(chunk) bar.update(len(chunk))else:print(f"Failed to download the file. Status code: {response.status_code}")else:print(f"The file '{filename}' already exists.")# READ FILE AND CONVERT TO PANDASquery ="""SELECT id_compteur, nom_compteur, id, sum_counts, dateFROM read_parquet('comptage_velo_donnees_compteurs.parquet')"""# READ WITH DUCKDB AND CONVERT TO PANDASdf = duckdb.sql(query).df()df.head(3)# PUT THEM TOGETHER ----------------------------historical_bike_data['date'] = ( historical_bike_data['date'] .dt.tz_localize(None))df["date"] = df["date"].dt.tz_localize(None)historical_bike_data = ( historical_bike_data .loc[historical_bike_data["date"] < df["date"].min()])df = pd.concat( [historical_bike_data, df])
To begin, let us reproduce the third figure, which is, once again, a barplot. Here, from a semiological perspective, it is not justified to use a barplot; a simple time series would suffice to provide similar information.
The first question in the next exercise involves an initial encounter with temporal data through a fairly common time series operation: changing the format of a date to allow aggregation at a broader time step.
Exercise 5: Monthly Counts Barplot
Create a month variable whose format follows, for example, the 2019-08 scheme using the correct option of the dt.to_period method.
Apply the previous tips to gradually build and improve a graph to obtain a figure similar to the 3rd production on the Parisian open data page. Create this figure first from early 2022 and then over the entire period of our history.
Optional Question: Represent the same information in the form of a lollipop.
The figure with data from early 2022 will look like this if it was created with plotnine:
With seaborn, it will look more like this:
If you prefer to represent this as a lollipop[^notecolor]:
Finally, over the entire period, the series will look more like this:
6 First Time Series
It is more common to represent data with a temporal dimension as a series rather than stacked bars.
Exercise 5: Barplot of Monthly Counts
Create a day variable that converts the timestamp into a daily format like 2021-05-01 using dt.day.
Reproduce the figure from the open data page.
7 Reactive Charts with Javascript Libraries
7.1 The Ecosystem Available from Python
Static figures created with matplotlib or plotnine are fixed and thus have the disadvantage of not allowing interaction with the viewer. All the information must be contained in the figure, which can make it difficult to read. If the figure is well-made with multiple levels of information, it can still work well.
However, thanks to web technologies, it is simpler to offer visualizations with multiple levels. A first level of information, the quick glance, may be enough to grasp the main messages of the visualization. Then, a more deliberate behavior of seeking secondary information can provide further insights. Reactive visualizations, now the standard in the dataviz world, allow for this approach: the viewer can hover over the visualization to find additional information (e.g., exact values) or click to display complementary details.
These visualizations rely on the same triptych as the entire web ecosystem: HTML, CSS, and JavaScript. Python users will not directly manipulate these languages, which require a certain level of expertise. Instead, they use libraries that automatically generate all the necessary HTML, CSS, and JavaScript code to create the figure.
Several Javascript ecosystems are made available to developers through Python. The two main libraries are Plotly, associated with the Javascript ecosystem of the same name, and Altair, associated with the Vega and Altair ecosystems in Javascript3. To allow Python users to explore the emerging Javascript library Observable Plot, French research engineer Julien Barnier developed pyobsplot, a Python library enabling the use of this ecosystem from Python.
Interactivity should not just be a gimmick that adds no readability or even worsens it. It is rare to rely solely on the figure as produced without further work to make it effective.
7.1.1 The Plotly Library
The Plotly package is a wrapper for the Javascript library Plotly.js, allowing for the creation and manipulation of graphical objects very flexibly to produce interactive objects without the need for Javascript.
The recommended entry point is the plotly.express module (documentation here), which provides an intuitive approach for creating charts that can be modified post hoc if needed (e.g., to customize axes).
Displaying Figures Created with Plotly
In a standard Jupyter notebook, the following lines of code allow the output of a Plotly command to be displayed under a code block:
For JupyterLab, the jupyterlab-plotly extension is required:
!jupyter labextension install jupyterlab-plotly
7.2 Replicating the Previous Example with Plotly
The following modules will be required to create charts with plotly:
import plotlyimport plotly.express as px
Exercise 7: A Barplot with Plotly
The goal is to recreate the first red bar chart using Plotly.
Create the chart using the appropriate function from plotly.express and…
Do not use the default theme but one with a white background to achieve a result similar to that on the open-data site.
Use the color_discrete_sequence argument for the red color.
Remember to label the axes.
Consider the text color for the lower axis.
Try another theme with a dark background. For colors, group the three highest values together and separate the others.
The first question allows the creation of the following chart:
Whereas with the dark theme (question 2), we get:
7.3 The altair Library
For this example, we will recreate our previous figure.
Like ggplot/plotnine, Vega is a graphics ecosystem designed to implement the grammar of graphics from Wilkinson (2012). The syntax of Vega is therefore based on a declarative principle: a construction is declared through layers and progressive data transformations.
Originally, Vega was based on a JSON syntax, hence its strong connection to Javascript. However, there is a Python API that allows for creating these types of interactive figures natively in Python. To understand the logic of constructing an altair code, here is how to replicate the previous figure:
View the architecture of an Altair figure
import altair as altcolor_scale = alt.Scale(domain=[True, False], range=['green', 'red'])fig2 = ( alt.Chart(df1)1 .mark_bar()2 .encode(3 x=alt.X('average(sum_counts):Q', title='Average count per hour over the selected period'), y=alt.Y('nom_compteur:N', sort='-x', title=''), color=alt.Color('top:N', scale=color_scale, legend=alt.Legend(title="Top")), tooltip=[ alt.Tooltip('nom_compteur:N', title='Counter Name'), alt.Tooltip('sum_counts:Q', title='Hourly Average')4 ]5 ).properties( title='The 10 counters with the highest hourly average' ).configure_view( strokeOpacity=0 ))fig2.interactive()
1
First, the dataframe to be used is declared, similar to ggplot(df) in plotnine. Then, the desired chart type is specified (in this case, a bar chart, mark_bar in Altair’s grammar).
2
The main layer is defined using encode. This can accept either simple column names or more complex constructors, as shown here.
3
A constructor is defined for the x-axis, both to manage value scaling and its parameters (e.g., labels). Here, the x-axis is defined as a continuous value (:Q), the average of sum_counts for each \(y\) value. This average is not strictly necessary; we could have used sum_counts:Q or even sum_counts, but this illustrates data transformations in altair.
4
The tooltip adds interactivity to the chart.
5
Properties are specified at the end of the declaration to finalize the chart.
Informations additionnelles
environment files have been tested on.
Python version used:
Package
Version
affine
2.4.0
aiobotocore
2.21.1
aiohappyeyeballs
2.6.1
aiohttp
3.11.13
aioitertools
0.12.0
aiosignal
1.3.2
alembic
1.13.3
altair
5.4.1
aniso8601
9.0.1
annotated-types
0.7.0
anyio
4.8.0
appdirs
1.4.4
archspec
0.2.3
asttokens
2.4.1
attrs
25.3.0
babel
2.17.0
bcrypt
4.2.0
beautifulsoup4
4.12.3
black
24.8.0
blinker
1.8.2
blis
1.2.0
bokeh
3.5.2
boltons
24.0.0
boto3
1.37.1
botocore
1.37.1
branca
0.7.2
Brotli
1.1.0
bs4
0.0.2
cachetools
5.5.0
cartiflette
0.0.2
Cartopy
0.24.1
catalogue
2.0.10
cattrs
24.1.2
certifi
2025.1.31
cffi
1.17.1
charset-normalizer
3.4.1
chromedriver-autoinstaller
0.6.4
click
8.1.8
click-plugins
1.1.1
cligj
0.7.2
cloudpathlib
0.21.0
cloudpickle
3.0.0
colorama
0.4.6
comm
0.2.2
commonmark
0.9.1
conda
24.9.1
conda-libmamba-solver
24.7.0
conda-package-handling
2.3.0
conda_package_streaming
0.10.0
confection
0.1.5
contextily
1.6.2
contourpy
1.3.1
cryptography
43.0.1
cycler
0.12.1
cymem
2.0.11
cytoolz
1.0.0
dask
2024.9.1
dask-expr
1.1.15
databricks-sdk
0.33.0
dataclasses-json
0.6.7
debugpy
1.8.6
decorator
5.1.1
Deprecated
1.2.14
diskcache
5.6.3
distributed
2024.9.1
distro
1.9.0
docker
7.1.0
duckdb
1.2.1
en_core_web_sm
3.8.0
entrypoints
0.4
et_xmlfile
2.0.0
exceptiongroup
1.2.2
executing
2.1.0
fastexcel
0.11.6
fastjsonschema
2.21.1
fiona
1.10.1
Flask
3.0.3
folium
0.17.0
fontawesomefree
6.6.0
fonttools
4.56.0
fr_core_news_sm
3.8.0
frozendict
2.4.4
frozenlist
1.5.0
fsspec
2023.12.2
geographiclib
2.0
geopandas
1.0.1
geoplot
0.5.1
geopy
2.4.1
gitdb
4.0.11
GitPython
3.1.43
google-auth
2.35.0
graphene
3.3
graphql-core
3.2.4
graphql-relay
3.2.0
graphviz
0.20.3
great-tables
0.12.0
greenlet
3.1.1
gunicorn
22.0.0
h11
0.14.0
h2
4.1.0
hpack
4.0.0
htmltools
0.6.0
httpcore
1.0.7
httpx
0.28.1
httpx-sse
0.4.0
hyperframe
6.0.1
idna
3.10
imageio
2.37.0
importlib_metadata
8.6.1
importlib_resources
6.5.2
inflate64
1.0.1
ipykernel
6.29.5
ipython
8.28.0
itsdangerous
2.2.0
jedi
0.19.1
Jinja2
3.1.6
jmespath
1.0.1
joblib
1.4.2
jsonpatch
1.33
jsonpointer
3.0.0
jsonschema
4.23.0
jsonschema-specifications
2024.10.1
jupyter-cache
1.0.0
jupyter_client
8.6.3
jupyter_core
5.7.2
kaleido
0.2.1
kiwisolver
1.4.8
langchain
0.3.20
langchain-community
0.3.9
langchain-core
0.3.45
langchain-text-splitters
0.3.6
langcodes
3.5.0
langsmith
0.1.147
language_data
1.3.0
lazy_loader
0.4
libmambapy
1.5.9
locket
1.0.0
loguru
0.7.3
lxml
5.3.1
lz4
4.3.3
Mako
1.3.5
mamba
1.5.9
mapclassify
2.8.1
marisa-trie
1.2.1
Markdown
3.6
markdown-it-py
3.0.0
MarkupSafe
3.0.2
marshmallow
3.26.1
matplotlib
3.10.1
matplotlib-inline
0.1.7
mdurl
0.1.2
menuinst
2.1.2
mercantile
1.2.1
mizani
0.11.4
mlflow
2.16.2
mlflow-skinny
2.16.2
msgpack
1.1.0
multidict
6.1.0
multivolumefile
0.2.3
munkres
1.1.4
murmurhash
1.0.12
mypy-extensions
1.0.0
narwhals
1.30.0
nbclient
0.10.0
nbformat
5.10.4
nest_asyncio
1.6.0
networkx
3.4.2
nltk
3.9.1
numpy
2.2.3
opencv-python-headless
4.10.0.84
openpyxl
3.1.5
opentelemetry-api
1.16.0
opentelemetry-sdk
1.16.0
opentelemetry-semantic-conventions
0.37b0
orjson
3.10.15
outcome
1.3.0.post0
OWSLib
0.28.1
packaging
24.2
pandas
2.2.3
paramiko
3.5.0
parso
0.8.4
partd
1.4.2
pathspec
0.12.1
patsy
1.0.1
Pebble
5.1.0
pexpect
4.9.0
pickleshare
0.7.5
pillow
11.1.0
pip
24.2
platformdirs
4.3.6
plotly
5.24.1
plotnine
0.13.6
pluggy
1.5.0
polars
1.8.2
preshed
3.0.9
prometheus_client
0.21.0
prometheus_flask_exporter
0.23.1
prompt_toolkit
3.0.48
propcache
0.3.0
protobuf
4.25.3
psutil
7.0.0
ptyprocess
0.7.0
pure_eval
0.2.3
py7zr
0.20.8
pyarrow
17.0.0
pyarrow-hotfix
0.6
pyasn1
0.6.1
pyasn1_modules
0.4.1
pybcj
1.0.3
pycosat
0.6.6
pycparser
2.22
pycryptodomex
3.21.0
pydantic
2.10.6
pydantic_core
2.27.2
pydantic-settings
2.8.1
Pygments
2.19.1
PyNaCl
1.5.0
pynsee
0.1.8
pyogrio
0.10.0
pyOpenSSL
24.2.1
pyparsing
3.2.1
pyppmd
1.1.1
pyproj
3.7.1
pyshp
2.3.1
PySocks
1.7.1
python-dateutil
2.9.0.post0
python-dotenv
1.0.1
python-magic
0.4.27
pytz
2025.1
pyu2f
0.1.5
pywaffle
1.1.1
PyYAML
6.0.2
pyzmq
26.3.0
pyzstd
0.16.2
querystring_parser
1.2.4
rasterio
1.4.3
referencing
0.36.2
regex
2024.9.11
requests
2.32.3
requests-cache
1.2.1
requests-toolbelt
1.0.0
retrying
1.3.4
rich
13.9.4
rpds-py
0.23.1
rsa
4.9
rtree
1.4.0
ruamel.yaml
0.18.6
ruamel.yaml.clib
0.2.8
s3fs
2023.12.2
s3transfer
0.11.3
scikit-image
0.24.0
scikit-learn
1.6.1
scipy
1.13.0
seaborn
0.13.2
selenium
4.29.0
setuptools
76.0.0
shapely
2.0.7
shellingham
1.5.4
six
1.17.0
smart-open
7.1.0
smmap
5.0.0
sniffio
1.3.1
sortedcontainers
2.4.0
soupsieve
2.5
spacy
3.8.4
spacy-legacy
3.0.12
spacy-loggers
1.0.5
SQLAlchemy
2.0.39
sqlparse
0.5.1
srsly
2.5.1
stack-data
0.6.2
statsmodels
0.14.4
tabulate
0.9.0
tblib
3.0.0
tenacity
9.0.0
texttable
1.7.0
thinc
8.3.4
threadpoolctl
3.6.0
tifffile
2025.3.13
toolz
1.0.0
topojson
1.9
tornado
6.4.2
tqdm
4.67.1
traitlets
5.14.3
trio
0.29.0
trio-websocket
0.12.2
truststore
0.9.2
typer
0.15.2
typing_extensions
4.12.2
typing-inspect
0.9.0
tzdata
2025.1
Unidecode
1.3.8
url-normalize
1.4.3
urllib3
1.26.20
uv
0.6.8
wasabi
1.1.3
wcwidth
0.2.13
weasel
0.4.1
webdriver-manager
4.0.2
websocket-client
1.8.0
Werkzeug
3.0.4
wheel
0.44.0
wordcloud
1.9.3
wrapt
1.17.2
wsproto
1.2.0
xgboost
2.1.1
xlrd
2.0.1
xyzservices
2025.1.0
yarl
1.18.3
yellowbrick
1.5
zict
3.0.0
zipp
3.21.0
zstandard
0.23.0
View file history
md`Ce fichier a été modifié __${table_commit.length}__ fois depuis sa création le ${creation_string} (dernière modification le ${last_modification_string})`
functionreplacePullRequestPattern(inputString, githubRepo) {// Use a regular expression to match the pattern #digitvar pattern =/#(\d+)/g;// Replace the pattern with ${github_repo}/pull/#digitvar replacedString = inputString.replace(pattern,'[#$1]('+ githubRepo +'/pull/$1)');return replacedString;}
table_commit = {// Get the HTML table by its class namevar table =document.querySelector('.commit-table');// Check if the table existsif (table) {// Initialize an array to store the table datavar dataArray = [];// Extract headers from the first rowvar headers = [];for (var i =0; i < table.rows[0].cells.length; i++) { headers.push(table.rows[0].cells[i].textContent.trim()); }// Iterate through the rows, starting from the second rowfor (var i =1; i < table.rows.length; i++) {var row = table.rows[i];var rowData = {};// Iterate through the cells in the rowfor (var j =0; j < row.cells.length; j++) {// Use headers as keys and cell content as values rowData[headers[j]] = row.cells[j].textContent.trim(); }// Push the rowData object to the dataArray dataArray.push(rowData); } }return dataArray}
// Get the element with class 'git-details'{var gitDetails =document.querySelector('.commit-table');// Check if the element existsif (gitDetails) {// Hide the element gitDetails.style.display='none'; }}
Wilkinson, Leland. 2012. The Grammar of Graphics. Springer.
Footnotes
This chapter will be built around the Quarto ecosystem. In the meantime, you can consult the excellent documentation of this ecosystem and practice, which is the best way to learn.↩︎
Thankfully, with a vast amount of online code using matplotlib, code assistants like ChatGPT or Github Copilot are invaluable for creating charts based on instructions.↩︎
The names of these libraries are inspired by the Summer Triangle constellation, of which Vega and Altair are two members.↩︎
Citation
BibTeX citation:
@book{galiana2023,
author = {Galiana, Lino},
title = {Python Pour La Data Science},
date = {2023},
url = {https://pythonds.linogaliana.fr/},
doi = {10.5281/zenodo.8229676},
langid = {en}
}