Partie 2: communiquer à partir de données

Data scientists need to be able to synthesize the information contained in a dataset through graphical representation, because the human brain understands information better through figures than through tables. Data visualization is important both as part of an exploratory approach to understanding the structure of the phenomena under study, but also as part of a phase of communicating results to audiences who don’t necessarily have access to raw data and need to make do with summaries. This part of the course is an introduction to this vast subject through the practical construction of descriptive graphs and maps.

Introduction
Visualisation
Author

Lino Galiana

Published

2025-03-19

1 Introduction

An essential part of the work of a data scientist is to synthesize the information contained in their datasets in order to distinguish what constitutes the signal, which they can focus on, and what constitutes the noise inherent in any dataset. In the work of a data scientist, during an exploratory phase, there is a constant back-and-forth between synthesized information and disaggregated datasets. It is therefore essential to know how to synthesize the information in a dataset before grasping its structure, which can then guide further analyses, whether for a modeling phase or data correction (anomaly detection or bad data retrieval).

We have already explored a key part of this work, namely the construction of relevant and reliable descriptive statistics. However, if we were content to present information using raw outputs from the groupby and agg combo on a Pandas DataFrame, our understanding of the data would be quite limited. The implementation of stylized tables using great tables was already a step forward in this process but, in truth, our brain processes information much more intuitively through simple graphical visualizations than through a table.

1.1 Data visualization, an essential part of communication work

As humans, our cognitive capacities are limited, and we can only grasp a limited amount of information, whereas computers are capable of processing large volumes of information. For a data scientist, this means that using our computational and statistical skills to obtain synthetic representations of our many datasets is essential to meet operational or scientific needs. The range of methods and tools that make up the toolbox of data scientists aims to simplify the understanding and subsequent exploitation of datasets whose volume exceeds our cognitive capacities.

This brings us to the question of data visualization, a set of tools and principles for representing stylized facts or contextualizing individual data in a synthetic manner. Data visualization is the art and science of visually representing complex and abstract information through visual elements. Its primary goal is to synthesize the information contained in a dataset to facilitate the understanding of its key issues for further analysis. Data visualization allows, among other things, to highlight trends, correlations, or anomalies that might be difficult or even impossible to grasp just by looking at raw data, which requires some context to make sense of it.

Data visualization plays a crucial role in the data analysis process by providing visual means to explore, interpret, and communicate information. It facilitates communication between data experts, decision-makers, and the general public, enabling the latter to benefit from the rigorous work of the former to make sense of the data without the need for deep conceptual knowledge that underpins the synthesized information.

1.2 The role of visualization in the data value creation process

Data visualization is not limited to the final phase of a project, which is the communication of results to an audience that does not have access to the data or the means to make use of it. Visualization plays a role at every stage of the data value creation process. It is, in fact, an essential part of the process of transitioning from a record, a snapshot of a phenomenon, to data— a record that has value because it carries information on its own or when combined with other records.

The daily work of a data scientist involves examining a dataset from every angle to identify key value extraction opportunities. Quickly knowing what statistics to represent, and how, is crucial for saving time during this exploratory phase. This is primarily a form of self-communication that can afford to be rough around the edges, as the goal is to sketch the work before refining certain aspects. The challenge at this stage of the process is not to overlook any dimension that could potentially bring value.

The truly time-consuming communication work comes when presenting to an audience with limited data access, unfamiliar with sources, with a limited attention span, or without quantitative skills. These audiences cannot be satisfied with raw outputs like a DataFrame in a notebook or a graph created in seconds with the plot method from Pandas. It is important to adapt to their evolving expectations, and the tools they are familiar with, which explains the growing importance of websites dedicated to data visualizations.

2 Communicating, an opening to data storytelling

Data visualization thus holds a special place among the various techniques of data science. It is involved at all stages of the data production process, from upstream (exploratory analysis) to downstream (presenting results to various audiences), and when well-constructed, it allows us to intuitively grasp the structure of the data or the key issues of its analysis.

As an art of synthesis, data visualization is also the art of storytelling, and when done well, it can even reach the level of artistic production. Data visualization is a profession in its own right, with more and more practitioners found in media outlets or specialized companies (Datawrapper, for example).

Without aiming to create visualizations as sophisticated as those produced by specialists, every data scientist should be able to quickly generate visualizations that synthesize the information in the datasets at hand. A clear and readable visualization, while remaining simple, can be more effective than a speech in conveying a message.

Just like a speech, a visualization is a form of communication in which a speaker—the person constructing the visualization— seeks to convey information to a recipient—potentially the same person as the speaker since a visualization can be created for oneself during exploratory analysis. It is no surprise that during the period when semiology played a significant role in intellectual debates, especially around the figure of Roland Barthes, the concept of graphic semiology emerged, centered around Jacques Bertin (Bertin 1967; Palsky 2017). This approach allows reflection on the relevance of the techniques used to convey a graphic message, and many visualizations, if they followed some of these rules, could be improved at little cost.

Eric Mauvière, a French statistician and a successor to Bertin’s school of graphic semiology, offers excellent content on the subject. Some of his presentations, notably the one for SSPHub, presented in the Note 2.1, should be viewed in all data science training programs as they highlight the numerous pitfalls encountered by data scientists.

An example of two visualizations made from the same dataset by Eric Mauvière, see ?@nte-mauviere

An example of two visualizations made from the same dataset by Eric Mauvière, see ?@nte-mauviere

3 Communicating, an opening to app development

The goal of this course is to introduce the main tools and the approach that data scientists should adopt when working with various datasets. However, it is becoming increasingly common for data scientists to develop and provide interactive applications offering a range of explorations and automated data visualizations. These are more advanced topics than this course covers, but they often serve as an entry point to data science for audiences close to data scientists, such as data engineers, data analysts, or statisticians.

We will mention some of the preferred tools for doing this, especially ecosystems related to web applications and Javascript tools. This need, now fairly standard for data scientists, bridges the gap with production deployment, which is the main focus of a third-year ENSAE course designed by Romain Avouac and myself (course website ensae-reproductibilite.github.io/). This current website, for example, is built on this principle using tools that allow Python code to be reproducibly executed on standardized servers and then made available through a website.

4 The Python ecosystem

Returning to our course, in this section we will present some basic libraries and visualizations in Python that provide a good starting point. There are plenty of resources to deepen and advance in the art of visualization, such as this book (Wilke 2019).

4.1 Data visualization packages

The Python ecosystem for data visualization is vast and diverse. Entire books could be dedicated to it (Dale 2022). Python offers numerous libraries to quickly and relatively easily produce data visualizations1.

The graphical libraries are mainly divided into two families:

  • Libraries for static representations. These are primarily intended for integration into fixed publications such as PDFs or text documents. We will mainly present Matplotlib and Seaborn, but there are others emerging, such as Plotnine, an adaptation of ggplot2 to the Python ecosystem.
  • Libraries for interactive representations. These are suited for web representations and allow readers to interact with the displayed graphical representation. Libraries offering these features usually rely on JavaScript, the web development ecosystem, with an entry point through Python. We will primarily discuss Plotly and Folium in this family, but many other frameworks exist in this field2.

It is entirely possible to create sophisticated visualizations with an end-to-end Python workflow since it is a versatile language with a very rich ecosystem. However, Python is not a cure-all, and sometimes it can be useful to finalize a perfectly polished product with other languages, such as JavaScript for interactive visualizations or QGIS for cartographic work. This course will provide the basic tools to quickly and enjoyably produce work, but as the saying goes, the devil is in the details, so one should not insist on using Python for every task.

In the realm of visualization, this course takes the approach of exploring a few central libraries through a limited number of examples by replicating charts found on the open data website of the city of Paris. The best training for visualization remains practicing on datasets, so it is recommended to explore the richness of the open data ecosystem to experiment with visualizations.

4.2 Visualization applications

This part of the course focuses on simple synthetic representations. It does not (yet?) cover the construction of data visualization applications where a set of graphs update synchronously based on user interactions.

This indeed exceeds the scope of an introductory course, as building these applications requires mastering more complex concepts like the interaction between a web page and a server, having some knowledge of Linux, etc. The concepts necessary to understand these tools are at the heart of the third-year course “Deploying Data Science Projects” that Romain Avouac and I teach in the third year at ENSAE.

Nevertheless, since data value creation in the form of applications is very common, it is useful, at a minimum, to mention the distinction between static sites and dynamic applications to provide the right approach and point to the appropriate tools. In the world of applications, it is important to distinguish between the front (the page visible to the application’s users) and the back office (the engine that performs actions based on parameters chosen by the user on the page).

There are primarily two paradigms for making these two elements interact. The key difference between these approaches is the servers they rely on. A static site runs on a web server, whereas Streamlit relies on a standard backend server. The main difference between these two types of servers lies in their function and usage:

  • A web server is specifically designed to store, process, and deliver web pages (the front) to clients. This includes HTML, CSS, JavaScript files, images, etc. Web servers listen for HTTP/HTTPS requests from user browsers and respond by sending the requested data. This doesn’t preclude having complex data processing steps or reactivity by embedding JavaScript in the application, but Python processing steps are done before the application is made available. For Python users, there are several static site generators before deployment via hosting on Github Pages. The two most common ecosystems are Quarto Markdown and Django, with the former being simpler to use and maintain than the latter. This site, for example, is built using Quarto, which ensures reproducibility of the presented examples and ergonomic, customizable formatting of the results.
  • A standard backend server is designed to perform operations in response to a front, in this case, a web page. In the context of an application built with Python, this is a server with an appropriate Python environment to execute the code required to respond to any action taken by an application user. The code is executed on demand rather than once and for all, as in the previous approach. This paradigm allows for more application complexity but represents an additional challenge during the deployment phase. In the Python ecosystem, the two main tools for building such applications are Streamlit and Dash, with the former being quicker to implement than the latter. More recently, the dominant R equivalent ecosystem, Shiny, has been adapted for Python by Posit.

The ecosystems presented above for reactive applications are web frameworks. They are distinct from heavier clients like tkinter, the historical tool for building graphical user interfaces. Besides the more rudimentary aspect of tkinter interfaces compared to those of Streamlit, Dash, or Shiny, there are strong reasons to prefer the latter over tkinter.

Tkinter is a heavy client, meaning it is tied to an operating system and requires pre-installation of packages before the interface can run. While it is certainly possible to make it portable, as discussed in the production course, there are many reasons why this approach may lead to errors or unexpected bugs. Web frameworks have the advantage of simplifying this deployment process by separating the front (HTML and CSS pages) from the back (the Python code). They have naturally become more popular, even though many dated online resources still exist for developing applications with tkinter.

When it comes to building applications, the first instinct should be: “Do I need to build a reactive application, or will a static site suffice?” The latter is much easier to implement and has minimal maintenance overhead, making it a rational choice in many cases. If building a static site becomes complex, for example, due to sophisticated calculations that would be difficult to implement without JavaScript skills, you can then consider separating the front from the back by delegating the calculations to an API, for example, built using FastAPI. This can be a practical method to deploy a machine learning model, as will be discussed in the final chapter of the modeling section. If implementing an API seems too complicated or overkill for the task, then you can turn to a reactive application like Streamlit.

Again, building an application involves concepts that go beyond an introductory level in Python. However, being aware of the right practices can save significant time by avoiding pitfalls due to poor initial choices.

4.3 Summary of this section

Returning to the content of this section after this aside, it is divided into two parts, and each chapter is dual in nature, depending on whether we are focused on static or dynamic representations:

  • First, we will discuss standard graphical representations (histograms, bar charts, etc.) to synthesize quantitative information;
    • Static representations will rely on Pandas, Matplotlib, and Seaborn
    • Reactive charts will be built using Plotly
  • Second, we will present cartographic representations:
    • Static maps created with Geopandas or plotnine
    • Reactive maps using Folium (a Python adaptation of the Leaflet.js library)

4.4 Useful references

Data visualization is an art that is learned primarily through practice, especially at the beginning. However, it is not always easy to produce readable and ergonomic visualizations, so it is helpful to draw inspiration from examples by specialists (major media outlets offer excellent visualizations).

Here are some useful resources on these topics:

And a few additional references mentioned in this introduction:

Bertin, Jacques. 1967. Sémiologie Graphique. Paris: Mouton/Gauthier-Villars.
Dale, Kyran. 2022. Data Visualization with Python and JavaScript. " O’Reilly Media, Inc.".
Palsky, Gilles. 2017. “La sémiologie Graphique de Jacques Bertin a Cinquante Ans.” Visions Carto (En Ligne).
Wilke, Claus O. 2019. Fundamentals of Data Visualization: A Primer on Making Informative and Compelling Figures. O’Reilly Media.

Informations additionnelles

environment files have been tested on.

Latest built version: 2025-03-19

Python version used:

'3.12.6 | packaged by conda-forge | (main, Sep 30 2024, 18:08:52) [GCC 13.3.0]'
Package Version
affine 2.4.0
aiobotocore 2.21.1
aiohappyeyeballs 2.6.1
aiohttp 3.11.13
aioitertools 0.12.0
aiosignal 1.3.2
alembic 1.13.3
altair 5.4.1
aniso8601 9.0.1
annotated-types 0.7.0
anyio 4.8.0
appdirs 1.4.4
archspec 0.2.3
asttokens 2.4.1
attrs 25.3.0
babel 2.17.0
bcrypt 4.2.0
beautifulsoup4 4.12.3
black 24.8.0
blinker 1.8.2
blis 1.2.0
bokeh 3.5.2
boltons 24.0.0
boto3 1.37.1
botocore 1.37.1
branca 0.7.2
Brotli 1.1.0
bs4 0.0.2
cachetools 5.5.0
cartiflette 0.0.2
Cartopy 0.24.1
catalogue 2.0.10
cattrs 24.1.2
certifi 2025.1.31
cffi 1.17.1
charset-normalizer 3.4.1
chromedriver-autoinstaller 0.6.4
click 8.1.8
click-plugins 1.1.1
cligj 0.7.2
cloudpathlib 0.21.0
cloudpickle 3.0.0
colorama 0.4.6
comm 0.2.2
commonmark 0.9.1
conda 24.9.1
conda-libmamba-solver 24.7.0
conda-package-handling 2.3.0
conda_package_streaming 0.10.0
confection 0.1.5
contextily 1.6.2
contourpy 1.3.1
cryptography 43.0.1
cycler 0.12.1
cymem 2.0.11
cytoolz 1.0.0
dask 2024.9.1
dask-expr 1.1.15
databricks-sdk 0.33.0
dataclasses-json 0.6.7
debugpy 1.8.6
decorator 5.1.1
Deprecated 1.2.14
diskcache 5.6.3
distributed 2024.9.1
distro 1.9.0
docker 7.1.0
duckdb 1.2.1
en_core_web_sm 3.8.0
entrypoints 0.4
et_xmlfile 2.0.0
exceptiongroup 1.2.2
executing 2.1.0
fastexcel 0.11.6
fastjsonschema 2.21.1
fiona 1.10.1
Flask 3.0.3
folium 0.17.0
fontawesomefree 6.6.0
fonttools 4.56.0
fr_core_news_sm 3.8.0
frozendict 2.4.4
frozenlist 1.5.0
fsspec 2023.12.2
geographiclib 2.0
geopandas 1.0.1
geoplot 0.5.1
geopy 2.4.1
gitdb 4.0.11
GitPython 3.1.43
google-auth 2.35.0
graphene 3.3
graphql-core 3.2.4
graphql-relay 3.2.0
graphviz 0.20.3
great-tables 0.12.0
greenlet 3.1.1
gunicorn 22.0.0
h11 0.14.0
h2 4.1.0
hpack 4.0.0
htmltools 0.6.0
httpcore 1.0.7
httpx 0.28.1
httpx-sse 0.4.0
hyperframe 6.0.1
idna 3.10
imageio 2.37.0
importlib_metadata 8.6.1
importlib_resources 6.5.2
inflate64 1.0.1
ipykernel 6.29.5
ipython 8.28.0
itsdangerous 2.2.0
jedi 0.19.1
Jinja2 3.1.6
jmespath 1.0.1
joblib 1.4.2
jsonpatch 1.33
jsonpointer 3.0.0
jsonschema 4.23.0
jsonschema-specifications 2024.10.1
jupyter-cache 1.0.0
jupyter_client 8.6.3
jupyter_core 5.7.2
kaleido 0.2.1
kiwisolver 1.4.8
langchain 0.3.20
langchain-community 0.3.9
langchain-core 0.3.45
langchain-text-splitters 0.3.6
langcodes 3.5.0
langsmith 0.1.147
language_data 1.3.0
lazy_loader 0.4
libmambapy 1.5.9
locket 1.0.0
loguru 0.7.3
lxml 5.3.1
lz4 4.3.3
Mako 1.3.5
mamba 1.5.9
mapclassify 2.8.1
marisa-trie 1.2.1
Markdown 3.6
markdown-it-py 3.0.0
MarkupSafe 3.0.2
marshmallow 3.26.1
matplotlib 3.10.1
matplotlib-inline 0.1.7
mdurl 0.1.2
menuinst 2.1.2
mercantile 1.2.1
mizani 0.11.4
mlflow 2.16.2
mlflow-skinny 2.16.2
msgpack 1.1.0
multidict 6.1.0
multivolumefile 0.2.3
munkres 1.1.4
murmurhash 1.0.12
mypy-extensions 1.0.0
narwhals 1.30.0
nbclient 0.10.0
nbformat 5.10.4
nest_asyncio 1.6.0
networkx 3.4.2
nltk 3.9.1
numpy 2.2.3
opencv-python-headless 4.10.0.84
openpyxl 3.1.5
opentelemetry-api 1.16.0
opentelemetry-sdk 1.16.0
opentelemetry-semantic-conventions 0.37b0
orjson 3.10.15
outcome 1.3.0.post0
OWSLib 0.28.1
packaging 24.2
pandas 2.2.3
paramiko 3.5.0
parso 0.8.4
partd 1.4.2
pathspec 0.12.1
patsy 1.0.1
Pebble 5.1.0
pexpect 4.9.0
pickleshare 0.7.5
pillow 11.1.0
pip 24.2
platformdirs 4.3.6
plotly 5.24.1
plotnine 0.13.6
pluggy 1.5.0
polars 1.8.2
preshed 3.0.9
prometheus_client 0.21.0
prometheus_flask_exporter 0.23.1
prompt_toolkit 3.0.48
propcache 0.3.0
protobuf 4.25.3
psutil 7.0.0
ptyprocess 0.7.0
pure_eval 0.2.3
py7zr 0.20.8
pyarrow 17.0.0
pyarrow-hotfix 0.6
pyasn1 0.6.1
pyasn1_modules 0.4.1
pybcj 1.0.3
pycosat 0.6.6
pycparser 2.22
pycryptodomex 3.21.0
pydantic 2.10.6
pydantic_core 2.27.2
pydantic-settings 2.8.1
Pygments 2.19.1
PyNaCl 1.5.0
pynsee 0.1.8
pyogrio 0.10.0
pyOpenSSL 24.2.1
pyparsing 3.2.1
pyppmd 1.1.1
pyproj 3.7.1
pyshp 2.3.1
PySocks 1.7.1
python-dateutil 2.9.0.post0
python-dotenv 1.0.1
python-magic 0.4.27
pytz 2025.1
pyu2f 0.1.5
pywaffle 1.1.1
PyYAML 6.0.2
pyzmq 26.3.0
pyzstd 0.16.2
querystring_parser 1.2.4
rasterio 1.4.3
referencing 0.36.2
regex 2024.9.11
requests 2.32.3
requests-cache 1.2.1
requests-toolbelt 1.0.0
retrying 1.3.4
rich 13.9.4
rpds-py 0.23.1
rsa 4.9
rtree 1.4.0
ruamel.yaml 0.18.6
ruamel.yaml.clib 0.2.8
s3fs 2023.12.2
s3transfer 0.11.3
scikit-image 0.24.0
scikit-learn 1.6.1
scipy 1.13.0
seaborn 0.13.2
selenium 4.29.0
setuptools 76.0.0
shapely 2.0.7
shellingham 1.5.4
six 1.17.0
smart-open 7.1.0
smmap 5.0.0
sniffio 1.3.1
sortedcontainers 2.4.0
soupsieve 2.5
spacy 3.8.4
spacy-legacy 3.0.12
spacy-loggers 1.0.5
SQLAlchemy 2.0.39
sqlparse 0.5.1
srsly 2.5.1
stack-data 0.6.2
statsmodels 0.14.4
tabulate 0.9.0
tblib 3.0.0
tenacity 9.0.0
texttable 1.7.0
thinc 8.3.4
threadpoolctl 3.6.0
tifffile 2025.3.13
toolz 1.0.0
topojson 1.9
tornado 6.4.2
tqdm 4.67.1
traitlets 5.14.3
trio 0.29.0
trio-websocket 0.12.2
truststore 0.9.2
typer 0.15.2
typing_extensions 4.12.2
typing-inspect 0.9.0
tzdata 2025.1
Unidecode 1.3.8
url-normalize 1.4.3
urllib3 1.26.20
uv 0.6.8
wasabi 1.1.3
wcwidth 0.2.13
weasel 0.4.1
webdriver-manager 4.0.2
websocket-client 1.8.0
Werkzeug 3.0.4
wheel 0.44.0
wordcloud 1.9.3
wrapt 1.17.2
wsproto 1.2.0
xgboost 2.1.1
xlrd 2.0.1
xyzservices 2025.1.0
yarl 1.18.3
yellowbrick 1.5
zict 3.0.0
zipp 3.21.0
zstandard 0.23.0

View file history

SHA Date Author Description
cbe6459f 2024-11-12 07:24:15 lgaliana Revoir quelques abstracts
593106f1 2024-09-21 14:39:32 lgaliana Abstract également
1ebb5eed 2024-09-21 14:23:43 lgaliana Translate the introduction to visualisation
72d44dd6 2024-09-21 12:50:38 lgaliana Force build for pandas chapters
d02515b4 2024-04-27 21:32:25 Lino Galiana Eléments sur les applis & évaluation (#495)
005d89b8 2023-12-20 17:23:04 Lino Galiana Finalise l’affichage des statistiques Git (#478)
1f23de28 2023-12-01 17:25:36 Lino Galiana Stockage des images sur S3 (#466)
09654c71 2023-11-14 15:16:44 Antoine Palazzolo Suggestions Git & Visualisation (#449)
80823022 2023-08-25 17:48:36 Lino Galiana Mise à jour des scripts de construction des notebooks (#395)
3bdf3b06 2023-08-25 11:23:02 Lino Galiana Simplification de la structure 🤓 (#393)
5d4874a8 2023-08-11 15:09:33 Lino Galiana Pimp les introductions des trois premières parties (#387)
2dc82e7b 2022-10-18 22:46:47 Lino Galiana Relec Kim (visualisation + API) (#302)
8e5edba6 2022-09-02 11:59:57 Lino Galiana Ajoute un chapitre dask (#264)
a4e24263 2022-06-16 19:34:18 Lino Galiana Improve style (#238)
12965bac 2022-05-25 15:53:27 Lino Galiana :launch: Bascule vers quarto (#226)
66a52761 2021-11-23 16:13:20 Lino Galiana Relecture partie visualisation (#181)
4cdb759c 2021-05-12 10:37:23 Lino Galiana :sparkles: :star2: Nouveau thème hugo :snake: :fire: (#105)
0a0d0348 2021-03-26 20:16:22 Lino Galiana Ajout d’une section sur S3 (#97)
5ac3cbee 2020-09-28 18:59:24 Lino Galiana Continue la partie graphiques (#54)
8ed01f45 2020-09-24 21:27:29 Lino Galiana Ajout d’une partie visualisation
Back to top

Footnotes

  1. To be honest, for a long time, Python was a bit less enjoyable in this regard compared to R, which benefits from the indispensable library ggplot2.

    Not built on the grammar of graphics, the main graphical library in Python, Matplotlib, is more cumbersome to use than ggplot2.

    seaborn, which we will present, simplifies graphical representation somewhat, but again, it is difficult to find something more flexible and universal than ggplot2.

    The library plotnine aims to provide a similar implementation to ggplot for Python users. Its development is worth following.↩︎

  2. In this regard, I highly recommend keeping up with data visualization news on the platform Observable, which tends to bring together the communities of dataviz specialists and data analysts. The library Plot could become a new standard in the coming years, a sort of intermediate between ggplot and d3.↩︎

Citation

BibTeX citation:
@book{galiana2023,
  author = {Galiana, Lino},
  title = {Python Pour La Data Science},
  date = {2023},
  url = {https://pythonds.linogaliana.fr/},
  doi = {10.5281/zenodo.8229676},
  langid = {en}
}
For attribution, please cite this work as:
Galiana, Lino. 2023. Python Pour La Data Science. https://doi.org/10.5281/zenodo.8229676.