Introduction

This introduction presents the course objective, pedagogical approach, the main theme of this of the course, as well as the practical practical details.
Author

Lino Galiana

Published

2025-03-19

1 Introduction

Important

This course gathers all the content of the course Python for Data Science that I have been teaching at ENSAE since 2018. This course was previously taught by Xavier Dupré. About 170 students take this course each year. In 2024, a gradual introduction of an English version equivalent to the French version began, aimed at serving as an introductory course in data science for European statistical institutes thanks to a European call for projects.

This site (pythonds.linogaliana.fr/) is the main entry point for the course. It centralizes all the content created during the course for practical work or provided additionally for continuing education purposes. This course is open source and I welcome suggestions for improvement on Github or through the comments at the bottom of each page. As Python is a living and dynamic language, practices evolve and this course continuously adapts to the changing ecosystem of data science, while trying to distinguish lasting practice evolutions from passing trends.

Additional elements are available in the introductory slides. More advanced elements are present in another course dedicated to deploying data science projects that I teach with Romain Avouac in the final year of ENSAE (ensae-reproductibilite.github.io/website).

Website Architecture

This course features tutorials and complete exercises. Each page is structured around a concrete problem and presents the generic approach to solving this general problem.

You can navigate the site architecture via the table of contents or through the links to previous or subsequent content at the end of each page. Some sections, notably the one dedicated to modeling, offer extended examples to illustrate the approach in more detail.

Python, with its recognizable logo in the form of , is a language that has been around for over thirty years but has experienced a renaissance during the 2010s due to the surge in interest around data science.

Python, more than any other programming language, brings together diverse communities such as statisticians, developers, application or IT infrastructure managers, high school students - Python has been part of the French baccalaureate program for several years - and researchers in both theoretical and applied fields.

Unlike many programming languages that have a fairly homogeneous community, Python has managed to bring together a wide range of users thanks to a few central principles: the readability of the language, the ease of using modules, the simplicity of integrating it with more performant languages for specific tasks, the vast amount of documentation available online… Being the second best language for performing a given task can thus be a source of success when competitors do not have a similarly broad range of advantages.

The success of Python, due to its nature as a Swiss Army knife language, is inseparable from the emergence of the data scientist profile, a role capable of integrating at different levels in data valuation. Davenport and Patil (2012), in the Harvard Business Review, talked about the “sexiest job of the 21st century” and, ten years later, provided a comprehensive overview of the evolving skills expected of a data scientist in the same review (Davenport and Patil 2022). It’s not only data scientists who are expected to use Python; within the ecosystem of data-related jobs (data scientist, data engineer, ML engineer…), Python serves as a Babel tower enabling communication between these interdependent profiles.

The richness of Python allows it to be used in all phases of data processing, from retrieval and structuring from various sources to its valuation. Through the lens of data science, we will see that Python is an excellent candidate to assist data scientists in all aspects of data work.

This course introduces various tools that allow for the connection of data and theories using Python. However, this course goes beyond a simple introduction to the language and provides more advanced elements, especially on the latest innovations enabled by data science in work methods.

2 Why Use Python for Data Analysis?

Python is first known in the world of data science for having provided early on the tools useful for training machine learning algorithms on various types of data. Indeed, the success of Scikit Learn1, Tensorflow2, or more recently PyTorch3 in the data science community has greatly contributed to the adoption of Python. However, reducing Python to a few machine learning libraries would be limiting, as it is truly a Swiss Army knife for data scientists, social scientists, or economists. The success story of Python is not just about having provided machine learning libraries at an opportune time: this language has real advantages for new data practitioners.

The appeal of Python is its central role in a larger ecosystem of powerful, flexible, and open-source tools. Like , it belongs to the class of languages that can be used daily for a wide variety of tasks. In many areas explored in this course, Python is, by far, the programming language offering the most complete and accessible ecosystem.

Beyond machine learning, which we have already discussed, Python is indispensable when it comes to retrieving data via APIs or web scraping4, two approaches that we will explore in the first part of the course. In the fields of tabular data analysis5, web content publishing, or graphic production, Python presents an ecosystem increasingly similar to due to the growing investment of Posit, the company behind the major libraries for data science, in the Python community.

Nevertheless, these elements are not meant to engage in the sterile debate of vs Python. These two languages have many more points of convergence than divergence, making it very simple to transpose good practices from one language to the other. This is a point that is discussed more extensively in the advanced course I teach with Romain Avouac in the final year at ENSAE: ensae-reproductibilite.github.io/website.

Ultimately, data scientists and researchers in social sciences or economics will use or Python almost interchangeably and alternately. This course will regularly present analogies with to help those discovering Python, but who are already familiar with , to better understand certain points.

3 Course Objectives

3.1 Introducing the Approach to Data Science

This course is aimed at practitioners of data science, understood here in a broad sense as the combination of techniques from mathematics, statistics, and computer science to produce useful knowledge from data. As data science is not only a scientific discipline but also aims to provide a set of tools to meet operational objectives, learning the main tool necessary for acquiring knowledge in data science, namely the Python language, is also an opportunity to discuss the rigorous scientific approach to be adopted when working with data. This course aims to present the approach to handling a dataset, the problems encountered, the solutions to overcome them, and the implications of these solutions. It is therefore not just a course on a technical tool, detached from scientific issues.

Is a Mathematical Background Required for This Course?

This course assumes a desire to use Python intensively for data analysis within a rigorous statistical framework. It only briefly touches on the statistical or algorithmic foundations behind some of the techniques discussed, which are often the subject of dedicated teachings, particularly at ENSAE.

Not knowing these concepts does not prevent understanding the content of this website, as more advanced concepts are generally presented separately in dedicated boxes. The ease of using Python avoids the need to program a model oneself, which makes it possible to apply models without being an expert. Knowledge of models will be more necessary for interpreting results.

However, even though it is relatively easy to use complex models with Python, it is very useful to have some background on them before embarking on a modeling approach. This is one of the reasons why modeling comes later in this course: in addition to involving advanced statistical concepts, it is necessary to have understood the stylized facts in our data to produce relevant modeling. A thorough understanding of data structure and its alignment with model assumptions is essential for building high-quality models.

3.2 Reproductibilité

This course places a central emphasis on the concept of reproducibility. This requirement is reflected in various ways throughout this teaching, primarily by ensuring that all examples and exercises in this course can be tested using Jupyter notebooks6.

The entire content of the website is reproducible in various computing environments. It is, of course, possible to copy and paste the code snippets present on this site, using the button above the code examples:

1x = "Try to copy-paste me"
1
Click on the button to copy this content and paste it elsewhere.

However, since this site presents many examples, the back-and-forth between a Python testing environment and this site could be cumbersome. Each chapter is therefore easily retrievable as a Jupyter notebook via buttons at the beginning of each page. For example, here are those buttons for the Numpy tutorial:

View on GitHub Onyxia Onyxia Open In Colab

Recommendations regarding the preferred environments for using these notebooks are deferred to the next chapter.

The requirement for reproducibility is also evident in the choice of examples used in this course. All content on this site relies on open data, whether it is French data (mainly from the centralizing platform data.gouv or the website of Insee) or American data. Results are therefore reproducible for someone with an identical environment7.

Note

American researchers have discussed a reproducibility crisis in the field of machine learning (Kapoor and Narayanan 2022). Issues with the scientific publishing ecosystem and the economic stakes behind academic publications in the field of machine learning are prominent factors that may explain this.

However, academic teaching also bears a responsibility in this area. Students and researchers are not trained in these topics, and if they do not adopt this requirement early in their careers, they may not be encouraged to do so later. For this reason, in addition to training in Python and data science, this course introduces the use of the version control software Git in a dedicated section. All student projects must be open source, which is one of the best ways for a teacher to ensure that students produce quality code.

3.3 Assessment

ENSAE students validate the course through an in-depth project. Details about the course assessment, as well as a list of previously completed projects, are available in the Assessment section.

4 Course Outline

This course is an introduction to the issues of data science through the learning of the Python language. As the term “data science” suggests, a significant part of this course is dedicated to working with data: retrieval, structuring, exploration, and linking. This is the subject of the first part of the course “Manipulating Data”, which serves as the foundation for the rest of the course. Unfortunately, many programs in data science, applied statistics, or social and economic sciences, overlook this part of the data scientist’s work sometimes referred to as “data wrangling” or “feature engineering”, which, in addition to being a significant portion of the data scientist’s work, is essential for building a relevant model.

The goal of this part is to illustrate the challenges related to retrieving various types of data sources and their exploitation using Python. The examples will be varied to illustrate the richness of the data that can be analyzed with Python: municipal \(CO_2\) emission statistics, real estate transaction data, energy diagnostics of housing, Vélib station attendance data…

The second part is dedicated to producing visualizations with Python. After retrieving and cleaning data, one generally wants to synthesize it through tables, graphics, or maps. This part is a brief introduction to this topic (“Communicating with Python). Being quite introductory, the goal of this part is mainly to provide some concepts that will be consolidated later.

The third part is dedicated to modeling through the example of electoral science (“Modeling with Python). The goal of this part is to illustrate the scientific approach of machine learning, the related methodological and technical choices, and to open up to the following issues that will be discussed in the rest of the university curriculum.

The fourth part of the course takes a step aside to focus on specific issues related to the exploitation of textual data. This is the chapter on “Introduction to Natural Language Processing (NLP) with Python. This research field being particularly active, it is only an introduction to the subject. For further reading, refer to Russell and Norvig (2020), chapter 24.

References

Davenport, Thomas H, and DJ Patil. 2012. “Data Scientist, the Sexiest Job of the 21st Century.” Harvard Business Review 90 (5): 70–76. https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century.
———. 2022. “Is Data Scientist Still the Sexiest Job of the 21st Century?” Harvard Business Review 90.
Kapoor, Sayash, and Arvind Narayanan. 2022. “Leakage and the Reproducibility Crisis in ML-Based Science.” arXiv. https://doi.org/10.48550/ARXIV.2207.07048.
Russell, Stuart J., and Peter Norvig. 2020. Artificial Intelligence: A Modern Approach (4th Edition). Pearson. http://aima.cs.berkeley.edu/.

Informations additionnelles

environment files have been tested on.

Latest built version: 2025-03-19

Python version used:

'3.12.6 | packaged by conda-forge | (main, Sep 30 2024, 18:08:52) [GCC 13.3.0]'
Package Version
affine 2.4.0
aiobotocore 2.21.1
aiohappyeyeballs 2.6.1
aiohttp 3.11.13
aioitertools 0.12.0
aiosignal 1.3.2
alembic 1.13.3
altair 5.4.1
aniso8601 9.0.1
annotated-types 0.7.0
anyio 4.8.0
appdirs 1.4.4
archspec 0.2.3
asttokens 2.4.1
attrs 25.3.0
babel 2.17.0
bcrypt 4.2.0
beautifulsoup4 4.12.3
black 24.8.0
blinker 1.8.2
blis 1.2.0
bokeh 3.5.2
boltons 24.0.0
boto3 1.37.1
botocore 1.37.1
branca 0.7.2
Brotli 1.1.0
bs4 0.0.2
cachetools 5.5.0
cartiflette 0.0.2
Cartopy 0.24.1
catalogue 2.0.10
cattrs 24.1.2
certifi 2025.1.31
cffi 1.17.1
charset-normalizer 3.4.1
chromedriver-autoinstaller 0.6.4
click 8.1.8
click-plugins 1.1.1
cligj 0.7.2
cloudpathlib 0.21.0
cloudpickle 3.0.0
colorama 0.4.6
comm 0.2.2
commonmark 0.9.1
conda 24.9.1
conda-libmamba-solver 24.7.0
conda-package-handling 2.3.0
conda_package_streaming 0.10.0
confection 0.1.5
contextily 1.6.2
contourpy 1.3.1
cryptography 43.0.1
cycler 0.12.1
cymem 2.0.11
cytoolz 1.0.0
dask 2024.9.1
dask-expr 1.1.15
databricks-sdk 0.33.0
dataclasses-json 0.6.7
debugpy 1.8.6
decorator 5.1.1
Deprecated 1.2.14
diskcache 5.6.3
distributed 2024.9.1
distro 1.9.0
docker 7.1.0
duckdb 1.2.1
en_core_web_sm 3.8.0
entrypoints 0.4
et_xmlfile 2.0.0
exceptiongroup 1.2.2
executing 2.1.0
fastexcel 0.11.6
fastjsonschema 2.21.1
fiona 1.10.1
Flask 3.0.3
folium 0.17.0
fontawesomefree 6.6.0
fonttools 4.56.0
fr_core_news_sm 3.8.0
frozendict 2.4.4
frozenlist 1.5.0
fsspec 2023.12.2
geographiclib 2.0
geopandas 1.0.1
geoplot 0.5.1
geopy 2.4.1
gitdb 4.0.11
GitPython 3.1.43
google-auth 2.35.0
graphene 3.3
graphql-core 3.2.4
graphql-relay 3.2.0
graphviz 0.20.3
great-tables 0.12.0
greenlet 3.1.1
gunicorn 22.0.0
h11 0.14.0
h2 4.1.0
hpack 4.0.0
htmltools 0.6.0
httpcore 1.0.7
httpx 0.28.1
httpx-sse 0.4.0
hyperframe 6.0.1
idna 3.10
imageio 2.37.0
importlib_metadata 8.6.1
importlib_resources 6.5.2
inflate64 1.0.1
ipykernel 6.29.5
ipython 8.28.0
itsdangerous 2.2.0
jedi 0.19.1
Jinja2 3.1.6
jmespath 1.0.1
joblib 1.4.2
jsonpatch 1.33
jsonpointer 3.0.0
jsonschema 4.23.0
jsonschema-specifications 2024.10.1
jupyter-cache 1.0.0
jupyter_client 8.6.3
jupyter_core 5.7.2
kaleido 0.2.1
kiwisolver 1.4.8
langchain 0.3.20
langchain-community 0.3.9
langchain-core 0.3.45
langchain-text-splitters 0.3.6
langcodes 3.5.0
langsmith 0.1.147
language_data 1.3.0
lazy_loader 0.4
libmambapy 1.5.9
locket 1.0.0
loguru 0.7.3
lxml 5.3.1
lz4 4.3.3
Mako 1.3.5
mamba 1.5.9
mapclassify 2.8.1
marisa-trie 1.2.1
Markdown 3.6
markdown-it-py 3.0.0
MarkupSafe 3.0.2
marshmallow 3.26.1
matplotlib 3.10.1
matplotlib-inline 0.1.7
mdurl 0.1.2
menuinst 2.1.2
mercantile 1.2.1
mizani 0.11.4
mlflow 2.16.2
mlflow-skinny 2.16.2
msgpack 1.1.0
multidict 6.1.0
multivolumefile 0.2.3
munkres 1.1.4
murmurhash 1.0.12
mypy-extensions 1.0.0
narwhals 1.30.0
nbclient 0.10.0
nbformat 5.10.4
nest_asyncio 1.6.0
networkx 3.4.2
nltk 3.9.1
numpy 2.2.3
opencv-python-headless 4.10.0.84
openpyxl 3.1.5
opentelemetry-api 1.16.0
opentelemetry-sdk 1.16.0
opentelemetry-semantic-conventions 0.37b0
orjson 3.10.15
outcome 1.3.0.post0
OWSLib 0.28.1
packaging 24.2
pandas 2.2.3
paramiko 3.5.0
parso 0.8.4
partd 1.4.2
pathspec 0.12.1
patsy 1.0.1
Pebble 5.1.0
pexpect 4.9.0
pickleshare 0.7.5
pillow 11.1.0
pip 24.2
platformdirs 4.3.6
plotly 5.24.1
plotnine 0.13.6
pluggy 1.5.0
polars 1.8.2
preshed 3.0.9
prometheus_client 0.21.0
prometheus_flask_exporter 0.23.1
prompt_toolkit 3.0.48
propcache 0.3.0
protobuf 4.25.3
psutil 7.0.0
ptyprocess 0.7.0
pure_eval 0.2.3
py7zr 0.20.8
pyarrow 17.0.0
pyarrow-hotfix 0.6
pyasn1 0.6.1
pyasn1_modules 0.4.1
pybcj 1.0.3
pycosat 0.6.6
pycparser 2.22
pycryptodomex 3.21.0
pydantic 2.10.6
pydantic_core 2.27.2
pydantic-settings 2.8.1
Pygments 2.19.1
PyNaCl 1.5.0
pynsee 0.1.8
pyogrio 0.10.0
pyOpenSSL 24.2.1
pyparsing 3.2.1
pyppmd 1.1.1
pyproj 3.7.1
pyshp 2.3.1
PySocks 1.7.1
python-dateutil 2.9.0.post0
python-dotenv 1.0.1
python-magic 0.4.27
pytz 2025.1
pyu2f 0.1.5
pywaffle 1.1.1
PyYAML 6.0.2
pyzmq 26.3.0
pyzstd 0.16.2
querystring_parser 1.2.4
rasterio 1.4.3
referencing 0.36.2
regex 2024.9.11
requests 2.32.3
requests-cache 1.2.1
requests-toolbelt 1.0.0
retrying 1.3.4
rich 13.9.4
rpds-py 0.23.1
rsa 4.9
rtree 1.4.0
ruamel.yaml 0.18.6
ruamel.yaml.clib 0.2.8
s3fs 2023.12.2
s3transfer 0.11.3
scikit-image 0.24.0
scikit-learn 1.6.1
scipy 1.13.0
seaborn 0.13.2
selenium 4.29.0
setuptools 76.0.0
shapely 2.0.7
shellingham 1.5.4
six 1.17.0
smart-open 7.1.0
smmap 5.0.0
sniffio 1.3.1
sortedcontainers 2.4.0
soupsieve 2.5
spacy 3.8.4
spacy-legacy 3.0.12
spacy-loggers 1.0.5
SQLAlchemy 2.0.39
sqlparse 0.5.1
srsly 2.5.1
stack-data 0.6.2
statsmodels 0.14.4
tabulate 0.9.0
tblib 3.0.0
tenacity 9.0.0
texttable 1.7.0
thinc 8.3.4
threadpoolctl 3.6.0
tifffile 2025.3.13
toolz 1.0.0
topojson 1.9
tornado 6.4.2
tqdm 4.67.1
traitlets 5.14.3
trio 0.29.0
trio-websocket 0.12.2
truststore 0.9.2
typer 0.15.2
typing_extensions 4.12.2
typing-inspect 0.9.0
tzdata 2025.1
Unidecode 1.3.8
url-normalize 1.4.3
urllib3 1.26.20
uv 0.6.8
wasabi 1.1.3
wcwidth 0.2.13
weasel 0.4.1
webdriver-manager 4.0.2
websocket-client 1.8.0
Werkzeug 3.0.4
wheel 0.44.0
wordcloud 1.9.3
wrapt 1.17.2
wsproto 1.2.0
xgboost 2.1.1
xlrd 2.0.1
xyzservices 2025.1.0
yarl 1.18.3
yellowbrick 1.5
zict 3.0.0
zipp 3.21.0
zstandard 0.23.0

View file history

SHA Date Author Description
3f1d2f3f 2025-03-15 15:55:59 Lino Galiana Fix problem with uv and malformed files (#599)
388fd975 2025-02-28 17:34:09 Lino Galiana Colab again and again… (#595)
488780a4 2024-09-25 14:32:16 Lino Galiana Change badge (#556)
5d15b063 2024-09-23 15:39:40 lgaliana Handling badges problem
f8b04136 2024-08-28 15:15:04 Lino Galiana Révision complète de la partie introductive (#549)
0908656f 2024-08-20 16:30:39 Lino Galiana English sidebar (#542)
a987feaa 2024-06-23 18:43:06 Lino Galiana Fix broken links (#506)
69a45850 2024-06-12 20:02:14 Antoine Palazzolo correct link (#502)
005d89b8 2023-12-20 17:23:04 Lino Galiana Finalise l’affichage des statistiques Git (#478)
16842200 2023-12-02 12:06:40 Antoine Palazzolo Première partie de relecture de fin du cours (#467)
1f23de28 2023-12-01 17:25:36 Lino Galiana Stockage des images sur S3 (#466)
69cf52bd 2023-11-21 16:12:37 Antoine Palazzolo [On-going] Suggestions chapitres modélisation (#452)
a7711832 2023-10-09 11:27:45 Antoine Palazzolo Relecture TD2 par Antoine (#418)
e8d0062d 2023-09-26 15:54:49 Kim A Relecture KA 25/09/2023 (#412)
154f09e4 2023-09-26 14:59:11 Antoine Palazzolo Des typos corrigées par Antoine (#411)
6178ebeb 2023-09-26 14:18:34 Lino Galiana Change quarto project type (#409)
9a4e2267 2023-08-28 17:11:52 Lino Galiana Action to check URL still exist (#399)
80823022 2023-08-25 17:48:36 Lino Galiana Mise à jour des scripts de construction des notebooks (#395)
3bdf3b06 2023-08-25 11:23:02 Lino Galiana Simplification de la structure 🤓 (#393)
dde3e934 2023-07-21 22:22:05 Lino Galiana Fix bug on chapter order (#385)
2dbf8533 2023-07-05 11:21:40 Lino Galiana Add nice featured images (#368)
f21a24d3 2023-07-02 10:58:15 Lino Galiana Pipeline Quarto & Pages 🚀 (#365)
Back to top

Footnotes

  1. Library developed by the French public research laboratories of INRIA since 2007.↩︎

  2. Library initially used by Google for their internal needs, it was made public in 2015. Although less used now, this library had a significant influence in the 2010s by promoting the use of neural networks in research and operational applications.↩︎

  3. Library developed by Meta since 2018 and affiliated since 2022 with the PyTorch foundation.↩︎

  4. In these two areas, the most serious competitor to Python is Javascript. However, the community around this language is more focused on web development issues than on data science.↩︎

  5. Tabular data are structured data, organized, as their name indicates, in a table format that allows matching observations with variables. This structuring differs from other types of more complex data: free texts, images, sounds, videos… In the domain of unstructured data, Python is the hegemonic language for analysis. In the domain of tabular data, Python’s competitive advantage is less pronounced, particularly compared to , but these two languages offer a core set of fairly similar functionalities. We will regularly draw parallels between these two languages in the chapters dedicated to the Pandas library.↩︎

  6. Un notebook est un environnement interactif qui permet d’écrire et d’exécuter du code en direct. Il combine, dans un seul document, du texte, du code qui peut être exécuté et dont les sorties s’affichent après calculs. C’est extrêmement pratique pour l’apprentissage du langage Python. Pour plus de détails, consultez la documentation officielle de Jupyter.↩︎

  7. Opening chapters as notebooks in standardized environments, as will be proposed starting from the next chapter, ensures that you have a controlled environment. Personal installations of Python are likely to have undergone modifications that can alter your environment and cause unexpected and hard-to-understand errors: this is not a recommended use for this course. As you will discover in the next chapter, cloud environments offer comfort regarding environment standardization.↩︎

Citation

BibTeX citation:
@book{galiana2023,
  author = {Galiana, Lino},
  title = {Python Pour La Data Science},
  date = {2023},
  url = {https://pythonds.linogaliana.fr/},
  doi = {10.5281/zenodo.8229676},
  langid = {en}
}
For attribution, please cite this work as:
Galiana, Lino. 2023. Python Pour La Data Science. https://doi.org/10.5281/zenodo.8229676.