Python pour la data science

Lino Galiana

doi:10.5281/zenodo.8229676

Skills you will acquire in this chapter

Understand why Python has become essential in the fields of data science and data engineering, thanks to its simplicity, readability, and strong community ecosystem
Recognize how Python can support you through every stage of a data-driven project—from structuring and exploring data to modeling and communicating results
Appreciate the value of a reproducible, rigorous, and scientific approach to data science and data engineering, and learn how to support this using tools like Jupyter notebooks, open data, and standardized environments
Grasp the purpose and teaching goals of the course, as well as its main themes

1 Introduction

This course brings together all the material from the Python for Data Science class I’ve been teaching at ENSAE since 2020¹. Each year, around 190 students take this course. In 2024, an English version—equivalent to the French original—was gradually introduced. It is designed as an introductory data science course for European statistical institutes, following a European call for projects.

The site (pythonds.linogaliana.fr) serves as the main hub for the course. It gathers all course content, including practical assignments and additional materials aimed at continuing education. The course is open source, and I welcome suggestions for improvement either on GitHub or in the comments section at the bottom of each page.

Because Python is a living, fast-evolving language, the course is continuously updated to reflect the changing data science ecosystem. At the same time, it strives to differentiate lasting trends from short-lived fads.

You can find more information in the introductory slides. More advanced topics are covered in another course focused on deploying data science projects to production, which I co-teach with Romain Avouac in the final year at ENSAE (ensae-reproductibilite.github.io/website).

Website Architecture

This course offers comprehensive tutorials and exercises that can be read directly on the site or edited and run in an interactive Jupyter Notebook environment (see the next chapter for details).

Each page is built around a concrete problem and introduces a general approach to solving it. All examples are based on open data and are fully reproducible.

You can navigate the site using the table of contents or the previous/next links at the bottom of each page. Some sections - such as the one on modeling - include highlighted examples that illustrate the methodology and present different possible approaches to solving the same problem.

2 Why `Python` ?

Python whose recognizable logo appears as , is a language that’s been around for over thirty years. But it was in the 2010s that it experienced a major resurgence, driven by the growing popularity of data science.

More than any other language, Python brings together a wide range of communities: statisticians, application developers, IT infrastructure managers, high school students (it has been part of the French baccalaureate curriculum for several years), and researchers in both theoretical and applied fields.

Unlike many programming languages that cater to relatively homogeneous communities, Python has succeeded in uniting diverse users thanks to a few key principles: its readable syntax, the simplicity of using modules, the ease of integration with more powerful languages for specific tasks, and the vast amount of online documentation. Sometimes, being the second-best tool for a task—while offering a broader set of advantages—can be the key to success.

Python success story is closely tied to the rise of the data scientist role, a profile capable of working across the entire data processing pipeline. In the Harvard Business Review, Davenport and Patil (2012) famously called it “the sexiest job of the 21st century.” A decade later, he and his co-authors provided a full update on the evolving expectations for data scientists (Davenport and Patil 2022).

But it’s not only data scientists who need to use Python. In the broader ecosystem of data-related professions—data scientists, data engineers, ML engineers, and more—Python serves as a kind of Tower of Babel, enabling collaboration among interdependent roles.

This course introduces various tools that use Python to connect data with theoretical concepts from statistics and the economic and social sciences. However, it goes beyond a simple introduction to the language: it regularly reflects on both the strengths and the limitations of Python in meeting operational and scientific needs.

3 Why use `Python` for data analysis?

This question is slightly different: if Python is already a popular language for learning programming due to its ease of use, how did it also become the dominant language in the data and AI ecosystem?

Python first gained traction in the data science world for offering tools to train machine learning algorithms, even before such approaches became mainstream. Of course, the success of libraries like Scikit-Learn, TensorFlow, and more recently PyTorch, played a major role in Python’s adoption by the data science community². However, reducing Python to a handful of machine learning libraries would be overly simplistic. It is truly a Swiss Army knife for data scientists, social scientists, economists, and data practitioners of all kinds. Its success is not only due to offering the right tools at the right time, but also because the language itself offers real advantages for newcomers to data work.

What makes Python appealing is its central role in a broader ecosystem of powerful, flexible, and open-source tools. Like , it belongs to a category of languages suitable for everyday use across a wide variety of tasks. In many of the fields covered in this course, Python has by far the richest and most accessible ecosystem. Unlike other popular languages such as JavaScript or Rust, it has a very gentle learning curve, allowing users to write high-quality code quickly - provided they learn the right habits, which this course (and the companion course on production workflows) aims to instill.

Beyond AI projects[^nte-ia-en], Python is also indispensable for retrieving data via APIs or through web scraping³, two techniques introduced early in the course. In areas like tabular data analysis⁴, web publishing or data visualization, Python now offers an ecosystem comparable to , thanks in part to growing investment from Posit, which has ported many of ’s most successful libraries—such as ggplot to Python.

Why discuss AI so little in a Python course?

While a significant portion of this course covers machine learning and related algorithms, I tend to resist the current trend - especially strong since the release of ChatGPT in late 2022 - of labeling everything as “AI”.

First, because the term is vague, overused, and often exploited for marketing purposes, capitalizing on its symbolic power drawn from science fiction to sell miracle solutions or stoke fear.

Second, because the term “AI” covers a vast range of possible methods, depending on how broadly we define it. The sections on modeling and NLP in this course, which are the closest to the AI field, focus on learning-based methods. But as definitions from Russell and Norvig (2020) or the European AI Act show, artificial intelligence encompasses much more:

The study of [intelligent] agents that perceive their environment and act upon it. Each such agent is implemented by a function that maps perceptions to actions. We study different ways to define this function, such as production systems, reactive agents, logical planners, neural networks, and decision-theoretic systems.

Russell and Norvig (2020)

“AI system” means a machine-based system designed to operate with varying levels of autonomy and capable of adapting after deployment. It infers, based on its inputs, how to generate outputs—such as predictions, content, recommendations, or decisions—that can influence physical or virtual environments.

European AI Act

For more on this topic, see a presentation I gave in 2024 (in French):

Scroll the slides or open in full screen.

Finally, there’s also a pedagogical reason. Since 2023, “AI” has largely become synonymous with generative AI. But to understand how this radically different paradigm works - and to implement meaningful, value-driven generative AI projects - one must first understand the foundations and limitations of the machine learning approach. Otherwise, there’s a risk of building overly complex solutions for simple problems or misjudging the value of generative models compared to more traditional methods. Since this is an introductory course, I’ve chosen to focus on machine learning and introductory NLP, deep enough to be meaningful, while leaving it to the curious to explore generative AI further on their own.

That said, this course does not aim to stir up the sterile debate between and Python. The two languages share far more than they differ, and best practices are often transferable between them. This idea is explored more deeply in the advanced course I co-teach with Romain Avouac (ensae-reproductibilite.github.io/website).

In practice, data scientists and researchers in social sciences or economics increasingly use and Python interchangeably. This course will frequently draw analogies between the two, to help learners already familiar with transition smoothly to Python.

4 Why learn `Python` when code-generating AIs exist?

Code assistants like Copilot and ChatGPT have fundamentally transformed software development. These tools are now part of the everyday toolkit of a data scientist, offering remarkable convenience by generating Python code from more or less well-specified instructions. Trained on massive amounts of publicly available code—and often fine-tuned for solving development tasks—they can be extremely helpful. The concept of vibe coding even pushes this further, aiming to let large language models (LLMs) take initiative without requiring human intermediaries to access the computational resources needed to run the code they generate.

So, if AIs can now generate code, why should we still learn how to code?

Because coding is not just about writing lines of code. It’s about understanding a problem, crafting a step-by-step strategy to tackle it, considering multiple solutions and trade-offs (e.g., speed, simplicity), testing and debugging. Code is a means to an engineering end. While AIs are very good at generating code, relating problems to known patterns, and even translating solutions across languages into Python, that’s only part of the picture.

Working with data is first and foremost an engineering process. Code is not the goal—it’s the tool that supports structured reasoning toward solving real-world problems. Just like an engineer designs a bridge to meet a practical need, a data scientist begins with an operational goal—such as building a recommendation system, evaluating the impact of a product launch, or forecasting sales—and reformulates it into an analytical task. This means translating scientific or business ideas into a set of questions, then breaking those down into logical steps, each of which can be executed by a computer.

In this context, an LLM can act as a valuable assistant—but only when the problem is well formulated. If the task is vague or ill-defined, the model’s answers will be approximate or even useless. On standard problems, the results may appear accurate. But for more specific, non-standard tasks, it often becomes necessary to iterate, refine the prompt, reframe the problem… and sometimes still fail to get a satisfactory result. Not because the model is poor, but because good problem formulation—the essence of problem engineering—makes all the difference⁵.

For instance, in the year 2025, uv saw rapid adoption, as did ruff the year before. It will still be some time before generative AIs propose this environment manager on their own, rather than poetry. The existence of generative AIs does not, therefore, dispense us, as before, from keeping an active technical watch and being vigilant about changes in practices.

5 Course Objectives

5.1 Introducing the data science approach

This course is intended for practitioners of data science, understood here in the broadest sense as the combination of techniques from mathematics, statistics, and computer science to extract useful knowledge from data.

Since data science is not only an academic discipline but also a practical field aimed at achieving operational goals, learning its main tool—namely, the Python programming language—goes hand in hand with adopting a rigorous, scientific approach to data.

The objective of this course is to explore how to approach a dataset, identify and address common challenges, develop appropriate solutions, and reflect on their broader implications. It is therefore not merely a course about a technical tool, disconnected from scientific reasoning, but one rooted in understanding data through both technical and conceptual lenses.

Do I need a math background for this course?

This course assumes you are interested in using data-intensive Python within a rigorous statistical framework. It does not delve deeply into the statistical or algorithmic foundations of the techniques presented - many of which are covered in dedicated courses, particularly at ENSAE.

That said, not being familiar with these concepts shoud not prevent from following this course. More advanced ideas are typically introduced separately, in dedicated callout boxes. Thanks to Python’s ease of use, you will not need to implement complex models from scratch - making it possible to apply techniques even if you are not an expert in the underlying theory. What is important, however, is having enough understanding to correctly interpret the results.

Still, while Python makes it relatively easy to run sophisticated models, it is very helpful to have some perspective before diving into modeling. That explains why modeling appears later in the course: in addition to relying on advanced statistical concepts, effective modeling also requires a solid understanding of the data. You need to identify key patterns and assess whether your data fits the assumptions of the model. Without this foundation, it is difficult to build models that are truly meaningful or reliable.

5.2 Reproducibility

This course places strong emphasis on reproducibility. This principle is reflected in several ways. First and foremost, by ensuring that all examples and exercises can be run and tested using Jupyter notebooks⁶.

All content on the website is designed to be reproducible across different computing environments. Of course, you’re free to copy and paste code snippets directly from the site using the button available at the top of each code block.

1x = "Try to copy-paste me"

1: Click on the button to copy this content and paste it elsewhere.

However, since this site includes many examples, constantly switching between a Python environment and the website can become tedious. To make things easier, each chapter can be downloaded as a Jupyter notebook using the buttons provided at the top of each page.

For example, here are the buttons for the first chapter on Pandas:

Recommendations on the best environments for using these notebooks are provided in the next chapter.

The focus on reproducibility is also reflected in the choice of examples used throughout the course. All content on this site is based on open data, sourced either from French platforms - primarily the centralized portal data.gouv, which aggregates public datasets from French institutions, or the official statistics agency Insee, France’s national institute for statistics and economic studies - or from U.S. datasets. This ensures that results are reproducible for anyone working in an identical environment⁷.

Note

American researchers have described a reproducibility crisis in the field of machine learning (Reproducibilitycrisis-en?). The distortions of the scientific publishing ecosystem - combined with the economic incentives driving academic publications in machine learning - are often cited as major contributing factors.

However, university education also bears a share of the responsibility. Students and researchers are rarely trained in the principles of reproducibility, and if these practices are not introduced early in their careers, they are unlikely to adopt them later. This is why, in addition to teaching Python and data science, this course includes a dedicated section on using version control with Git .

All student projects are required to be open source—one of the most effective ways for instructors to encourage high-quality, transparent, and reproducible code.

5.3 Assessment

Students at ENSAE complete the course by working on an in-depth project. Details on how the course is assessed, along with a list of past student projects, can be found in the Evaluation section.

6 Course Outline

This course serves as an introduction to the core challenges of data science through learning the Python programming language. As the term “data science” implies, a significant portion of the course is dedicated to working directly with data: retrieving it, structuring it, exploring it, and combining it.

These topics are covered in the first part of the course, “Manipulating Data”, which lays the foundation for everything that follows. Unfortunately, many training programs in data science, applied statistics, or the economic and social sciences tend to overlook this crucial aspect of a data scientist’s work—often referred to as “data wrangling” or “feature engineering”. And yet, not only does it represent a large share of the day-to-day work in data science, it’s also essential for building relevant and accurate models.

The goal of this first section is to highlight the challenges involved in accessing and leveraging different types of data sources using Python. The examples are diverse, reflecting the variety of data that can be analyzed with Python: municipal \(CO_2\) emissions in France, real estate transaction records, housing energy performance diagnostics, bike-sharing data from the Velib system, and more.

The second part of the course focuses on creating visualizations with Python. Once your data has been cleaned and processed, you’ll typically want to summarize it—through tables, graphs, or maps. This part, “Communicating with Python”, offers a concise introduction to the topic. While somewhat introductory, it provides essential concepts that will be reinforced later in the course.

The third part centers on modeling, using electoral science as the main example (“Modeling with Python”). This section introduces the scientific reasoning behind machine learning, explores both methodological and technical choices, and sets the stage for deeper topics addressed later in the program.

The fourth part takes a step back to focus on the specific challenges of working with text data. This is the “Introduction to Natural Language Processing (NLP) with Python” chapter. Given that NLP is a rapidly evolving field, this section serves only as an introduction. For more advanced coverage, see Russell and Norvig (2020), chapter 24.

This chapter also includes a section on version control with Git (Discover Git). Why include this in a course about Python ? Because learning Git helps you write better code, collaborate effectively, and test or share your work in reproducible environments. This is especially valuable in a world where platforms like GitHub act as professional showcases—and where companies and public institutions increasingly expect their data scientists to be proficient with Git.

For more advanced applications, including deployment and reproducibility, refer to the companion course on putting data science projects into production.

References

Davenport, Thomas H, and DJ Patil. 2012. “Data Scientist, the Sexiest Job of the 21st Century.” Harvard Business Review 90 (5): 70–76. https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century.

———. 2022. “Is Data Scientist Still the Sexiest Job of the 21st Century?” Harvard Business Review 90.

Russell, Stuart J., and Peter Norvig. 2020. Artificial Intelligence: A Modern Approach (4th Edition). Pearson. http://aima.cs.berkeley.edu/.

Informations additionnelles

Python environment

This site was built automatically through a Github action using the Quarto reproducible publishing software (version 1.7.33).

The environment used to obtain the results is reproducible via uv. The pyproject.toml file used to build this environment is available on the linogaliana/python-datascientist repository

pyproject.toml

[project]
name = "python-datascientist"
version = "0.1.0"
description = "Source code for Lino Galiana's Python for data science course"
readme = "README.md"
requires-python = ">=3.12,<3.13"
dependencies = [
    "altair==5.4.1",
    "black==24.8.0",
    "cartiflette",
    "contextily==1.6.2",
    "duckdb>=0.10.1",
    "folium>=0.19.6",
    "geoplot==0.5.1",
    "graphviz==0.20.3",
    "great-tables==0.12.0",
    "ipykernel>=6.29.5",
    "jupyter>=1.1.1",
    "jupyter-cache==1.0.0",
    "kaleido==0.2.1",
    "langchain-community==0.3.9",
    "loguru==0.7.3",
    "markdown>=3.8",
    "nbclient==0.10.0",
    "nbformat==5.10.4",
    "nltk>=3.9.1",
    "pip>=25.1.1",
    "plotly>=6.1.2",
    "plotnine==0.13.6",
    "polars==1.8.2",
    "pyarrow==17.0.0",
    "pynsee==0.1.8",
    "python-dotenv==1.0.1",
    "pywaffle==1.1.1",
    "requests>=2.32.3",
    "scikit-image==0.24.0",
    "scipy==1.13.0",
    "spacy==3.8.4",
    "webdriver-manager==4.0.2",
    "wordcloud==1.9.3",
    "xlrd==2.0.1",
    "yellowbrick==1.5",
]

[tool.uv.sources]
cartiflette = { git = "https://github.com/inseefrlab/cartiflette" }

To use exactly the same environment (version of Python and packages), please refer to the documentation for uv.

File history

md`This file has been modified __${table_commit.length}__ times since its creation on ${creation_string} (last modified on ${last_modification_string})`

html`<div>${git_history_table}</div>`

html`<div>${git_history_plot}</div>`

SHA	Date	Author	Description
e43d74ca	2025-08-13 13:59:17	lgaliana	remove no longer relevant include shortcode
0a7ad5ec	2025-08-13 13:40:21	lgaliana	Change additional informations section
c3d51646	2025-08-12 17:28:51	Lino Galiana	Ajoute un résumé au début de chaque chapitre (première partie) (#634)
45eca343	2025-06-10 16:47:07	lgaliana	Lua extension
ba2663fa	2025-06-04 15:39:53	Lino Galiana	Améliore l’intro (#608)
3f1d2f3f	2025-03-15 15:55:59	Lino Galiana	Fix problem with uv and malformed files (#599)
388fd975	2025-02-28 17:34:09	Lino Galiana	Colab again and again… (#595)
488780a4	2024-09-25 14:32:16	Lino Galiana	Change badge (#556)
5d15b063	2024-09-23 15:39:40	lgaliana	Handling badges problem
f8b04136	2024-08-28 15:15:04	Lino Galiana	Révision complète de la partie introductive (#549)
0908656f	2024-08-20 16:30:39	Lino Galiana	English sidebar (#542)
a987feaa	2024-06-23 18:43:06	Lino Galiana	Fix broken links (#506)
69a45850	2024-06-12 20:02:14	Antoine Palazzolo	correct link (#502)
005d89b8	2023-12-20 17:23:04	Lino Galiana	Finalise l’affichage des statistiques Git (#478)
16842200	2023-12-02 12:06:40	Antoine Palazzolo	Première partie de relecture de fin du cours (#467)
1f23de28	2023-12-01 17:25:36	Lino Galiana	Stockage des images sur S3 (#466)
69cf52bd	2023-11-21 16:12:37	Antoine Palazzolo	[On-going] Suggestions chapitres modélisation (#452)
a7711832	2023-10-09 11:27:45	Antoine Palazzolo	Relecture TD2 par Antoine (#418)
e8d0062d	2023-09-26 15:54:49	Kim A	Relecture KA 25/09/2023 (#412)
154f09e4	2023-09-26 14:59:11	Antoine Palazzolo	Des typos corrigées par Antoine (#411)
6178ebeb	2023-09-26 14:18:34	Lino Galiana	Change quarto project type (#409)
9a4e2267	2023-08-28 17:11:52	Lino Galiana	Action to check URL still exist (#399)
80823022	2023-08-25 17:48:36	Lino Galiana	Mise à jour des scripts de construction des notebooks (#395)
3bdf3b06	2023-08-25 11:23:02	Lino Galiana	Simplification de la structure 🤓 (#393)
dde3e934	2023-07-21 22:22:05	Lino Galiana	Fix bug on chapter order (#385)
2dbf8533	2023-07-05 11:21:40	Lino Galiana	Add nice featured images (#368)
f21a24d3	2023-07-02 10:58:15	Lino Galiana	Pipeline Quarto & Pages 🚀 (#365)

creation = d3.min(
  table_commit.map(d => new Date(d.Date))
)

last_modification = d3.max(
  table_commit.map(d => new Date(d.Date))
)

creation_string = creation.toLocaleString("fr", {
  "day": "numeric",
  "month": "long",
  "year": "numeric"
})

last_modification_string = last_modification.toLocaleString("fr", {
  "day": "numeric",
  "month": "long",
  "year": "numeric"
})

git_history_table = Inputs.table(
  table_commit,
  {
    format: {
      SHA: x => md`[${x}](${github_repo}/commit/${x})`,
      Description: x => md`${replacePullRequestPattern(x, github_repo)}`,
      /*Date: x => x.toLocaleString("fr", {
        "month": "numeric",
        "day": "numeric",
        "year": "numeric"
        })
      */
    }
  }
)

git_history_plot = Plot.plot({
  marks: [
    Plot.ruleY([0], {stroke: "royalblue"}),
    Plot.dot(
          table_commit,
          Plot.pointerX({x: (d) => new Date(d.date), y: 0, stroke: "red"})),
    Plot.dot(table_commit, {x: (d) => new Date(d.Date), y: 0, fill: "royalblue"})
  ]
})

function replacePullRequestPattern(inputString, githubRepo) {
    // Use a regular expression to match the pattern #digit
    var pattern = /#(\d+)/g;

    // Replace the pattern with ${github_repo}/pull/#digit
    var replacedString = inputString.replace(pattern, '[#$1](' + githubRepo + '/pull/$1)');

    return replacedString;
}

github_repo = "https://github.com/linogaliana/python-datascientist"

table_commit = {

// Get the HTML table by its class name
var table = document.querySelector('.commit-table');

// Check if the table exists
if (table) {
    // Initialize an array to store the table data
    var dataArray = [];

    // Extract headers from the first row
    var headers = [];
    for (var i = 0; i < table.rows[0].cells.length; i++) {
        headers.push(table.rows[0].cells[i].textContent.trim());
    }

    // Iterate through the rows, starting from the second row
    for (var i = 1; i < table.rows.length; i++) {
        var row = table.rows[i];
        var rowData = {};

        // Iterate through the cells in the row
        for (var j = 0; j < row.cells.length; j++) {
            // Use headers as keys and cell content as values
            rowData[headers[j]] = row.cells[j].textContent.trim();
        }

        // Push the rowData object to the dataArray
        dataArray.push(rowData);
    }
  }

  return dataArray

}

// Get the element with class 'git-details'
{
  var gitDetails = document.querySelector('.commit-table');

  // Check if the element exists
  if (gitDetails) {
      // Hide the element
      gitDetails.style.display = 'none';
  }
}

Plot = require('@observablehq/plot@0.6.12/dist/plot.umd.min.js')

Back to top

Footnotes

This course was originally taught by Xavier Dupré.↩︎
Scikit-Learn is a library developed since 2007 by French public research labs at INRIA. Open source from the outset, it is now maintained by :probabl., a startup created to manage the Scikit ecosystem, bringing together some of the INRIA research teams behind the core of the modern machine learning stack.

TensorFlow was developed internally at Google and made public in 2015. Although now less widely used - partly due to the rise of PyTorch - it played a major role in popularizing neural networks in both research and production during the 2010s.

PyTorch was developed by Meta starting 2018 and has been governed by the PyTorch Foundation since 2022. It is now the most widely used framework to train neural networks.↩︎
In the domains of API access and web scraping, JavaScript is Python’s most serious competitor. However, its community is more focused on web development than on data science.↩︎
Tabular data refers to structured data organized in tables that map observations to variables. This structure contrasts with unstructured data like free text, images, audio, or video. In unstructured data analysis, Python dominates. For tabular data, Python advantage is less clear - especially compared to - but both languages now offer similar capabilities. We will regularly draw parallels between them in chapters on the Pandas library.↩︎
On this topic, see Thomas Wolf’s blog post The Einstein AI model. Although the post focuses on disruptive innovation and pays less attention to incremental progress, it’s insightful in understanding that LLMs—despite bold predictions from tech influencers—are still just tools. They may excel at standardized tasks, but for now, they remain assistants.↩︎
Jupyter notebooks are interactive documents that allow you to combine code, text, and visualizations in a single file. They’re widely used in data science and education to make code both readable and executable.↩︎
Opening the chapters as notebooks in standardized environments - something explained in the next chapter - ensures you are working in a controlled setup. Personal Python installations often involve tweaks and adjustments that can alter your environment and lead to unexpected, hard-to-diagnose errors. For this reason, such local setups are not recommended for this course. As you’ll see in the next chapter, cloud-based environments offer the advantage of consistent, preconfigured setups that greatly improve reliability and ease of use.↩︎

Citation

BibTeX citation:

@book{galiana2023,
  author = {Galiana, Lino},
  title = {Python Pour La Data Science},
  date = {2023},
  url = {https://pythonds.linogaliana.fr/},
  doi = {10.5281/zenodo.8229676},
  langid = {en}
}

For attribution, please cite this work as:

Galiana, Lino. 2023. Python Pour La Data Science. https://doi.org/10.5281/zenodo.8229676.