Python pour la data science

Lino Galiana

doi:10.5281/zenodo.8229676

Although data scientists are often associated with the implementation of artificial intelligence models, it is important not to forget that training and using these models do not necessarily represent the daily work of data scientists.

In practice, gathering heterogeneous data sources, structuring and harmonizing them for exploratory analysis prior to modeling or visualization represents a significant part of data scientists’ work. In many environments, this is even the essence of a data scientist’s role. Developing relevant models indeed requires deep reflection on the data; an essential step that should not be overlooked.

This course, like many introductory resources on data science (Wickham, Çetinkaya-Rundel, and Grolemund 2023; VanderPlas 2016; McKinney 2012), will therefore offer a lot of content on data manipulation, an essential skill for data scientists.

Programming software based around the database concept has become the main tool for data scientists. Being able to apply a number of standard operations on databases, regardless of their nature, allows programmers to be more efficient than if they had to repeat these operations manually, as in Excel.

All the dominant programming languages in the data science ecosystem are based on the dataframe principle. It is even a central object in some software, notably R. The logic of SQL, a language for declaring data operations that has been around for over fifty years, provides a relevant framework for performing standardized operations on columns (creating new columns, selecting subsets of rows, etc.).

However, the dataframe only recently became established in Python, thanks to the Pandas package created by Wes McKinney. The rise of the Pandas library (downloaded over 5 million times per day in 2023) is largely responsible for Python’s success in the data science ecosystem and has led, in just a few years, to a complete renewal of how coding in Python, such a flexible language, is approached for data analysis.

This part of the course is a general introduction to the rich ecosystem of data manipulation with Python. These chapters cover both data retrieval and the restructuring and analysis of that data.

Summary of that section

Pandas has become essential in the Python ecosystem for data science. Pandas itself is built on top of the Numpy package, which is useful to understand to be comfortable with Pandas. Numpy is a low-level library for storing and manipulating data. Numpy is at the heart of the data science ecosystem because most libraries, even those handling unstructured objects, use objects built from Numpy¹.

The Pandas approach, which provides a unified entry point for manipulating datasets of very different natures, has been extended to geographic objects with Geopandas. This allows for the manipulation of geographic data as if it were classic structured data. Geographic data and cartographic representation are becoming increasingly common with the rise of open localized data and geolocated big-data.

However, structured data imported from flat files is not the only data source. APIs and web scraping allow for flexible downloading or extraction of data from web pages or specialized portals. These data, particularly those obtained through web scraping, often require a bit more data cleaning work, especially with character strings.

The Pandas ecosystem thus represents a Swiss army knife for data analysis. This is why this course will cover it extensively. Before trying to implement an ad hoc solution, it is often useful to ask the following question: “Could I do this with the basic functionalities of Pandas?” Asking this question can prevent arduous paths and save a lot of time.

However, Pandas is not suitable for handling large volumes of data. To process such data, it is recommended to use Polars or Dask, which adopt the logic of Pandas but optimize its functionality, Spark if you have suitable infrastructure, generally in big data environments, or DuckDB if you are willing to use SQL queries rather than a high-level library.

Exercises

This section provides both detailed tutorials and guided exercises. You can view them on this site or use one of the badges at the beginning of the chapter, for example these to open the Pandas exercises chapter:

Going further

This course does not really address issues of volume or speed of computation. Pandas can show its limits in this area with large datasets (several gigabytes).

It is therefore interesting to consider:

The book Modern Pandas for additional insights on performance with Pandas;
The question of sparse objects;
The packages Dask or Polars to speed up computations;
DuckDB for very efficient SQL queries;
PySpark for very large datasets.

References

Here is a selective bibliography of interesting books complementary to the chapters in the “Manipulation” section of this course:

McKinney, Wes. 2012. Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython. " O’Reilly Media, Inc.".

VanderPlas, Jake. 2016. Python Data Science Handbook: Essential Tools for Working with Data. " O’Reilly Media, Inc.".

Wickham, Hadley, Mine Çetinkaya-Rundel, and Garrett Grolemund. 2023. R for Data Science. " O’Reilly Media, Inc.".

Informations additionnelles

Python environment

This site was built automatically through a Github action using the Quarto reproducible publishing software (version 1.7.33).

The environment used to obtain the results is reproducible via uv. The pyproject.toml file used to build this environment is available on the linogaliana/python-datascientist repository

pyproject.toml

[project]
name = "python-datascientist"
version = "0.1.0"
description = "Source code for Lino Galiana's Python for data science course"
readme = "README.md"
requires-python = ">=3.12,<3.13"
dependencies = [
    "altair==5.4.1",
    "black==24.8.0",
    "cartiflette",
    "contextily==1.6.2",
    "duckdb>=0.10.1",
    "folium>=0.19.6",
    "geoplot==0.5.1",
    "graphviz==0.20.3",
    "great-tables==0.12.0",
    "ipykernel>=6.29.5",
    "jupyter>=1.1.1",
    "jupyter-cache==1.0.0",
    "kaleido==0.2.1",
    "langchain-community==0.3.9",
    "loguru==0.7.3",
    "markdown>=3.8",
    "nbclient==0.10.0",
    "nbformat==5.10.4",
    "nltk>=3.9.1",
    "pip>=25.1.1",
    "plotly>=6.1.2",
    "plotnine==0.13.6",
    "polars==1.8.2",
    "pyarrow==17.0.0",
    "pynsee==0.1.8",
    "python-dotenv==1.0.1",
    "pywaffle==1.1.1",
    "requests>=2.32.3",
    "scikit-image==0.24.0",
    "scipy==1.13.0",
    "spacy==3.8.4",
    "webdriver-manager==4.0.2",
    "wordcloud==1.9.3",
    "xlrd==2.0.1",
    "yellowbrick==1.5",
]

[tool.uv.sources]
cartiflette = { git = "https://github.com/inseefrlab/cartiflette" }

To use exactly the same environment (version of Python and packages), please refer to the documentation for uv.

File history

md`This file has been modified __${table_commit.length}__ times since its creation on ${creation_string} (last modified on ${last_modification_string})`

html`<div>${git_history_table}</div>`

html`<div>${git_history_plot}</div>`

SHA	Date	Author	Description
7006f605	2025-07-28 14:20:47	Lino Galiana	Une première PR qui gère plein de bugs détectés par Nicolas (#630)
91431fa2	2025-06-09 17:08:00	Lino Galiana	Improve homepage hero banner (#612)
5f08b572	2024-08-29 10:33:57	Lino Galiana	Traduction de l’introduction (#551)
005d89b8	2023-12-20 17:23:04	Lino Galiana	Finalise l’affichage des statistiques Git (#478)
1f23de28	2023-12-01 17:25:36	Lino Galiana	Stockage des images sur S3 (#466)
69cf52bd	2023-11-21 16:12:37	Antoine Palazzolo	[On-going] Suggestions chapitres modélisation (#452)
154f09e4	2023-09-26 14:59:11	Antoine Palazzolo	Des typos corrigées par Antoine (#411)
9a4e2267	2023-08-28 17:11:52	Lino Galiana	Action to check URL still exist (#399)
80823022	2023-08-25 17:48:36	Lino Galiana	Mise à jour des scripts de construction des notebooks (#395)
3bdf3b06	2023-08-25 11:23:02	Lino Galiana	Simplification de la structure 🤓 (#393)
5d4874a8	2023-08-11 15:09:33	Lino Galiana	Pimp les introductions des trois premières parties (#387)
8e5edba6	2022-09-02 11:59:57	Lino Galiana	Ajoute un chapitre dask (#264)
f10815b5	2022-08-25 16:00:03	Lino Galiana	Notebooks should now look more beautiful (#260)
d201e3cd	2022-08-03 15:50:34	Lino Galiana	Pimp la homepage ✨ (#249)
12965bac	2022-05-25 15:53:27	Lino Galiana	:launch: Bascule vers quarto (#226)
5cac236e	2021-12-16 19:46:43	Lino Galiana	un petit mot sur mercator (#201)
4cdb759c	2021-05-12 10:37:23	Lino Galiana	:sparkles: :star2: Nouveau thème hugo :snake: :fire: (#105)
0a0d0348	2021-03-26 20:16:22	Lino Galiana	Ajout d’une section sur S3 (#97)
4677769b	2020-09-15 18:19:24	Lino Galiana	Nettoyage des coquilles pour premiers TP (#37)
d48e68fa	2020-09-08 18:35:07	Lino Galiana	Continuer la partie pandas (#13)
913047d3	2020-09-08 14:44:41	Lino Galiana	Harmonisation des niveaux de titre (#17)

creation = d3.min(
  table_commit.map(d => new Date(d.Date))
)

last_modification = d3.max(
  table_commit.map(d => new Date(d.Date))
)

creation_string = creation.toLocaleString("fr", {
  "day": "numeric",
  "month": "long",
  "year": "numeric"
})

last_modification_string = last_modification.toLocaleString("fr", {
  "day": "numeric",
  "month": "long",
  "year": "numeric"
})

git_history_table = Inputs.table(
  table_commit,
  {
    format: {
      SHA: x => md`[${x}](${github_repo}/commit/${x})`,
      Description: x => md`${replacePullRequestPattern(x, github_repo)}`,
      /*Date: x => x.toLocaleString("fr", {
        "month": "numeric",
        "day": "numeric",
        "year": "numeric"
        })
      */
    }
  }
)

git_history_plot = Plot.plot({
  marks: [
    Plot.ruleY([0], {stroke: "royalblue"}),
    Plot.dot(
          table_commit,
          Plot.pointerX({x: (d) => new Date(d.date), y: 0, stroke: "red"})),
    Plot.dot(table_commit, {x: (d) => new Date(d.Date), y: 0, fill: "royalblue"})
  ]
})

function replacePullRequestPattern(inputString, githubRepo) {
    // Use a regular expression to match the pattern #digit
    var pattern = /#(\d+)/g;

    // Replace the pattern with ${github_repo}/pull/#digit
    var replacedString = inputString.replace(pattern, '[#$1](' + githubRepo + '/pull/$1)');

    return replacedString;
}

github_repo = "https://github.com/linogaliana/python-datascientist"

table_commit = {

// Get the HTML table by its class name
var table = document.querySelector('.commit-table');

// Check if the table exists
if (table) {
    // Initialize an array to store the table data
    var dataArray = [];

    // Extract headers from the first row
    var headers = [];
    for (var i = 0; i < table.rows[0].cells.length; i++) {
        headers.push(table.rows[0].cells[i].textContent.trim());
    }

    // Iterate through the rows, starting from the second row
    for (var i = 1; i < table.rows.length; i++) {
        var row = table.rows[i];
        var rowData = {};

        // Iterate through the cells in the row
        for (var j = 0; j < row.cells.length; j++) {
            // Use headers as keys and cell content as values
            rowData[headers[j]] = row.cells[j].textContent.trim();
        }

        // Push the rowData object to the dataArray
        dataArray.push(rowData);
    }
  }

  return dataArray

}

// Get the element with class 'git-details'
{
  var gitDetails = document.querySelector('.commit-table');

  // Check if the element exists
  if (gitDetails) {
      // Hide the element
      gitDetails.style.display = 'none';
  }
}

Plot = require('@observablehq/plot@0.6.12/dist/plot.umd.min.js')

Back to top

Footnotes

Some libraries are gradually moving away from Numpy, which is not always the most suitable for managing certain types of data. The Arrow framework is becoming the lower layer used by more and more data science libraries. This blog post provides a detailed explanation of this topic.↩︎

Citation

BibTeX citation:

@book{galiana2023,
  author = {Galiana, Lino},
  title = {Python Pour La Data Science},
  date = {2023},
  url = {https://pythonds.linogaliana.fr/},
  doi = {10.5281/zenodo.8229676},
  langid = {en}
}

For attribution, please cite this work as:

Galiana, Lino. 2023. Python Pour La Data Science. https://doi.org/10.5281/zenodo.8229676.