Python pour la data science

Lino Galiana

doi:10.5281/zenodo.8229676

1 Introduction

The previous sections focused on acquiring cross-functional skills for working with data. Naturally, we have so far mostly focused on structured data—modest in size but already rich in analytical potential. This new section turns to a subject that, at first glance, may seem unlikely to be handled by computers—a topic of centuries-old philosophical debate, from Plato to Saussure: the richness of human language.

By drawing an analogy between “language” and “tongue”—that is, defining the former as the capacity to express and communicate thought through signs, and the latter as the conventional implementation of that capacity—we align ourselves with the field of linguistics and treat language as data. This opens the door to statistical and algorithmic analysis. Yet, even if statistical regularities exist, how can computers—ultimately limited to just 0 and 1—grasp such a complex object as language, which takes humans years to understand and master?¹

2 Natural Language Processing

Natural Language Processing (NLP) refers to the set of techniques that allow computers to understand, analyze, synthesize, and generate human language².

NLP is a disciplinary field at the intersection of statistics and linguistics, which has experienced significant growth in recent years — academically, operationally, and industrially. Some applications of these techniques have become essential in our daily lives, such as search engines, machine translation, and more recently, chatbots, whose development has accelerated rapidly since the launch of ChatGPT in December 2022.

3 Section Summary

This part of the course is dedicated to text data analysis with 📖 examples for fun. It serves as a gradual introduction to the topic by focusing on basic concepts necessary for later understanding of more advanced principles and sophisticated techniques³. This section mainly covers:

The challenges of cleaning textual fields and frequency analysis. This is somewhat old school NLP but understanding it is essential to progress further;
Language modeling using several approaches.

Before diving into the topic of embeddings, it’s important to understand the contributions and limitations of concepts like the bag of words or TF-IDF (term frequency - inverse document frequency). One of the main benefits of large language models—namely the richness of their contextual window that allows them to better grasp textual nuances and speaker intentionality—becomes clearer when the limitations of traditional NLP are understood.

As an introductory perspective, this course focuses on frequency-based approaches, especially the bag-of-words approach, to ease into the later exploration of the Pandora’s box that is embeddings.

3.1 Text Cleaning and Frequency Analysis

Python is an excellent tool for text data analysis. Basic methods for transforming textual data or dictionaries, combined with specialized libraries such as NLTK and SpaCy, make it possible to perform normalization and text data analysis very efficiently. Python is much better equipped than R for text data analysis. There is a wealth of online resources on this subject, and the best way to learn remains hands-on practice with a corpus to clean.

This section first revisits how to structure and clean a textual corpus through the bag of words approach. It aims to demonstrate how to turn a corpus into a tool suitable for statistical analysis:

It first introduces the challenges of text data cleaning through an analysis of The Count of Monte Cristo by Alexandre Dumas here, which helps to quickly summarize the available information in a large volume of text data (as illustrated by ?@fig-wordcloud-dumas)
It then offers a series of exercises on text cleaning based on the works of Edgar Allan Poe, Mary Shelley, and H.P. Lovecraft, aiming to highlight the specificity of each author’s vocabulary (for example, ?@fig-waffle-fear). These exercises are available in the second chapter of the section.

This frequency-based analysis provides perspective on the nature of text data and recurring issues in dimensionality reduction of natural language corpora. Just as descriptive statistics naturally lead to modeling, this frequency approach typically quickly leads to the desire to identify underlying rules behind our text corpora.

3.2 Language Modeling

The remainder of this section introduces the challenges of language modeling. These are currently very popular due to the success of ChatGPT. However, before delving into large language models (LLMs)—those neural networks with billions of parameters trained on massive data volumes—it’s important to first understand some preliminary modeling techniques.

We begin by exploring an alternative approach that takes into account the context in which a word appears. The introduction of Latent Dirichlet Allocation (LDA) serves as an opportunity to present document modeling through topics. However, this approach has fallen out of favor in comparison to methods related to the concept of embedding.

Toward the end of this course section, we will introduce the challenge of transforming textual fields into numeric vector forms. To do so, we will present the principle behind Word2Vec, which allows us, for instance, despite significant syntactic distance, to identify that semantically, Man and Woman are closely related. This chapter serves as a bridge to the concept of embedding, a major recent revolution in NLP. It enables the comparison of corpora not only by syntactic similarity (e.g., do they share common words?) but also by semantic similarity (e.g., do they share a theme or meaning?)⁴. Covering Word2Vec will give curious learners a solid foundation to then explore transformer-based models, which are now the benchmark in NLP.

To Go Further

Research in the field of NLP is highly active. It is therefore advisable to stay curious and explore additional resources, as no single source can compile all knowledge—especially in a field as dynamic as NLP.

To deepen the skills discussed in this course, I strongly recommend this course by HuggingFace.

To understand the internal architecture of an LLM, this post by Sebastian Raschka is very helpful.

These chapters only scratch the surface of NLP use cases for data scientists. For instance, in public statistics, one major NLP use case involves using automatic classification techniques to convert free-text answers in surveys into predefined fields within a nomenclature. This is a specific adaptation to public statistics, a heavy user of standardized nomenclatures, of multi-level classification problems.

Here is an example from a project on automated job classification using the PCS (socio-professional categories) typology, based on a model trained with the Fasttext library:

viewof activite = Inputs.text(
  {label: '', value: 'data scientist', width: 800}
)

d3.json(urlApe).then(res => {
  var IC, results;

  ({ IC, ...results } = res);

  IC = parseFloat(IC);

  const rows = Object.values(results).map(obj => {
    return `
    <tr>
      <td>${obj.code} | ${obj.libelle}</td>
      <td>${obj.probabilite.toFixed(3)}</td>
    </tr>
  `;
  }).join('');

  const confidenceRow = `<tr>
    <td colspan="2" style="text-align:left; "><em>Indice de confiance : ${IC.toFixed(3)}</em></td>
  </tr>`;

  const tableHTML = html`
  <table>
    <caption>
      Prédiction de l'activité
    </caption>
    <tr>
      <th style="text-align:center;">Libellé (NA2008)</th>
      <th>Probabilité</th>
    </tr>
      ${rows}
      ${confidenceRow}
  </table>`;

  // Now you can use the tableHTML as needed, for example, inserting it into the DOM.
  // For example, assuming you have a container with the id "tableContainer":
  return tableHTML;
});

activite_debounce = debounce(viewof activite, 2000)
urlApe = `https://codification-ape-test.lab.sspcloud.fr/predict?nb_echos_max=3&prob_min=0&text_feature=${activite_debounce}`

import {debounce} from "@mbostock/debouncing-input"

Informations additionnelles

environment files have been tested on.

Latest built version: 2025-07-29

Python version used:

'3.12.3 (main, Jun 18 2025, 17:59:45) [GCC 13.3.0]'

Package	Version
affine	2.4.0
aiobotocore	2.22.0
aiohappyeyeballs	2.6.1
aiohttp	3.11.18
aioitertools	0.12.0
aiosignal	1.3.2
altair	5.4.1
annotated-types	0.7.0
anyio	4.9.0
appdirs	1.4.4
argon2-cffi	25.1.0
argon2-cffi-bindings	21.2.0
arrow	1.3.0
asttokens	3.0.0
async-lru	2.0.5
attrs	25.3.0
babel	2.17.0
beautifulsoup4	4.13.4
black	24.8.0
bleach	6.2.0
blis	1.3.0
boto3	1.37.3
botocore	1.37.3
branca	0.8.1
Brotli	1.1.0
bs4	0.0.2
cartiflette	0.0.3
Cartopy	0.24.1
catalogue	2.0.10
cattrs	24.1.3
certifi	2025.7.14
cffi	1.17.1
charset-normalizer	3.4.2
chromedriver-autoinstaller	0.6.4
click	8.2.1
click-plugins	1.1.1
cligj	0.7.2
cloudpathlib	0.21.1
comm	0.2.2
commonmark	0.9.1
confection	0.1.5
contextily	1.6.2
contourpy	1.3.2
cycler	0.12.1
cymem	2.0.11
dataclasses-json	0.6.7
debugpy	1.8.14
decorator	5.2.1
defusedxml	0.7.1
diskcache	5.6.3
duckdb	1.3.0
en_core_web_sm	3.8.0
et_xmlfile	2.0.0
executing	2.2.0
fastexcel	0.14.0
fastjsonschema	2.21.1
fiona	1.10.1
folium	0.19.6
fontawesomefree	6.6.0
fonttools	4.58.0
fqdn	1.5.1
fr_core_news_sm	3.8.0
frozenlist	1.6.0
fsspec	2025.5.0
geographiclib	2.0
geopandas	1.0.1
geoplot	0.5.1
geopy	2.4.1
graphviz	0.20.3
great-tables	0.12.0
greenlet	3.2.2
h11	0.16.0
htmltools	0.6.0
httpcore	1.0.9
httpx	0.28.1
httpx-sse	0.4.0
idna	3.10
imageio	2.37.0
importlib_metadata	8.7.0
importlib_resources	6.5.2
inflate64	1.0.1
ipykernel	6.29.5
ipython	9.3.0
ipython_pygments_lexers	1.1.1
ipywidgets	8.1.7
isoduration	20.11.0
jedi	0.19.2
Jinja2	3.1.6
jmespath	1.0.1
joblib	1.5.1
json5	0.12.0
jsonpatch	1.33
jsonpointer	3.0.0
jsonschema	4.23.0
jsonschema-specifications	2025.4.1
jupyter	1.1.1
jupyter-cache	1.0.0
jupyter_client	8.6.3
jupyter-console	6.6.3
jupyter_core	5.7.2
jupyter-events	0.12.0
jupyter-lsp	2.2.5
jupyter_server	2.16.0
jupyter_server_terminals	0.5.3
jupyterlab	4.4.3
jupyterlab_pygments	0.3.0
jupyterlab_server	2.27.3
jupyterlab_widgets	3.0.15
kaleido	0.2.1
kiwisolver	1.4.8
langchain	0.3.25
langchain-community	0.3.9
langchain-core	0.3.61
langchain-text-splitters	0.3.8
langcodes	3.5.0
langsmith	0.1.147
language_data	1.3.0
lazy_loader	0.4
loguru	0.7.3
lxml	5.4.0
mapclassify	2.8.1
marisa-trie	1.2.1
Markdown	3.8
markdown-it-py	3.0.0
MarkupSafe	3.0.2
marshmallow	3.26.1
matplotlib	3.10.3
matplotlib-inline	0.1.7
mdurl	0.1.2
mercantile	1.2.1
mistune	3.1.3
mizani	0.11.4
multidict	6.4.4
multivolumefile	0.2.3
murmurhash	1.0.13
mypy_extensions	1.1.0
narwhals	1.40.0
nbclient	0.10.0
nbconvert	7.16.6
nbformat	5.10.4
nest-asyncio	1.6.0
networkx	3.4.2
nltk	3.9.1
notebook	7.4.3
notebook_shim	0.2.4
numpy	2.2.6
openpyxl	3.1.5
orjson	3.10.18
outcome	1.3.0.post0
overrides	7.7.0
OWSLib	0.33.0
packaging	24.2
pandas	2.2.3
pandocfilters	1.5.1
parso	0.8.4
pathspec	0.12.1
patsy	1.0.1
Pebble	5.1.1
pexpect	4.9.0
pillow	11.2.1
pip	25.1.1
platformdirs	4.3.8
plotly	6.1.2
plotnine	0.13.6
polars	1.8.2
preshed	3.0.9
prometheus_client	0.22.1
prompt_toolkit	3.0.51
propcache	0.3.1
psutil	7.0.0
ptyprocess	0.7.0
pure_eval	0.2.3
py7zr	0.22.0
pyarrow	17.0.0
pybcj	1.0.6
pycparser	2.22
pycryptodomex	3.23.0
pydantic	2.11.5
pydantic_core	2.33.2
pydantic-settings	2.9.1
Pygments	2.19.1
pynsee	0.1.8
pyogrio	0.11.0
pyparsing	3.2.3
pyppmd	1.1.1
pyproj	3.7.1
pyshp	2.3.1
PySocks	1.7.1
python-dateutil	2.9.0.post0
python-dotenv	1.0.1
python-json-logger	3.3.0
python-magic	0.4.27
pytz	2025.2
pywaffle	1.1.1
PyYAML	6.0.2
pyzmq	26.4.0
pyzstd	0.17.0
rasterio	1.4.3
referencing	0.36.2
regex	2024.11.6
requests	2.32.3
requests-cache	1.2.1
requests-toolbelt	1.0.0
retrying	1.3.4
rfc3339-validator	0.1.4
rfc3986-validator	0.1.1
rich	14.0.0
rpds-py	0.25.1
rtree	1.4.0
s3fs	2025.5.0
s3transfer	0.11.3
scikit-image	0.24.0
scikit-learn	1.6.1
scipy	1.13.0
seaborn	0.13.2
selenium	4.34.2
Send2Trash	1.8.3
setuptools	80.8.0
shapely	2.1.1
shellingham	1.5.4
six	1.17.0
smart-open	7.1.0
sniffio	1.3.1
sortedcontainers	2.4.0
soupsieve	2.7
spacy	3.8.4
spacy-legacy	3.0.12
spacy-loggers	1.0.5
SQLAlchemy	2.0.41
srsly	2.5.1
stack-data	0.6.3
statsmodels	0.14.4
tabulate	0.9.0
tenacity	9.1.2
terminado	0.18.1
texttable	1.7.0
thinc	8.3.6
threadpoolctl	3.6.0
tifffile	2025.5.24
tinycss2	1.4.0
topojson	1.9
tornado	6.5.1
tqdm	4.67.1
traitlets	5.14.3
trio	0.30.0
trio-websocket	0.12.2
typer	0.15.3
types-python-dateutil	2.9.0.20250516
typing_extensions	4.14.1
typing-inspect	0.9.0
typing-inspection	0.4.1
tzdata	2025.2
Unidecode	1.4.0
uri-template	1.3.0
url-normalize	2.2.1
urllib3	2.5.0
wasabi	1.1.3
wcwidth	0.2.13
weasel	0.4.1
webcolors	24.11.1
webdriver-manager	4.0.2
webencodings	0.5.1
websocket-client	1.8.0
widgetsnbextension	4.0.14
wordcloud	1.9.3
wrapt	1.17.2
wsproto	1.2.0
xlrd	2.0.1
xyzservices	2025.4.0
yarl	1.20.0
yellowbrick	1.5
zipp	3.21.0

View file history

md`Ce fichier a été modifié __${table_commit.length}__ fois depuis sa création le ${creation_string} (dernière modification le ${last_modification_string})`

creation = d3.min(
  table_commit.map(d => new Date(d.Date))
)

last_modification = d3.max(
  table_commit.map(d => new Date(d.Date))
)

creation_string = creation.toLocaleString("fr", {
  "day": "numeric",
  "month": "long",
  "year": "numeric"
})

last_modification_string = last_modification.toLocaleString("fr", {
  "day": "numeric",
  "month": "long",
  "year": "numeric"
})

html`<div>${git_history_table}</div>`

html`<div>${git_history_plot}</div>`

SHA	Date	Author	Description
91431fa2	2025-06-09 17:08:00	Lino Galiana	Improve homepage hero banner (#612)
d6b67125	2025-05-23 18:03:48	Lino Galiana	Traduction des chapitres NLP (#603)
ff42cf23	2024-04-25 20:05:33	linogaliana	Editorisalisation NLP
005d89b8	2023-12-20 17:23:04	Lino Galiana	Finalise l’affichage des statistiques Git (#478)
4cd44f35	2023-12-11 17:37:50	Antoine Palazzolo	Relecture NLP (#474)
deaafb6f	2023-12-11 13:44:34	Thomas Faria	Relecture Thomas partie NLP (#472)
1f23de28	2023-12-01 17:25:36	Lino Galiana	Stockage des images sur S3 (#466)
a1ab3d94	2023-11-24 10:57:02	Lino Galiana	Reprise des chapitres NLP (#459)
7bd768a6	2023-08-28 09:14:55	linogaliana	Erreur image
862ea4b3	2023-08-28 11:07:31	Lino Galiana	Ajoute référence au post de Raschka (#398)
3bdf3b06	2023-08-25 11:23:02	Lino Galiana	Simplification de la structure 🤓 (#393)
f2905a7d	2023-08-11 17:24:57	Lino Galiana	Introduction de la partie NLP (#388)
5d4874a8	2023-08-11 15:09:33	Lino Galiana	Pimp les introductions des trois premières parties (#387)
f21a24d3	2023-07-02 10:58:15	Lino Galiana	Pipeline Quarto & Pages 🚀 (#365)
a408cc96	2023-02-01 09:07:27	Lino Galiana	Ajoute bouton suggérer modification (#347)
164fa689	2022-11-30 09:13:45	Lino Galiana	Travail partie NLP (#328)
495599d7	2021-12-19 18:33:05	Lino Galiana	Des éléments supplémentaires dans la partie NLP (#202)
4f675284	2021-12-12 08:37:21	Lino Galiana	Improve website appareance (#194)
4cdb759c	2021-05-12 10:37:23	Lino Galiana	:sparkles: :star2: Nouveau thème hugo :snake: :fire: (#105)
d164635d	2020-12-08 16:22:00	Lino Galiana	:books: Première partie NLP (#87)

git_history_table = Inputs.table(
  table_commit,
  {
    format: {
      SHA: x => md`[${x}](${github_repo}/commit/${x})`,
      Description: x => md`${replacePullRequestPattern(x, github_repo)}`,
      /*Date: x => x.toLocaleString("fr", {
        "month": "numeric",
        "day": "numeric",
        "year": "numeric"
        })
      */
    }
  }
)

git_history_plot = Plot.plot({
  marks: [
    Plot.ruleY([0], {stroke: "royalblue"}),
    Plot.dot(
          table_commit,
          Plot.pointerX({x: (d) => new Date(d.date), y: 0, stroke: "red"})),
    Plot.dot(table_commit, {x: (d) => new Date(d.Date), y: 0, fill: "royalblue"})
  ]
})

function replacePullRequestPattern(inputString, githubRepo) {
    // Use a regular expression to match the pattern #digit
    var pattern = /#(\d+)/g;

    // Replace the pattern with ${github_repo}/pull/#digit
    var replacedString = inputString.replace(pattern, '[#$1](' + githubRepo + '/pull/$1)');

    return replacedString;
}

github_repo = "https://github.com/linogaliana/python-datascientist"

table_commit = {

// Get the HTML table by its class name
var table = document.querySelector('.commit-table');

// Check if the table exists
if (table) {
    // Initialize an array to store the table data
    var dataArray = [];

    // Extract headers from the first row
    var headers = [];
    for (var i = 0; i < table.rows[0].cells.length; i++) {
        headers.push(table.rows[0].cells[i].textContent.trim());
    }

    // Iterate through the rows, starting from the second row
    for (var i = 1; i < table.rows.length; i++) {
        var row = table.rows[i];
        var rowData = {};

        // Iterate through the cells in the row
        for (var j = 0; j < row.cells.length; j++) {
            // Use headers as keys and cell content as values
            rowData[headers[j]] = row.cells[j].textContent.trim();
        }

        // Push the rowData object to the dataArray
        dataArray.push(rowData);
    }
  }

  return dataArray

}

// Get the element with class 'git-details'
{
  var gitDetails = document.querySelector('.commit-table');

  // Check if the element exists
  if (gitDetails) {
      // Hide the element
      gitDetails.style.display = 'none';
  }
}

Plot = require('@observablehq/plot@0.6.12/dist/plot.umd.min.js')

Back to top

Footnotes

While writing these lines, I asked ChatGPT to tell me a joke.

Why do programmers hate the beach? Because the sand keeps changing paradigms!

Not understanding it, I asked for an explanation. Here’s what I got:

Of course, I’d be happy to explain! This joke plays on the double meaning of the word “paradigm” and is a pun involving programming and the beach.

In programming, a “paradigm” refers to a model or method of programming, such as object-oriented or functional programming. Programmers are often familiar with different paradigms.

On the other hand, in a beach context, “paradigm” is humorously misused to mean “tide”—alluding to the natural shifting of sand and water.

The joke plays on the humorous frustration programmers might feel when adapting to new programming paradigms, just like how sand constantly shifts with the tides. By combining both meanings of “paradigm,” the joke creates an absurd and amusing scenario that will resonate with those familiar with programming and beaches.

In the end, I found the explanation funnier than the joke itself.↩︎
The concept of “natural language processing” has tended to replace the more specific term “text mining.” Text mining refers to the search for — or mining of — information within a set of texts. While it is one application area of NLP, it is not the only one. Advances in computing power and in the formalization of language have expanded the scope of fields where computational linguistics is applied.↩︎
For example, the concept of embedding—the transformation of a text field into a multidimensional numeric vector—is central in NLP today but is only briefly mentioned here.↩︎
An example of the value of this approach can be seen in ?@fig-relevanc-table-embedding.↩︎

Citation

BibTeX citation:

@book{galiana2023,
  author = {Galiana, Lino},
  title = {Python Pour La Data Science},
  date = {2023},
  url = {https://pythonds.linogaliana.fr/},
  doi = {10.5281/zenodo.8229676},
  langid = {en}
}

For attribution, please cite this work as:

Galiana, Lino. 2023. Python Pour La Data Science. https://doi.org/10.5281/zenodo.8229676.