Python pour la data science

Lino Galiana

doi:10.5281/zenodo.8229676

If you want to try the examples in this tutorial:

1 Introduction

1.1 Recap

As mentioned in the introduction to this section on natural language processing, the main goal of the techniques we will explore is the synthetic representation of language.

Natural Language Processing (NLP) aims to extract information from text through statistical content analysis. This definition includes a wide range of NLP applications (translation, sentiment analysis, recommendation, monitoring, etc.).

This approach involves transforming a text—understandable by humans—into a number, which is the appropriate format for a computer in the context of statistical or algorithmic approaches.

Turning textual information into numerical values suitable for statistical analysis is no easy task. Text data are unstructured since the sought-after information—specific to each analysis—is hidden in a large mass of information that must also be interpreted within a certain context (the same word or phrase can have different meanings depending on the context).

If that wasn’t already difficult enough, there are additional challenges specific to text analysis, as this data is:

noisy: spelling, typos…
evolving: language changes with new words, meanings…
complex: variable structures, agreements…
ambiguous: synonymy, polysemy, hidden meanings…
language-specific: no single set of rules applies across languages
high-dimensional: infinite combinations of word sequences

1.2 Chapter Objective

In this chapter, we will focus on frequency-based methods within the bag of words paradigm. These are somewhat old school compared to the more sophisticated approaches we’ll cover later. However, introducing them allows us to address several typical challenges of text data that remain central in modern NLP.

The main takeaway from this section is that since text data is very high-dimensional—language is a rich object—we need methods to reduce the noise in our text corpora to better capture the signal.

This part serves as an introduction, drawing on classic works of French and English literature. It will also present some key libraries that form part of the essential toolkit for data scientists: NLTK and SpaCy. The following chapters will then focus on language modeling.

The SpaCy Library

NLTK is the historical text analysis library in Python, dating back to the 1990s. The industrial application of NLP in the world of data science is more recent and owes a lot to the increased collection of unstructured data by social networks. This has led to a renewal of the NLP field, both in research and in its industrial application.

The spaCy package is one of the tools that enabled this industrialization of NLP methods. Designed around the concept of data pipelines, it is much more convenient to use for a text data processing chain that involves multiple transformation steps.

1.3 Method

Text analysis aims to transform text into manipulable numerical data. To do this, it is necessary to define a minimal semantic unit. This textual unit can be a word, a sequence of n words (an ngram), or even a string of characters (e.g., punctuation can be meaningful). This is called a token.

Various techniques (such as clustering or supervised classification) can then be used depending on the objective, in order to exploit the transformed information. However, text cleaning steps are essential. Otherwise, an algorithm will be unable to detect meaningful information in the infinite range of possibilities.

The following packages will be useful throughout this chapter:

!pip install pywaffle
!pip install spacy
!pip install plotnine
!pip install great_tables
!pip install wordcloud

2 Example Dataset

2.1 The Count of Monte Cristo

The example dataset is The Count of Monte Cristo by Alexandre Dumas. It is available for free on the http://www.gutenberg.org (Project Gutenberg) website, along with thousands of other public domain books.

The simplest way to retrieve it is to use the request package to download the text file and slightly clean it to retain only the core content of the book:

import requests
import re

url = "https://www.gutenberg.org/files/17989/17989-0.txt"
response = requests.get(url)
response.encoding = 'utf-8'  # Assure le bon décodage
raw = response.text

dumas = (
  raw
  .split("*** START OF THE PROJECT GUTENBERG EBOOK 17989 ***")[1]
  .split("*** END OF THE PROJECT GUTENBERG EBOOK 17989 ***")[0]
1)


def clean_text(text):
    text = text.lower() # mettre les mots en minuscule
    text = " ".join(text.split())
    return text

dumas = clean_text(dumas)

dumas[10000:10500]

1: On extrait de manière un petit peu simpliste le contenu de l’ouvrage

" mes yeux. --vous avez donc vu l'empereur aussi? --il est entré chez le maréchal pendant que j'y étais. --et vous lui avez parlé? --c'est-à-dire que c'est lui qui m'a parlé, monsieur, dit dantès en souriant. --et que vous a-t-il dit? --il m'a fait des questions sur le bâtiment, sur l'époque de son départ pour marseille, sur la route qu'il avait suivie et sur la cargaison qu'il portait. je crois que s'il eût été vide, et que j'en eusse été le maître, son intention eût été de l'acheter; mais je lu"

2.2 The Anglo-Saxon Corpus

We will use an Anglo-Saxon corpus featuring three authors of gothic literature:

The data is available in a CSV file provided on GitHub. The direct URL for retrieval is https://github.com/GU4243-ADS/spring2018-project1-ginnyqg/raw/master/data/spooky.csv.

Having a corpus that compares multiple authors will allow us to understand how text data cleaning facilitates comparative analysis.

We can use the following code to read and prepare this data:

import pandas as pd

url='https://github.com/GU4243-ADS/spring2018-project1-ginnyqg/raw/master/data/spooky.csv'
#1. Import des données
horror = pd.read_csv(url,encoding='latin-1')
#2. Majuscules aux noms des colonnes
horror.columns = horror.columns.str.capitalize()
#3. Retirer le prefixe id
horror['ID'] = horror['Id'].str.replace("id","")
horror = horror.set_index('Id')

The dataset thus pairs an author with a sentence they wrote:

horror.head()

	Text	Author	ID
Id
id26305	This process, however, afforded me no means of...	EAP	26305
id17569	It never once occurred to me that the fumbling...	HPL	17569
id11008	In his left hand was a gold snuff box, from wh...	EAP	11008
id27763	How lovely is spring As we looked from Windsor...	MWS	27763
id12958	Finding nothing else, not even gold, the Super...	HPL	12958

We can observe that the excerpts from the 3 authors are not necessarily balanced within the dataset. If this corpus is later used for modeling, it will be important to account for this imbalance.

(
  horror
  .value_counts('Author')
  .plot(kind = "barh")
)

3 Initial Frequency Analysis

The standard approach in statistics—starting with descriptive analysis before modeling—also applies to text data analysis. Text mining therefore begins with a statistical analysis to determine the structure of the corpus.

Before diving into a systematic analysis of each author’s lexical field, we will first focus on a single word: fear.

3.1 Targeted Exploration

Tip

The exercise below presents a graphical representation called a waffle chart. This is a better alternative to pie charts, which are misleading because the human eye can be easily deceived by their shape, which does not accurately represent proportions.

Exercise 1: Word Frequency

First, we will focus on our Anglo-Saxon corpus (horror)

Count the number of sentences, for each author, in which the word fear appears.
Use pywaffle to generate the charts below that visually summarize the number of occurrences of the word “fear” by author.
Repeat the analysis with the word “horror”.

The resulting count should be as follows

	wordtoplot
Author
EAP	70
HPL	160
MWS	211

This produces the following waffle chart:

Figure 3.1: Répartition du terme fear dans le corpus de nos trois auteurs

This clearly highlights the imbalance in our dataset when focusing on the term “fear”, with Mary Shelley accounting for nearly 50% of the observations.

If we repeat the analysis with the term “horror”, we get the following figure:

Figure 3.2: Répartition du terme horror dans le corpus de nos trois auteurs

3.2 Converting Text into Tokens

In the previous exercise, we performed a one-off search, which doesn’t scale well. To generalize this approach, a corpus is typically broken down into independent semantic units: tokens.

Tip

We will need to import several ready-to-use corpora to work with the NLTK or SpaCy libraries. The instructions below will help you retrieve all these resources.

To retrieve all our ready-to-use NLTK corpora, we do the following

import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('genesis')
nltk.download('wordnet')
nltk.download('omw-1.4')

For SpaCy, it is necessary to use the following command line:

!python -m spacy download fr_core_news_sm
!python -m spacy download en_core_web_sm

Rather than implementing an inefficient tokenizer yourself, it is more appropriate to use one from a specialized library. Historically, the simplest choice was to use the tokenizer from NLTK, the classic Python text mining library:

from nltk.tokenize import word_tokenize
word_tokenize(dumas[10000:10500])

As we can see, this library lacks detail and has some inconsistencies: j'y étais is split into 4 tokens (['j', "'", 'y', 'étais']) whereas l'acheter remains a single token. NLTK is an English-centric library, and its tokenization algorithm is not always well-suited to French grammar rules. In such cases, it is better to use SpaCy, the more modern library for this kind of task. Besides being well-documented, it is better adapted to non-English languages. As shown in the documentation example on tokenizers, its algorithm provides a certain level of sophistication:

It can be applied in the following way:

import spacy
from spacy.tokenizer import Tokenizer
nlp = spacy.load("fr_core_news_sm")
doc = nlp(dumas[10000:10500])

text_tokenized = []
for token in doc:
    text_tokenized += [token.text]

", ".join(text_tokenized)

" , mes, yeux, ., --vous, avez, donc, vu, l', empereur, aussi, ?, --il, est, entré, chez, le, maréchal, pendant, que, j', y, étais, ., --et, vous, lui, avez, parlé, ?, --c', est, -, à, -, dire, que, c', est, lui, qui, m', a, parlé, ,, monsieur, ,, dit, dantès, en, souriant, ., --et, que, vous, a, -t, -il, dit, ?, --il, m', a, fait, des, questions, sur, le, bâtiment, ,, sur, l', époque, de, son, départ, pour, marseille, ,, sur, la, route, qu', il, avait, suivie, et, sur, la, cargaison, qu', il, portait, ., je, crois, que, s', il, eût, été, vide, ,, et, que, j', en, eusse, été, le, maître, ,, son, intention, eût, été, de, l', acheter, ;, mais, je, lu"

As we can see, there are still many elements cluttering our corpus structure, starting with punctuation. However, we will be able to easily remove these later on, as we will see.

3.3 Word Cloud: A First Generalized Analysis

At this point, we still have no clear sense of the structure of our corpus: word count, most frequent words, etc.

To get an idea of the corpus structure, we can start by counting word distribution in Dumas’ work. Let’s begin with the first 30,000 words and count the unique words:

from collections import Counter

doc = nlp(dumas[:30000])

# Extract tokens, convert to lowercase and filter out punctuation and spaces
tokens = [token.text.lower() for token in doc if not token.is_punct and not token.is_space]

# Count the frequency of each token
token_counts = Counter(tokens)

There are already many different words in the beginning of the work.

len(token_counts)

We can observe the high dimensionality of the corpus, with nearly 1,500 unique words in the first 30,000 words of Dumas’ work.

token_count_all = list(token_counts.items())

# Create a DataFrame from the list of tuples
token_count_all = pd.DataFrame(token_count_all, columns=['word', 'count'])

If we look at the distribution of word frequencies—an analysis we will extend later when discussing Zipf’s law—we can see that many words are unique (nearly half), that the frequency density drops off quickly, and that we should focus more on the tail of the distribution than the following figure allows:

from plotnine import *
(
  ggplot(token_count_all) +
  geom_histogram(aes(x = "count")) +
  scale_x_log10()
)

/home/runner/work/python-datascientist/python-datascientist/.venv/lib/python3.12/site-packages/plotnine/stats/stat_bin.py:109: PlotnineWarning: 'stat_bin()' using 'bins = 42'. Pick better value with 'binwidth'.

Now, if we look at the 25 most frequent words, we can see that they are not very informative for analyzing the meaning of our document:

# Sort the tokens by frequency in descending order
sorted_token_counts = token_counts.most_common(25)
sorted_token_counts = pd.DataFrame(sorted_token_counts, columns=['word', 'count'])

/tmp/ipykernel_12033/3224890599.py:18: MapWithoutReturnDtypeWarning: Calling `map_elements` without specifying `return_dtype` can lead to unpredictable results. Specify `return_dtype` to silence this warning.
/tmp/ipykernel_12033/3224890599.py:23: MapWithoutReturnDtypeWarning: Calling `map_elements` without specifying `return_dtype` can lead to unpredictable results. Specify `return_dtype` to silence this warning.

Mot	Nombre d'occurrences
de	176
le	149
et	124
à	116
l'	102
que	100
vous	88
la	83
il	81
un	77
je	73
en	70
est	61
qui	53
dantès	53
d'	50
dit	50
les	49
du	46
a	41
ne	37
n'	37
mon	37
son	36
pas	36
Nombre d'apparitions sur les 30 000 premiers caractères du Comte de Monte Cristo

If we represent this ranking graphically

(
    ggplot(sorted_token_counts, aes(x='word', y='count')) +
    geom_point(stat='identity', size = 3, color = "red") +
    scale_x_discrete(
      limits=sorted_token_counts.sort_values("count")["word"].tolist()
    ) +
    coord_flip() +
    theme_minimal() +
    labs(title='Word Frequency', x='Word', y='Count')
)

We will focus on these filler words later on, as it will be important to account for them in our deeper document analyses.

Through these word counts, we’ve gained a first intuition about the nature of our corpus. However, a more visual approach would be helpful to gain further insight. Word clouds are convenient graphical representations for visualizing the most frequent words—when used appropriately. Word clouds are very easy to implement in Python with the Wordcloud module. Some formatting parameters even allow the shape of the cloud to be adjusted to match an image.

Exercise 2: Wordcloud

Using the wordCloud function, create three word clouds to represent the most commonly used words by each author in the horror corpus¹.
Create a word cloud for the dumas corpus using a mask like the one below.

Example mask for question 2

Image URL: https://minio.lab.sspcloud.fr/lgaliana/generative-art/pythonds/book.png

The word clouds generated for question 1 are as follows:

Whereas the one generated from Dumas’ work takes the shape

Figure 3.6: Nuage de mot produit à partir du Comte de Monte Cristo

If it wasn’t already obvious, these visualizations clearly show the need to clean our text. For instance, in the case of Dumas’ work, the name of the main character, Dantès, is obscured by various articles or connecting words that interfere with the analysis. In the Anglo-Saxon corpus, similar terms like “the”, “of”, etc., dominate.

These words are called stop words. This is a clear example of why text should be cleaned before analysis (unless one is interested in Zipf’s law, see the next exercise).

3.4 Aside: Zipf’s Law

In the 1930s, Zipf observed a statistical regularity in Joyce’s Ulysses. The most frequent word appeared \(x\) times, the second most frequent word appeared half as often, the third a third as often, and so on. Statistically, this means that the frequency of occurrence \(f(n_i)\) of a word is related to its rank \(n_i\) in the frequency order by a law of the form:

\[f(n_i) = c/n_i\]

where \(c\) is a constant.

More generally, Zipf’s law can be derived from an exponentially decreasing frequency distribution: \(f(n_i) = cn_{i}^{-k}\). Empirically, this means that we can use Poisson regressions to estimate the law’s parameters, following the specification:

\[ \mathbb{E}\bigg( f(n_i)|n_i \bigg) = \exp(\beta_0 + \beta_1 \log(n_i)) \]

Generalized linear models (GLMs) allow us to perform this type of regression. In Python, they are available via the statsmodels package, whose outputs are heavily inspired by specialized econometric software such as Stata.

count_words = pd.DataFrame({'counter' : horror
    .groupby('Author')
    .apply(lambda s: ' '.join(s['Text']).split())
    .apply(lambda s: Counter(s))
    .apply(lambda s: s.most_common())
    .explode()}
)
count_words[['word','count']] = pd.DataFrame(count_words['counter'].tolist(), index=count_words.index)
count_words = count_words.reset_index()

count_words = count_words.assign(
    tot_mots_auteur = lambda x: (x.groupby("Author")['count'].transform('sum')),
    freq = lambda x: x['count'] /  x['tot_mots_auteur'],
    rank = lambda x: x.groupby("Author")['count'].transform('rank', ascending = False)
)

/tmp/ipykernel_12033/4009929367.py:3: DeprecationWarning: DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass `include_groups=False` to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning.

Let’s begin by visualizing the relationship between frequency and rank:

from plotnine import *

g = (
    ggplot(count_words) +
    geom_point(aes(y = "freq", x = "rank", color = 'Author'), alpha = 0.4) +
    scale_x_log10() + scale_y_log10() +
    theme_minimal()
)

We do indeed observe a log-linear relationship between the two in the plot:

Using statsmodels, let’s formally verify this relationship:

import statsmodels.api as sm
import numpy as np

exog = sm.add_constant(np.log(count_words['rank'].astype(float)))

model = sm.GLM(count_words['freq'].astype(float), exog, family = sm.families.Poisson()).fit()

# Afficher les résultats du modèle
print(model.summary())

                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                   freq   No. Observations:                69301
Model:                            GLM   Df Residuals:                    69299
Model Family:                 Poisson   Df Model:                            1
Link Function:                    Log   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -23.011
Date:                Tue, 29 Jul 2025   Deviance:                     0.065676
Time:                        14:47:12   Pearson chi2:                   0.0656
No. Iterations:                     5   Pseudo R-squ. (CS):          0.0002431
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -2.4388      1.089     -2.239      0.025      -4.574      -0.303
rank          -0.9831      0.189     -5.196      0.000      -1.354      -0.612
==============================================================================

The regression coefficient is close to 1, which suggests a nearly log-linear relationship between rank and word frequency. In other words, the most used word occurs twice as often as the second most frequent word, which occurs three times more than the third, and so on. This law is indeed empirically observed in this corpus of three authors.

4 Text Cleaning

4.1 Removing Stop Words

As we have seen, whether in French or English, a number of connecting words, while grammatically necessary, carry little informational value and prevent us from identifying the main information-bearing words in our corpus.

Therefore, it is necessary to clean our corpus by removing such terms. This cleaning process goes beyond simply removing words; it’s also an opportunity to eliminate other problematic tokens, such as punctuation.

Let’s start by downloading the stopwords corpus

import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /home/runner/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

True

The list of English stopwords in NLTK is as follows:

from nltk.corpus import stopwords
", ".join(stopwords.words("english"))

"a, about, above, after, again, against, ain, all, am, an, and, any, are, aren, aren't, as, at, be, because, been, before, being, below, between, both, but, by, can, couldn, couldn't, d, did, didn, didn't, do, does, doesn, doesn't, doing, don, don't, down, during, each, few, for, from, further, had, hadn, hadn't, has, hasn, hasn't, have, haven, haven't, having, he, he'd, he'll, her, here, hers, herself, he's, him, himself, his, how, i, i'd, if, i'll, i'm, in, into, is, isn, isn't, it, it'd, it'll, it's, its, itself, i've, just, ll, m, ma, me, mightn, mightn't, more, most, mustn, mustn't, my, myself, needn, needn't, no, nor, not, now, o, of, off, on, once, only, or, other, our, ours, ourselves, out, over, own, re, s, same, shan, shan't, she, she'd, she'll, she's, should, shouldn, shouldn't, should've, so, some, such, t, than, that, that'll, the, their, theirs, them, themselves, then, there, these, they, they'd, they'll, they're, they've, this, those, through, to, too, under, until, up, ve, very, was, wasn, wasn't, we, we'd, we'll, we're, were, weren, weren't, we've, what, when, where, which, while, who, whom, why, will, with, won, won't, wouldn, wouldn't, y, you, you'd, you'll, your, you're, yours, yourself, yourselves, you've"

The list provided by SpaCy is more comprehensive (we have already downloaded the en_core_web_sm corpus in question):

nlp_english = spacy.load('en_core_web_sm')
stop_words_english = nlp_english.Defaults.stop_words
", ".join(stop_words_english)

"perhaps, n’t, where, any, might, most, make, enough, via, hereupon, namely, whether, another, when, ours, now, he, often, four, three, two, ‘m, cannot, upon, may, nine, part, somewhere, amongst, is, beside, nowhere, latterly, neither, here, everything, herein, yourself, itself, became, a, per, get, done, be, everywhere, six, thereafter, meanwhile, never, move, ‘re, first, none, behind, see, there, latter, however, whose, many, very, former, beyond, several, further, name, using, mine, full, show, few, whereafter, doing, re, towards, ’m, someone, wherever, me, to, seem, become, she, was, 've, had, except, own, we, our, both, thereby, eight, by, ’d, anywhere, side, something, between, has, becoming, formerly, nothing, amount, down, becomes, back, still, do, over, whence, his, nobody, then, fifteen, less, no, though, 're, did, ’ve, those, some, otherwise, not, as, across, even, almost, or, your, seeming, 'll, together, must, yourselves, than, in, up, but, also, against, empty, them, for, ever, third, every, everyone, sometimes, others, it, more, five, ’ll, my, unless, well, eleven, fifty, these, whereby, ’re, nor, whom, herself, other, us, are, only, anyone, ’s, made, various, one, were, mostly, top, himself, thus, n‘t, same, hers, regarding, yet, nevertheless, themselves, around, quite, whoever, does, thence, below, should, always, put, will, thereupon, under, noone, last, forty, within, why, the, yours, whereas, call, if, whereupon, toward, without, say, anything, although, ‘ll, really, hereby, too, seemed, ‘s, keep, an, during, either, least, else, i, above, through, can, please, indeed, wherein, already, because, ca, out, have, twelve, somehow, being, am, been, with, whither, since, thru, take, all, myself, this, ten, seems, until, its, you, among, who, whole, much, hereafter, sometime, along, about, bottom, ‘d, once, ‘ve, whenever, 'd, afterwards, hundred, moreover, while, which, could, anyway, before, next, off, what, so, they, him, again, due, 's, used, such, give, after, ourselves, into, rather, onto, elsewhere, that, would, at, alone, how, of, beforehand, therefore, on, twenty, each, just, besides, and, from, therein, n't, throughout, their, go, hence, anyhow, her, sixty, serious, 'm, front, whatever"

This time, if we look at the list of French stopwords in NLTK:

", ".join(stopwords.words("french"))

'au, aux, avec, ce, ces, dans, de, des, du, elle, en, et, eux, il, ils, je, la, le, les, leur, lui, ma, mais, me, même, mes, moi, mon, ne, nos, notre, nous, on, ou, par, pas, pour, qu, que, qui, sa, se, ses, son, sur, ta, te, tes, toi, ton, tu, un, une, vos, votre, vous, c, d, j, l, à, m, n, s, t, y, été, étée, étées, étés, étant, étante, étants, étantes, suis, es, est, sommes, êtes, sont, serai, seras, sera, serons, serez, seront, serais, serait, serions, seriez, seraient, étais, était, étions, étiez, étaient, fus, fut, fûmes, fûtes, furent, sois, soit, soyons, soyez, soient, fusse, fusses, fût, fussions, fussiez, fussent, ayant, ayante, ayantes, ayants, eu, eue, eues, eus, ai, as, avons, avez, ont, aurai, auras, aura, aurons, aurez, auront, aurais, aurait, aurions, auriez, auraient, avais, avait, avions, aviez, avaient, eut, eûmes, eûtes, eurent, aie, aies, ait, ayons, ayez, aient, eusse, eusses, eût, eussions, eussiez, eussent'

We can see that this list is not very extensive and could benefit from being more complete. The one from SpaCy is more in line with what we would expect.

stop_words_french = nlp.Defaults.stop_words
", ".join(stop_words_french)

"onze, quant-à-soi, tien, sinon, j', uns, desquelles, aurait, chez, auquel, elles-memes, siens, tous, le, via, restant, d', douzième, differents, m’, également, voici, plutot, exactement, telles, possible, c', lui-même, ouste, ces, du, seraient, hem, quelqu'un, etait, sous, tout, ho, cinquante, certaines, quelle, hé, ceux-là, sans, notamment, cinquantaine, septième, quatre, derriere, maintenant, a, leur, soixante, faisaient, l’, certain, celles-la, qu’, six, desormais, ses, maint, cent, nous, va, qui, ton, entre, quiconque, de, vé, surtout, suivant, différents, tellement, peuvent, j’, bat, prealable, ta, dit, chaque, où, quelques, combien, sept, dedans, huit, première, celle-ci, suit, sienne, étant, miennes, toi-même, vôtres, aux, son, ceux-ci, onzième, auxquels, et, eu, parlent, precisement, directement, nul, mêmes, peut, ceux, pas, celui, ma, dans, pense, effet, debout, hui, dejà, hou, vôtre, me, celle, tiens, seuls, votre, qu', devra, procedant, eux, très, tenir, permet, toutes, stop, suivantes, soi-même, possibles, allons, dits, specifique, mais, ça, anterieur, depuis, était, la, concernant, être, lès, ait, na, auraient, celles-ci, durant, étaient, vous, pourrait, dès, vont, aussi, seule, soit, tu, différente, t’, tres, moi-meme, d’, pu, après, désormais, gens, tel, à, diverses, celles, juste, pourquoi, autrement, quinze, as, tiennes, abord, hue, treize, etre, or, antérieures, rendre, devant, dix-sept, egalement, etant, seules, donc, pouvait, même, etc, pourrais, nous-mêmes, plusieurs, fait, attendu, vers, souvent, quel, differentes, retour, est, chacun, n’, dont, notre, néanmoins, préalable, te, spécifique, hep, tant, vous-mêmes, quoique, neanmoins, ainsi, jusqu, alors, da, autres, premièrement, celui-la, tend, derrière, vais, pour, avec, tenant, memes, leurs, delà, etais, partant, reste, dix-neuf, quand, nombreux, toi-meme, celui-là, moi-même, celui-ci, nombreuses, sent, cinquantième, siennes, autrui, mes, sait, dire, encore, diverse, té, antérieure, pres, eux-mêmes, ouverte, elles-mêmes, certaine, façon, c’, longtemps, précisement, aie, merci, specifiques, l', seul, ayant, un, soi-meme, rend, anterieures, quant, puisque, aura, lorsque, mien, allaient, des, revoila, peux, quelconque, desquels, serait, selon, sauf, parfois, dix-huit, restent, envers, peu, quelque, quatre-vingt, ès, suivante, font, environ, seize, celle-là, aupres, deja, quoi, déja, cinq, trente, parmi, anterieure, sur, mille, revoici, seulement, eh, elles, t', o, i, elle, avoir, quelles, ha, semblent, ce, celle-la, comment, ouias, hi, lesquels, différent, certains, sa, comme, vu, en, nôtre, apres, les, deuxièmement, ci, doit, outre, differente, dehors, neuvième, trois, suis, malgré, voila, laquelle, tienne, deux, ont, nos, ni, cet, toujours, semblaient, â, vingt, revoilà, doivent, sien, es, hors, soi, s’, certes, autre, spécifiques, avais, là, cela, tels, avaient, toute, vos, cependant, mienne, unes, dessus, avant, ils, lesquelles, enfin, moindres, cette, sont, chacune, près, pendant, dixième, m', divers, relative, seront, nouveau, ô, ne, meme, puis, sixième, nôtres, suffisant, suivants, votres, plutôt, ah, excepté, laisser, ou, fais, compris, moi, relativement, déjà, par, huitième, devers, ouverts, personne, une, ai, tes, celles-là, dite, troisième, houp, mon, parce, suivre, lors, se, lui, directe, douze, suffisante, quels, auront, etaient, je, voilà, y, parle, auxquelles, lequel, proche, quatrièmement, elle-même, jusque, semblable, miens, car, vas, assez, étais, lui-meme, feront, il, on, deuxième, s', malgre, avons, premier, sera, moins, cinquième, au, ceci, quarante, que, bas, plus, different, n', toi, facon, antérieur, troisièmement, parler, avait, basee, semble, dix, si, afin, duquel, suffit, quatrième, hormis, importe, elle-meme, dessous, touchant, telle, tente, différentes, faisant, quatorze, ouvert"

Exercise 3: Text Cleaning

Take Dumas’ work and clean it using Spacy. Generate the word cloud again and draw your conclusions.
Perform the same task on the Anglo-Saxon dataset. Ideally, you should be able to use the SpaCy pipeline functionality.

# Function to clean the text
def clean_text(doc):
    # Tokenize, remove stop words and punctuation, and lemmatize
    cleaned_tokens = [
        token.lemma_ for token in doc if not token.is_stop and not token.is_punct
    ]
    # Join tokens back into a single string
    cleaned_text = " ".join(cleaned_tokens)
    return cleaned_text

Ces retraitements commencent à porter leurs fruits puisque des mots ayant plus de sens commencent à se dégager, notamment les noms des personnages (Dantès, Danglart, etc.):

Figure 4.1: Nuage de mot produit à partir du Comte de Monte Cristo après nettoyage

4.2 Stemming and Lemmatization

To go further in text harmonization, it is possible to establish equivalence classes between words. For example, when conducting frequency analysis, one might want to treat “cheval” and “chevaux” as equivalent. Depending on the context, different forms of the same word (plural, singular, conjugated) can be treated as equivalent and replaced with a canonical form.

There are two main approaches:

Lemmatization, which requires knowledge of grammatical roles (example: “chevaux” becomes “cheval”);
Stemming, which is more rudimentary but faster, especially when dealing with spelling errors. In this case, “chevaux” might become “chev”, but that could also match “chevet” or “cheveux”.

This approach has the advantage of reducing the vocabulary size that both the computer and modeler must handle. Several stemming algorithms exist, including the Porter Stemming Algorithm and the Snowball Stemming Algorithm.

Note

To access the necessary corpus for lemmatization, you need to download it the first time using the following commands:

import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')

Let’s take this character string:

"naples. comme d'habitude, un pilote côtier partit aussitôt du port, rasa le château d'if, et alla aborder le navire entre le cap de morgion et l'île de rion. aussitôt, co"

The stemmed version is as follows:

"napl,.,comm,d'habitud,,,un,pilot,côti,part,aussitôt,du,port,,,ras,le,château,d'if,,,et,alla,abord,le,navir,entre,le,cap,de,morgion,et,l'îl,de,rion,.,aussitôt,,,co"

At this stage, the words become less intelligible for humans but can still be understandable for machines. This choice is not trivial, and its relevance depends on the specific use case.

Lemmatizers allow for more nuanced harmonization. They rely on knowledge bases, such as WordNet, an open lexical database. For instance, the words “women”, “daughters”, and “leaves” will be lemmatized as follows:

from nltk.stem import WordNetLemmatizer
lemm = WordNetLemmatizer()

for word in ["women","daughters", "leaves"]:
    print(f"The lemmatized form of {word} is: {lemm.lemmatize(word)}")

The lemmatized form of women is: woman
The lemmatized form of daughters is: daughter
The lemmatized form of leaves is: leaf

Exercise 4: Lemmatization with nltk

Following the previous example, use a WordNetLemmatizer on the dumas[1030:1200] corpus and observe the result.

The lemmatized version of this small excerpt from Dumas’ work is as follows:

"naples, ., comme, d'habitude, ,, un, pilote, côtier, partit, aussitôt, du, port, ,, rasa, le, château, d'if, ,, et, alla, aborder, le, navire, entre, le, cap, de, morgion, et, l'île, de, rion, ., aussitôt, ,, co"

4.3 Limitation

In frequency-based approaches, where the goal is to find similarity between texts based on term co-occurrence, the question of forming equivalence classes is fundamental. Words are either identical or different—there is no subtle gradation. For instance, one must decide whether “python” and “pythons” are equivalent or not, without any nuance distinguishing “pythons”, “anaconda”, or “table” from “python”. Modern approaches, which no longer rely solely on word frequency, allow for more nuance in synthesizing the information present in textual data.

Informations additionnelles

environment files have been tested on.

Latest built version: 2025-07-29

Python version used:

'3.12.3 (main, Jun 18 2025, 17:59:45) [GCC 13.3.0]'

Package	Version
affine	2.4.0
aiobotocore	2.22.0
aiohappyeyeballs	2.6.1
aiohttp	3.11.18
aioitertools	0.12.0
aiosignal	1.3.2
altair	5.4.1
annotated-types	0.7.0
anyio	4.9.0
appdirs	1.4.4
argon2-cffi	25.1.0
argon2-cffi-bindings	21.2.0
arrow	1.3.0
asttokens	3.0.0
async-lru	2.0.5
attrs	25.3.0
babel	2.17.0
beautifulsoup4	4.13.4
black	24.8.0
bleach	6.2.0
blis	1.3.0
boto3	1.37.3
botocore	1.37.3
branca	0.8.1
Brotli	1.1.0
bs4	0.0.2
cartiflette	0.0.3
Cartopy	0.24.1
catalogue	2.0.10
cattrs	24.1.3
certifi	2025.7.14
cffi	1.17.1
charset-normalizer	3.4.2
chromedriver-autoinstaller	0.6.4
click	8.2.1
click-plugins	1.1.1
cligj	0.7.2
cloudpathlib	0.21.1
comm	0.2.2
commonmark	0.9.1
confection	0.1.5
contextily	1.6.2
contourpy	1.3.2
cycler	0.12.1
cymem	2.0.11
dataclasses-json	0.6.7
debugpy	1.8.14
decorator	5.2.1
defusedxml	0.7.1
diskcache	5.6.3
duckdb	1.3.0
en_core_web_sm	3.8.0
et_xmlfile	2.0.0
executing	2.2.0
fastexcel	0.14.0
fastjsonschema	2.21.1
fiona	1.10.1
folium	0.19.6
fontawesomefree	6.6.0
fonttools	4.58.0
fqdn	1.5.1
fr_core_news_sm	3.8.0
frozenlist	1.6.0
fsspec	2025.5.0
geographiclib	2.0
geopandas	1.0.1
geoplot	0.5.1
geopy	2.4.1
graphviz	0.20.3
great-tables	0.12.0
greenlet	3.2.2
h11	0.16.0
htmltools	0.6.0
httpcore	1.0.9
httpx	0.28.1
httpx-sse	0.4.0
idna	3.10
imageio	2.37.0
importlib_metadata	8.7.0
importlib_resources	6.5.2
inflate64	1.0.1
ipykernel	6.29.5
ipython	9.3.0
ipython_pygments_lexers	1.1.1
ipywidgets	8.1.7
isoduration	20.11.0
jedi	0.19.2
Jinja2	3.1.6
jmespath	1.0.1
joblib	1.5.1
json5	0.12.0
jsonpatch	1.33
jsonpointer	3.0.0
jsonschema	4.23.0
jsonschema-specifications	2025.4.1
jupyter	1.1.1
jupyter-cache	1.0.0
jupyter_client	8.6.3
jupyter-console	6.6.3
jupyter_core	5.7.2
jupyter-events	0.12.0
jupyter-lsp	2.2.5
jupyter_server	2.16.0
jupyter_server_terminals	0.5.3
jupyterlab	4.4.3
jupyterlab_pygments	0.3.0
jupyterlab_server	2.27.3
jupyterlab_widgets	3.0.15
kaleido	0.2.1
kiwisolver	1.4.8
langchain	0.3.25
langchain-community	0.3.9
langchain-core	0.3.61
langchain-text-splitters	0.3.8
langcodes	3.5.0
langsmith	0.1.147
language_data	1.3.0
lazy_loader	0.4
loguru	0.7.3
lxml	5.4.0
mapclassify	2.8.1
marisa-trie	1.2.1
Markdown	3.8
markdown-it-py	3.0.0
MarkupSafe	3.0.2
marshmallow	3.26.1
matplotlib	3.10.3
matplotlib-inline	0.1.7
mdurl	0.1.2
mercantile	1.2.1
mistune	3.1.3
mizani	0.11.4
multidict	6.4.4
multivolumefile	0.2.3
murmurhash	1.0.13
mypy_extensions	1.1.0
narwhals	1.40.0
nbclient	0.10.0
nbconvert	7.16.6
nbformat	5.10.4
nest-asyncio	1.6.0
networkx	3.4.2
nltk	3.9.1
notebook	7.4.3
notebook_shim	0.2.4
numpy	2.2.6
openpyxl	3.1.5
orjson	3.10.18
outcome	1.3.0.post0
overrides	7.7.0
OWSLib	0.33.0
packaging	24.2
pandas	2.2.3
pandocfilters	1.5.1
parso	0.8.4
pathspec	0.12.1
patsy	1.0.1
Pebble	5.1.1
pexpect	4.9.0
pillow	11.2.1
pip	25.1.1
platformdirs	4.3.8
plotly	6.1.2
plotnine	0.13.6
polars	1.8.2
preshed	3.0.9
prometheus_client	0.22.1
prompt_toolkit	3.0.51
propcache	0.3.1
psutil	7.0.0
ptyprocess	0.7.0
pure_eval	0.2.3
py7zr	0.22.0
pyarrow	17.0.0
pybcj	1.0.6
pycparser	2.22
pycryptodomex	3.23.0
pydantic	2.11.5
pydantic_core	2.33.2
pydantic-settings	2.9.1
Pygments	2.19.1
pynsee	0.1.8
pyogrio	0.11.0
pyparsing	3.2.3
pyppmd	1.1.1
pyproj	3.7.1
pyshp	2.3.1
PySocks	1.7.1
python-dateutil	2.9.0.post0
python-dotenv	1.0.1
python-json-logger	3.3.0
python-magic	0.4.27
pytz	2025.2
pywaffle	1.1.1
PyYAML	6.0.2
pyzmq	26.4.0
pyzstd	0.17.0
rasterio	1.4.3
referencing	0.36.2
regex	2024.11.6
requests	2.32.3
requests-cache	1.2.1
requests-toolbelt	1.0.0
retrying	1.3.4
rfc3339-validator	0.1.4
rfc3986-validator	0.1.1
rich	14.0.0
rpds-py	0.25.1
rtree	1.4.0
s3fs	2025.5.0
s3transfer	0.11.3
scikit-image	0.24.0
scikit-learn	1.6.1
scipy	1.13.0
seaborn	0.13.2
selenium	4.34.2
Send2Trash	1.8.3
setuptools	80.8.0
shapely	2.1.1
shellingham	1.5.4
six	1.17.0
smart-open	7.1.0
sniffio	1.3.1
sortedcontainers	2.4.0
soupsieve	2.7
spacy	3.8.4
spacy-legacy	3.0.12
spacy-loggers	1.0.5
SQLAlchemy	2.0.41
srsly	2.5.1
stack-data	0.6.3
statsmodels	0.14.4
tabulate	0.9.0
tenacity	9.1.2
terminado	0.18.1
texttable	1.7.0
thinc	8.3.6
threadpoolctl	3.6.0
tifffile	2025.5.24
tinycss2	1.4.0
topojson	1.9
tornado	6.5.1
tqdm	4.67.1
traitlets	5.14.3
trio	0.30.0
trio-websocket	0.12.2
typer	0.15.3
types-python-dateutil	2.9.0.20250516
typing_extensions	4.14.1
typing-inspect	0.9.0
typing-inspection	0.4.1
tzdata	2025.2
Unidecode	1.4.0
uri-template	1.3.0
url-normalize	2.2.1
urllib3	2.5.0
wasabi	1.1.3
wcwidth	0.2.13
weasel	0.4.1
webcolors	24.11.1
webdriver-manager	4.0.2
webencodings	0.5.1
websocket-client	1.8.0
widgetsnbextension	4.0.14
wordcloud	1.9.3
wrapt	1.17.2
wsproto	1.2.0
xlrd	2.0.1
xyzservices	2025.4.0
yarl	1.20.0
yellowbrick	1.5
zipp	3.21.0

View file history

md`Ce fichier a été modifié __${table_commit.length}__ fois depuis sa création le ${creation_string} (dernière modification le ${last_modification_string})`

creation = d3.min(
  table_commit.map(d => new Date(d.Date))
)

last_modification = d3.max(
  table_commit.map(d => new Date(d.Date))
)

creation_string = creation.toLocaleString("fr", {
  "day": "numeric",
  "month": "long",
  "year": "numeric"
})

last_modification_string = last_modification.toLocaleString("fr", {
  "day": "numeric",
  "month": "long",
  "year": "numeric"
})

html`<div>${git_history_table}</div>`

html`<div>${git_history_plot}</div>`

SHA	Date	Author	Description
94648290	2025-07-22 18:57:48	Lino Galiana	Fix boxes now that it is better supported by jupyter (#628)
91431fa2	2025-06-09 17:08:00	Lino Galiana	Improve homepage hero banner (#612)
df498c79	2025-05-24 19:10:07	Lino Galiana	uv friendly pipeline (#605)
d6b67125	2025-05-23 18:03:48	Lino Galiana	Traduction des chapitres NLP (#603)
4181dab1	2024-12-06 13:16:36	lgaliana	Transition
1b7188a1	2024-12-05 13:21:11	lgaliana	Embedding chapter
c641de05	2024-08-22 11:37:13	Lino Galiana	A series of fix for notebooks that were bugging (#545)
0908656f	2024-08-20 16:30:39	Lino Galiana	English sidebar (#542)
5108922f	2024-08-08 18:43:37	Lino Galiana	Improve notebook generation and tests on PR (#536)
8d23a533	2024-07-10 18:45:54	Julien PRAMIL	Modifs 02_exoclean.qmd (#523)
75950080	2024-07-08 17:24:29	Julien PRAMIL	Changes into NLP/01_intro.qmd (#517)
56b6442d	2024-07-08 15:05:57	Lino Galiana	Version anglaise du chapitre numpy (#516)
a3dc832c	2024-06-24 16:15:19	Lino Galiana	Improve homepage images (#508)
e660d769	2024-06-19 14:31:09	linogaliana	improve output NLP 1
4f41cf6a	2024-06-14 15:00:41	Lino Galiana	Une partie sur les sacs de mots plus cohérente (#501)
06d003a1	2024-04-23 10:09:22	Lino Galiana	Continue la restructuration des sous-parties (#492)
005d89b8	2023-12-20 17:23:04	Lino Galiana	Finalise l’affichage des statistiques Git (#478)
3437373a	2023-12-16 20:11:06	Lino Galiana	Améliore l’exercice sur le LASSO (#473)
4cd44f35	2023-12-11 17:37:50	Antoine Palazzolo	Relecture NLP (#474)
deaafb6f	2023-12-11 13:44:34	Thomas Faria	Relecture Thomas partie NLP (#472)
4c060a17	2023-12-01 17:44:17	Lino Galiana	Update book image location
1f23de28	2023-12-01 17:25:36	Lino Galiana	Stockage des images sur S3 (#466)
a1ab3d94	2023-11-24 10:57:02	Lino Galiana	Reprise des chapitres NLP (#459)
a7711832	2023-10-09 11:27:45	Antoine Palazzolo	Relecture TD2 par Antoine (#418)
154f09e4	2023-09-26 14:59:11	Antoine Palazzolo	Des typos corrigées par Antoine (#411)
80823022	2023-08-25 17:48:36	Lino Galiana	Mise à jour des scripts de construction des notebooks (#395)
3bdf3b06	2023-08-25 11:23:02	Lino Galiana	Simplification de la structure 🤓 (#393)
f2905a7d	2023-08-11 17:24:57	Lino Galiana	Introduction de la partie NLP (#388)
78ea2cbd	2023-07-20 20:27:31	Lino Galiana	Change titles levels (#381)
a9b384ed	2023-07-18 18:07:16	Lino Galiana	Sépare les notebooks (#373)
29ff3f58	2023-07-07 14:17:53	linogaliana	description everywhere
f21a24d3	2023-07-02 10:58:15	Lino Galiana	Pipeline Quarto & Pages 🚀 (#365)
164fa689	2022-11-30 09:13:45	Lino Galiana	Travail partie NLP (#328)
f10815b5	2022-08-25 16:00:03	Lino Galiana	Notebooks should now look more beautiful (#260)
494a85ae	2022-08-05 14:49:56	Lino Galiana	Images featured ✨ (#252)
d201e3cd	2022-08-03 15:50:34	Lino Galiana	Pimp la homepage ✨ (#249)
12965bac	2022-05-25 15:53:27	Lino Galiana	:launch: Bascule vers quarto (#226)
9c71d6e7	2022-03-08 10:34:26	Lino Galiana	Plus d’éléments sur S3 (#218)
3299f1d9	2022-01-08 16:50:11	Lino Galiana	Clean NLP notebooks (#215)
495599d7	2021-12-19 18:33:05	Lino Galiana	Des éléments supplémentaires dans la partie NLP (#202)
4f675284	2021-12-12 08:37:21	Lino Galiana	Improve website appareance (#194)
2a8809fb	2021-10-27 12:05:34	Lino Galiana	Simplification des hooks pour gagner en flexibilité et clarté (#166)
04f8b8f6	2021-09-08 11:55:35	Lino Galiana	echo = FALSE sur la page tuto NLP
048e3dd6	2021-09-02 18:36:23	Lino Galiana	Fix problem with Dumas corpus (#134)
2e4d5862	2021-09-02 12:03:39	Lino Galiana	Simplify badges generation (#130)
49e2826f	2021-05-13 18:11:20	Lino Galiana	Corrige quelques images n’apparaissant pas (#108)
4cdb759c	2021-05-12 10:37:23	Lino Galiana	:sparkles: :star2: Nouveau thème hugo :snake: :fire: (#105)
48ed9d25	2021-05-01 08:58:58	Lino Galiana	lien mort corrigé
7f9f97bc	2021-04-30 21:44:04	Lino Galiana	🐳 + 🐍 New workflow (docker 🐳) and new dataset for modelization (2020 🇺🇸 elections) (#99)
d164635d	2020-12-08 16:22:00	Lino Galiana	:books: Première partie NLP (#87)

git_history_table = Inputs.table(
  table_commit,
  {
    format: {
      SHA: x => md`[${x}](${github_repo}/commit/${x})`,
      Description: x => md`${replacePullRequestPattern(x, github_repo)}`,
      /*Date: x => x.toLocaleString("fr", {
        "month": "numeric",
        "day": "numeric",
        "year": "numeric"
        })
      */
    }
  }
)

git_history_plot = Plot.plot({
  marks: [
    Plot.ruleY([0], {stroke: "royalblue"}),
    Plot.dot(
          table_commit,
          Plot.pointerX({x: (d) => new Date(d.date), y: 0, stroke: "red"})),
    Plot.dot(table_commit, {x: (d) => new Date(d.Date), y: 0, fill: "royalblue"})
  ]
})

function replacePullRequestPattern(inputString, githubRepo) {
    // Use a regular expression to match the pattern #digit
    var pattern = /#(\d+)/g;

    // Replace the pattern with ${github_repo}/pull/#digit
    var replacedString = inputString.replace(pattern, '[#$1](' + githubRepo + '/pull/$1)');

    return replacedString;
}

github_repo = "https://github.com/linogaliana/python-datascientist"

table_commit = {

// Get the HTML table by its class name
var table = document.querySelector('.commit-table');

// Check if the table exists
if (table) {
    // Initialize an array to store the table data
    var dataArray = [];

    // Extract headers from the first row
    var headers = [];
    for (var i = 0; i < table.rows[0].cells.length; i++) {
        headers.push(table.rows[0].cells[i].textContent.trim());
    }

    // Iterate through the rows, starting from the second row
    for (var i = 1; i < table.rows.length; i++) {
        var row = table.rows[i];
        var rowData = {};

        // Iterate through the cells in the row
        for (var j = 0; j < row.cells.length; j++) {
            // Use headers as keys and cell content as values
            rowData[headers[j]] = row.cells[j].textContent.trim();
        }

        // Push the rowData object to the dataArray
        dataArray.push(rowData);
    }
  }

  return dataArray

}

// Get the element with class 'git-details'
{
  var gitDetails = document.querySelector('.commit-table');

  // Check if the element exists
  if (gitDetails) {
      // Hide the element
      gitDetails.style.display = 'none';
  }
}

Plot = require('@observablehq/plot@0.6.12/dist/plot.umd.min.js')

Back to top

Footnotes

To obtain the same results as shown below, you can set the argument random_state=21.↩︎

Citation

BibTeX citation:

@book{galiana2023,
  author = {Galiana, Lino},
  title = {Python Pour La Data Science},
  date = {2023},
  url = {https://pythonds.linogaliana.fr/},
  doi = {10.5281/zenodo.8229676},
  langid = {en}
}

For attribution, please cite this work as:

Galiana, Lino. 2023. Python Pour La Data Science. https://doi.org/10.5281/zenodo.8229676.