1 Introduction
The previous sections focused on acquiring cross-functional skills for working with data. Naturally, we have so far mostly focused on structured data—modest in size but already rich in analytical potential. This new section turns to a subject that, at first glance, may seem unlikely to be handled by computers—a topic of centuries-old philosophical debate, from Plato to Saussure: the richness of human language.
By drawing an analogy between “language” and “tongue”—that is, defining the former as the capacity to express and communicate thought through signs, and the latter as the conventional implementation of that capacity—we align ourselves with the field of linguistics and treat language as data. This opens the door to statistical and algorithmic analysis. Yet, even if statistical regularities exist, how can computers—ultimately limited to just 0 and 1—grasp such a complex object as language, which takes humans years to understand and master?1
2 Natural Language Processing
Natural Language Processing (NLP) refers to the set of techniques that allow computers to understand, analyze, synthesize, and generate human language2.
NLP is a disciplinary field at the intersection of statistics and linguistics, which has experienced significant growth in recent years — academically, operationally, and industrially.
Some applications of these techniques have become essential in our daily lives, such as search engines, machine translation, and more recently, chatbots, whose development has accelerated rapidly since the launch of ChatGPT
in December 2022.
3 Section Summary
This part of the course is dedicated to text data analysis with 📖 examples for fun. It serves as a gradual introduction to the topic by focusing on basic concepts necessary for later understanding of more advanced principles and sophisticated techniques3. This section mainly covers:
- The challenges of cleaning textual fields and frequency analysis. This is somewhat old school NLP but understanding it is essential to progress further;
- Language modeling using several approaches.
Before diving into the topic of embeddings, it’s important to understand the contributions and limitations of concepts like the bag of words or TF-IDF (term frequency - inverse document frequency). One of the main benefits of large language models—namely the richness of their contextual window that allows them to better grasp textual nuances and speaker intentionality—becomes clearer when the limitations of traditional NLP are understood.
As an introductory perspective, this course focuses on frequency-based approaches, especially the bag-of-words approach, to ease into the later exploration of the Pandora’s box that is embeddings.
3.1 Text Cleaning and Frequency Analysis
Python
is an excellent tool for text data analysis.
Basic methods for transforming textual data or dictionaries, combined with specialized libraries
such as NLTK
and SpaCy
, make it possible to perform normalization and text data analysis
very efficiently. Python
is much better equipped than R
for text data analysis.
There is a wealth of online resources on this subject, and the best way to learn remains hands-on practice with a corpus to clean.
This section first revisits how to structure and clean a textual corpus through the bag of words approach. It aims to demonstrate how to turn a corpus into a tool suitable for statistical analysis:
- It first introduces the challenges of text data cleaning through an analysis of The Count of Monte Cristo by Alexandre Dumas here, which helps to quickly summarize the available information in a large volume of text data (as illustrated by ?@fig-wordcloud-dumas)
- It then offers a series of exercises on text cleaning based on the works of Edgar Allan Poe, Mary Shelley, and H.P. Lovecraft, aiming to highlight the specificity of each author’s vocabulary (for example, ?@fig-waffle-fear). These exercises are available in the second chapter of the section.
This frequency-based analysis provides perspective on the nature of text data and recurring issues in dimensionality reduction of natural language corpora. Just as descriptive statistics naturally lead to modeling, this frequency approach typically quickly leads to the desire to identify underlying rules behind our text corpora.
3.2 Language Modeling
The remainder of this section introduces the challenges of language modeling. These are currently very popular due to the success of ChatGPT
. However, before delving into large language models (LLMs)—those neural networks with billions of parameters trained on massive data volumes—it’s important to first understand some preliminary modeling techniques.
We begin by exploring an alternative approach that takes into account the context in which a word appears. The introduction of Latent Dirichlet Allocation (LDA) serves as an opportunity to present document modeling through topics. However, this approach has fallen out of favor in comparison to methods related to the concept of embedding.
Toward the end of this course section, we will introduce the challenge of transforming textual fields
into numeric vector forms. To do so, we will present the principle
behind Word2Vec
, which allows us, for instance,
despite significant syntactic distance,
to identify that semantically, Man
and Woman
are closely related.
This chapter serves as a bridge to the concept of embedding, a major recent revolution in NLP. It enables the comparison of corpora
not only by syntactic similarity (e.g., do they share common words?)
but also by semantic similarity (e.g., do they share a theme or meaning?)4. Covering Word2Vec
will give curious learners a solid foundation to then explore transformer-based models, which are now the benchmark in NLP.
To Go Further
Research in the field of NLP is highly active. It is therefore advisable to stay curious and explore additional resources, as no single source can compile all knowledge—especially in a field as dynamic as NLP.
To deepen the skills discussed in this course, I strongly recommend
this course by HuggingFace
.
To understand the internal architecture of an LLM, this post by Sebastian Raschka is very helpful.
These chapters only scratch the surface of NLP use cases for data scientists. For instance, in public statistics, one major NLP use case involves using automatic classification techniques to convert free-text answers in surveys into predefined fields within a nomenclature. This is a specific adaptation to public statistics, a heavy user of standardized nomenclatures, of multi-level classification problems.
Here is an example from a project on automated job classification using the PCS (socio-professional categories) typology,
based on a model trained with the Fasttext
library:
Informations additionnelles
environment files have been tested on.
Latest built version: 2025-05-26
Python version used:
'3.12.6 | packaged by conda-forge | (main, Sep 30 2024, 18:08:52) [GCC 13.3.0]'
Package | Version |
---|---|
affine | 2.4.0 |
aiobotocore | 2.15.1 |
aiohappyeyeballs | 2.4.3 |
aiohttp | 3.10.8 |
aioitertools | 0.12.0 |
aiosignal | 1.3.1 |
alembic | 1.13.3 |
altair | 5.4.1 |
aniso8601 | 9.0.1 |
annotated-types | 0.7.0 |
anyio | 4.9.0 |
appdirs | 1.4.4 |
archspec | 0.2.3 |
asttokens | 2.4.1 |
attrs | 24.2.0 |
babel | 2.17.0 |
bcrypt | 4.2.0 |
beautifulsoup4 | 4.12.3 |
black | 24.8.0 |
blinker | 1.8.2 |
blis | 1.3.0 |
bokeh | 3.5.2 |
boltons | 24.0.0 |
boto3 | 1.35.23 |
botocore | 1.35.23 |
branca | 0.7.2 |
Brotli | 1.1.0 |
bs4 | 0.0.2 |
cachetools | 5.5.0 |
cartiflette | 0.1.9 |
Cartopy | 0.24.1 |
catalogue | 2.0.10 |
cattrs | 24.1.3 |
certifi | 2025.4.26 |
cffi | 1.17.1 |
charset-normalizer | 3.3.2 |
chromedriver-autoinstaller | 0.6.4 |
click | 8.1.7 |
click-plugins | 1.1.1 |
cligj | 0.7.2 |
cloudpathlib | 0.21.1 |
cloudpickle | 3.0.0 |
colorama | 0.4.6 |
comm | 0.2.2 |
commonmark | 0.9.1 |
conda | 24.9.1 |
conda-libmamba-solver | 24.7.0 |
conda-package-handling | 2.3.0 |
conda_package_streaming | 0.10.0 |
confection | 0.1.5 |
contextily | 1.6.2 |
contourpy | 1.3.0 |
cryptography | 43.0.1 |
cycler | 0.12.1 |
cymem | 2.0.11 |
cytoolz | 1.0.0 |
dask | 2024.9.1 |
dask-expr | 1.1.15 |
databricks-sdk | 0.33.0 |
dataclasses-json | 0.6.7 |
debugpy | 1.8.6 |
decorator | 5.1.1 |
Deprecated | 1.2.14 |
diskcache | 5.6.3 |
distributed | 2024.9.1 |
distro | 1.9.0 |
docker | 7.1.0 |
duckdb | 1.1.1 |
en_core_web_sm | 3.8.0 |
entrypoints | 0.4 |
et_xmlfile | 2.0.0 |
exceptiongroup | 1.2.2 |
executing | 2.1.0 |
fastexcel | 0.14.0 |
fastjsonschema | 2.21.1 |
fiona | 1.10.1 |
Flask | 3.0.3 |
folium | 0.19.6 |
fontawesomefree | 6.6.0 |
fonttools | 4.54.1 |
fr_core_news_sm | 3.8.0 |
frozendict | 2.4.4 |
frozenlist | 1.4.1 |
fsspec | 2024.9.0 |
geographiclib | 2.0 |
geopandas | 1.0.1 |
geoplot | 0.5.1 |
geopy | 2.4.1 |
gitdb | 4.0.11 |
GitPython | 3.1.43 |
google-auth | 2.35.0 |
graphene | 3.3 |
graphql-core | 3.2.4 |
graphql-relay | 3.2.0 |
graphviz | 0.20.3 |
great-tables | 0.12.0 |
greenlet | 3.1.1 |
gunicorn | 22.0.0 |
h11 | 0.16.0 |
h2 | 4.1.0 |
hpack | 4.0.0 |
htmltools | 0.6.0 |
httpcore | 1.0.9 |
httpx | 0.28.1 |
httpx-sse | 0.4.0 |
hyperframe | 6.0.1 |
idna | 3.10 |
imageio | 2.37.0 |
importlib_metadata | 8.5.0 |
importlib_resources | 6.4.5 |
inflate64 | 1.0.1 |
ipykernel | 6.29.5 |
ipython | 8.28.0 |
itsdangerous | 2.2.0 |
jedi | 0.19.1 |
Jinja2 | 3.1.4 |
jmespath | 1.0.1 |
joblib | 1.4.2 |
jsonpatch | 1.33 |
jsonpointer | 3.0.0 |
jsonschema | 4.23.0 |
jsonschema-specifications | 2025.4.1 |
jupyter-cache | 1.0.0 |
jupyter_client | 8.6.3 |
jupyter_core | 5.7.2 |
kaleido | 0.2.1 |
kiwisolver | 1.4.7 |
langchain | 0.3.25 |
langchain-community | 0.3.9 |
langchain-core | 0.3.61 |
langchain-text-splitters | 0.3.8 |
langcodes | 3.5.0 |
langsmith | 0.1.147 |
language_data | 1.3.0 |
lazy_loader | 0.4 |
libmambapy | 1.5.9 |
locket | 1.0.0 |
loguru | 0.7.3 |
lxml | 5.4.0 |
lz4 | 4.3.3 |
Mako | 1.3.5 |
mamba | 1.5.9 |
mapclassify | 2.8.1 |
marisa-trie | 1.2.1 |
Markdown | 3.6 |
markdown-it-py | 3.0.0 |
MarkupSafe | 2.1.5 |
marshmallow | 3.26.1 |
matplotlib | 3.9.2 |
matplotlib-inline | 0.1.7 |
mdurl | 0.1.2 |
menuinst | 2.1.2 |
mercantile | 1.2.1 |
mizani | 0.11.4 |
mlflow | 2.16.2 |
mlflow-skinny | 2.16.2 |
msgpack | 1.1.0 |
multidict | 6.1.0 |
multivolumefile | 0.2.3 |
munkres | 1.1.4 |
murmurhash | 1.0.13 |
mypy_extensions | 1.1.0 |
narwhals | 1.41.0 |
nbclient | 0.10.0 |
nbformat | 5.10.4 |
nest_asyncio | 1.6.0 |
networkx | 3.3 |
nltk | 3.9.1 |
numpy | 2.1.2 |
opencv-python-headless | 4.10.0.84 |
openpyxl | 3.1.5 |
opentelemetry-api | 1.16.0 |
opentelemetry-sdk | 1.16.0 |
opentelemetry-semantic-conventions | 0.37b0 |
orjson | 3.10.18 |
outcome | 1.3.0.post0 |
OWSLib | 0.33.0 |
packaging | 24.1 |
pandas | 2.2.3 |
paramiko | 3.5.0 |
parso | 0.8.4 |
partd | 1.4.2 |
pathspec | 0.12.1 |
patsy | 0.5.6 |
Pebble | 5.1.1 |
pexpect | 4.9.0 |
pickleshare | 0.7.5 |
pillow | 10.4.0 |
pip | 24.2 |
platformdirs | 4.3.6 |
plotly | 5.24.1 |
plotnine | 0.13.6 |
pluggy | 1.5.0 |
polars | 1.8.2 |
preshed | 3.0.10 |
prometheus_client | 0.21.0 |
prometheus_flask_exporter | 0.23.1 |
prompt_toolkit | 3.0.48 |
protobuf | 4.25.3 |
psutil | 6.0.0 |
ptyprocess | 0.7.0 |
pure_eval | 0.2.3 |
py7zr | 0.22.0 |
pyarrow | 17.0.0 |
pyarrow-hotfix | 0.6 |
pyasn1 | 0.6.1 |
pyasn1_modules | 0.4.1 |
pybcj | 1.0.6 |
pycosat | 0.6.6 |
pycparser | 2.22 |
pycryptodomex | 3.23.0 |
pydantic | 2.11.5 |
pydantic_core | 2.33.2 |
pydantic-settings | 2.9.1 |
Pygments | 2.18.0 |
PyNaCl | 1.5.0 |
pynsee | 0.1.8 |
pyogrio | 0.10.0 |
pyOpenSSL | 24.2.1 |
pyparsing | 3.1.4 |
pyppmd | 1.1.1 |
pyproj | 3.7.0 |
pyshp | 2.3.1 |
PySocks | 1.7.1 |
python-dateutil | 2.9.0 |
python-dotenv | 1.0.1 |
python-magic | 0.4.27 |
pytz | 2024.1 |
pyu2f | 0.1.5 |
pywaffle | 1.1.1 |
PyYAML | 6.0.2 |
pyzmq | 26.2.0 |
pyzstd | 0.16.2 |
querystring_parser | 1.2.4 |
rasterio | 1.4.3 |
referencing | 0.36.2 |
regex | 2024.9.11 |
requests | 2.32.3 |
requests-cache | 1.2.1 |
requests-toolbelt | 1.0.0 |
retrying | 1.3.4 |
rich | 14.0.0 |
rpds-py | 0.25.1 |
rsa | 4.9 |
rtree | 1.4.0 |
ruamel.yaml | 0.18.6 |
ruamel.yaml.clib | 0.2.8 |
s3fs | 2024.9.0 |
s3transfer | 0.10.2 |
scikit-image | 0.24.0 |
scikit-learn | 1.5.2 |
scipy | 1.13.0 |
seaborn | 0.13.2 |
selenium | 4.33.0 |
setuptools | 74.1.2 |
shapely | 2.0.6 |
shellingham | 1.5.4 |
six | 1.16.0 |
smart-open | 7.1.0 |
smmap | 5.0.0 |
sniffio | 1.3.1 |
sortedcontainers | 2.4.0 |
soupsieve | 2.5 |
spacy | 3.8.4 |
spacy-legacy | 3.0.12 |
spacy-loggers | 1.0.5 |
SQLAlchemy | 2.0.35 |
sqlparse | 0.5.1 |
srsly | 2.5.1 |
stack-data | 0.6.2 |
statsmodels | 0.14.4 |
tabulate | 0.9.0 |
tblib | 3.0.0 |
tenacity | 9.0.0 |
texttable | 1.7.0 |
thinc | 8.3.6 |
threadpoolctl | 3.5.0 |
tifffile | 2025.5.24 |
toolz | 1.0.0 |
topojson | 1.9 |
tornado | 6.4.1 |
tqdm | 4.67.1 |
traitlets | 5.14.3 |
trio | 0.30.0 |
trio-websocket | 0.12.2 |
truststore | 0.9.2 |
typer | 0.16.0 |
typing_extensions | 4.13.2 |
typing-inspect | 0.9.0 |
typing-inspection | 0.4.1 |
tzdata | 2024.2 |
Unidecode | 1.4.0 |
url-normalize | 2.2.1 |
urllib3 | 2.4.0 |
uv | 0.7.8 |
wasabi | 1.1.3 |
wcwidth | 0.2.13 |
weasel | 0.4.1 |
webdriver-manager | 4.0.2 |
websocket-client | 1.8.0 |
Werkzeug | 3.0.4 |
wheel | 0.44.0 |
wordcloud | 1.9.3 |
wrapt | 1.16.0 |
wsproto | 1.2.0 |
xgboost | 2.1.1 |
xlrd | 2.0.1 |
xyzservices | 2024.9.0 |
yarl | 1.13.1 |
yellowbrick | 1.5 |
zict | 3.0.0 |
zipp | 3.20.2 |
zstandard | 0.23.0 |
View file history
SHA | Date | Author | Description |
---|---|---|---|
d6b67125 | 2025-05-23 18:03:48 | Lino Galiana | Traduction des chapitres NLP (#603) |
ff42cf23 | 2024-04-25 20:05:33 | linogaliana | Editorisalisation NLP |
005d89b8 | 2023-12-20 17:23:04 | Lino Galiana | Finalise l’affichage des statistiques Git (#478) |
4cd44f35 | 2023-12-11 17:37:50 | Antoine Palazzolo | Relecture NLP (#474) |
deaafb6f | 2023-12-11 13:44:34 | Thomas Faria | Relecture Thomas partie NLP (#472) |
1f23de28 | 2023-12-01 17:25:36 | Lino Galiana | Stockage des images sur S3 (#466) |
a1ab3d94 | 2023-11-24 10:57:02 | Lino Galiana | Reprise des chapitres NLP (#459) |
7bd768a6 | 2023-08-28 09:14:55 | linogaliana | Erreur image |
862ea4b3 | 2023-08-28 11:07:31 | Lino Galiana | Ajoute référence au post de Raschka (#398) |
3bdf3b06 | 2023-08-25 11:23:02 | Lino Galiana | Simplification de la structure 🤓 (#393) |
f2905a7d | 2023-08-11 17:24:57 | Lino Galiana | Introduction de la partie NLP (#388) |
5d4874a8 | 2023-08-11 15:09:33 | Lino Galiana | Pimp les introductions des trois premières parties (#387) |
f21a24d3 | 2023-07-02 10:58:15 | Lino Galiana | Pipeline Quarto & Pages 🚀 (#365) |
a408cc96 | 2023-02-01 09:07:27 | Lino Galiana | Ajoute bouton suggérer modification (#347) |
164fa689 | 2022-11-30 09:13:45 | Lino Galiana | Travail partie NLP (#328) |
495599d7 | 2021-12-19 18:33:05 | Lino Galiana | Des éléments supplémentaires dans la partie NLP (#202) |
4f675284 | 2021-12-12 08:37:21 | Lino Galiana | Improve website appareance (#194) |
4cdb759c | 2021-05-12 10:37:23 | Lino Galiana | :sparkles: :star2: Nouveau thème hugo :snake: :fire: (#105) |
d164635d | 2020-12-08 16:22:00 | Lino Galiana | :books: Première partie NLP (#87) |
Footnotes
While writing these lines, I asked
ChatGPT
to tell me a joke.Why do programmers hate the beach? Because the sand keeps changing paradigms!
Not understanding it, I asked for an explanation. Here’s what I got:
Of course, I’d be happy to explain! This joke plays on the double meaning of the word “paradigm” and is a pun involving programming and the beach.
In programming, a “paradigm” refers to a model or method of programming, such as object-oriented or functional programming. Programmers are often familiar with different paradigms.
On the other hand, in a beach context, “paradigm” is humorously misused to mean “tide”—alluding to the natural shifting of sand and water.
The joke plays on the humorous frustration programmers might feel when adapting to new programming paradigms, just like how sand constantly shifts with the tides. By combining both meanings of “paradigm,” the joke creates an absurd and amusing scenario that will resonate with those familiar with programming and beaches.
In the end, I found the explanation funnier than the joke itself.↩︎
The concept of “natural language processing” has tended to replace the more specific term “text mining.” Text mining refers to the search for — or mining of — information within a set of texts. While it is one application area of NLP, it is not the only one. Advances in computing power and in the formalization of language have expanded the scope of fields where computational linguistics is applied.↩︎
For example, the concept of embedding—the transformation of a text field into a multidimensional numeric vector—is central in NLP today but is only briefly mentioned here.↩︎
An example of the value of this approach can be seen in ?@fig-relevanc-table-embedding.↩︎
Citation
BibTeX citation:
@book{galiana2023,
author = {Galiana, Lino},
title = {Python Pour La Data Science},
date = {2023},
url = {https://pythonds.linogaliana.fr/},
doi = {10.5281/zenodo.8229676},
langid = {en}
}
For attribution, please cite this work as:
Galiana, Lino. 2023. Python Pour La Data Science. https://doi.org/10.5281/zenodo.8229676.