Part 4: Natural Language Processing (NLP)

This part of the course introduces automatic language processing (NLP), a scientific field at the crossroads of linguistics and statistics that has become central to the field of data science as a result of the craze for generative AI. Using literary examples, this section first explores classic methods such as frequentist analysis and the processing of textual corpora in the form of bag of words. It then looks at language modelling, which opens the way to more original approaches. The aim of this chapter is to recall a few general points about the vast field of NLP

Introduction
NLP
Author

Lino Galiana

Published

2025-05-26

1 Introduction

The previous sections focused on acquiring cross-functional skills for working with data. Naturally, we have so far mostly focused on structured data—modest in size but already rich in analytical potential. This new section turns to a subject that, at first glance, may seem unlikely to be handled by computers—a topic of centuries-old philosophical debate, from Plato to Saussure: the richness of human language.

By drawing an analogy between “language” and “tongue”—that is, defining the former as the capacity to express and communicate thought through signs, and the latter as the conventional implementation of that capacity—we align ourselves with the field of linguistics and treat language as data. This opens the door to statistical and algorithmic analysis. Yet, even if statistical regularities exist, how can computers—ultimately limited to just 0 and 1—grasp such a complex object as language, which takes humans years to understand and master?1

2 Natural Language Processing

Natural Language Processing (NLP) refers to the set of techniques that allow computers to understand, analyze, synthesize, and generate human language2.

NLP is a disciplinary field at the intersection of statistics and linguistics, which has experienced significant growth in recent years — academically, operationally, and industrially. Some applications of these techniques have become essential in our daily lives, such as search engines, machine translation, and more recently, chatbots, whose development has accelerated rapidly since the launch of ChatGPT in December 2022.

3 Section Summary

This part of the course is dedicated to text data analysis with 📖 examples for fun. It serves as a gradual introduction to the topic by focusing on basic concepts necessary for later understanding of more advanced principles and sophisticated techniques3. This section mainly covers:

  • The challenges of cleaning textual fields and frequency analysis. This is somewhat old school NLP but understanding it is essential to progress further;
  • Language modeling using several approaches.

Before diving into the topic of embeddings, it’s important to understand the contributions and limitations of concepts like the bag of words or TF-IDF (term frequency - inverse document frequency). One of the main benefits of large language models—namely the richness of their contextual window that allows them to better grasp textual nuances and speaker intentionality—becomes clearer when the limitations of traditional NLP are understood.

As an introductory perspective, this course focuses on frequency-based approaches, especially the bag-of-words approach, to ease into the later exploration of the Pandora’s box that is embeddings.

3.1 Text Cleaning and Frequency Analysis

Python is an excellent tool for text data analysis. Basic methods for transforming textual data or dictionaries, combined with specialized libraries such as NLTK and SpaCy, make it possible to perform normalization and text data analysis very efficiently. Python is much better equipped than R for text data analysis. There is a wealth of online resources on this subject, and the best way to learn remains hands-on practice with a corpus to clean.

This section first revisits how to structure and clean a textual corpus through the bag of words approach. It aims to demonstrate how to turn a corpus into a tool suitable for statistical analysis:

  • It first introduces the challenges of text data cleaning through an analysis of The Count of Monte Cristo by Alexandre Dumas here, which helps to quickly summarize the available information in a large volume of text data (as illustrated by ?@fig-wordcloud-dumas)
  • It then offers a series of exercises on text cleaning based on the works of Edgar Allan Poe, Mary Shelley, and H.P. Lovecraft, aiming to highlight the specificity of each author’s vocabulary (for example, ?@fig-waffle-fear). These exercises are available in the second chapter of the section.

This frequency-based analysis provides perspective on the nature of text data and recurring issues in dimensionality reduction of natural language corpora. Just as descriptive statistics naturally lead to modeling, this frequency approach typically quickly leads to the desire to identify underlying rules behind our text corpora.

3.2 Language Modeling

The remainder of this section introduces the challenges of language modeling. These are currently very popular due to the success of ChatGPT. However, before delving into large language models (LLMs)—those neural networks with billions of parameters trained on massive data volumes—it’s important to first understand some preliminary modeling techniques.

We begin by exploring an alternative approach that takes into account the context in which a word appears. The introduction of Latent Dirichlet Allocation (LDA) serves as an opportunity to present document modeling through topics. However, this approach has fallen out of favor in comparison to methods related to the concept of embedding.

Toward the end of this course section, we will introduce the challenge of transforming textual fields into numeric vector forms. To do so, we will present the principle behind Word2Vec, which allows us, for instance, despite significant syntactic distance, to identify that semantically, Man and Woman are closely related. This chapter serves as a bridge to the concept of embedding, a major recent revolution in NLP. It enables the comparison of corpora not only by syntactic similarity (e.g., do they share common words?) but also by semantic similarity (e.g., do they share a theme or meaning?)4. Covering Word2Vec will give curious learners a solid foundation to then explore transformer-based models, which are now the benchmark in NLP.

To Go Further

Research in the field of NLP is highly active. It is therefore advisable to stay curious and explore additional resources, as no single source can compile all knowledge—especially in a field as dynamic as NLP.

To deepen the skills discussed in this course, I strongly recommend this course by HuggingFace.

To understand the internal architecture of an LLM, this post by Sebastian Raschka is very helpful.

These chapters only scratch the surface of NLP use cases for data scientists. For instance, in public statistics, one major NLP use case involves using automatic classification techniques to convert free-text answers in surveys into predefined fields within a nomenclature. This is a specific adaptation to public statistics, a heavy user of standardized nomenclatures, of multi-level classification problems.

Here is an example from a project on automated job classification using the PCS (socio-professional categories) typology, based on a model trained with the Fasttext library:

Informations additionnelles

environment files have been tested on.

Latest built version: 2025-05-26

Python version used:

'3.12.6 | packaged by conda-forge | (main, Sep 30 2024, 18:08:52) [GCC 13.3.0]'
Package Version
affine 2.4.0
aiobotocore 2.15.1
aiohappyeyeballs 2.4.3
aiohttp 3.10.8
aioitertools 0.12.0
aiosignal 1.3.1
alembic 1.13.3
altair 5.4.1
aniso8601 9.0.1
annotated-types 0.7.0
anyio 4.9.0
appdirs 1.4.4
archspec 0.2.3
asttokens 2.4.1
attrs 24.2.0
babel 2.17.0
bcrypt 4.2.0
beautifulsoup4 4.12.3
black 24.8.0
blinker 1.8.2
blis 1.3.0
bokeh 3.5.2
boltons 24.0.0
boto3 1.35.23
botocore 1.35.23
branca 0.7.2
Brotli 1.1.0
bs4 0.0.2
cachetools 5.5.0
cartiflette 0.1.9
Cartopy 0.24.1
catalogue 2.0.10
cattrs 24.1.3
certifi 2025.4.26
cffi 1.17.1
charset-normalizer 3.3.2
chromedriver-autoinstaller 0.6.4
click 8.1.7
click-plugins 1.1.1
cligj 0.7.2
cloudpathlib 0.21.1
cloudpickle 3.0.0
colorama 0.4.6
comm 0.2.2
commonmark 0.9.1
conda 24.9.1
conda-libmamba-solver 24.7.0
conda-package-handling 2.3.0
conda_package_streaming 0.10.0
confection 0.1.5
contextily 1.6.2
contourpy 1.3.0
cryptography 43.0.1
cycler 0.12.1
cymem 2.0.11
cytoolz 1.0.0
dask 2024.9.1
dask-expr 1.1.15
databricks-sdk 0.33.0
dataclasses-json 0.6.7
debugpy 1.8.6
decorator 5.1.1
Deprecated 1.2.14
diskcache 5.6.3
distributed 2024.9.1
distro 1.9.0
docker 7.1.0
duckdb 1.1.1
en_core_web_sm 3.8.0
entrypoints 0.4
et_xmlfile 2.0.0
exceptiongroup 1.2.2
executing 2.1.0
fastexcel 0.14.0
fastjsonschema 2.21.1
fiona 1.10.1
Flask 3.0.3
folium 0.19.6
fontawesomefree 6.6.0
fonttools 4.54.1
fr_core_news_sm 3.8.0
frozendict 2.4.4
frozenlist 1.4.1
fsspec 2024.9.0
geographiclib 2.0
geopandas 1.0.1
geoplot 0.5.1
geopy 2.4.1
gitdb 4.0.11
GitPython 3.1.43
google-auth 2.35.0
graphene 3.3
graphql-core 3.2.4
graphql-relay 3.2.0
graphviz 0.20.3
great-tables 0.12.0
greenlet 3.1.1
gunicorn 22.0.0
h11 0.16.0
h2 4.1.0
hpack 4.0.0
htmltools 0.6.0
httpcore 1.0.9
httpx 0.28.1
httpx-sse 0.4.0
hyperframe 6.0.1
idna 3.10
imageio 2.37.0
importlib_metadata 8.5.0
importlib_resources 6.4.5
inflate64 1.0.1
ipykernel 6.29.5
ipython 8.28.0
itsdangerous 2.2.0
jedi 0.19.1
Jinja2 3.1.4
jmespath 1.0.1
joblib 1.4.2
jsonpatch 1.33
jsonpointer 3.0.0
jsonschema 4.23.0
jsonschema-specifications 2025.4.1
jupyter-cache 1.0.0
jupyter_client 8.6.3
jupyter_core 5.7.2
kaleido 0.2.1
kiwisolver 1.4.7
langchain 0.3.25
langchain-community 0.3.9
langchain-core 0.3.61
langchain-text-splitters 0.3.8
langcodes 3.5.0
langsmith 0.1.147
language_data 1.3.0
lazy_loader 0.4
libmambapy 1.5.9
locket 1.0.0
loguru 0.7.3
lxml 5.4.0
lz4 4.3.3
Mako 1.3.5
mamba 1.5.9
mapclassify 2.8.1
marisa-trie 1.2.1
Markdown 3.6
markdown-it-py 3.0.0
MarkupSafe 2.1.5
marshmallow 3.26.1
matplotlib 3.9.2
matplotlib-inline 0.1.7
mdurl 0.1.2
menuinst 2.1.2
mercantile 1.2.1
mizani 0.11.4
mlflow 2.16.2
mlflow-skinny 2.16.2
msgpack 1.1.0
multidict 6.1.0
multivolumefile 0.2.3
munkres 1.1.4
murmurhash 1.0.13
mypy_extensions 1.1.0
narwhals 1.41.0
nbclient 0.10.0
nbformat 5.10.4
nest_asyncio 1.6.0
networkx 3.3
nltk 3.9.1
numpy 2.1.2
opencv-python-headless 4.10.0.84
openpyxl 3.1.5
opentelemetry-api 1.16.0
opentelemetry-sdk 1.16.0
opentelemetry-semantic-conventions 0.37b0
orjson 3.10.18
outcome 1.3.0.post0
OWSLib 0.33.0
packaging 24.1
pandas 2.2.3
paramiko 3.5.0
parso 0.8.4
partd 1.4.2
pathspec 0.12.1
patsy 0.5.6
Pebble 5.1.1
pexpect 4.9.0
pickleshare 0.7.5
pillow 10.4.0
pip 24.2
platformdirs 4.3.6
plotly 5.24.1
plotnine 0.13.6
pluggy 1.5.0
polars 1.8.2
preshed 3.0.10
prometheus_client 0.21.0
prometheus_flask_exporter 0.23.1
prompt_toolkit 3.0.48
protobuf 4.25.3
psutil 6.0.0
ptyprocess 0.7.0
pure_eval 0.2.3
py7zr 0.22.0
pyarrow 17.0.0
pyarrow-hotfix 0.6
pyasn1 0.6.1
pyasn1_modules 0.4.1
pybcj 1.0.6
pycosat 0.6.6
pycparser 2.22
pycryptodomex 3.23.0
pydantic 2.11.5
pydantic_core 2.33.2
pydantic-settings 2.9.1
Pygments 2.18.0
PyNaCl 1.5.0
pynsee 0.1.8
pyogrio 0.10.0
pyOpenSSL 24.2.1
pyparsing 3.1.4
pyppmd 1.1.1
pyproj 3.7.0
pyshp 2.3.1
PySocks 1.7.1
python-dateutil 2.9.0
python-dotenv 1.0.1
python-magic 0.4.27
pytz 2024.1
pyu2f 0.1.5
pywaffle 1.1.1
PyYAML 6.0.2
pyzmq 26.2.0
pyzstd 0.16.2
querystring_parser 1.2.4
rasterio 1.4.3
referencing 0.36.2
regex 2024.9.11
requests 2.32.3
requests-cache 1.2.1
requests-toolbelt 1.0.0
retrying 1.3.4
rich 14.0.0
rpds-py 0.25.1
rsa 4.9
rtree 1.4.0
ruamel.yaml 0.18.6
ruamel.yaml.clib 0.2.8
s3fs 2024.9.0
s3transfer 0.10.2
scikit-image 0.24.0
scikit-learn 1.5.2
scipy 1.13.0
seaborn 0.13.2
selenium 4.33.0
setuptools 74.1.2
shapely 2.0.6
shellingham 1.5.4
six 1.16.0
smart-open 7.1.0
smmap 5.0.0
sniffio 1.3.1
sortedcontainers 2.4.0
soupsieve 2.5
spacy 3.8.4
spacy-legacy 3.0.12
spacy-loggers 1.0.5
SQLAlchemy 2.0.35
sqlparse 0.5.1
srsly 2.5.1
stack-data 0.6.2
statsmodels 0.14.4
tabulate 0.9.0
tblib 3.0.0
tenacity 9.0.0
texttable 1.7.0
thinc 8.3.6
threadpoolctl 3.5.0
tifffile 2025.5.24
toolz 1.0.0
topojson 1.9
tornado 6.4.1
tqdm 4.67.1
traitlets 5.14.3
trio 0.30.0
trio-websocket 0.12.2
truststore 0.9.2
typer 0.16.0
typing_extensions 4.13.2
typing-inspect 0.9.0
typing-inspection 0.4.1
tzdata 2024.2
Unidecode 1.4.0
url-normalize 2.2.1
urllib3 2.4.0
uv 0.7.8
wasabi 1.1.3
wcwidth 0.2.13
weasel 0.4.1
webdriver-manager 4.0.2
websocket-client 1.8.0
Werkzeug 3.0.4
wheel 0.44.0
wordcloud 1.9.3
wrapt 1.16.0
wsproto 1.2.0
xgboost 2.1.1
xlrd 2.0.1
xyzservices 2024.9.0
yarl 1.13.1
yellowbrick 1.5
zict 3.0.0
zipp 3.20.2
zstandard 0.23.0

View file history

SHA Date Author Description
d6b67125 2025-05-23 18:03:48 Lino Galiana Traduction des chapitres NLP (#603)
ff42cf23 2024-04-25 20:05:33 linogaliana Editorisalisation NLP
005d89b8 2023-12-20 17:23:04 Lino Galiana Finalise l’affichage des statistiques Git (#478)
4cd44f35 2023-12-11 17:37:50 Antoine Palazzolo Relecture NLP (#474)
deaafb6f 2023-12-11 13:44:34 Thomas Faria Relecture Thomas partie NLP (#472)
1f23de28 2023-12-01 17:25:36 Lino Galiana Stockage des images sur S3 (#466)
a1ab3d94 2023-11-24 10:57:02 Lino Galiana Reprise des chapitres NLP (#459)
7bd768a6 2023-08-28 09:14:55 linogaliana Erreur image
862ea4b3 2023-08-28 11:07:31 Lino Galiana Ajoute référence au post de Raschka (#398)
3bdf3b06 2023-08-25 11:23:02 Lino Galiana Simplification de la structure 🤓 (#393)
f2905a7d 2023-08-11 17:24:57 Lino Galiana Introduction de la partie NLP (#388)
5d4874a8 2023-08-11 15:09:33 Lino Galiana Pimp les introductions des trois premières parties (#387)
f21a24d3 2023-07-02 10:58:15 Lino Galiana Pipeline Quarto & Pages 🚀 (#365)
a408cc96 2023-02-01 09:07:27 Lino Galiana Ajoute bouton suggérer modification (#347)
164fa689 2022-11-30 09:13:45 Lino Galiana Travail partie NLP (#328)
495599d7 2021-12-19 18:33:05 Lino Galiana Des éléments supplémentaires dans la partie NLP (#202)
4f675284 2021-12-12 08:37:21 Lino Galiana Improve website appareance (#194)
4cdb759c 2021-05-12 10:37:23 Lino Galiana :sparkles: :star2: Nouveau thème hugo :snake: :fire: (#105)
d164635d 2020-12-08 16:22:00 Lino Galiana :books: Première partie NLP (#87)
Back to top

Footnotes

  1. While writing these lines, I asked ChatGPT to tell me a joke.

    Why do programmers hate the beach? Because the sand keeps changing paradigms!

    Not understanding it, I asked for an explanation. Here’s what I got:

    Of course, I’d be happy to explain! This joke plays on the double meaning of the word “paradigm” and is a pun involving programming and the beach.

    In programming, a “paradigm” refers to a model or method of programming, such as object-oriented or functional programming. Programmers are often familiar with different paradigms.

    On the other hand, in a beach context, “paradigm” is humorously misused to mean “tide”—alluding to the natural shifting of sand and water.

    The joke plays on the humorous frustration programmers might feel when adapting to new programming paradigms, just like how sand constantly shifts with the tides. By combining both meanings of “paradigm,” the joke creates an absurd and amusing scenario that will resonate with those familiar with programming and beaches.

    In the end, I found the explanation funnier than the joke itself.↩︎

  2. The concept of “natural language processing” has tended to replace the more specific term “text mining.” Text mining refers to the search for — or mining of — information within a set of texts. While it is one application area of NLP, it is not the only one. Advances in computing power and in the formalization of language have expanded the scope of fields where computational linguistics is applied.↩︎

  3. For example, the concept of embedding—the transformation of a text field into a multidimensional numeric vector—is central in NLP today but is only briefly mentioned here.↩︎

  4. An example of the value of this approach can be seen in ?@fig-relevanc-table-embedding.↩︎

Citation

BibTeX citation:
@book{galiana2023,
  author = {Galiana, Lino},
  title = {Python Pour La Data Science},
  date = {2023},
  url = {https://pythonds.linogaliana.fr/},
  doi = {10.5281/zenodo.8229676},
  langid = {en}
}
For attribution, please cite this work as:
Galiana, Lino. 2023. Python Pour La Data Science. https://doi.org/10.5281/zenodo.8229676.