Frequentist analysis using the bag-of-words approach: forces and limitations

This chapter takes a closer look at frequentist analysis applied to textual data by presenting the challenges in the field of information retrieval. After presenting the cleaning and vectorisation of texts using TF-IDF, it discusses the limitations of the bag of words approach and paves the way for embeddings, vectors representing textual objects by proximity of meaning, the subject of the next chapter.
NLP
Author

Lino Galiana

Published

2025-05-26

If you want to try the examples in this tutorial:
View on GitHub Onyxia Onyxia Open In Colab

To move forward in this chapter, we need to perform some preliminary installations:

!pip install spacy
!python -m spacy download en_core_web_sm
!python -m spacy download fr_core_news_sm

It is also useful to define the following function, taken from our previous chapter:

def clean_text(doc):
    # Tokenize, remove stop words and punctuation, and lemmatize
    cleaned_tokens = [token.lemma_ for token in doc if not token.is_stop and not token.is_punct]
    # Join tokens back into a single string
    cleaned_text = ' '.join(cleaned_tokens)
    return cleaned_text

1 Introduction

Previously, we saw the importance of cleaning data to filter down the volume of information present in unstructured data. The goal of this chapter is to deepen our understanding of the frequency-based approach applied to text data. We will explore how this frequentist analysis helps summarize the information contained within a text corpus. We’ll also look at how to refine the bag of words approach by taking into account the order or proximity of terms within a sentence.

1.1 Data

We will reuse the Anglo-Saxon dataset from the previous chapter, which includes texts from gothic authors Edgar Allan Poe (EAP), HP Lovecraft (HPL), and Mary Wollstonecraft Shelley (MWS).

import pandas as pd

url='https://github.com/GU4243-ADS/spring2018-project1-ginnyqg/raw/master/data/spooky.csv'
#1. Import des données
horror = pd.read_csv(url,encoding='latin-1')
#2. Majuscules aux noms des colonnes
horror.columns = horror.columns.str.capitalize()
#3. Retirer le prefixe id
horror['ID'] = horror['Id'].str.replace("id","")
horror = horror.set_index('Id')

2 The TF-IDF Measure (term frequency - inverse document frequency)

2.1 The Document-Term Matrix

As mentioned earlier, we construct a synthetic representation of our corpus as a bag of words, where words are sampled more or less frequently depending on their appearance frequency. This is, of course, a simplified representation of reality: word sequences are not just random independent words.

However, before addressing those limitations, we should complete the bag-of-words approach. The most characteristic representation of this paradigm is the document-term matrix, mainly used to compare corpora. It involves creating a matrix where each document is represented by the presence or absence of terms in our corpus. The idea is to count how often words (terms, in columns) appear in each sentence or phrase (documents, in rows). This matrix then becomes a numerical representation of the text data.

Consider a corpus made up of the following three sentences:

  • The practice of knitting and crocheting
  • Passing on the passion for stamps
  • Living off one’s passion”

The corresponding document-term matrix is:

Sentence and crocheting for knitting living of one’s on passion passing practice stamps the
The practice of knitting and crocheting 1 1 0 1 0 1 0 0 0 0 1 0 1
Passing on the passion for stamps 0 0 1 0 0 0 0 1 1 1 0 1 1
Living off one’s passion 0 0 0 0 1 0 1 0 1 0 0 0 0

Each sentence in the corpus is associated with a numeric vector. For instance, the sentence “La pratique du tricot et du crochet”, which is meaningless to a machine on its own, becomes a numeric vector it can interpret: [1, 0, 2, 1, 1, 0, 1, 0, 0, 0, 1, 0]. This numeric vector is a sparse representation of language, since each document (row) will only contain a small portion of the total vocabulary (all columns). Words that do not appear in a document are represented as zeros, hence a sparse vector. As we’ll see later, this numeric representation is very different from modern embedding approaches, which are based on dense representations.

2.2 Use for Information Retrieval

Different documents can then be compared based on these measures. This is one of the methods used by search engines, although the most advanced ones rely on far more sophisticated approaches. The tf-idf metric (term frequency–inverse document frequency) allows for calculating a relevance score between a search term and a document using two components:

\[ \text{tf-idf}(t, d, D) = \text{tf}(t, d) \times \text{idf}(t, D) \]

Let \(t\) be a specific term (e.g., a word), \(d\) a specific document, and \(D\) the entire set of documents in the corpus.

  • The tf component computes a function that increases with the frequency of the search term in the document under consideration;

  • The idf component computes a function that decreases with the frequency of the term across the entire document set (or corpus).

  • The first part (term frequency, TF) is the frequency of occurrence of term \(t\) in document \(d\). There are normalization strategies available to avoid biasing the score in favor of longer documents.

\[ \text{tf}(t, d) = \frac{f_{t,d}}{\sum_{t' \in d} f_{t',d}} \]

where \(f_{t,d}\) is the raw count of how many times term \(t\) appears in document \(d\), and the denominator is the total number of terms in document \(d\).

  • The second part (inverse document frequency, IDF) measures the rarity—or conversely, the commonness—of a term across the corpus. If \(N\) is the total number of documents in the corpus \(D\), this part of the metric is given by

\[ \text{idf}(t, D) = \log \left( \frac{N}{|\{d \in D : t \in d\}|} \right) \]

The denominator \(( |\{d \in D : t \in d\}| )\) corresponds to the number of documents in which the term \(t\) appears. The rarer the word, the more its presence in a document is given additional weight.

Many search engines use this logic to find the most relevant documents in response to a search query. One notable example is ElasticSearch, the software used to implement powerful search engines. To rank the most relevant documents for a given search term, it uses the BM25 distance metric, which is a more advanced version of the TF-IDF measure.

2.3 Example

Let’s illustrate this with a small corpus. The following code implements a TF-IDF metric. It slightly deviates from the standard definition to avoid division by zero.

import numpy as np

# Documents d'exemple
documents = [
    "Le corbeau et le renard",
    "Rusé comme un renard",
    "Le chat est orange comme un renard"
]

# Tokenisation
def preprocess(doc):
    return doc.lower().split()

tokenized_docs = [preprocess(doc) for doc in documents]

# Term frequency (TF)
def term_frequency(term, tokenized_doc):
    term_count = tokenized_doc.count(term)
    return term_count / len(tokenized_doc)

# Inverse document frequency (DF)
def document_frequency(term, tokenized_docs):
    return sum(1 for doc in tokenized_docs if term in doc)

# Calculate inverse document frequency (IDF)
def inverse_document_frequency(word, corpus):
    # Normalisation avec + 1 pour éviter la division par zéro
    count_of_documents = len(corpus) + 1
    count_of_documents_with_word = sum([1 for doc in corpus if word in doc]) + 1
    idf = np.log10(count_of_documents/count_of_documents_with_word) + 1
    return idf

# Calculate TF-IDF scores in each document
def tf_idf_term(term):
  tf_idf_scores = pd.DataFrame(
    [
      [
      term_frequency(term, doc),
      inverse_document_frequency(term, tokenized_docs)
      ] for doc in tokenized_docs
    ],
    columns = ["TF", "IDF"]
  )
  tf_idf_scores["TF-IDF"] = tf_idf_scores["TF"] * tf_idf_scores["IDF"]
  return tf_idf_scores

Let’s begin by computing the TF-IDF score of the word “cat” for each document. Naturally, it is the third document—the only one where the word appears—that has the highest score:

tf_idf_term("chat")
TF IDF TF-IDF
0 0.000000 1.30103 0.000000
1 0.000000 1.30103 0.000000
2 0.142857 1.30103 0.185861

What about the term “renard” (fox in French) which appears in all the documents (making the \(\text{idf}\) component equal to 1)? In this case, the document where the word appears most frequently—in this example, the second document—has the highest score.

tf_idf_term("renard")
TF IDF TF-IDF
0 0.200000 1.0 0.200000
1 0.250000 1.0 0.250000
2 0.142857 1.0 0.142857

2.4 Application

The previous example didn’t scale very well. Fortunately, Scikit provides an implementation of TF-IDF vector search, which we can explore in a new exercise.

Exercise 1: TF-IDF Frequency Calculation
  1. Use the TF-IDF vectorizer from scikit-learn to transform your corpus into a document x terms matrix. Use the stop_words option to avoid inflating the matrix size. Name the model tfidf and the resulting dataset tfs.
  2. After constructing the document x terms matrix with the code below, find the rows where terms matching abandon are non-zero.
  3. Identify the 50 excerpts where the TF-IDF score for the word “fear” is highest and their associated authors. Determine the distribution of authors among these 50 documents.
  4. Inspect the top 10 scores where TF-IDF for “fear” is highest.
Hint for question 2
feature_names = tfidf.get_feature_names_out()
corpus_index = [n for n in list(tfidf.vocabulary_.keys())]
horror_dense = pd.DataFrame(tfs.todense(), columns=feature_names)

The vectorizer obtained at the end of question 1 is as follows:

TfidfVectorizer(stop_words=['anyway', 'what', 'elsewhere', 'whereby',
                            'something', 'latterly', 'here', 'them', 'seem',
                            'five', 'these', 'sometime', 'call', 'although',
                            'also', 'were', 'just', 'via', '‘d', 'make',
                            'unless', 'nevertheless', 'alone', 'enough', 'an',
                            'around', 'used', 'through', 'first', 'show', ...])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
/opt/conda/lib/python3.12/site-packages/sklearn/feature_extraction/text.py:406: UserWarning:

Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['ll', 've'] not in stop_words.
aaem ab aback abaft abandon abandoned abandoning abandonment abaout abased ... zodiacal zoilus zokkar zone zones zopyrus zorry zubmizzion zuro á¼
0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.267616 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 24783 columns

The lines where the term “abandon” appears are as follows (question 2):

Index([    4,   116,   215,   571,   839,  1042,  1052,  1069,  2247,  2317,
        2505,  3023,  3058,  3245,  3380,  3764,  3886,  4425,  5289,  5576,
        5694,  6812,  7500,  9013,  9021,  9077,  9560, 11229, 11395, 11451,
       11588, 11827, 11989, 11998, 12122, 12158, 12189, 13666, 15259, 16516,
       16524, 16759, 17547, 18019, 18072, 18126, 18204, 18251],
      dtype='int64')

The document-term matrix associated with these is as follows:

aaem ab aback abaft abandon abandoned abandoning abandonment abaout abased ... zodiacal zoilus zokkar zone zones zopyrus zorry zubmizzion zuro á¼
4 0.0 0.0 0.0 0.0 0.000000 0.267616 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
116 0.0 0.0 0.0 0.0 0.000000 0.359676 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
215 0.0 0.0 0.0 0.0 0.249090 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
571 0.0 0.0 0.0 0.0 0.000000 0.153280 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
839 0.0 0.0 0.0 0.0 0.312172 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 24783 columns

Here we notice the drawback of not applying stemming. Variations of “abandon” are spread across many columns. “abandoned” is treated as different from “abandon” just as it is from “fear”. This is one of the limitations of the bag of words approach.

Text    50
dtype: int64

The 10 highest scores are as follows:

['We could not fear we did not.',
 '"And now I do not fear death.',
 'Be of heart and fear nothing.',
 'Indeed I had no fear on her account.',
 'I smiled, for what had I to fear?',
 'I did not like everything about what I saw, and felt again the fear I had had.',
 'At length, in an abrupt manner she asked, "Where is he?" "O, fear not," she continued, "fear not that I should entertain hope Yet tell me, have you found him?',
 'I have not the slightest fear for the result.',
 '"I fear you are right there," said the Prefect.']

We observe that the highest scores correspond either to short excerpts where the word appears once, or to longer excerpts where the word “fear” appears multiple times.

3 An Initial Enhancement of the Bag-of-Words Approach: n-grams

We previously identified two main limitations of the bag-of-words approach: its disregard for context and its sparse representation of language, which sometimes leads to weak similarity matches between texts. However, within the bag-of-words paradigm, it is possible to account for the sequence of tokens using n-grams.

To recap, in the traditional bag of words approach, word order doesn’t matter. A text is treated as a collection of words drawn independently, with varying frequencies based on their occurrence probabilities. Drawing a specific word doesn’t affect the likelihood of subsequent words.

A way to introduce relationships between sequences of tokens is through n-grams. This method considers not only word frequencies but also which words follow others. It’s particularly useful for disambiguating homonyms. The computation of n-grams 1 is the simplest method for incorporating context.

To carry out this type of analysis, we need to download an additional corpus:

import nltk
nltk.download('genesis')
nltk.corpus.genesis.words('english-web.txt')
[nltk_data] Downloading package genesis to /github/home/nltk_data...
[nltk_data]   Package genesis is already up-to-date!
['In', 'the', 'beginning', 'God', 'created', 'the', ...]

NLTK provides methods for incorporating context. To do this, we compute n-grams—that is, sequences of n consecutive word co-occurrences. Generally, we limit ourselves to bigrams or at most trigrams:

  • Classification models, sentiment analysis, document comparison, etc., that rely on n-grams with large n quickly face sparse data issues, reducing their predictive power;
  • Performance drops quickly as n increases, and data storage costs increase substantially (roughly n times larger than the original dataset).

Let’s quickly examine the context in which the word fear appears in the works of Edgar Allan Poe (EAP). To do this, we first transform the EAP corpus into NLTK tokens:

eap_clean = horror.loc[horror["Author"] == "EAP"]
eap_clean = ' '.join(eap_clean['Text'])
tokens = eap_clean.split()
print(tokens[:10])
text = nltk.Text(tokens)
print(text)
['This', 'process,', 'however,', 'afforded', 'me', 'no', 'means', 'of', 'ascertaining', 'the']
<Text: This process, however, afforded me no means of...>

You will need the functions BigramCollocationFinder.from_words and BigramAssocMeasures.likelihood_ratio:

Exercise 2: n-grams and the Context of the Word “fear”
  1. Use the concordance method to display the context in which the word fear appears.
  2. Select and display the top collocations, for instance using the likelihood ratio criterion.

When two words are strongly associated, it may be due to their rarity. Therefore, it’s often necessary to apply filters—for example, ignore bigrams that occur fewer than 5 times in the corpus.

  1. Repeat the previous task using the BigramCollocationFinder model, followed by the apply_freq_filter method to retain only bigrams appearing at least 5 times. Then, instead of the likelihood ratio, test the method nltk.collocations.BigramAssocMeasures().jaccard.

  2. Focus only on collocations involving the word fear.

Using the concordance method (question 1), the list should look like this:

Exemples d'occurences du terme 'fear' :
Displaying 13 of 13 matches:
d quick unequal spoken apparently in fear as well as in anger. What he said wa
hutters were close fastened, through fear of robbers, and so I knew that he co
to details. I even went so far as to fear that, as I occasioned much trouble, 
years of age, was heard to express a fear "that she should never see Marie aga
ich must be entirely remodelled, for fear of serious accident I mean the steel
 my arm, and I attended her home. 'I fear that I shall never see Marie again.'
clusion here is absurd. "I very much fear it is so," replied Monsieur Maillard
bt of ultimately seeing the Pole. "I fear you are right there," said the Prefe
er occurred before.' Indeed I had no fear on her account. For a moment there w
erhaps so," said I; "but, Legrand, I fear you are no artist. It is my firm int
 raps with a hammer. Be of heart and fear nothing. My daughter, Mademoiselle M
e splendor. I have not the slightest fear for the result. The face was so far 
arriers of iron that hemmed me in. I fear you have mesmerized" adding immediat

Although it is easy to see the words that appear before and after, this list is rather hard to interpret because it combines a lot of information.

Collocation involves identifying bigrams that frequently occur together. Among all observed word pairs, the idea is to select the “best” ones based on a statistical model. Using this method (question 2), we get:

[('of', 'the'),
 ('in', 'the'),
 ('had', 'been'),
 ('to', 'be'),
 ('have', 'been'),
 ('I', 'had'),
 ('It', 'was'),
 ('it', 'is'),
 ('could', 'not'),
 ('from', 'the'),
 ('upon', 'the'),
 ('more', 'than'),
 ('it', 'was'),
 ('would', 'have'),
 ('with', 'a'),
 ('did', 'not'),
 ('I', 'am'),
 ('the', 'a'),
 ('at', 'once'),
 ('might', 'have')]

If we model the best collocations:

"Gad Fly"
'Hum Drum,'
'Rowdy Dow,'
Brevet Brigadier
Barrière du
ugh ugh
Ourang Outang
Chess Player
John A.
A. B.
hu hu
General John
'Oppodeldoc,' whoever
mille, mille,
Brigadier General

This list is a bit more meaningful, including character names, places, and frequently used expressions (like Chess Player for example).

As for the collocations of the word fear:

[('fear', 'of'), ('fear', 'God'), ('I', 'fear'), ('the', 'fear'), ('The', 'fear'), ('fear', 'him'), ('you', 'fear')]

If we perform the same analysis for the term love, we logically find subjects that are commonly associated with the verb:

[('love', 'me'), ('love', 'he'), ('will', 'love'), ('I', 'love'), ('love', ','), ('you', 'love'), ('the', 'love')]

4 Some Applications

We just discussed an initial application of the bag of words approach: grouping texts based on shared terms. However, this is not the only use case. We will now explore two additional applications that lead us toward language modeling: named entity recognition and classification.

4.1 Named Entity Recognition

Named Entity Recognition (NER) is an information extraction technique used to identify the type of certain terms in a text, such as locations, people, quantities, etc.

To illustrate this, let’s return to The Count of Monte Cristo and examine a short excerpt from the work to see how named entity recognition operates:

from urllib import request

url = "https://www.gutenberg.org/files/17989/17989-0.txt"
response = request.urlopen(url)
raw = response.read().decode('utf8')

dumas = (
  raw
  .split("*** START OF THE PROJECT GUTENBERG EBOOK LE COMTE DE MONTE-CRISTO, TOME I ***")[1]
  .split("*** END OF THE PROJECT GUTENBERG EBOOK LE COMTE DE MONTE-CRISTO, TOME I ***")[0]
1)

import re

def clean_text(text):
    text = text.lower() # mettre les mots en minuscule
    text = " ".join(text.split())
    return text

dumas = clean_text(dumas)

dumas[10000:10500]
1
On extrait de manière un petit peu simpliste le contenu de l’ouvrage
" mes yeux. --vous avez donc vu l'empereur aussi? --il est entré chez le maréchal pendant que j'y étais. --et vous lui avez parlé? --c'est-à-dire que c'est lui qui m'a parlé, monsieur, dit dantès en souriant. --et que vous a-t-il dit? --il m'a fait des questions sur le bâtiment, sur l'époque de son départ pour marseille, sur la route qu'il avait suivie et sur la cargaison qu'il portait. je crois que s'il eût été vide, et que j'en eusse été le maître, son intention eût été de l'acheter; mais je lu"
import spacy
from spacy import displacy

nlp = spacy.load("fr_core_news_sm")
doc = nlp(dumas[15000:17000])
# displacy.render(doc, style="ent", jupyter=True)

The named entity recognition provided by default in general-purpose libraries is often underwhelming; it is frequently necessary to supplement the default rules with ad hoc rules specific to each corpus.

In practice, named entity recognition was recently used by Etalab to pseudonymize administrative documents. This involves identifying certain sensitive information (such as civil status, address, etc.) through entity recognition and replacing it with pseudonyms.

4.2 Text Data Classification: The Fasttext Algorithm

Fasttext is a single-layer neural network developed by Meta in 2016 for text classification and language modeling. As we will see, this model serves as a bridge to more refined forms of language modeling, although Fasttext remains far simpler than large language models (LLMs). One of the main use cases of Fasttext is supervised text classification: determining a text’s category. For example, identifying whether a song’s lyrics belong to the rap or rock genre. This is a supervised model because it learns to recognize features—in this case, pieces of text—that lead to good prediction performance on both training and test sets.

The concept of a feature might seem odd for text data, which is inherently unstructured. For structured data, as discussed in the modeling section, the approach was straightforward: features were observed variables, and the classification algorithm identified the best combination to predict the label. With text data, we must build features from the text itself—turning unstructured data into structured form. This is where the concepts we’ve covered so far come into play.

FastText uses a “bag of n-grams” approach. It considers that features are derived not only from words in the corpus but also from multiple levels of n-grams. The general architecture of FastText looks like this:

Diagram of FastText architecture

Diagram of FastText architecture

What interests us here is the left side of the diagram—“feature extraction”—since the embedding part relates to concepts we will cover in upcoming chapters. In the figure’s example, the text “Business engineering and services” is tokenized into words as we’ve seen earlier. But Fasttext also creates multiple levels of n-grams. For instance, it generates word bigrams: “Business engineering”, “engineering and”, “and services”; and also character four-grams like “busi”, “usin”, and “sine”. Then, Fasttext transforms all these items into numeric vectors. Unlike the term frequency representations we’ve seen, these vectors are not based on corpus frequency (as in document-term matrices) but are word embeddings. We’ll explore this concept in future chapters.

Fasttext is widely used in public statistics, as many textual data sources need to be classified into aggregated nomenclatures.

Here is an example of how such a model can be used for activity classification

Informations additionnelles

environment files have been tested on.

Latest built version: 2025-05-26

Python version used:

'3.12.6 | packaged by conda-forge | (main, Sep 30 2024, 18:08:52) [GCC 13.3.0]'
Package Version
affine 2.4.0
aiobotocore 2.15.1
aiohappyeyeballs 2.4.3
aiohttp 3.10.8
aioitertools 0.12.0
aiosignal 1.3.1
alembic 1.13.3
altair 5.4.1
aniso8601 9.0.1
annotated-types 0.7.0
anyio 4.9.0
appdirs 1.4.4
archspec 0.2.3
asttokens 2.4.1
attrs 24.2.0
babel 2.17.0
bcrypt 4.2.0
beautifulsoup4 4.12.3
black 24.8.0
blinker 1.8.2
blis 1.3.0
bokeh 3.5.2
boltons 24.0.0
boto3 1.35.23
botocore 1.35.23
branca 0.7.2
Brotli 1.1.0
bs4 0.0.2
cachetools 5.5.0
cartiflette 0.1.9
Cartopy 0.24.1
catalogue 2.0.10
cattrs 24.1.3
certifi 2025.4.26
cffi 1.17.1
charset-normalizer 3.3.2
chromedriver-autoinstaller 0.6.4
click 8.1.7
click-plugins 1.1.1
cligj 0.7.2
cloudpathlib 0.21.1
cloudpickle 3.0.0
colorama 0.4.6
comm 0.2.2
commonmark 0.9.1
conda 24.9.1
conda-libmamba-solver 24.7.0
conda-package-handling 2.3.0
conda_package_streaming 0.10.0
confection 0.1.5
contextily 1.6.2
contourpy 1.3.0
cryptography 43.0.1
cycler 0.12.1
cymem 2.0.11
cytoolz 1.0.0
dask 2024.9.1
dask-expr 1.1.15
databricks-sdk 0.33.0
dataclasses-json 0.6.7
debugpy 1.8.6
decorator 5.1.1
Deprecated 1.2.14
diskcache 5.6.3
distributed 2024.9.1
distro 1.9.0
docker 7.1.0
duckdb 1.1.1
en_core_web_sm 3.8.0
entrypoints 0.4
et_xmlfile 2.0.0
exceptiongroup 1.2.2
executing 2.1.0
fastexcel 0.14.0
fastjsonschema 2.21.1
fiona 1.10.1
Flask 3.0.3
folium 0.19.6
fontawesomefree 6.6.0
fonttools 4.54.1
fr_core_news_sm 3.8.0
frozendict 2.4.4
frozenlist 1.4.1
fsspec 2024.9.0
geographiclib 2.0
geopandas 1.0.1
geoplot 0.5.1
geopy 2.4.1
gitdb 4.0.11
GitPython 3.1.43
google-auth 2.35.0
graphene 3.3
graphql-core 3.2.4
graphql-relay 3.2.0
graphviz 0.20.3
great-tables 0.12.0
greenlet 3.1.1
gunicorn 22.0.0
h11 0.16.0
h2 4.1.0
hpack 4.0.0
htmltools 0.6.0
httpcore 1.0.9
httpx 0.28.1
httpx-sse 0.4.0
hyperframe 6.0.1
idna 3.10
imageio 2.37.0
importlib_metadata 8.5.0
importlib_resources 6.4.5
inflate64 1.0.1
ipykernel 6.29.5
ipython 8.28.0
itsdangerous 2.2.0
jedi 0.19.1
Jinja2 3.1.4
jmespath 1.0.1
joblib 1.4.2
jsonpatch 1.33
jsonpointer 3.0.0
jsonschema 4.23.0
jsonschema-specifications 2025.4.1
jupyter-cache 1.0.0
jupyter_client 8.6.3
jupyter_core 5.7.2
kaleido 0.2.1
kiwisolver 1.4.7
langchain 0.3.25
langchain-community 0.3.9
langchain-core 0.3.61
langchain-text-splitters 0.3.8
langcodes 3.5.0
langsmith 0.1.147
language_data 1.3.0
lazy_loader 0.4
libmambapy 1.5.9
locket 1.0.0
loguru 0.7.3
lxml 5.4.0
lz4 4.3.3
Mako 1.3.5
mamba 1.5.9
mapclassify 2.8.1
marisa-trie 1.2.1
Markdown 3.6
markdown-it-py 3.0.0
MarkupSafe 2.1.5
marshmallow 3.26.1
matplotlib 3.9.2
matplotlib-inline 0.1.7
mdurl 0.1.2
menuinst 2.1.2
mercantile 1.2.1
mizani 0.11.4
mlflow 2.16.2
mlflow-skinny 2.16.2
msgpack 1.1.0
multidict 6.1.0
multivolumefile 0.2.3
munkres 1.1.4
murmurhash 1.0.13
mypy_extensions 1.1.0
narwhals 1.41.0
nbclient 0.10.0
nbformat 5.10.4
nest_asyncio 1.6.0
networkx 3.3
nltk 3.9.1
numpy 2.1.2
opencv-python-headless 4.10.0.84
openpyxl 3.1.5
opentelemetry-api 1.16.0
opentelemetry-sdk 1.16.0
opentelemetry-semantic-conventions 0.37b0
orjson 3.10.18
outcome 1.3.0.post0
OWSLib 0.33.0
packaging 24.1
pandas 2.2.3
paramiko 3.5.0
parso 0.8.4
partd 1.4.2
pathspec 0.12.1
patsy 0.5.6
Pebble 5.1.1
pexpect 4.9.0
pickleshare 0.7.5
pillow 10.4.0
pip 24.2
platformdirs 4.3.6
plotly 5.24.1
plotnine 0.13.6
pluggy 1.5.0
polars 1.8.2
preshed 3.0.10
prometheus_client 0.21.0
prometheus_flask_exporter 0.23.1
prompt_toolkit 3.0.48
protobuf 4.25.3
psutil 6.0.0
ptyprocess 0.7.0
pure_eval 0.2.3
py7zr 0.22.0
pyarrow 17.0.0
pyarrow-hotfix 0.6
pyasn1 0.6.1
pyasn1_modules 0.4.1
pybcj 1.0.6
pycosat 0.6.6
pycparser 2.22
pycryptodomex 3.23.0
pydantic 2.11.5
pydantic_core 2.33.2
pydantic-settings 2.9.1
Pygments 2.18.0
PyNaCl 1.5.0
pynsee 0.1.8
pyogrio 0.10.0
pyOpenSSL 24.2.1
pyparsing 3.1.4
pyppmd 1.1.1
pyproj 3.7.0
pyshp 2.3.1
PySocks 1.7.1
python-dateutil 2.9.0
python-dotenv 1.0.1
python-magic 0.4.27
pytz 2024.1
pyu2f 0.1.5
pywaffle 1.1.1
PyYAML 6.0.2
pyzmq 26.2.0
pyzstd 0.16.2
querystring_parser 1.2.4
rasterio 1.4.3
referencing 0.36.2
regex 2024.9.11
requests 2.32.3
requests-cache 1.2.1
requests-toolbelt 1.0.0
retrying 1.3.4
rich 14.0.0
rpds-py 0.25.1
rsa 4.9
rtree 1.4.0
ruamel.yaml 0.18.6
ruamel.yaml.clib 0.2.8
s3fs 2024.9.0
s3transfer 0.10.2
scikit-image 0.24.0
scikit-learn 1.5.2
scipy 1.13.0
seaborn 0.13.2
selenium 4.33.0
setuptools 74.1.2
shapely 2.0.6
shellingham 1.5.4
six 1.16.0
smart-open 7.1.0
smmap 5.0.0
sniffio 1.3.1
sortedcontainers 2.4.0
soupsieve 2.5
spacy 3.8.4
spacy-legacy 3.0.12
spacy-loggers 1.0.5
SQLAlchemy 2.0.35
sqlparse 0.5.1
srsly 2.5.1
stack-data 0.6.2
statsmodels 0.14.4
tabulate 0.9.0
tblib 3.0.0
tenacity 9.0.0
texttable 1.7.0
thinc 8.3.6
threadpoolctl 3.5.0
tifffile 2025.5.24
toolz 1.0.0
topojson 1.9
tornado 6.4.1
tqdm 4.67.1
traitlets 5.14.3
trio 0.30.0
trio-websocket 0.12.2
truststore 0.9.2
typer 0.16.0
typing_extensions 4.13.2
typing-inspect 0.9.0
typing-inspection 0.4.1
tzdata 2024.2
Unidecode 1.4.0
url-normalize 2.2.1
urllib3 2.4.0
uv 0.7.8
wasabi 1.1.3
wcwidth 0.2.13
weasel 0.4.1
webdriver-manager 4.0.2
websocket-client 1.8.0
Werkzeug 3.0.4
wheel 0.44.0
wordcloud 1.9.3
wrapt 1.16.0
wsproto 1.2.0
xgboost 2.1.1
xlrd 2.0.1
xyzservices 2024.9.0
yarl 1.13.1
yellowbrick 1.5
zict 3.0.0
zipp 3.20.2
zstandard 0.23.0

View file history

SHA Date Author Description
d6b67125 2025-05-23 18:03:48 Lino Galiana Traduction des chapitres NLP (#603)
e1617833 2025-05-23 10:08:11 lgaliana update uv env & remove discarded code
cb655535 2024-12-11 08:20:54 lgaliana PCA
5df69ccf 2024-12-05 17:56:47 lgaliana up
1b7188a1 2024-12-05 13:21:11 lgaliana Embedding chapter
5108922f 2024-08-08 18:43:37 Lino Galiana Improve notebook generation and tests on PR (#536)
8d23a533 2024-07-10 18:45:54 Julien PRAMIL Modifs 02_exoclean.qmd (#523)
f32915b9 2024-07-09 18:41:00 Julien PRAMIL Add badges NLP chapter (#522)
6f2a5658 2024-06-16 16:23:01 linogaliana Détails tf-idf
8cb248ab 2024-06-16 16:09:45 linogaliana TF-IDF
fcdd7b4d 2024-06-14 15:11:24 linogaliana Add spacy corpus
4f41cf6a 2024-06-14 15:00:41 Lino Galiana Une partie sur les sacs de mots plus cohérente (#501)
06d003a1 2024-04-23 10:09:22 Lino Galiana Continue la restructuration des sous-parties (#492)
005d89b8 2023-12-20 17:23:04 Lino Galiana Finalise l’affichage des statistiques Git (#478)
3437373a 2023-12-16 20:11:06 Lino Galiana Améliore l’exercice sur le LASSO (#473)
4cd44f35 2023-12-11 17:37:50 Antoine Palazzolo Relecture NLP (#474)
deaafb6f 2023-12-11 13:44:34 Thomas Faria Relecture Thomas partie NLP (#472)
1f23de28 2023-12-01 17:25:36 Lino Galiana Stockage des images sur S3 (#466)
a1ab3d94 2023-11-24 10:57:02 Lino Galiana Reprise des chapitres NLP (#459)
a06a2689 2023-11-23 18:23:28 Antoine Palazzolo 2ème relectures chapitres ML (#457)
09654c71 2023-11-14 15:16:44 Antoine Palazzolo Suggestions Git & Visualisation (#449)
889a71ba 2023-11-10 11:40:51 Antoine Palazzolo Modification TP 3 (#443)
a7711832 2023-10-09 11:27:45 Antoine Palazzolo Relecture TD2 par Antoine (#418)
154f09e4 2023-09-26 14:59:11 Antoine Palazzolo Des typos corrigées par Antoine (#411)
a8f90c2f 2023-08-28 09:26:12 Lino Galiana Update featured paths (#396)
80823022 2023-08-25 17:48:36 Lino Galiana Mise à jour des scripts de construction des notebooks (#395)
3bdf3b06 2023-08-25 11:23:02 Lino Galiana Simplification de la structure 🤓 (#393)
f2905a7d 2023-08-11 17:24:57 Lino Galiana Introduction de la partie NLP (#388)
78ea2cbd 2023-07-20 20:27:31 Lino Galiana Change titles levels (#381)
a9b384ed 2023-07-18 18:07:16 Lino Galiana Sépare les notebooks (#373)
29ff3f58 2023-07-07 14:17:53 linogaliana description everywhere
f21a24d3 2023-07-02 10:58:15 Lino Galiana Pipeline Quarto & Pages 🚀 (#365)
934149d6 2023-02-13 11:45:23 Lino Galiana get_feature_names is deprecated in scikit 1.0.X versions (#351)
164fa689 2022-11-30 09:13:45 Lino Galiana Travail partie NLP (#328)
f10815b5 2022-08-25 16:00:03 Lino Galiana Notebooks should now look more beautiful (#260)
494a85ae 2022-08-05 14:49:56 Lino Galiana Images featured ✨ (#252)
d201e3cd 2022-08-03 15:50:34 Lino Galiana Pimp la homepage ✨ (#249)
12965bac 2022-05-25 15:53:27 Lino Galiana :launch: Bascule vers quarto (#226)
9c71d6e7 2022-03-08 10:34:26 Lino Galiana Plus d’éléments sur S3 (#218)
3299f1d9 2022-01-08 16:50:11 Lino Galiana Clean NLP notebooks (#215)
09b60a18 2021-12-21 19:58:58 Lino Galiana Relecture suite du NLP (#205)
495599d7 2021-12-19 18:33:05 Lino Galiana Des éléments supplémentaires dans la partie NLP (#202)
17092b2e 2021-12-13 09:17:13 Lino Galiana Retouches partie NLP (#199)
3c874832 2021-12-13 08:46:52 Lino Galiana Notebooks NLP update (#198)
2a8809fb 2021-10-27 12:05:34 Lino Galiana Simplification des hooks pour gagner en flexibilité et clarté (#166)
2e4d5862 2021-09-02 12:03:39 Lino Galiana Simplify badges generation (#130)
49e2826f 2021-05-13 18:11:20 Lino Galiana Corrige quelques images n’apparaissant pas (#108)
4cdb759c 2021-05-12 10:37:23 Lino Galiana :sparkles: :star2: Nouveau thème hugo :snake: :fire: (#105)
7f9f97bc 2021-04-30 21:44:04 Lino Galiana 🐳 + 🐍 New workflow (docker 🐳) and new dataset for modelization (2020 🇺🇸 elections) (#99)
d164635d 2020-12-08 16:22:00 Lino Galiana :books: Première partie NLP (#87)
Back to top

Footnotes

  1. We use the term bigrams for two-word co-occurrences, trigrams for three-word ones, etc.↩︎

Citation

BibTeX citation:
@book{galiana2023,
  author = {Galiana, Lino},
  title = {Python Pour La Data Science},
  date = {2023},
  url = {https://pythonds.linogaliana.fr/},
  doi = {10.5281/zenodo.8229676},
  langid = {en}
}
For attribution, please cite this work as:
Galiana, Lino. 2023. Python Pour La Data Science. https://doi.org/10.5281/zenodo.8229676.