'3.12.6 | packaged by conda-forge | (main, Sep 30 2024, 18:08:52) [GCC 13.3.0]'
1 Introduction
It is common to associate data scientists with the idea of complex artificial intelligence models.
The popular success of tools like ChatGPT
contributes to this perception. However, modeling is generally only a phase of a data scientist’s work, similar to visualization. In some organizations with more specialized divisions of labor, data engineers are as involved in the modeling phase as data scientists.
It is a common misconception, especially among newcomers, to think that the data scientist’s work can be reduced exclusively to the modeling phase. This phase heavily depends on the quality of the data cleaning and structuring work done beforehand. Implementing complex models that can handle unstructured data is resource-intensive and costly. Only a limited number of players can train large language models1 ex nihilo by spending at least $300,000 to train a model, even before any inference phase (Izsak, Berchansky, and Levy 2021). These computational needs for training large language models are also quite energy-intensive, which can lead to significant carbon footprints (Strubell, Ganesh, and McCallum 2019; Arcep 2019).
Fortunately, it is possible to implement lighter models, which we will introduce in the coming chapters. This part of the course will primarily focus on presenting machine learning algorithms. We can define machine learning broadly as a set of techniques that enable algorithms to detect structures or statistical regularities from a set of observations without these structures being defined a priori by modelers. This definition helps differentiate this approach from other fields of artificial intelligence[^russel-norvig], particularly symbolic AI, where each observation is characterized by an exhaustive and predefined set of rules. Although this broad definition encompasses traditional inferential statistics, it still highlights the major philosophical difference between machine learning and econometrics, as we will discuss later.
Russell and Norvig (2020) defines artificial intelligence as follows:
“The study of [intelligent] agents that receive precepts from the environment and take action. Each such agent is implemented by a function that maps percepts to actions, and we cover different ways to represent these functions, such as production systems, reactive agents, logical planners, neural networks, and decision-theoretic systems”
This very broad definition includes many different approaches within the field of artificial intelligence. It defines artificial intelligence as a very generic decision rule derived from data. Mathematically, this means linking perceptions \(\mathbb{X}\)—facts considered given—to a decision \(y\) through a decision rule \(f\): \(y=f(\mathbb{X})\) (where decision \(y\) comes from a set of decisions, either restricted or broad, depending on the phenomenon, noted as \(\mathcal{Y}\)). The way this function \(f\) is constructed distinguishes different fields of artificial intelligence.
The European AI Act of 2024 offers a similar definition though expressed in different terms:
“A machine-based system designed to operate with varying levels of autonomy and adaptability after deployment, which, for explicit or implicit purposes, deduces from the data it receives how to generate results such as predictions, content, recommendations, or decisions that can influence physical or virtual environments.”
In this course, we will only discuss approaches built around learning—those that aim to induce necessarily uncertain rules from a dataset. This approach is very different from symbolic AI, which offers limited autonomy to the machine since its behavior is constrained by a set, sometimes large, of deterministic rules.
👈️ Artifical intelligence
The choice to focus on simple machine learning algorithms in the modeling section, rather than directly jumping into neural networks, first allows us to present the scientific approach related to learning, especially to achieve satisfactory performance when extrapolating to data not encountered during the training phase. This also highlights issues that are relevant even for more complex models, such as data preparation to reduce noise in the data, enabling models to extract more reliable structures from the data. In fact, to be more effective than more parsimonious approaches, deep learning techniques, particularly neural networks, require either very large volumes of data (millions or tens of millions of observations) or complex-structured data, such as natural language or images. In many cases, simpler models, such as machine learning techniques, are more than sufficient.
2 Modeling: An Approach at the Heart of Statistics
A statistical model is a simplified and structured representation of a real phenomenon, built from observations drawn from a partial dataset.
👈️ Statistical model
A model aims to capture relationships and underlying structures within this data, allowing for hypothesis formulation, making predictions, and extrapolating conclusions beyond the measured dataset. Statistical models thus provide an analytical framework to explore, understand, and interpret the information contained in the data.
Representing reality as a model is a foundational principle in statistics as a scientific discipline, with applications across many fields: economics, sociology, geography, biology, physics, and more. The specific term may vary depending on the discipline, but the scientific approach is typically consistent: the modeler establishes relationships between several theoretical variables with empirical counterparts to quantify the relationship between them. This approach is central to inferential statistics, as opposed to descriptive statistics. In both cases, the objective is to use a sample, a limited set of observed data, to better understand a population, the entire dataset relevant to a study. The difference between the two approaches lies in how this extrapolation is conducted. In inferential statistics, the goal is generally to infer general laws, with statistical uncertainty margins, from observed data, whether it is about the statistical distribution of a variable (univariate statistics) or relationships between multiple variables. Descriptive statistics, on the other hand, simply summarize information within a dataset, often through distribution moments (mean, quantiles, etc.), without aiming to provide a general explanation of the data-generating process.
👈️ Inferential statistics, descriptive statistics, sample, population
These two approaches are not mutually exclusive; rather, they are complementary. Attempting an inferential approach without a thorough descriptive analysis often leads to dead ends or unreliable conclusions. Inferential analysis can also enrich a descriptive analysis by allowing information to be prioritized within a dataset, guiding descriptive work by focusing on key findings.
In economic research, empirical models are generally used to associate specific structural parameters of economic behavior models with quantitative values. Statistical models, like economic models, always contain an element of unreality (Friedman 1953; Salmon 2010), and accepting a model’s implications too literally, even if it has strong predictive performance, can be risky and reflect a scientific bias. Rather than identifying the true data-generating process, the goal is to select the least inaccurate model.
At ENSAE, empirical modeling is primarily seen in two application areas: machine learning and econometrics. The distinction is largely semantic—a linear regression can be considered either a machine learning or econometric technique—but also conceptual:
- In machine learning, the structure imposed by the modeler is minimal, and algorithms, based on statistical performance criteria, help select a mathematical rule that best fits the data.
- In econometrics, the assumptions about the structure of laws are stronger (even in semi or non-parametric frameworks) and are more often imposed by the modeler.
In this part of the course, we will focus mainly on machine learning, as it offers a more practical perspective than econometrics, which is more directly associated with complex statistical concepts like asymptotic theory.
The adoption of machine learning in economic literature has been gradual, as data structuring often serves as the empirical counterpart of theoretical hypotheses regarding actor or market behavior (Athey and Imbens 2019; Charpentier, Flachaire, and Ly 2018). To simplify, econometrics focuses on understanding the causality of certain variables on another. This implies that what matters to the econometrician is primarily the estimation of parameters (and the uncertainty around them) that quantify the effect of one variable on another. Again, to simplify, machine learning focuses on a predictive objective, exploiting correlations between variables. From this perspective, causality is less important than knowing that a variation in one variable of \(x\)% can predict a change of \(\beta x\) in the target variable; the reason is irrelevant. Mullainathan and Spiess (2017), for simplicity, proposed this fundamental difference: econometrics concerns itself with \(\widehat{\beta}\), while machine learning focuses on \(\widehat{y}\). Both are, of course, connected in a linear framework, but this difference in approach has important implications for the structure of the models studied, particularly their parsimony2.
3 Some useful definitions
In this part of the course, we will use several terms familiar to machine learning practitioners, but which need to be clarified to understand the following chapters.
3.1 Training and inference
Machine learning is an operational approach: the objective is generally to estimate relationships between observed variables to create a decision rule, which can then be extrapolated to another data sample. The next two chapters aim to present the scientific approach for achieving high-quality extrapolation.
Training (or learning) is the phase where a machine learning model refines relationships based on a dataset. Analogous to human learning, it is the phase where machine learning “studies” before an exam.
👈️ Training, learning
Inference is the phase where the decision rule is applied to new data, unseen during training. To continue the previous analogy, it could involve new questions on the exam (evaluation phase) or the application of the learned knowledge to real-world situations.
👈️ Inference
3.2 Machine learning and deep learning
Up to this point, we have frequently used, without defining it, the concept of machine learning.
Machine learning is a set of algorithmic techniques that allow computers to learn from examples and adjust a model without being explicitly defined. Through iterative algorithms and a performance metric, classification or prediction rules make it possible to relate features with a target variable (label)3.
👈️ Machine learning, label, features
There are many algorithms that differ in how they introduce a more or less formal structure into the relationship between observed variables. We will cover only a few of these algorithms: support vector machine (SVM), logistic regression, decision trees, random forests, etc. Simple to implement using the Scikit-Learn
library, they will provide an understanding of the original approach to machine learning, which can be further explored later.
Within the broad family of machine learning algorithms, neural network techniques are increasingly becoming more autonomous. Techniques based on neural networks are grouped within a family called deep learning. These networks are inspired by the functioning of the human brain and are composed of many interconnected layers of neurons. The well-known canonical structure is illustrated in Figure 3.1. Deep learning is useful for creating models capable of learning complex and abstract data representations from raw data, sometimes bypassing the complex task of manually defining specific features to target. The fields of image analysis (computer vision) or natural language processing are the main application areas for these methods.
👈️ Deep learning, neural networks
Neural networks are complex models. With large and complex datasets, they can be heavy, even impossible to train on standard machines. If the data matches what the model expects, it is quite possible to use it for inference without retraining or by implementing marginal learning, known as fine-tuning. To extend the analogy with human learning, fine-tuning is similar to updating one’s knowledge with a new lesson before an exam. It isn’t necessary to relearn everything, only to refine knowledge with the latest course content.
A large number of pretrained models are available on HuggingFace
, a
platform for sharing deep learning models (the Github
of deep learning). HuggingFace
also offers courses on the subject, notably on natural language processing (NLP). We’ll be looking at natural language processing in the next part of this course, but in a more modest way, going back over the concepts necessary before implementing sophisticated language modeling.
👈️ Fine tuning
In this part of the course, we will not delve deeply into deep learning because these models, to be effective, require either large structured datasets, which are rarely available as open data and have complex variable relationships, or unstructured data such as text, images, videos, etc. Text data will be covered in the next part of the course, as they involve specific concepts that require an understanding of structured data modeling challenges.
3.3 Supervised and unsupervised learning
An important dividing line between methods is whether or not we observe the label (the variable \(y\)) to be modeled.
Take, for example, an e-commerce site that has information on its customers such as age, gender, and place of residence. This site may want to leverage this information in different ways to model purchasing behavior.
First, the site may wish to predict the purchase volume of a new customer with certain characteristics. In this case, it is possible to use the amounts spent by other customers based on their characteristics. Information for our new customer is not measured, but it can be inferred from a set of similar observations.
However, it is entirely possible to train a model on an unobserved label, assuming it makes sense. For instance, our e-commerce site may wish to determine, based on the characteristics of our new customer and its existing customer base, whether they belong to a specific group of consumers: big spenders, frugal shoppers, etc. Of course, we never know a priori which group a consumer belongs to, but grouping customers with similar behaviors will give meaning to this categorization. In this case, the algorithm will learn to recognize which characteristics are structuring for grouping consumers with similar behavior, allowing any new customer to be associated with a group.
These two examples illustrate the different approaches depending on whether we attempt to build models on an observed label or not. This distinction represents one of the fundamental dualities in machine learning techniques:
👈️ Supervised learning, unsupervised learning
- Supervised learning: the target value is known and can be used to evaluate the model’s quality;
- Unsupervised learning: the target value is unknown, and statistical criteria are used to select the most plausible data structure.
This part of the course will illustrate these two approaches using the same dataset: the results of U.S. elections. In the supervised learning case, we will aim to directly model the election candidates’ results (either the score or the winner). In the unsupervised learning case, we will try to group territories with similar voting behavior based on socio-demographic factors.
3.4 Classification and Regression
A second fundamental distinction that determines the choice of machine learning method to implement is the nature of the label. Is it a continuous variable or a discrete variable, meaning it takes a limited number of categories?
This difference in data type leads to two types of approaches:
👈️ Classification, regression
- In classification tasks, where our label \(y\) has a finite number of values4, we aim to predict the class or group to which our data belongs. For example, if you have coffee in the morning, are you part of the group of grumpy morning people? Performance metrics typically use the proportion of correct or incorrect classifications to assess model quality.
- In regression tasks, where our label is a numerical value, we seek to directly predict the value of our variable in the model. For instance, given a certain age, what would be your daily expenditure on fast food? Performance metrics are usually averages of the differences between the prediction and the observed value, often with varying degrees of sophistication.
In summary, the following cheat sheet, taken from Scikit-Learn
documentation, provides an initial overview of the different model families:
Scikit-Learn
4 Data
In this section, we will focus on electoral science. This branch of research lies at the interface between political science, sociology, economics and geography. Its birth certificate is Siegfried (1913)’s Tableau de la France de l’ouest, a work known for certain conclusions, the best-known of which is “limestone votes left, granite right”. More recently, the debates surrounding Piketty and Cagé (2023), mobilizing multiple data from French open data, demonstrate the interest of data science in understanding a real phenomenon, namely the determinants of voting.
Most examples in this section are based on the 2020 U.S. election results at the county level. Several datasets are used for this purpose:
- Election data is a reconstruction from the MIT Election Lab data, available on
Github
bytonmcg
or directly on the MIT Election Lab website. - Socioeconomic data (population, income and poverty data, unemployment rate, education variables) come from the USDA (source).
- The shapefile used for the maps comes from the Census Bureau data. The file can be downloaded directly from this URL: https://www2.census.gov/geo/tiger/GENZ2019/shp/cb_2019_us_county_20m.zip
The code to build a single database from these various sources is shown below, for those interested:
View code for data retrieval
import urllib
import urllib.request
import os
import zipfile
from urllib.request import Request, urlopen
from pathlib import Path
import numpy as np
import pandas as pd
import geopandas as gpd
def download_url(url, save_path):
with urllib.request.urlopen(url) as dl_file:
with open(save_path, "wb") as out_file:
out_file.write(dl_file.read())
def create_votes_dataframes():
"data").mkdir(parents=True, exist_ok=True)
Path(
# Backup de "https://www2.census.gov/geo/tiger/GENZ2019/shp/cb_2019_us_county_20m.zip",
download_url("https://minio.lab.sspcloud.fr/lgaliana/data/python-ENSAE/shapefile_county_us_2019.zip",
"data/shapefile",
)
with zipfile.ZipFile("data/shapefile", "r") as zip_ref:
"data/counties")
zip_ref.extractall(
= gpd.read_file("data/counties/cb_2019_us_county_20m.shp")
shp = shp[~shp["STATEFP"].isin(["02", "69", "66", "78", "60", "72", "15"])]
shp
shp
= pd.read_csv(
df_election "https://raw.githubusercontent.com/tonmcg/US_County_Level_Election_Results_08-20/master/2020_US_County_Level_Presidential_Results.csv"
)2)
df_election.head(= pd.read_excel(
population "https://ers.usda.gov/sites/default/files/_laserfiche/DataFiles/48747/PopulationEstimates.xlsx?v=85724",
=4,
header={"FIPStxt": "FIPS"})
).rename(columns= pd.read_excel(
education "https://ers.usda.gov/sites/default/files/_laserfiche/DataFiles/48747/Education.xlsx?v=34502",
=3,
header={"FIPS Code": "FIPS", "Area name": "Area_Name"})
).rename(columns= pd.read_excel(
unemployment "https://ers.usda.gov/sites/default/files/_laserfiche/DataFiles/48747/Unemployment.xlsx?v=35579",
=4,
header={"FIPS_Code": "FIPS", "area_name": "Area_Name", "Stabr": "State"})
).rename(columns= pd.read_excel(
income "https://ers.usda.gov/sites/default/files/_laserfiche/DataFiles/48747/PovertyEstimates.xlsx?v=36737",
=4,
header={"FIPS_Code": "FIPS", "Stabr": "State", "Area_name": "Area_Name"})
).rename(columns
= [
dfs "FIPS", "State"])
df.set_index([for df in [population, education, unemployment, income]
]= pd.concat(dfs, axis=1)
data_county = df_election.merge(
df_election ="county_fips", right_on="FIPS"
data_county.reset_index(), left_on
)"county_fips"] = df_election["county_fips"].astype(str).str.lstrip("0")
df_election["FIPS"] = shp["GEOID"].astype(str).str.lstrip("0")
shp[= shp.merge(df_election, left_on="FIPS", right_on="county_fips")
votes
= Request(
req "https://dataverse.harvard.edu/api/access/datafile/3641280?gbrecs=false"
)
req.add_header("User-Agent",
"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0",
)= urlopen(req)
content = pd.read_csv(content, sep="\t")
df_historical # df_historical = pd.read_csv('https://dataverse.harvard.edu/api/access/datafile/3641280?gbrecs=false', sep = "\t")
= df_historical.dropna(subset=["FIPS"])
df_historical "FIPS"] = df_historical["FIPS"].astype(int)
df_historical["share"] = (
df_historical["candidatevotes"] / df_historical["totalvotes"]
df_historical[
)= df_historical[["year", "FIPS", "party", "candidatevotes", "share"]]
df_historical "party"] = df_historical["party"].fillna("other")
df_historical[
= df_historical.pivot_table(
df_historical_wide ="FIPS", values=["candidatevotes", "share"], columns=["year", "party"]
index
)= [
df_historical_wide.columns "_".join(map(str, s)) for s in df_historical_wide.columns.values
]= df_historical_wide.reset_index()
df_historical_wide "FIPS"] = df_historical_wide["FIPS"].astype(str).str.lstrip("0")
df_historical_wide["FIPS"] = votes["GEOID"].astype(str).str.lstrip("0")
votes[= votes.merge(df_historical_wide, on="FIPS")
votes "winner"] = np.where(
votes["votes_gop"] > votes["votes_dem"], "republican", "democrats"
votes[
)
return votes
This section is by no means exhaustive. It serves as an entry point into modeling, based on a series of examples developed around a common theme. For those interested in learning more about econometric models, which will be less emphasized than machine learning models here, I recommend reading Turrell and contributors (2021).
5 Going further
This section is an introduction to machine learning. It does not cover recent research areas, among which we can highlight:
👈️ Interpretability, conformal prediction, bayesian methods
- Interpretability: a set of methods aimed at opening up the black box of machine learning models. This includes techniques that help to better understand how a model, given certain inputs, arrives at a prediction. Popular methods today include LIME and Shapley values. For more information, a good starting point is Christoph (2020).
- Conformal Prediction: a statistical approach that provides an estimate of the uncertainty of a prediction by generating confidence intervals for each individual prediction. It guarantees a predetermined accuracy level, helping to deliver reliable and understandable predictions. For further reading, refer to the technical article by Angelopoulos and Bates (2021).
- Bayesian methods: a set of methods which introduce uncertainty into parameter estimation, and update this uncertainty on the basis of observed data. These methods are frequently used in a statistical context, where flexibility is nevertheless required with regard to modeling assumptions. They have been popularized by the popular book Silver (2012) (creator of fivethirtyeight website), which presents several applications (sports forecasts, elections, etc.) and are the subject of several dedicated courses at ENSAE, notably the third-year Monte Carlo Markov Chain (MCMC) course. Nevertheless, we will occasionally refer to this family of methods, in particular when we present Bayes’ naive classifier.
To take your first steps in modeling, particularly data preprocessing, you can also check out topic 3 of a hackathon organized by Insee in 2023, Exploring the eating habits of the French, on the SSP Cloud or on Github.
The aim of the topic is to work on food consumption and habits data from the INCA 3 study. You will work on several themes:
- Exploratory data analysis and visualization
- Clustering of individuals: from pre-processing to classical unsupervised learning methods (PCA, K-means, Hierarchical Ascending Clustering)
- BMI prediction: first steps towards supervised learning methods and associated preprocessings.
5.1 References
Informations additionnelles
environment files have been tested on.
Latest built version: 2025-03-19
Python version used:
Package | Version |
---|---|
affine | 2.4.0 |
aiobotocore | 2.21.1 |
aiohappyeyeballs | 2.6.1 |
aiohttp | 3.11.13 |
aioitertools | 0.12.0 |
aiosignal | 1.3.2 |
alembic | 1.13.3 |
altair | 5.4.1 |
aniso8601 | 9.0.1 |
annotated-types | 0.7.0 |
anyio | 4.8.0 |
appdirs | 1.4.4 |
archspec | 0.2.3 |
asttokens | 2.4.1 |
attrs | 25.3.0 |
babel | 2.17.0 |
bcrypt | 4.2.0 |
beautifulsoup4 | 4.12.3 |
black | 24.8.0 |
blinker | 1.8.2 |
blis | 1.2.0 |
bokeh | 3.5.2 |
boltons | 24.0.0 |
boto3 | 1.37.1 |
botocore | 1.37.1 |
branca | 0.7.2 |
Brotli | 1.1.0 |
bs4 | 0.0.2 |
cachetools | 5.5.0 |
cartiflette | 0.0.2 |
Cartopy | 0.24.1 |
catalogue | 2.0.10 |
cattrs | 24.1.2 |
certifi | 2025.1.31 |
cffi | 1.17.1 |
charset-normalizer | 3.4.1 |
chromedriver-autoinstaller | 0.6.4 |
click | 8.1.8 |
click-plugins | 1.1.1 |
cligj | 0.7.2 |
cloudpathlib | 0.21.0 |
cloudpickle | 3.0.0 |
colorama | 0.4.6 |
comm | 0.2.2 |
commonmark | 0.9.1 |
conda | 24.9.1 |
conda-libmamba-solver | 24.7.0 |
conda-package-handling | 2.3.0 |
conda_package_streaming | 0.10.0 |
confection | 0.1.5 |
contextily | 1.6.2 |
contourpy | 1.3.1 |
cryptography | 43.0.1 |
cycler | 0.12.1 |
cymem | 2.0.11 |
cytoolz | 1.0.0 |
dask | 2024.9.1 |
dask-expr | 1.1.15 |
databricks-sdk | 0.33.0 |
dataclasses-json | 0.6.7 |
debugpy | 1.8.6 |
decorator | 5.1.1 |
Deprecated | 1.2.14 |
diskcache | 5.6.3 |
distributed | 2024.9.1 |
distro | 1.9.0 |
docker | 7.1.0 |
duckdb | 1.2.1 |
en_core_web_sm | 3.8.0 |
entrypoints | 0.4 |
et_xmlfile | 2.0.0 |
exceptiongroup | 1.2.2 |
executing | 2.1.0 |
fastexcel | 0.11.6 |
fastjsonschema | 2.21.1 |
fiona | 1.10.1 |
Flask | 3.0.3 |
folium | 0.17.0 |
fontawesomefree | 6.6.0 |
fonttools | 4.56.0 |
fr_core_news_sm | 3.8.0 |
frozendict | 2.4.4 |
frozenlist | 1.5.0 |
fsspec | 2023.12.2 |
geographiclib | 2.0 |
geopandas | 1.0.1 |
geoplot | 0.5.1 |
geopy | 2.4.1 |
gitdb | 4.0.11 |
GitPython | 3.1.43 |
google-auth | 2.35.0 |
graphene | 3.3 |
graphql-core | 3.2.4 |
graphql-relay | 3.2.0 |
graphviz | 0.20.3 |
great-tables | 0.12.0 |
greenlet | 3.1.1 |
gunicorn | 22.0.0 |
h11 | 0.14.0 |
h2 | 4.1.0 |
hpack | 4.0.0 |
htmltools | 0.6.0 |
httpcore | 1.0.7 |
httpx | 0.28.1 |
httpx-sse | 0.4.0 |
hyperframe | 6.0.1 |
idna | 3.10 |
imageio | 2.37.0 |
importlib_metadata | 8.6.1 |
importlib_resources | 6.5.2 |
inflate64 | 1.0.1 |
ipykernel | 6.29.5 |
ipython | 8.28.0 |
itsdangerous | 2.2.0 |
jedi | 0.19.1 |
Jinja2 | 3.1.6 |
jmespath | 1.0.1 |
joblib | 1.4.2 |
jsonpatch | 1.33 |
jsonpointer | 3.0.0 |
jsonschema | 4.23.0 |
jsonschema-specifications | 2024.10.1 |
jupyter-cache | 1.0.0 |
jupyter_client | 8.6.3 |
jupyter_core | 5.7.2 |
kaleido | 0.2.1 |
kiwisolver | 1.4.8 |
langchain | 0.3.20 |
langchain-community | 0.3.9 |
langchain-core | 0.3.45 |
langchain-text-splitters | 0.3.6 |
langcodes | 3.5.0 |
langsmith | 0.1.147 |
language_data | 1.3.0 |
lazy_loader | 0.4 |
libmambapy | 1.5.9 |
locket | 1.0.0 |
loguru | 0.7.3 |
lxml | 5.3.1 |
lz4 | 4.3.3 |
Mako | 1.3.5 |
mamba | 1.5.9 |
mapclassify | 2.8.1 |
marisa-trie | 1.2.1 |
Markdown | 3.6 |
markdown-it-py | 3.0.0 |
MarkupSafe | 3.0.2 |
marshmallow | 3.26.1 |
matplotlib | 3.10.1 |
matplotlib-inline | 0.1.7 |
mdurl | 0.1.2 |
menuinst | 2.1.2 |
mercantile | 1.2.1 |
mizani | 0.11.4 |
mlflow | 2.16.2 |
mlflow-skinny | 2.16.2 |
msgpack | 1.1.0 |
multidict | 6.1.0 |
multivolumefile | 0.2.3 |
munkres | 1.1.4 |
murmurhash | 1.0.12 |
mypy-extensions | 1.0.0 |
narwhals | 1.30.0 |
nbclient | 0.10.0 |
nbformat | 5.10.4 |
nest_asyncio | 1.6.0 |
networkx | 3.4.2 |
nltk | 3.9.1 |
numpy | 2.2.3 |
opencv-python-headless | 4.10.0.84 |
openpyxl | 3.1.5 |
opentelemetry-api | 1.16.0 |
opentelemetry-sdk | 1.16.0 |
opentelemetry-semantic-conventions | 0.37b0 |
orjson | 3.10.15 |
outcome | 1.3.0.post0 |
OWSLib | 0.28.1 |
packaging | 24.2 |
pandas | 2.2.3 |
paramiko | 3.5.0 |
parso | 0.8.4 |
partd | 1.4.2 |
pathspec | 0.12.1 |
patsy | 1.0.1 |
Pebble | 5.1.0 |
pexpect | 4.9.0 |
pickleshare | 0.7.5 |
pillow | 11.1.0 |
pip | 24.2 |
platformdirs | 4.3.6 |
plotly | 5.24.1 |
plotnine | 0.13.6 |
pluggy | 1.5.0 |
polars | 1.8.2 |
preshed | 3.0.9 |
prometheus_client | 0.21.0 |
prometheus_flask_exporter | 0.23.1 |
prompt_toolkit | 3.0.48 |
propcache | 0.3.0 |
protobuf | 4.25.3 |
psutil | 7.0.0 |
ptyprocess | 0.7.0 |
pure_eval | 0.2.3 |
py7zr | 0.20.8 |
pyarrow | 17.0.0 |
pyarrow-hotfix | 0.6 |
pyasn1 | 0.6.1 |
pyasn1_modules | 0.4.1 |
pybcj | 1.0.3 |
pycosat | 0.6.6 |
pycparser | 2.22 |
pycryptodomex | 3.21.0 |
pydantic | 2.10.6 |
pydantic_core | 2.27.2 |
pydantic-settings | 2.8.1 |
Pygments | 2.19.1 |
PyNaCl | 1.5.0 |
pynsee | 0.1.8 |
pyogrio | 0.10.0 |
pyOpenSSL | 24.2.1 |
pyparsing | 3.2.1 |
pyppmd | 1.1.1 |
pyproj | 3.7.1 |
pyshp | 2.3.1 |
PySocks | 1.7.1 |
python-dateutil | 2.9.0.post0 |
python-dotenv | 1.0.1 |
python-magic | 0.4.27 |
pytz | 2025.1 |
pyu2f | 0.1.5 |
pywaffle | 1.1.1 |
PyYAML | 6.0.2 |
pyzmq | 26.3.0 |
pyzstd | 0.16.2 |
querystring_parser | 1.2.4 |
rasterio | 1.4.3 |
referencing | 0.36.2 |
regex | 2024.9.11 |
requests | 2.32.3 |
requests-cache | 1.2.1 |
requests-toolbelt | 1.0.0 |
retrying | 1.3.4 |
rich | 13.9.4 |
rpds-py | 0.23.1 |
rsa | 4.9 |
rtree | 1.4.0 |
ruamel.yaml | 0.18.6 |
ruamel.yaml.clib | 0.2.8 |
s3fs | 2023.12.2 |
s3transfer | 0.11.3 |
scikit-image | 0.24.0 |
scikit-learn | 1.6.1 |
scipy | 1.13.0 |
seaborn | 0.13.2 |
selenium | 4.29.0 |
setuptools | 76.0.0 |
shapely | 2.0.7 |
shellingham | 1.5.4 |
six | 1.17.0 |
smart-open | 7.1.0 |
smmap | 5.0.0 |
sniffio | 1.3.1 |
sortedcontainers | 2.4.0 |
soupsieve | 2.5 |
spacy | 3.8.4 |
spacy-legacy | 3.0.12 |
spacy-loggers | 1.0.5 |
SQLAlchemy | 2.0.39 |
sqlparse | 0.5.1 |
srsly | 2.5.1 |
stack-data | 0.6.2 |
statsmodels | 0.14.4 |
tabulate | 0.9.0 |
tblib | 3.0.0 |
tenacity | 9.0.0 |
texttable | 1.7.0 |
thinc | 8.3.4 |
threadpoolctl | 3.6.0 |
tifffile | 2025.3.13 |
toolz | 1.0.0 |
topojson | 1.9 |
tornado | 6.4.2 |
tqdm | 4.67.1 |
traitlets | 5.14.3 |
trio | 0.29.0 |
trio-websocket | 0.12.2 |
truststore | 0.9.2 |
typer | 0.15.2 |
typing_extensions | 4.12.2 |
typing-inspect | 0.9.0 |
tzdata | 2025.1 |
Unidecode | 1.3.8 |
url-normalize | 1.4.3 |
urllib3 | 1.26.20 |
uv | 0.6.8 |
wasabi | 1.1.3 |
wcwidth | 0.2.13 |
weasel | 0.4.1 |
webdriver-manager | 4.0.2 |
websocket-client | 1.8.0 |
Werkzeug | 3.0.4 |
wheel | 0.44.0 |
wordcloud | 1.9.3 |
wrapt | 1.17.2 |
wsproto | 1.2.0 |
xgboost | 2.1.1 |
xlrd | 2.0.1 |
xyzservices | 2025.1.0 |
yarl | 1.18.3 |
yellowbrick | 1.5 |
zict | 3.0.0 |
zipp | 3.21.0 |
zstandard | 0.23.0 |
View file history
SHA | Date | Author | Description |
---|---|---|---|
3f1d2f3f | 2025-03-15 15:55:59 | Lino Galiana | Fix problem with uv and malformed files (#599) |
388fd975 | 2025-02-28 17:34:09 | Lino Galiana | Colab again and again… (#595) |
5ff770b5 | 2024-12-04 10:07:34 | lgaliana | Partie ML plus esthétique |
ff0820bc | 2024-11-27 15:10:39 | lgaliana | Mise en forme chapitre régression |
cbe6459f | 2024-11-12 07:24:15 | lgaliana | Revoir quelques abstracts |
e945ff4a | 2024-11-07 18:02:05 | lgaliana | update |
a2517095 | 2024-11-07 09:31:19 | lgaliana | Exemple |
f3bbddce | 2024-11-06 16:48:47 | lgaliana | Commence revoir premier chapitre modélisation |
e56a2191 | 2024-10-30 17:13:03 | Lino Galiana | Intro partie modélisation & typo geopandas (#571) |
c326488c | 2024-10-10 14:31:57 | Romain Avouac | Various fixes (#565) |
005d89b8 | 2023-12-20 17:23:04 | Lino Galiana | Finalise l’affichage des statistiques Git (#478) |
e52cc8af | 2023-12-19 21:40:01 | Lino Galiana | Automatic black formatting for python examples (#477) |
3437373a | 2023-12-16 20:11:06 | Lino Galiana | Améliore l’exercice sur le LASSO (#473) |
1f23de28 | 2023-12-01 17:25:36 | Lino Galiana | Stockage des images sur S3 (#466) |
a06a2689 | 2023-11-23 18:23:28 | Antoine Palazzolo | 2ème relectures chapitres ML (#457) |
69cf52bd | 2023-11-21 16:12:37 | Antoine Palazzolo | [On-going] Suggestions chapitres modélisation (#452) |
09654c71 | 2023-11-14 15:16:44 | Antoine Palazzolo | Suggestions Git & Visualisation (#449) |
a7711832 | 2023-10-09 11:27:45 | Antoine Palazzolo | Relecture TD2 par Antoine (#418) |
154f09e4 | 2023-09-26 14:59:11 | Antoine Palazzolo | Des typos corrigées par Antoine (#411) |
80823022 | 2023-08-25 17:48:36 | Lino Galiana | Mise à jour des scripts de construction des notebooks (#395) |
3bdf3b06 | 2023-08-25 11:23:02 | Lino Galiana | Simplification de la structure 🤓 (#393) |
f2905a7d | 2023-08-11 17:24:57 | Lino Galiana | Introduction de la partie NLP (#388) |
5d4874a8 | 2023-08-11 15:09:33 | Lino Galiana | Pimp les introductions des trois premières parties (#387) |
f5f0f9c4 | 2022-11-02 19:19:07 | Lino Galiana | Relecture début partie modélisation KA (#318) |
8e5edba6 | 2022-09-02 11:59:57 | Lino Galiana | Ajoute un chapitre dask (#264) |
12965bac | 2022-05-25 15:53:27 | Lino Galiana | :launch: Bascule vers quarto (#226) |
5c104904 | 2021-12-03 17:44:08 | Lino Galiana | Relec (antuki?) partie modelisation (#183) |
4cdb759c | 2021-05-12 10:37:23 | Lino Galiana | :sparkles: :star2: Nouveau thème hugo :snake: :fire: (#105) |
7f9f97bc | 2021-04-30 21:44:04 | Lino Galiana | 🐳 + 🐍 New workflow (docker 🐳) and new dataset for modelization (2020 🇺🇸 elections) (#99) |
671f75a4 | 2020-10-21 15:15:24 | Lino Galiana | Introduction au Machine Learning (#72) |
Footnotes
We will periodically revisit the principle of large language models, which have become central in the data science ecosystem within just a few years and are also becoming popular tools for the general public, like
ChatGPT
.↩︎As we said, this differentiation is a bit of a caricature, especially now that economists are more familiar with the concepts of predictive performance on learning and test subsets (but this is a slow evolution). Conversely, research in machine learning is very dynamic on the question of the explicability and interpretability of machine learning models, notably around the concept of Shapley values. Nevertheless, this philosophical difference between these two schools of thought continues to influence the way econometrics or machine learning is practiced in different scientific fields.↩︎
Drawing an analogy with econometrics, features are explanatory variables or covariates (the matrix \(X\)), and label is the explained variable (\(y\)).↩︎
We will focus on the simplest binary case. In this type of problem, the variable \(y\) has two categories: winner-loser, 0-1, yes-no… However, there are many use cases where the variable has more categories, such as satisfaction scores ranging from 0 to 5 or A to D. Implementing these models is more complex, but the general approach often involves breaking it down into a set of binary models to enable the use of simple and stable metrics.↩︎
Citation
@book{galiana2023,
author = {Galiana, Lino},
title = {Python Pour La Data Science},
date = {2023},
url = {https://pythonds.linogaliana.fr/},
doi = {10.5281/zenodo.8229676},
langid = {en}
}