Python pour la data science

Lino Galiana

doi:10.5281/zenodo.8229676

If you want to try the examples in this tutorial:

The previous chapter aimed to propose a first model to understand the counties where the Republican Party wins. The variable of interest was bimodal (win or lose), placing us within the framework of a classification model.

Now, using the same data, we will propose a regression model to explain the Republican Party’s score. The variable is thus continuous. We will ignore the fact that its bounds lie between 0 and 100, meaning that to be rigorous, we would need to transform the scale so that the data fits within this interval.

Ce chapitre utilise toujours le même jeu de données, présenté dans l’introduction de cette partie : les données de vote aux élections présidentielles américaines croisées à des variables sociodémographiques. Le code est disponible sur Github.

!pip install --upgrade xlrd #colab bug verson xlrd
!pip install geopandas

import requests

url = 'https://raw.githubusercontent.com/linogaliana/python-datascientist/main/content/modelisation/get_data.py'
r = requests.get(url, allow_redirects=True)
open('getdata.py', 'wb').write(r.content)

import getdata
votes = getdata.create_votes_dataframes()

This chapter will use several modeling packages, the main ones being Scikit and Statsmodels. Here is a suggested import for all these packages.

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import sklearn.metrics
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

1 General Principle

The general principle of regression consists of finding a law \(h_\theta(X)\) such that

\[ h_\theta(X) = \mathbb{E}_\theta(Y|X) \]

This formalization is extremely general and is not limited to linear regression.

In econometrics, regression offers an alternative to maximum likelihood methods and moment methods. Regression encompasses a very broad range of methods, depending on the family of models (parametric, non-parametric, etc.) and model structures.

1.1 Linear Regression

This is the simplest way to represent the law \(h_\theta(X)\) as a linear combination of variables \(X\) and parameters \(\theta\). In this case,

\[ \mathbb{E}_\theta(Y|X) = X\beta \]

This relationship is, under this formulation, theoretical. It must be estimated from the observed data \(y\). The method of least squares aims to minimize the quadratic error between the prediction and the observed data (which explains why regression can be seen as a Machine Learning problem). In general, the method of least squares seeks to find the set of parameters \(\theta\) such that

\[ \theta = \arg \min_{\theta \in \Theta} \mathbb{E}\bigg[ \left( y - h_\theta(X) \right)^2 \bigg] \]

Which, in the context of linear regression, is expressed as follows:

\[ \beta = \arg\min \mathbb{E}\bigg[ \left( y - X\beta \right)^2 \bigg] \]

When the theoretical model (\(\mathbb{E}_\theta(Y|X) = X\beta\)) is applied to data, the model is formalized as follows:

\[ Y = X\beta + \epsilon \]

With a certain distribution of the noise \(\epsilon\) that depends on the assumptions made. For example, with \(\epsilon \sim \mathcal{N}(0,\sigma^2)\) i.i.d., the estimator \(\beta\) obtained is equivalent to the Maximum Likelihood Estimator, whose asymptotic theory ensures unbiasedness and minimum variance (Cramer-Rao bound).

1.1.1 Application

Under the guidance of the heirs of Siegfried (1913), our objective in this chapter is to explain and predict the Republican score based on some socioeconomic variables. Unlike the previous chapter, where we focused on a binary outcome (Republican victory/defeat), this time we will model the Republican score directly.

The next exercise aims to demonstrate how to perform linear regression using scikit. In this area, statsmodels is significantly more comprehensive, as the following exercise will demonstrate. The main advantage of performing regressions with scikit is the ability to compare the results of linear regression with other regression models in the context of selecting the best predictive model.

Exercise 1a: Linear Regression with scikit

Using a few variables, for example, ‘Unemployment_rate_2019’, ‘Median_Household_Income_2021’, ‘Percent of adults with less than a high school diploma, 2018-22’, “Percent of adults with a bachelor’s degree or higher, 2018-22”, explain the variable per_gop using a training sample X_train prepared beforehand.

⚠️ Use the variable Median_Household_Income_2021 in log form; otherwise, its scale might dominate and obscure other effects.

Display the values of the coefficients, including the constant.
Evaluate the relevance of the model using \(R^2\) and assess the fit quality with the MSE.
Plot a scatter plot of observed values and prediction errors. Do you observe any specification issues?

In question 4, it can be observed that the distribution of errors is clearly not random with respect to \(X\).

The model therefore suffers from a specification issue, so work will need to be done on the selected variables later. Before that, we can redo this exercise using the statsmodels package.

Exercise 1b: Linear Regression with statsmodels

This exercise aims to demonstrate how to perform linear regression using statsmodels, which offers features more similar to those of R and less oriented toward Machine Learning.

The goal is still to explain the Republican score based on a few variables.

Using a few variables, for example, ‘Unemployment_rate_2019’, ‘Median_Household_Income_2021’, ‘Percent of adults with less than a high school diploma, 2015-19’, “Percent of adults with a bachelor’s degree or higher, 2015-19”, explain the variable per_gop.
⚠️ Use the variable Median_Household_Income_2021 in log form; otherwise, its scale might dominate and obscure other effects.
Display a regression table.
Evaluate the model’s relevance using the R^2.
Use the formula API to regress the Republican score as a function of the variable Unemployment_rate_2019, Unemployment_rate_2019 squared, and the log of Median_Household_Income_2021.

R2:  0.4310933195576123

Tip

To generate a well-formatted table for a report in \(\LaTeX\), you can use the method Summary.as_latex. For an HTML report, you can use Summary.as_html.

Note

Users of R will find many familiar features in statsmodels, particularly the ability to use a formula to define a regression. The philosophy of statsmodels is similar to that which influenced the construction of R’s stats and MASS packages: providing a general-purpose library with a wide range of models.

However, statsmodels benefits from being more modern compared to R’s packages. Since the 1990s, R packages aiming to provide missing features in stats and MASS have proliferated, while statsmodels, born in the 2010s, only had to propose a general framework (the generalized estimating equations) to encompass these models.

1.2 La régression logistique

We applied our linear regression to a continuous outcome variable. How do we handle a binary distribution?
In this case, \(\mathbb{E}_{\theta} (Y|X) = \mathbb{P}_{\theta} (Y = 1|X)\).
Logistic regression can be seen as a linear probability model:

\[ \text{logit}\bigg(\mathbb{E}_{\theta}(Y|X)\bigg) = \text{logit}\bigg(\mathbb{P}_{\theta}(Y = 1|X)\bigg) = X\beta \]

The \(\text{logit}\) function is \(]0,1[ \to \mathbb{R}: p \mapsto \log(\frac{p}{1-p})\).

It allows a probability to be transformed into \(\mathbb{R}\). Its reciprocal function is the sigmoid (\(\frac{1}{1 + e^{-x}}\)), a central concept in Deep Learning.

It should be noted that probabilities are not observed; what is observed is the binary outcome (0/1). This leads to two different perspectives on logistic regression:

In econometrics, interest lies in the latent model that determines the choice of the outcome. For example, if observing the choice to participate in the labor market, the goal is to model the factors determining this choice;
In Machine Learning, the latent model is only necessary to classify observations into the correct category.

Parameter estimation for \(\beta\) can be performed using maximum likelihood or regression, both of which are equivalent under certain assumptions.

Exercise 2a: Logistic Regression with scikit

Using scikit with training and test samples:

Evaluate the effect of the already-used variables on the probability of Republicans winning. Display the values of the coefficients.
Derive a confusion matrix and a measure of model quality.
Remove regularization using the penalty parameter. What effect does this have on the estimated parameters?

Exercise 2b: Logistic Regression with statsmodels

Using training and test samples:

Evaluate the effect of the already-used variables on the probability of Republicans winning.
Perform a likelihood ratio test regarding the inclusion of the (log)-income variable.

The p-value of the likelihood ratio test being close to 1 means that the log-income variable almost certainly adds information to the model.

Tip

The test statistic is: \[ LR = -2\log\bigg(\frac{\mathcal{L}_{\theta}}{\mathcal{L}_{\theta_0}}\bigg) = -2(\mathcal{l}_{\theta} - \mathcal{l}_{\theta_0}) \]

2 Going Further

This chapter only introduces the concepts of regression in a very introductory way. To expand on this, it is recommended to explore further based on your interests and needs.

In the field of machine learning, the main areas for deeper exploration are:

Alternative regression models like random forests.
Boosting and bagging methods to learn how multiple models can be trained jointly and their predictions combined democratically to converge on better decisions than a single model.
Issues related to model explainability, a very active research area, to better understand the decision criteria of models.

In the field of econometrics, the main areas for deeper exploration are:

Generalized linear models to explore regression with more general assumptions than those we have made so far;
Hypothesis testing to delve deeper into these questions beyond our likelihood ratio test.

References

Informations additionnelles

environment files have been tested on.

Latest built version: 2025-07-29

Python version used:

'3.12.3 (main, Jun 18 2025, 17:59:45) [GCC 13.3.0]'

Package	Version
affine	2.4.0
aiobotocore	2.22.0
aiohappyeyeballs	2.6.1
aiohttp	3.11.18
aioitertools	0.12.0
aiosignal	1.3.2
altair	5.4.1
annotated-types	0.7.0
anyio	4.9.0
appdirs	1.4.4
argon2-cffi	25.1.0
argon2-cffi-bindings	21.2.0
arrow	1.3.0
asttokens	3.0.0
async-lru	2.0.5
attrs	25.3.0
babel	2.17.0
beautifulsoup4	4.13.4
black	24.8.0
bleach	6.2.0
blis	1.3.0
boto3	1.37.3
botocore	1.37.3
branca	0.8.1
Brotli	1.1.0
bs4	0.0.2
cartiflette	0.0.3
Cartopy	0.24.1
catalogue	2.0.10
cattrs	24.1.3
certifi	2025.7.14
cffi	1.17.1
charset-normalizer	3.4.2
chromedriver-autoinstaller	0.6.4
click	8.2.1
click-plugins	1.1.1
cligj	0.7.2
cloudpathlib	0.21.1
comm	0.2.2
commonmark	0.9.1
confection	0.1.5
contextily	1.6.2
contourpy	1.3.2
cycler	0.12.1
cymem	2.0.11
dataclasses-json	0.6.7
debugpy	1.8.14
decorator	5.2.1
defusedxml	0.7.1
diskcache	5.6.3
duckdb	1.3.0
en_core_web_sm	3.8.0
et_xmlfile	2.0.0
executing	2.2.0
fastexcel	0.14.0
fastjsonschema	2.21.1
fiona	1.10.1
folium	0.19.6
fontawesomefree	6.6.0
fonttools	4.58.0
fqdn	1.5.1
fr_core_news_sm	3.8.0
frozenlist	1.6.0
fsspec	2025.5.0
geographiclib	2.0
geopandas	1.0.1
geoplot	0.5.1
geopy	2.4.1
graphviz	0.20.3
great-tables	0.12.0
greenlet	3.2.2
h11	0.16.0
htmltools	0.6.0
httpcore	1.0.9
httpx	0.28.1
httpx-sse	0.4.0
idna	3.10
imageio	2.37.0
importlib_metadata	8.7.0
importlib_resources	6.5.2
inflate64	1.0.1
ipykernel	6.29.5
ipython	9.3.0
ipython_pygments_lexers	1.1.1
ipywidgets	8.1.7
isoduration	20.11.0
jedi	0.19.2
Jinja2	3.1.6
jmespath	1.0.1
joblib	1.5.1
json5	0.12.0
jsonpatch	1.33
jsonpointer	3.0.0
jsonschema	4.23.0
jsonschema-specifications	2025.4.1
jupyter	1.1.1
jupyter-cache	1.0.0
jupyter_client	8.6.3
jupyter-console	6.6.3
jupyter_core	5.7.2
jupyter-events	0.12.0
jupyter-lsp	2.2.5
jupyter_server	2.16.0
jupyter_server_terminals	0.5.3
jupyterlab	4.4.3
jupyterlab_pygments	0.3.0
jupyterlab_server	2.27.3
jupyterlab_widgets	3.0.15
kaleido	0.2.1
kiwisolver	1.4.8
langchain	0.3.25
langchain-community	0.3.9
langchain-core	0.3.61
langchain-text-splitters	0.3.8
langcodes	3.5.0
langsmith	0.1.147
language_data	1.3.0
lazy_loader	0.4
loguru	0.7.3
lxml	5.4.0
mapclassify	2.8.1
marisa-trie	1.2.1
Markdown	3.8
markdown-it-py	3.0.0
MarkupSafe	3.0.2
marshmallow	3.26.1
matplotlib	3.10.3
matplotlib-inline	0.1.7
mdurl	0.1.2
mercantile	1.2.1
mistune	3.1.3
mizani	0.11.4
multidict	6.4.4
multivolumefile	0.2.3
murmurhash	1.0.13
mypy_extensions	1.1.0
narwhals	1.40.0
nbclient	0.10.0
nbconvert	7.16.6
nbformat	5.10.4
nest-asyncio	1.6.0
networkx	3.4.2
nltk	3.9.1
notebook	7.4.3
notebook_shim	0.2.4
numpy	2.2.6
openpyxl	3.1.5
orjson	3.10.18
outcome	1.3.0.post0
overrides	7.7.0
OWSLib	0.33.0
packaging	24.2
pandas	2.2.3
pandocfilters	1.5.1
parso	0.8.4
pathspec	0.12.1
patsy	1.0.1
Pebble	5.1.1
pexpect	4.9.0
pillow	11.2.1
pip	25.1.1
platformdirs	4.3.8
plotly	6.1.2
plotnine	0.13.6
polars	1.8.2
preshed	3.0.9
prometheus_client	0.22.1
prompt_toolkit	3.0.51
propcache	0.3.1
psutil	7.0.0
ptyprocess	0.7.0
pure_eval	0.2.3
py7zr	0.22.0
pyarrow	17.0.0
pybcj	1.0.6
pycparser	2.22
pycryptodomex	3.23.0
pydantic	2.11.5
pydantic_core	2.33.2
pydantic-settings	2.9.1
Pygments	2.19.1
pynsee	0.1.8
pyogrio	0.11.0
pyparsing	3.2.3
pyppmd	1.1.1
pyproj	3.7.1
pyshp	2.3.1
PySocks	1.7.1
python-dateutil	2.9.0.post0
python-dotenv	1.0.1
python-json-logger	3.3.0
python-magic	0.4.27
pytz	2025.2
pywaffle	1.1.1
PyYAML	6.0.2
pyzmq	26.4.0
pyzstd	0.17.0
rasterio	1.4.3
referencing	0.36.2
regex	2024.11.6
requests	2.32.3
requests-cache	1.2.1
requests-toolbelt	1.0.0
retrying	1.3.4
rfc3339-validator	0.1.4
rfc3986-validator	0.1.1
rich	14.0.0
rpds-py	0.25.1
rtree	1.4.0
s3fs	2025.5.0
s3transfer	0.11.3
scikit-image	0.24.0
scikit-learn	1.6.1
scipy	1.13.0
seaborn	0.13.2
selenium	4.34.2
Send2Trash	1.8.3
setuptools	80.8.0
shapely	2.1.1
shellingham	1.5.4
six	1.17.0
smart-open	7.1.0
sniffio	1.3.1
sortedcontainers	2.4.0
soupsieve	2.7
spacy	3.8.4
spacy-legacy	3.0.12
spacy-loggers	1.0.5
SQLAlchemy	2.0.41
srsly	2.5.1
stack-data	0.6.3
statsmodels	0.14.4
tabulate	0.9.0
tenacity	9.1.2
terminado	0.18.1
texttable	1.7.0
thinc	8.3.6
threadpoolctl	3.6.0
tifffile	2025.5.24
tinycss2	1.4.0
topojson	1.9
tornado	6.5.1
tqdm	4.67.1
traitlets	5.14.3
trio	0.30.0
trio-websocket	0.12.2
typer	0.15.3
types-python-dateutil	2.9.0.20250516
typing_extensions	4.14.1
typing-inspect	0.9.0
typing-inspection	0.4.1
tzdata	2025.2
Unidecode	1.4.0
uri-template	1.3.0
url-normalize	2.2.1
urllib3	2.5.0
wasabi	1.1.3
wcwidth	0.2.13
weasel	0.4.1
webcolors	24.11.1
webdriver-manager	4.0.2
webencodings	0.5.1
websocket-client	1.8.0
widgetsnbextension	4.0.14
wordcloud	1.9.3
wrapt	1.17.2
wsproto	1.2.0
xlrd	2.0.1
xyzservices	2025.4.0
yarl	1.20.0
yellowbrick	1.5
zipp	3.21.0

View file history

md`Ce fichier a été modifié __${table_commit.length}__ fois depuis sa création le ${creation_string} (dernière modification le ${last_modification_string})`

creation = d3.min(
  table_commit.map(d => new Date(d.Date))
)

last_modification = d3.max(
  table_commit.map(d => new Date(d.Date))
)

creation_string = creation.toLocaleString("fr", {
  "day": "numeric",
  "month": "long",
  "year": "numeric"
})

last_modification_string = last_modification.toLocaleString("fr", {
  "day": "numeric",
  "month": "long",
  "year": "numeric"
})

html`<div>${git_history_table}</div>`

html`<div>${git_history_plot}</div>`

SHA	Date	Author	Description
94648290	2025-07-22 18:57:48	Lino Galiana	Fix boxes now that it is better supported by jupyter (#628)
91431fa2	2025-06-09 17:08:00	Lino Galiana	Improve homepage hero banner (#612)
48dccf14	2025-01-14 21:45:34	lgaliana	Fix bug in modeling section
d4f89590	2024-12-20 14:36:20	lgaliana	format fstring R2
8c8ca4c0	2024-12-20 10:45:00	lgaliana	Traduction du chapitre clustering
a5ecaedc	2024-12-20 09:36:42	Lino Galiana	Traduction du chapitre modélisation (#582)
ff0820bc	2024-11-27 15:10:39	lgaliana	Mise en forme chapitre régression
06d003a1	2024-04-23 10:09:22	Lino Galiana	Continue la restructuration des sous-parties (#492)
8c316d0a	2024-04-05 19:00:59	Lino Galiana	Fix cartiflette deprecated snippets (#487)
005d89b8	2023-12-20 17:23:04	Lino Galiana	Finalise l’affichage des statistiques Git (#478)
7d12af8b	2023-12-05 10:30:08	linogaliana	Modularise la partie import pour l’avoir partout
417fb669	2023-12-04 18:49:21	Lino Galiana	Corrections partie ML (#468)
a06a2689	2023-11-23 18:23:28	Antoine Palazzolo	2ème relectures chapitres ML (#457)
889a71ba	2023-11-10 11:40:51	Antoine Palazzolo	Modification TP 3 (#443)
154f09e4	2023-09-26 14:59:11	Antoine Palazzolo	Des typos corrigées par Antoine (#411)
9a4e2267	2023-08-28 17:11:52	Lino Galiana	Action to check URL still exist (#399)
a8f90c2f	2023-08-28 09:26:12	Lino Galiana	Update featured paths (#396)
3bdf3b06	2023-08-25 11:23:02	Lino Galiana	Simplification de la structure 🤓 (#393)
78ea2cbd	2023-07-20 20:27:31	Lino Galiana	Change titles levels (#381)
29ff3f58	2023-07-07 14:17:53	linogaliana	description everywhere
f21a24d3	2023-07-02 10:58:15	Lino Galiana	Pipeline Quarto & Pages 🚀 (#365)
58c71287	2023-06-11 21:32:03	Lino Galiana	change na subset (#362)
2ed4aa78	2022-11-07 15:57:31	Lino Galiana	Reprise 2e partie ML + Règle problème mathjax (#319)
f10815b5	2022-08-25 16:00:03	Lino Galiana	Notebooks should now look more beautiful (#260)
494a85ae	2022-08-05 14:49:56	Lino Galiana	Images featured ✨ (#252)
d201e3cd	2022-08-03 15:50:34	Lino Galiana	Pimp la homepage ✨ (#249)
12965bac	2022-05-25 15:53:27	Lino Galiana	:launch: Bascule vers quarto (#226)
9c71d6e7	2022-03-08 10:34:26	Lino Galiana	Plus d’éléments sur S3 (#218)
c3bf4d42	2021-12-06 19:43:26	Lino Galiana	Finalise debug partie ML (#190)
fb14d406	2021-12-06 17:00:52	Lino Galiana	Modifie l’import du script (#187)
37ecfa3c	2021-12-06 14:48:05	Lino Galiana	Essaye nom différent (#186)
5d0a5e38	2021-12-04 07:41:43	Lino Galiana	MAJ URL script recup data (#184)
5c104904	2021-12-03 17:44:08	Lino Galiana	Relec (antuki?) partie modelisation (#183)
2a8809fb	2021-10-27 12:05:34	Lino Galiana	Simplification des hooks pour gagner en flexibilité et clarté (#166)
2e4d5862	2021-09-02 12:03:39	Lino Galiana	Simplify badges generation (#130)
4cdb759c	2021-05-12 10:37:23	Lino Galiana	:sparkles: :star2: Nouveau thème hugo :snake: :fire: (#105)
7f9f97bc	2021-04-30 21:44:04	Lino Galiana	🐳 + 🐍 New workflow (docker 🐳) and new dataset for modelization (2020 🇺🇸 elections) (#99)
59eadf58	2020-11-12 16:41:46	Lino Galiana	Correction des typos partie ML (#81)
347f50f3	2020-11-12 15:08:18	Lino Galiana	Suite de la partie machine learning (#78)
671f75a4	2020-10-21 15:15:24	Lino Galiana	Introduction au Machine Learning (#72)

git_history_table = Inputs.table(
  table_commit,
  {
    format: {
      SHA: x => md`[${x}](${github_repo}/commit/${x})`,
      Description: x => md`${replacePullRequestPattern(x, github_repo)}`,
      /*Date: x => x.toLocaleString("fr", {
        "month": "numeric",
        "day": "numeric",
        "year": "numeric"
        })
      */
    }
  }
)

git_history_plot = Plot.plot({
  marks: [
    Plot.ruleY([0], {stroke: "royalblue"}),
    Plot.dot(
          table_commit,
          Plot.pointerX({x: (d) => new Date(d.date), y: 0, stroke: "red"})),
    Plot.dot(table_commit, {x: (d) => new Date(d.Date), y: 0, fill: "royalblue"})
  ]
})

function replacePullRequestPattern(inputString, githubRepo) {
    // Use a regular expression to match the pattern #digit
    var pattern = /#(\d+)/g;

    // Replace the pattern with ${github_repo}/pull/#digit
    var replacedString = inputString.replace(pattern, '[#$1](' + githubRepo + '/pull/$1)');

    return replacedString;
}

github_repo = "https://github.com/linogaliana/python-datascientist"

table_commit = {

// Get the HTML table by its class name
var table = document.querySelector('.commit-table');

// Check if the table exists
if (table) {
    // Initialize an array to store the table data
    var dataArray = [];

    // Extract headers from the first row
    var headers = [];
    for (var i = 0; i < table.rows[0].cells.length; i++) {
        headers.push(table.rows[0].cells[i].textContent.trim());
    }

    // Iterate through the rows, starting from the second row
    for (var i = 1; i < table.rows.length; i++) {
        var row = table.rows[i];
        var rowData = {};

        // Iterate through the cells in the row
        for (var j = 0; j < row.cells.length; j++) {
            // Use headers as keys and cell content as values
            rowData[headers[j]] = row.cells[j].textContent.trim();
        }

        // Push the rowData object to the dataArray
        dataArray.push(rowData);
    }
  }

  return dataArray

}

// Get the element with class 'git-details'
{
  var gitDetails = document.querySelector('.commit-table');

  // Check if the element exists
  if (gitDetails) {
      // Hide the element
      gitDetails.style.display = 'none';
  }
}

Plot = require('@observablehq/plot@0.6.12/dist/plot.umd.min.js')

Back to top

References

Siegfried, André. 1913. Tableau Politique de La France de l’ouest Sous La Troisième république: 102 Cartes Et Croquis, 1 Carte Hors Texte. A. Colin.

Citation

BibTeX citation:

@book{galiana2023,
  author = {Galiana, Lino},
  title = {Python Pour La Data Science},
  date = {2023},
  url = {https://pythonds.linogaliana.fr/},
  doi = {10.5281/zenodo.8229676},
  langid = {en}
}

For attribution, please cite this work as:

Galiana, Lino. 2023. Python Pour La Data Science. https://doi.org/10.5281/zenodo.8229676.