An introduction to regression

Linear regression is the first statistical modeling to be discovered in a quantitative curriculum. It is a very intuitive and rich method. Machine Learning allows us to approach it in a different way to econometrics. With scikit and statsmodels, we have all the tools we need to satisfy both data scientists and economists.

Modélisation
Author

Lino Galiana

Published

2025-05-26

If you want to try the examples in this tutorial:
View on GitHub Onyxia Onyxia Open In Colab

The previous chapter aimed to propose a first model to understand the counties where the Republican Party wins. The variable of interest was bimodal (win or lose), placing us within the framework of a classification model.

Now, using the same data, we will propose a regression model to explain the Republican Party’s score. The variable is thus continuous. We will ignore the fact that its bounds lie between 0 and 100, meaning that to be rigorous, we would need to transform the scale so that the data fits within this interval.

Ce chapitre utilise toujours le même jeu de données, présenté dans l’introduction de cette partie : les données de vote aux élections présidentielles américaines croisées à des variables sociodémographiques. Le code est disponible sur Github.

!pip install --upgrade xlrd #colab bug verson xlrd
!pip install geopandas
import requests

url = 'https://raw.githubusercontent.com/linogaliana/python-datascientist/main/content/modelisation/get_data.py'
r = requests.get(url, allow_redirects=True)
open('getdata.py', 'wb').write(r.content)

import getdata
votes = getdata.create_votes_dataframes()

This chapter will use several modeling packages, the main ones being Scikit and Statsmodels. Here is a suggested import for all these packages.

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import sklearn.metrics
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

1 General Principle

The general principle of regression consists of finding a law \(h_\theta(X)\) such that

\[ h_\theta(X) = \mathbb{E}_\theta(Y|X) \]

This formalization is extremely general and is not limited to linear regression.

In econometrics, regression offers an alternative to maximum likelihood methods and moment methods. Regression encompasses a very broad range of methods, depending on the family of models (parametric, non-parametric, etc.) and model structures.

1.1 Linear Regression

This is the simplest way to represent the law \(h_\theta(X)\) as a linear combination of variables \(X\) and parameters \(\theta\). In this case,

\[ \mathbb{E}_\theta(Y|X) = X\beta \]

This relationship is, under this formulation, theoretical. It must be estimated from the observed data \(y\). The method of least squares aims to minimize the quadratic error between the prediction and the observed data (which explains why regression can be seen as a Machine Learning problem). In general, the method of least squares seeks to find the set of parameters \(\theta\) such that

\[ \theta = \arg \min_{\theta \in \Theta} \mathbb{E}\bigg[ \left( y - h_\theta(X) \right)^2 \bigg] \]

Which, in the context of linear regression, is expressed as follows:

\[ \beta = \arg\min \mathbb{E}\bigg[ \left( y - X\beta \right)^2 \bigg] \]

When the theoretical model (\(\mathbb{E}_\theta(Y|X) = X\beta\)) is applied to data, the model is formalized as follows:

\[ Y = X\beta + \epsilon \]

With a certain distribution of the noise \(\epsilon\) that depends on the assumptions made. For example, with \(\epsilon \sim \mathcal{N}(0,\sigma^2)\) i.i.d., the estimator \(\beta\) obtained is equivalent to the Maximum Likelihood Estimator, whose asymptotic theory ensures unbiasedness and minimum variance (Cramer-Rao bound).

1.1.1 Application

Under the guidance of the heirs of Siegfried (1913), our objective in this chapter is to explain and predict the Republican score based on some socioeconomic variables. Unlike the previous chapter, where we focused on a binary outcome (Republican victory/defeat), this time we will model the Republican score directly.

The next exercise aims to demonstrate how to perform linear regression using scikit. In this area, statsmodels is significantly more comprehensive, as the following exercise will demonstrate. The main advantage of performing regressions with scikit is the ability to compare the results of linear regression with other regression models in the context of selecting the best predictive model.

Exercise 1a: Linear Regression with scikit
  1. Using a few variables, for example, ‘Unemployment_rate_2019’, ‘Median_Household_Income_2021’, ‘Percent of adults with less than a high school diploma, 2018-22’, “Percent of adults with a bachelor’s degree or higher, 2018-22”, explain the variable per_gop using a training sample X_train prepared beforehand.

⚠️ Use the variable Median_Household_Income_2021 in log form; otherwise, its scale might dominate and obscure other effects.

  1. Display the values of the coefficients, including the constant.

  2. Evaluate the relevance of the model using \(R^2\) and assess the fit quality with the MSE.

  3. Plot a scatter plot of observed values and prediction errors. Do you observe any specification issues?

In question 4, it can be observed that the distribution of errors is clearly not random with respect to \(X\).

The model therefore suffers from a specification issue, so work will need to be done on the selected variables later. Before that, we can redo this exercise using the statsmodels package.

Exercise 1b: Linear Regression with statsmodels

This exercise aims to demonstrate how to perform linear regression using statsmodels, which offers features more similar to those of R and less oriented toward Machine Learning.

The goal is still to explain the Republican score based on a few variables.

  1. Using a few variables, for example, ‘Unemployment_rate_2019’, ‘Median_Household_Income_2021’, ‘Percent of adults with less than a high school diploma, 2015-19’, “Percent of adults with a bachelor’s degree or higher, 2015-19”, explain the variable per_gop.
    ⚠️ Use the variable Median_Household_Income_2021 in log form; otherwise, its scale might dominate and obscure other effects.

  2. Display a regression table.

  3. Evaluate the model’s relevance using the R^2.

  4. Use the formula API to regress the Republican score as a function of the variable Unemployment_rate_2019, Unemployment_rate_2019 squared, and the log of Median_Household_Income_2021.

R2:  0.4310933195576123
Tip

To generate a well-formatted table for a report in \(\LaTeX\), you can use the method Summary.as_latex. For an HTML report, you can use Summary.as_html.

Note

Users of R will find many familiar features in statsmodels, particularly the ability to use a formula to define a regression. The philosophy of statsmodels is similar to that which influenced the construction of R’s stats and MASS packages: providing a general-purpose library with a wide range of models.

However, statsmodels benefits from being more modern compared to R’s packages. Since the 1990s, R packages aiming to provide missing features in stats and MASS have proliferated, while statsmodels, born in the 2010s, only had to propose a general framework (the generalized estimating equations) to encompass these models.

1.2 La régression logistique

We applied our linear regression to a continuous outcome variable. How do we handle a binary distribution?
In this case, \(\mathbb{E}_{\theta} (Y|X) = \mathbb{P}_{\theta} (Y = 1|X)\).
Logistic regression can be seen as a linear probability model:

\[ \text{logit}\bigg(\mathbb{E}_{\theta}(Y|X)\bigg) = \text{logit}\bigg(\mathbb{P}_{\theta}(Y = 1|X)\bigg) = X\beta \]

The \(\text{logit}\) function is \(]0,1[ \to \mathbb{R}: p \mapsto \log(\frac{p}{1-p})\).

It allows a probability to be transformed into \(\mathbb{R}\). Its reciprocal function is the sigmoid (\(\frac{1}{1 + e^{-x}}\)), a central concept in Deep Learning.

It should be noted that probabilities are not observed; what is observed is the binary outcome (0/1). This leads to two different perspectives on logistic regression:

  • In econometrics, interest lies in the latent model that determines the choice of the outcome. For example, if observing the choice to participate in the labor market, the goal is to model the factors determining this choice;
  • In Machine Learning, the latent model is only necessary to classify observations into the correct category.

Parameter estimation for \(\beta\) can be performed using maximum likelihood or regression, both of which are equivalent under certain assumptions.

Exercise 2a: Logistic Regression with scikit

Using scikit with training and test samples:

  1. Evaluate the effect of the already-used variables on the probability of Republicans winning. Display the values of the coefficients.
  2. Derive a confusion matrix and a measure of model quality.
  3. Remove regularization using the penalty parameter. What effect does this have on the estimated parameters?
Exercise 2b: Logistic Regression with statsmodels

Using training and test samples:

  1. Evaluate the effect of the already-used variables on the probability of Republicans winning.
  2. Perform a likelihood ratio test regarding the inclusion of the (log)-income variable.

The p-value of the likelihood ratio test being close to 1 means that the log-income variable almost certainly adds information to the model.

Tip

The test statistic is: \[ LR = -2\log\bigg(\frac{\mathcal{L}_{\theta}}{\mathcal{L}_{\theta_0}}\bigg) = -2(\mathcal{l}_{\theta} - \mathcal{l}_{\theta_0}) \]

2 Going Further

This chapter only introduces the concepts of regression in a very introductory way. To expand on this, it is recommended to explore further based on your interests and needs.

In the field of machine learning, the main areas for deeper exploration are:

  • Alternative regression models like random forests.
  • Boosting and bagging methods to learn how multiple models can be trained jointly and their predictions combined democratically to converge on better decisions than a single model.
  • Issues related to model explainability, a very active research area, to better understand the decision criteria of models.

In the field of econometrics, the main areas for deeper exploration are:

  • Generalized linear models to explore regression with more general assumptions than those we have made so far;
  • Hypothesis testing to delve deeper into these questions beyond our likelihood ratio test.

References

Informations additionnelles

environment files have been tested on.

Latest built version: 2025-05-26

Python version used:

'3.12.6 | packaged by conda-forge | (main, Sep 30 2024, 18:08:52) [GCC 13.3.0]'
Package Version
affine 2.4.0
aiobotocore 2.15.1
aiohappyeyeballs 2.4.3
aiohttp 3.10.8
aioitertools 0.12.0
aiosignal 1.3.1
alembic 1.13.3
altair 5.4.1
aniso8601 9.0.1
annotated-types 0.7.0
anyio 4.9.0
appdirs 1.4.4
archspec 0.2.3
asttokens 2.4.1
attrs 24.2.0
babel 2.17.0
bcrypt 4.2.0
beautifulsoup4 4.12.3
black 24.8.0
blinker 1.8.2
blis 1.3.0
bokeh 3.5.2
boltons 24.0.0
boto3 1.35.23
botocore 1.35.23
branca 0.7.2
Brotli 1.1.0
bs4 0.0.2
cachetools 5.5.0
cartiflette 0.1.9
Cartopy 0.24.1
catalogue 2.0.10
cattrs 24.1.3
certifi 2025.4.26
cffi 1.17.1
charset-normalizer 3.3.2
chromedriver-autoinstaller 0.6.4
click 8.1.7
click-plugins 1.1.1
cligj 0.7.2
cloudpathlib 0.21.1
cloudpickle 3.0.0
colorama 0.4.6
comm 0.2.2
commonmark 0.9.1
conda 24.9.1
conda-libmamba-solver 24.7.0
conda-package-handling 2.3.0
conda_package_streaming 0.10.0
confection 0.1.5
contextily 1.6.2
contourpy 1.3.0
cryptography 43.0.1
cycler 0.12.1
cymem 2.0.11
cytoolz 1.0.0
dask 2024.9.1
dask-expr 1.1.15
databricks-sdk 0.33.0
dataclasses-json 0.6.7
debugpy 1.8.6
decorator 5.1.1
Deprecated 1.2.14
diskcache 5.6.3
distributed 2024.9.1
distro 1.9.0
docker 7.1.0
duckdb 1.1.1
en_core_web_sm 3.8.0
entrypoints 0.4
et_xmlfile 2.0.0
exceptiongroup 1.2.2
executing 2.1.0
fastexcel 0.14.0
fastjsonschema 2.21.1
fiona 1.10.1
Flask 3.0.3
folium 0.19.6
fontawesomefree 6.6.0
fonttools 4.54.1
fr_core_news_sm 3.8.0
frozendict 2.4.4
frozenlist 1.4.1
fsspec 2024.9.0
geographiclib 2.0
geopandas 1.0.1
geoplot 0.5.1
geopy 2.4.1
gitdb 4.0.11
GitPython 3.1.43
google-auth 2.35.0
graphene 3.3
graphql-core 3.2.4
graphql-relay 3.2.0
graphviz 0.20.3
great-tables 0.12.0
greenlet 3.1.1
gunicorn 22.0.0
h11 0.16.0
h2 4.1.0
hpack 4.0.0
htmltools 0.6.0
httpcore 1.0.9
httpx 0.28.1
httpx-sse 0.4.0
hyperframe 6.0.1
idna 3.10
imageio 2.37.0
importlib_metadata 8.5.0
importlib_resources 6.4.5
inflate64 1.0.1
ipykernel 6.29.5
ipython 8.28.0
itsdangerous 2.2.0
jedi 0.19.1
Jinja2 3.1.4
jmespath 1.0.1
joblib 1.4.2
jsonpatch 1.33
jsonpointer 3.0.0
jsonschema 4.23.0
jsonschema-specifications 2025.4.1
jupyter-cache 1.0.0
jupyter_client 8.6.3
jupyter_core 5.7.2
kaleido 0.2.1
kiwisolver 1.4.7
langchain 0.3.25
langchain-community 0.3.9
langchain-core 0.3.61
langchain-text-splitters 0.3.8
langcodes 3.5.0
langsmith 0.1.147
language_data 1.3.0
lazy_loader 0.4
libmambapy 1.5.9
locket 1.0.0
loguru 0.7.3
lxml 5.4.0
lz4 4.3.3
Mako 1.3.5
mamba 1.5.9
mapclassify 2.8.1
marisa-trie 1.2.1
Markdown 3.6
markdown-it-py 3.0.0
MarkupSafe 2.1.5
marshmallow 3.26.1
matplotlib 3.9.2
matplotlib-inline 0.1.7
mdurl 0.1.2
menuinst 2.1.2
mercantile 1.2.1
mizani 0.11.4
mlflow 2.16.2
mlflow-skinny 2.16.2
msgpack 1.1.0
multidict 6.1.0
multivolumefile 0.2.3
munkres 1.1.4
murmurhash 1.0.13
mypy_extensions 1.1.0
narwhals 1.41.0
nbclient 0.10.0
nbformat 5.10.4
nest_asyncio 1.6.0
networkx 3.3
nltk 3.9.1
numpy 2.1.2
opencv-python-headless 4.10.0.84
openpyxl 3.1.5
opentelemetry-api 1.16.0
opentelemetry-sdk 1.16.0
opentelemetry-semantic-conventions 0.37b0
orjson 3.10.18
outcome 1.3.0.post0
OWSLib 0.33.0
packaging 24.1
pandas 2.2.3
paramiko 3.5.0
parso 0.8.4
partd 1.4.2
pathspec 0.12.1
patsy 0.5.6
Pebble 5.1.1
pexpect 4.9.0
pickleshare 0.7.5
pillow 10.4.0
pip 24.2
platformdirs 4.3.6
plotly 5.24.1
plotnine 0.13.6
pluggy 1.5.0
polars 1.8.2
preshed 3.0.10
prometheus_client 0.21.0
prometheus_flask_exporter 0.23.1
prompt_toolkit 3.0.48
protobuf 4.25.3
psutil 6.0.0
ptyprocess 0.7.0
pure_eval 0.2.3
py7zr 0.22.0
pyarrow 17.0.0
pyarrow-hotfix 0.6
pyasn1 0.6.1
pyasn1_modules 0.4.1
pybcj 1.0.6
pycosat 0.6.6
pycparser 2.22
pycryptodomex 3.23.0
pydantic 2.11.5
pydantic_core 2.33.2
pydantic-settings 2.9.1
Pygments 2.18.0
PyNaCl 1.5.0
pynsee 0.1.8
pyogrio 0.10.0
pyOpenSSL 24.2.1
pyparsing 3.1.4
pyppmd 1.1.1
pyproj 3.7.0
pyshp 2.3.1
PySocks 1.7.1
python-dateutil 2.9.0
python-dotenv 1.0.1
python-magic 0.4.27
pytz 2024.1
pyu2f 0.1.5
pywaffle 1.1.1
PyYAML 6.0.2
pyzmq 26.2.0
pyzstd 0.16.2
querystring_parser 1.2.4
rasterio 1.4.3
referencing 0.36.2
regex 2024.9.11
requests 2.32.3
requests-cache 1.2.1
requests-toolbelt 1.0.0
retrying 1.3.4
rich 14.0.0
rpds-py 0.25.1
rsa 4.9
rtree 1.4.0
ruamel.yaml 0.18.6
ruamel.yaml.clib 0.2.8
s3fs 2024.9.0
s3transfer 0.10.2
scikit-image 0.24.0
scikit-learn 1.5.2
scipy 1.13.0
seaborn 0.13.2
selenium 4.33.0
setuptools 74.1.2
shapely 2.0.6
shellingham 1.5.4
six 1.16.0
smart-open 7.1.0
smmap 5.0.0
sniffio 1.3.1
sortedcontainers 2.4.0
soupsieve 2.5
spacy 3.8.4
spacy-legacy 3.0.12
spacy-loggers 1.0.5
SQLAlchemy 2.0.35
sqlparse 0.5.1
srsly 2.5.1
stack-data 0.6.2
statsmodels 0.14.4
tabulate 0.9.0
tblib 3.0.0
tenacity 9.0.0
texttable 1.7.0
thinc 8.3.6
threadpoolctl 3.5.0
tifffile 2025.5.24
toolz 1.0.0
topojson 1.9
tornado 6.4.1
tqdm 4.67.1
traitlets 5.14.3
trio 0.30.0
trio-websocket 0.12.2
truststore 0.9.2
typer 0.16.0
typing_extensions 4.13.2
typing-inspect 0.9.0
typing-inspection 0.4.1
tzdata 2024.2
Unidecode 1.4.0
url-normalize 2.2.1
urllib3 2.4.0
uv 0.7.8
wasabi 1.1.3
wcwidth 0.2.13
weasel 0.4.1
webdriver-manager 4.0.2
websocket-client 1.8.0
Werkzeug 3.0.4
wheel 0.44.0
wordcloud 1.9.3
wrapt 1.16.0
wsproto 1.2.0
xgboost 2.1.1
xlrd 2.0.1
xyzservices 2024.9.0
yarl 1.13.1
yellowbrick 1.5
zict 3.0.0
zipp 3.20.2
zstandard 0.23.0

View file history

SHA Date Author Description
48dccf14 2025-01-14 21:45:34 lgaliana Fix bug in modeling section
d4f89590 2024-12-20 14:36:20 lgaliana format fstring R2
8c8ca4c0 2024-12-20 10:45:00 lgaliana Traduction du chapitre clustering
a5ecaedc 2024-12-20 09:36:42 Lino Galiana Traduction du chapitre modélisation (#582)
ff0820bc 2024-11-27 15:10:39 lgaliana Mise en forme chapitre régression
06d003a1 2024-04-23 10:09:22 Lino Galiana Continue la restructuration des sous-parties (#492)
8c316d0a 2024-04-05 19:00:59 Lino Galiana Fix cartiflette deprecated snippets (#487)
005d89b8 2023-12-20 17:23:04 Lino Galiana Finalise l’affichage des statistiques Git (#478)
7d12af8b 2023-12-05 10:30:08 linogaliana Modularise la partie import pour l’avoir partout
417fb669 2023-12-04 18:49:21 Lino Galiana Corrections partie ML (#468)
a06a2689 2023-11-23 18:23:28 Antoine Palazzolo 2ème relectures chapitres ML (#457)
889a71ba 2023-11-10 11:40:51 Antoine Palazzolo Modification TP 3 (#443)
154f09e4 2023-09-26 14:59:11 Antoine Palazzolo Des typos corrigées par Antoine (#411)
9a4e2267 2023-08-28 17:11:52 Lino Galiana Action to check URL still exist (#399)
a8f90c2f 2023-08-28 09:26:12 Lino Galiana Update featured paths (#396)
3bdf3b06 2023-08-25 11:23:02 Lino Galiana Simplification de la structure 🤓 (#393)
78ea2cbd 2023-07-20 20:27:31 Lino Galiana Change titles levels (#381)
29ff3f58 2023-07-07 14:17:53 linogaliana description everywhere
f21a24d3 2023-07-02 10:58:15 Lino Galiana Pipeline Quarto & Pages 🚀 (#365)
58c71287 2023-06-11 21:32:03 Lino Galiana change na subset (#362)
2ed4aa78 2022-11-07 15:57:31 Lino Galiana Reprise 2e partie ML + Règle problème mathjax (#319)
f10815b5 2022-08-25 16:00:03 Lino Galiana Notebooks should now look more beautiful (#260)
494a85ae 2022-08-05 14:49:56 Lino Galiana Images featured ✨ (#252)
d201e3cd 2022-08-03 15:50:34 Lino Galiana Pimp la homepage ✨ (#249)
12965bac 2022-05-25 15:53:27 Lino Galiana :launch: Bascule vers quarto (#226)
9c71d6e7 2022-03-08 10:34:26 Lino Galiana Plus d’éléments sur S3 (#218)
c3bf4d42 2021-12-06 19:43:26 Lino Galiana Finalise debug partie ML (#190)
fb14d406 2021-12-06 17:00:52 Lino Galiana Modifie l’import du script (#187)
37ecfa3c 2021-12-06 14:48:05 Lino Galiana Essaye nom différent (#186)
5d0a5e38 2021-12-04 07:41:43 Lino Galiana MAJ URL script recup data (#184)
5c104904 2021-12-03 17:44:08 Lino Galiana Relec (antuki?) partie modelisation (#183)
2a8809fb 2021-10-27 12:05:34 Lino Galiana Simplification des hooks pour gagner en flexibilité et clarté (#166)
2e4d5862 2021-09-02 12:03:39 Lino Galiana Simplify badges generation (#130)
4cdb759c 2021-05-12 10:37:23 Lino Galiana :sparkles: :star2: Nouveau thème hugo :snake: :fire: (#105)
7f9f97bc 2021-04-30 21:44:04 Lino Galiana 🐳 + 🐍 New workflow (docker 🐳) and new dataset for modelization (2020 🇺🇸 elections) (#99)
59eadf58 2020-11-12 16:41:46 Lino Galiana Correction des typos partie ML (#81)
347f50f3 2020-11-12 15:08:18 Lino Galiana Suite de la partie machine learning (#78)
671f75a4 2020-10-21 15:15:24 Lino Galiana Introduction au Machine Learning (#72)
Back to top

References

Siegfried, André. 1913. Tableau Politique de La France de l’ouest Sous La Troisième république: 102 Cartes Et Croquis, 1 Carte Hors Texte. A. Colin.

Citation

BibTeX citation:
@book{galiana2023,
  author = {Galiana, Lino},
  title = {Python Pour La Data Science},
  date = {2023},
  url = {https://pythonds.linogaliana.fr/},
  doi = {10.5281/zenodo.8229676},
  langid = {en}
}
For attribution, please cite this work as:
Galiana, Lino. 2023. Python Pour La Data Science. https://doi.org/10.5281/zenodo.8229676.