Python pour la data science

Lino Galiana

doi:10.5281/zenodo.8229676

Pour essayer les exemples présents dans ce tutoriel :

Ce chapitre utilise toujours le même jeu de données, présenté dans l’introduction de cette partie : les données de vote aux élections présidentielles américaines croisées à des variables sociodémographiques. Le code est disponible sur Github.

!pip install --upgrade xlrd #colab bug verson xlrd
!pip install geopandas

import requests

url = 'https://raw.githubusercontent.com/linogaliana/python-datascientist/main/content/modelisation/get_data.py'
r = requests.get(url, allow_redirects=True)
open('getdata.py', 'wb').write(r.content)

import getdata
votes = getdata.create_votes_dataframes()

Jusqu’à présent, nous avons supposé que les variables utiles à la prévision du vote Républicain étaient connues du modélisateur. Nous n’avons ainsi exploité qu’une partie limitée des variables disponibles dans nos données. Néanmoins, outre le fléau computationnel que représenterait la construction d’un modèle avec un grand nombre de variables, le choix d’un nombre restreint de variables (modèle parcimonieux) limite le risque de sur-apprentissage.

Comment, dès lors, choisir le bon nombre de variables et la meilleure combinaison de ces variables ? Il existe de multiples méthodes, parmi lesquelles :

se fonder sur des critères statistiques de performance qui pénalisent les modèles non parcimonieux. Par exemple, le BIC.
techniques de backward elimination.
construire des modèles pour lesquels la statistique d’intérêt pénalise l’absence de parcimonie (ce que l’on va souhaiter faire ici).

Dans ce chapitre, nous allons présenter les enjeux principaux de la sélection de variables par le biais du LASSO.

Nous allons utiliser par la suite les fonctions ou packages suivants :

import numpy as np
from sklearn.svm import LinearSVC
from sklearn.feature_selection import SelectFromModel
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.linear_model import Lasso
import sklearn.metrics
from sklearn.linear_model import LinearRegression
import matplotlib.cm as cm
import matplotlib.pyplot as plt
from sklearn.linear_model import lasso_path
import seaborn as sns

1 Principe du LASSO

1.1 Principe général

La classe des modèles de feature selection est ainsi très vaste et regroupe un ensemble très divers de modèles. Nous allons nous focaliser sur le LASSO (Least Absolute Shrinkage and Selection Operator) qui est une extension de la régression linéaire qui vise à sélectionner des modèles sparses. Ce type de modèle est central dans le champ du Compressed sensing (où on emploie plutôt le terme de L1-regularization que de LASSO). Le LASSO est un cas particulier des régressions elastic-net dont un autre cas fameux est la régression ridge. Contrairement à la régression linéaire classique, elles fonctionnent également dans un cadre où \(p>N\), c’est-à-dire où le nombre de régresseurs est très grand puisque supérieur au nombre d’observations.

1.2 Pénalisation

En adoptant le principe d’une fonction objectif pénalisée, le LASSO permet de fixer un certain nombre de coefficients à 0. Les variables dont la norme est non nulle passent ainsi le test de sélection.

Tip

Le LASSO est un programme d’optimisation sous contrainte. On cherche à trouver l’estimateur \(\beta\) qui minimise l’erreur quadratique (régression linéaire) sous une contrainte additionnelle régularisant les paramètres: \[ \min_{\beta} \frac{1}{2}\mathbb{E}\bigg( \big( X\beta - y \big)^2 \bigg) \\ \text{s.t. } \sum_{j=1}^p |\beta_j| \leq t \]

Ce programme se reformule grâce au Lagrangien et permet ainsi d’obtenir un programme de minimisation plus maniable :

\[ \beta^{\text{LASSO}} = \arg \min_{\beta} \frac{1}{2}\mathbb{E}\bigg( \big( X\beta - y \big)^2 \bigg) + \alpha \sum_{j=1}^p |\beta_j| = \arg \min_{\beta} ||y-X\beta||_{2}^{2} + \lambda ||\beta||_1 \]

où \(\lambda\) est une réécriture de la régularisation précédente qui dépend de \(\alpha\). La force de la pénalité appliquée aux modèles non parcimonieux dépend de ce paramètre.

1.3 Première régression LASSO

Comme nous cherchons à trouver les meilleurs prédicteurs du vote Républicain, nous allons retirer les variables qui sont dérivables directement de celles-ci : les scores des concurrents !

import pandas as pd

df2 = pd.DataFrame(votes.drop(columns='geometry'))
df2 = df2.loc[
  :,
  ~df2.columns.str.endswith(
    ('_democrat','_green','_other', 'winner', 'per_point_diff', 'per_dem')
    )
  ]


df2 = df2.loc[:,~df2.columns.duplicated()]

Dans cet exercice, nous utiliserons également une fonction pour extraire les variables sélectionnées par le LASSO, la voici :

Fonction pour récupérer les variables validées par l’étape de sélection

from sklearn.linear_model import Lasso
from sklearn.pipeline import Pipeline

def extract_features_selected(lasso: Pipeline, preprocessing_step_name: str = 'preprocess') -> pd.Series:
    """
    Extracts selected features based on the coefficients obtained from Lasso regression.

    Parameters:
    - lasso (Pipeline): The scikit-learn pipeline containing a trained Lasso regression model.
    - preprocessing_step_name (str): The name of the preprocessing step in the pipeline. Default is 'preprocess'.

    Returns:
    - pd.Series: A Pandas Series containing selected features with non-zero coefficients.
    """
    # Check if lasso object is provided
    if not isinstance(lasso, Pipeline):
        raise ValueError("The provided lasso object is not a scikit-learn pipeline.")

    # Extract the final transformer from the pipeline
    lasso_model = lasso[-1]

    # Check if lasso_model is a Lasso regression model
    if not isinstance(lasso_model, Lasso):
        raise ValueError("The final step of the pipeline is not a Lasso regression model.")

    # Check if lasso model has 'coef_' attribute
    if not hasattr(lasso_model, 'coef_'):
        raise ValueError("The provided Lasso regression model does not have 'coef_' attribute. "
                         "Make sure it is a trained Lasso regression model.")

    # Get feature names from the preprocessing step
    features_preprocessing = lasso[preprocessing_step_name].get_feature_names_out()

    # Extract selected features based on non-zero coefficients
    features_selec = pd.Series(features_preprocessing[np.abs(lasso_model.coef_) > 0])

    return features_selec

Exercice 1 : Premier LASSO

On cherche toujours à prédire la variable per_gop. Avant de faire notre estimation, nous allons créer certains objets intermédiaires qui seront utilisés pour définir notre pipeline:

Dans notre DataFrame, remplacer les valeurs infinies par des NaN.
Créez un échantillon d’entraînement et un échantillon test.

Maintenant nous pouvons passer au coeur de la définition de notre pipeline. Cet exemple pourra servir de source d’inspiration, ainsi que celui-ci.

Créer en premier lieu les étapes de preprocessing pour notre modèle. Pour cela, il est d’usage de séparer les étapes appliquées aux variables numériques continues des autres variables, dites catégorielles.

Pour les variables numériques, imputer à la moyenne puis effectuer une standardisation ;
Pour les variables catégorielles, les techniques de régression linéaires impliquent d’utiliser une expansion par one hot encoding. Avant de faire ce one hot encoding, faire une imputation par valeur la plus fréquente.

Finaliser le pipeline en ajoutant l’étape d’estimation puis estimer un modèle LASSO pénalisé avec \(\alpha = 0.1\).

En supposant que votre pipeline soit dans un objet nommé pipeline et que la dernière étape est nommée model, vous pouvez directement accéder à cette étape en utilisant l’objet pipeline['model'].

Afficher les valeurs des coefficients. Quelles variables ont une valeur non nulle ?
Montrer que les variables sélectionnées sont parfois très corrélées.
Comparer la performance de ce modèle parcimonieux avec celle d’un modèle avec plus de variables.

Aide pour la question 1

# Remplacer les infinis par des NaN
df2.replace([np.inf, -np.inf], np.nan, inplace=True)

Aide pour la question 3

La définition d’un pipeline suit la structure suivante :

numeric_pipeline = Pipeline(steps=[
    ('impute', # définir la méthode d'imputation ici
     ),
    ('scale', # définir la méthode de standardisation ici
    )
])

categorical_pipeline = # adapter le template

# À vous de définir en amont numerical_features et categorical_features
preprocessor = ColumnTransformer(transformers=[
    ('number', numeric_pipeline, numerical_features),
    ('category', categorical_pipeline, categorical_features)
])

Le pipeline de preprocessing (question 3) prend la forme suivante :

ColumnTransformer(transformers=[('number',
                                 Pipeline(steps=[('impute', SimpleImputer()),
                                                 ('scale', StandardScaler())]),
                                 ['ALAND', 'AWATER', 'votes_gop', 'votes_dem',
                                  'total_votes', 'diff', 'FIPS_y',
                                  'Rural_Urban_Continuum_Code_2013',
                                  'Rural_Urban_Continuum_Code_2023',
                                  'Urban_Influence_2013',
                                  'Economic_typology_2015', 'CENSUS_2020_POP',
                                  'ESTIMATES_BASE_2020', 'POP_EST...
                                  'DEATHS_2020', 'DEATHS_2021', 'DEATHS_2022',
                                  'DEATHS_2023', 'NATURAL_CHG_2020', ...]),
                                ('category',
                                 Pipeline(steps=[('impute',
                                                  SimpleImputer(strategy='most_frequent')),
                                                 ('one-hot',
                                                  OneHotEncoder(handle_unknown='ignore',
                                                                sparse_output=False))]),
                                 ['STATEFP', 'COUNTYFP', 'COUNTYNS', 'AFFGEOID',
                                  'GEOID', 'NAME', 'LSAD', 'FIPS_x',
                                  'state_name', 'county_fips', 'county_name',
                                  'State', 'Area_Name', 'FIPS'])])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

ColumnTransformer

?Documentation for ColumnTransformeriNot fitted

ColumnTransformer(transformers=[('number',
                                 Pipeline(steps=[('impute', SimpleImputer()),
                                                 ('scale', StandardScaler())]),
                                 ['ALAND', 'AWATER', 'votes_gop', 'votes_dem',
                                  'total_votes', 'diff', 'FIPS_y',
                                  'Rural_Urban_Continuum_Code_2013',
                                  'Rural_Urban_Continuum_Code_2023',
                                  'Urban_Influence_2013',
                                  'Economic_typology_2015', 'CENSUS_2020_POP',
                                  'ESTIMATES_BASE_2020', 'POP_EST...
                                  'DEATHS_2020', 'DEATHS_2021', 'DEATHS_2022',
                                  'DEATHS_2023', 'NATURAL_CHG_2020', ...]),
                                ('category',
                                 Pipeline(steps=[('impute',
                                                  SimpleImputer(strategy='most_frequent')),
                                                 ('one-hot',
                                                  OneHotEncoder(handle_unknown='ignore',
                                                                sparse_output=False))]),
                                 ['STATEFP', 'COUNTYFP', 'COUNTYNS', 'AFFGEOID',
                                  'GEOID', 'NAME', 'LSAD', 'FIPS_x',
                                  'state_name', 'county_fips', 'county_name',
                                  'State', 'Area_Name', 'FIPS'])])

number

['ALAND', 'AWATER', 'votes_gop', 'votes_dem', 'total_votes', 'diff', 'FIPS_y', 'Rural_Urban_Continuum_Code_2013', 'Rural_Urban_Continuum_Code_2023', 'Urban_Influence_2013', 'Economic_typology_2015', 'CENSUS_2020_POP', 'ESTIMATES_BASE_2020', 'POP_ESTIMATE_2020', 'POP_ESTIMATE_2021', 'POP_ESTIMATE_2022', 'POP_ESTIMATE_2023', 'N_POP_CHG_2020', 'N_POP_CHG_2021', 'N_POP_CHG_2022', 'N_POP_CHG_2023', 'BIRTHS_2020', 'BIRTHS_2021', 'BIRTHS_2022', 'BIRTHS_2023', 'DEATHS_2020', 'DEATHS_2021', 'DEATHS_2022', 'DEATHS_2023', 'NATURAL_CHG_2020', 'NATURAL_CHG_2021', 'NATURAL_CHG_2022', 'NATURAL_CHG_2023', 'INTERNATIONAL_MIG_2020', 'INTERNATIONAL_MIG_2021', 'INTERNATIONAL_MIG_2022', 'INTERNATIONAL_MIG_2023', 'DOMESTIC_MIG_2020', 'DOMESTIC_MIG_2021', 'DOMESTIC_MIG_2022', 'DOMESTIC_MIG_2023', 'NET_MIG_2020', 'NET_MIG_2021', 'NET_MIG_2022', 'NET_MIG_2023', 'RESIDUAL_2020', 'RESIDUAL_2021', 'RESIDUAL_2022', 'RESIDUAL_2023', 'GQ_ESTIMATES_BASE_2020', 'GQ_ESTIMATES_2020', 'GQ_ESTIMATES_2021', 'GQ_ESTIMATES_2022', 'GQ_ESTIMATES_2023', 'R_BIRTH_2021', 'R_BIRTH_2022', 'R_BIRTH_2023', 'R_DEATH_2021', 'R_DEATH_2022', 'R_DEATH_2023', 'R_NATURAL_CHG_2021', 'R_NATURAL_CHG_2022', 'R_NATURAL_CHG_2023', 'R_INTERNATIONAL_MIG_2021', 'R_INTERNATIONAL_MIG_2022', 'R_INTERNATIONAL_MIG_2023', 'R_DOMESTIC_MIG_2021', 'R_DOMESTIC_MIG_2022', 'R_DOMESTIC_MIG_2023', 'R_NET_MIG_2021', 'R_NET_MIG_2022', 'R_NET_MIG_2023', '2003 Urban Influence Code', '2013 Urban Influence Code', '2013 Rural-urban Continuum Code', '2023 Rural-urban Continuum Code', 'Less than a high school diploma, 1970', 'High school diploma only, 1970', 'Some college (1-3 years), 1970', 'Four years of college or higher, 1970', 'Percent of adults with less than a high school diploma, 1970', 'Percent of adults with a high school diploma only, 1970', 'Percent of adults completing some college (1-3 years), 1970', 'Percent of adults completing four years of college or higher, 1970', 'Less than a high school diploma, 1980', 'High school diploma only, 1980', 'Some college (1-3 years), 1980', 'Four years of college or higher, 1980', 'Percent of adults with less than a high school diploma, 1980', 'Percent of adults with a high school diploma only, 1980', 'Percent of adults completing some college (1-3 years), 1980', 'Percent of adults completing four years of college or higher, 1980', 'Less than a high school diploma, 1990', 'High school diploma only, 1990', "Some college or associate's degree, 1990", "Bachelor's degree or higher, 1990", 'Percent of adults with less than a high school diploma, 1990', 'Percent of adults with a high school diploma only, 1990', "Percent of adults completing some college or associate's degree, 1990", "Percent of adults with a bachelor's degree or higher, 1990", 'Less than a high school diploma, 2000', 'High school diploma only, 2000', "Some college or associate's degree, 2000", "Bachelor's degree or higher, 2000", 'Percent of adults with less than a high school diploma, 2000', 'Percent of adults with a high school diploma only, 2000', "Percent of adults completing some college or associate's degree, 2000", "Percent of adults with a bachelor's degree or higher, 2000", 'Less than a high school diploma, 2008-12', 'High school diploma only, 2008-12', "Some college or associate's degree, 2008-12", "Bachelor's degree or higher, 2008-12", 'Percent of adults with less than a high school diploma, 2008-12', 'Percent of adults with a high school diploma only, 2008-12', "Percent of adults completing some college or associate's degree, 2008-12", "Percent of adults with a bachelor's degree or higher, 2008-12", 'Less than a high school diploma, 2018-22', 'High school diploma only, 2018-22', "Some college or associate's degree, 2018-22", "Bachelor's degree or higher, 2018-22", 'Percent of adults with less than a high school diploma, 2018-22', 'Percent of adults with a high school diploma only, 2018-22', "Percent of adults completing some college or associate's degree, 2018-22", "Percent of adults with a bachelor's degree or higher, 2018-22", 'Urban_Influence_Code_2013', 'Metro_2013', 'Civilian_labor_force_2000', 'Employed_2000', 'Unemployed_2000', 'Unemployment_rate_2000', 'Civilian_labor_force_2001', 'Employed_2001', 'Unemployed_2001', 'Unemployment_rate_2001', 'Civilian_labor_force_2002', 'Employed_2002', 'Unemployed_2002', 'Unemployment_rate_2002', 'Civilian_labor_force_2003', 'Employed_2003', 'Unemployed_2003', 'Unemployment_rate_2003', 'Civilian_labor_force_2004', 'Employed_2004', 'Unemployed_2004', 'Unemployment_rate_2004', 'Civilian_labor_force_2005', 'Employed_2005', 'Unemployed_2005', 'Unemployment_rate_2005', 'Civilian_labor_force_2006', 'Employed_2006', 'Unemployed_2006', 'Unemployment_rate_2006', 'Civilian_labor_force_2007', 'Employed_2007', 'Unemployed_2007', 'Unemployment_rate_2007', 'Civilian_labor_force_2008', 'Employed_2008', 'Unemployed_2008', 'Unemployment_rate_2008', 'Civilian_labor_force_2009', 'Employed_2009', 'Unemployed_2009', 'Unemployment_rate_2009', 'Civilian_labor_force_2010', 'Employed_2010', 'Unemployed_2010', 'Unemployment_rate_2010', 'Civilian_labor_force_2011', 'Employed_2011', 'Unemployed_2011', 'Unemployment_rate_2011', 'Civilian_labor_force_2012', 'Employed_2012', 'Unemployed_2012', 'Unemployment_rate_2012', 'Civilian_labor_force_2013', 'Employed_2013', 'Unemployed_2013', 'Unemployment_rate_2013', 'Civilian_labor_force_2014', 'Employed_2014', 'Unemployed_2014', 'Unemployment_rate_2014', 'Civilian_labor_force_2015', 'Employed_2015', 'Unemployed_2015', 'Unemployment_rate_2015', 'Civilian_labor_force_2016', 'Employed_2016', 'Unemployed_2016', 'Unemployment_rate_2016', 'Civilian_labor_force_2017', 'Employed_2017', 'Unemployed_2017', 'Unemployment_rate_2017', 'Civilian_labor_force_2018', 'Employed_2018', 'Unemployed_2018', 'Unemployment_rate_2018', 'Civilian_labor_force_2019', 'Employed_2019', 'Unemployed_2019', 'Unemployment_rate_2019', 'Civilian_labor_force_2020', 'Employed_2020', 'Unemployed_2020', 'Unemployment_rate_2020', 'Civilian_labor_force_2021', 'Employed_2021', 'Unemployed_2021', 'Unemployment_rate_2021', 'Civilian_labor_force_2022', 'Employed_2022', 'Unemployed_2022', 'Unemployment_rate_2022', 'Median_Household_Income_2021', 'Med_HH_Income_Percent_of_State_Total_2021', 'Rural-urban_Continuum_Code_2003', 'Urban_Influence_Code_2003', 'Rural-urban_Continuum_Code_2013', 'Urban_Influence_Code_ 2013', 'POVALL_2021', 'CI90LBALL_2021', 'CI90UBALL_2021', 'PCTPOVALL_2021', 'CI90LBALLP_2021', 'CI90UBALLP_2021', 'POV017_2021', 'CI90LB017_2021', 'CI90UB017_2021', 'PCTPOV017_2021', 'CI90LB017P_2021', 'CI90UB017P_2021', 'POV517_2021', 'CI90LB517_2021', 'CI90UB517_2021', 'PCTPOV517_2021', 'CI90LB517P_2021', 'CI90UB517P_2021', 'MEDHHINC_2021', 'CI90LBINC_2021', 'CI90UBINC_2021', 'POV04_2021', 'CI90LB04_2021', 'CI90UB04_2021', 'PCTPOV04_2021', 'CI90LB04P_2021', 'CI90UB04P_2021', 'candidatevotes_2000_republican', 'candidatevotes_2004_republican', 'candidatevotes_2008_republican', 'candidatevotes_2012_republican', 'candidatevotes_2016_republican', 'share_2000_republican', 'share_2004_republican', 'share_2008_republican', 'share_2012_republican', 'share_2016_republican']

SimpleImputer

?Documentation for SimpleImputer

SimpleImputer()

StandardScaler

?Documentation for StandardScaler

StandardScaler()

category

['STATEFP', 'COUNTYFP', 'COUNTYNS', 'AFFGEOID', 'GEOID', 'NAME', 'LSAD', 'FIPS_x', 'state_name', 'county_fips', 'county_name', 'State', 'Area_Name', 'FIPS']

SimpleImputer

?Documentation for SimpleImputer

SimpleImputer(strategy='most_frequent')

OneHotEncoder

?Documentation for OneHotEncoder

OneHotEncoder(handle_unknown='ignore', sparse_output=False)

/home/runner/work/python-datascientist/python-datascientist/.venv/lib/python3.12/site-packages/sklearn/impute/_base.py:635: UserWarning:

Skipping features without any observed values: ['POV04_2021' 'CI90LB04_2021' 'CI90UB04_2021' 'PCTPOV04_2021'
 'CI90LB04P_2021' 'CI90UB04P_2021']. At least one non-missing value is needed for imputation with strategy='mean'.

Pipeline(steps=[('preprocess',
                 ColumnTransformer(transformers=[('number',
                                                  Pipeline(steps=[('impute',
                                                                   SimpleImputer()),
                                                                  ('scale',
                                                                   StandardScaler())]),
                                                  ['ALAND', 'AWATER',
                                                   'votes_gop', 'votes_dem',
                                                   'total_votes', 'diff',
                                                   'FIPS_y',
                                                   'Rural_Urban_Continuum_Code_2013',
                                                   'Rural_Urban_Continuum_Code_2023',
                                                   'Urban_Influence_2013',
                                                   'Economic_typology_2015',
                                                   'CENSUS_2020_POP',...
                                                   'NATURAL_CHG_2020', ...]),
                                                 ('category',
                                                  Pipeline(steps=[('impute',
                                                                   SimpleImputer(strategy='most_frequent')),
                                                                  ('one-hot',
                                                                   OneHotEncoder(handle_unknown='ignore',
                                                                                 sparse_output=False))]),
                                                  ['STATEFP', 'COUNTYFP',
                                                   'COUNTYNS', 'AFFGEOID',
                                                   'GEOID', 'NAME', 'LSAD',
                                                   'FIPS_x', 'state_name',
                                                   'county_fips', 'county_name',
                                                   'State', 'Area_Name',
                                                   'FIPS'])])),
                ('model', Lasso(alpha=0.1))])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Pipeline

?Documentation for PipelineiFitted

Pipeline(steps=[('preprocess',
                 ColumnTransformer(transformers=[('number',
                                                  Pipeline(steps=[('impute',
                                                                   SimpleImputer()),
                                                                  ('scale',
                                                                   StandardScaler())]),
                                                  ['ALAND', 'AWATER',
                                                   'votes_gop', 'votes_dem',
                                                   'total_votes', 'diff',
                                                   'FIPS_y',
                                                   'Rural_Urban_Continuum_Code_2013',
                                                   'Rural_Urban_Continuum_Code_2023',
                                                   'Urban_Influence_2013',
                                                   'Economic_typology_2015',
                                                   'CENSUS_2020_POP',...
                                                   'NATURAL_CHG_2020', ...]),
                                                 ('category',
                                                  Pipeline(steps=[('impute',
                                                                   SimpleImputer(strategy='most_frequent')),
                                                                  ('one-hot',
                                                                   OneHotEncoder(handle_unknown='ignore',
                                                                                 sparse_output=False))]),
                                                  ['STATEFP', 'COUNTYFP',
                                                   'COUNTYNS', 'AFFGEOID',
                                                   'GEOID', 'NAME', 'LSAD',
                                                   'FIPS_x', 'state_name',
                                                   'county_fips', 'county_name',
                                                   'State', 'Area_Name',
                                                   'FIPS'])])),
                ('model', Lasso(alpha=0.1))])

preprocess: ColumnTransformer

?Documentation for preprocess: ColumnTransformer

ColumnTransformer(transformers=[('number',
                                 Pipeline(steps=[('impute', SimpleImputer()),
                                                 ('scale', StandardScaler())]),
                                 ['ALAND', 'AWATER', 'votes_gop', 'votes_dem',
                                  'total_votes', 'diff', 'FIPS_y',
                                  'Rural_Urban_Continuum_Code_2013',
                                  'Rural_Urban_Continuum_Code_2023',
                                  'Urban_Influence_2013',
                                  'Economic_typology_2015', 'CENSUS_2020_POP',
                                  'ESTIMATES_BASE_2020', 'POP_EST...
                                  'DEATHS_2020', 'DEATHS_2021', 'DEATHS_2022',
                                  'DEATHS_2023', 'NATURAL_CHG_2020', ...]),
                                ('category',
                                 Pipeline(steps=[('impute',
                                                  SimpleImputer(strategy='most_frequent')),
                                                 ('one-hot',
                                                  OneHotEncoder(handle_unknown='ignore',
                                                                sparse_output=False))]),
                                 ['STATEFP', 'COUNTYFP', 'COUNTYNS', 'AFFGEOID',
                                  'GEOID', 'NAME', 'LSAD', 'FIPS_x',
                                  'state_name', 'county_fips', 'county_name',
                                  'State', 'Area_Name', 'FIPS'])])

number

['ALAND', 'AWATER', 'votes_gop', 'votes_dem', 'total_votes', 'diff', 'FIPS_y', 'Rural_Urban_Continuum_Code_2013', 'Rural_Urban_Continuum_Code_2023', 'Urban_Influence_2013', 'Economic_typology_2015', 'CENSUS_2020_POP', 'ESTIMATES_BASE_2020', 'POP_ESTIMATE_2020', 'POP_ESTIMATE_2021', 'POP_ESTIMATE_2022', 'POP_ESTIMATE_2023', 'N_POP_CHG_2020', 'N_POP_CHG_2021', 'N_POP_CHG_2022', 'N_POP_CHG_2023', 'BIRTHS_2020', 'BIRTHS_2021', 'BIRTHS_2022', 'BIRTHS_2023', 'DEATHS_2020', 'DEATHS_2021', 'DEATHS_2022', 'DEATHS_2023', 'NATURAL_CHG_2020', 'NATURAL_CHG_2021', 'NATURAL_CHG_2022', 'NATURAL_CHG_2023', 'INTERNATIONAL_MIG_2020', 'INTERNATIONAL_MIG_2021', 'INTERNATIONAL_MIG_2022', 'INTERNATIONAL_MIG_2023', 'DOMESTIC_MIG_2020', 'DOMESTIC_MIG_2021', 'DOMESTIC_MIG_2022', 'DOMESTIC_MIG_2023', 'NET_MIG_2020', 'NET_MIG_2021', 'NET_MIG_2022', 'NET_MIG_2023', 'RESIDUAL_2020', 'RESIDUAL_2021', 'RESIDUAL_2022', 'RESIDUAL_2023', 'GQ_ESTIMATES_BASE_2020', 'GQ_ESTIMATES_2020', 'GQ_ESTIMATES_2021', 'GQ_ESTIMATES_2022', 'GQ_ESTIMATES_2023', 'R_BIRTH_2021', 'R_BIRTH_2022', 'R_BIRTH_2023', 'R_DEATH_2021', 'R_DEATH_2022', 'R_DEATH_2023', 'R_NATURAL_CHG_2021', 'R_NATURAL_CHG_2022', 'R_NATURAL_CHG_2023', 'R_INTERNATIONAL_MIG_2021', 'R_INTERNATIONAL_MIG_2022', 'R_INTERNATIONAL_MIG_2023', 'R_DOMESTIC_MIG_2021', 'R_DOMESTIC_MIG_2022', 'R_DOMESTIC_MIG_2023', 'R_NET_MIG_2021', 'R_NET_MIG_2022', 'R_NET_MIG_2023', '2003 Urban Influence Code', '2013 Urban Influence Code', '2013 Rural-urban Continuum Code', '2023 Rural-urban Continuum Code', 'Less than a high school diploma, 1970', 'High school diploma only, 1970', 'Some college (1-3 years), 1970', 'Four years of college or higher, 1970', 'Percent of adults with less than a high school diploma, 1970', 'Percent of adults with a high school diploma only, 1970', 'Percent of adults completing some college (1-3 years), 1970', 'Percent of adults completing four years of college or higher, 1970', 'Less than a high school diploma, 1980', 'High school diploma only, 1980', 'Some college (1-3 years), 1980', 'Four years of college or higher, 1980', 'Percent of adults with less than a high school diploma, 1980', 'Percent of adults with a high school diploma only, 1980', 'Percent of adults completing some college (1-3 years), 1980', 'Percent of adults completing four years of college or higher, 1980', 'Less than a high school diploma, 1990', 'High school diploma only, 1990', "Some college or associate's degree, 1990", "Bachelor's degree or higher, 1990", 'Percent of adults with less than a high school diploma, 1990', 'Percent of adults with a high school diploma only, 1990', "Percent of adults completing some college or associate's degree, 1990", "Percent of adults with a bachelor's degree or higher, 1990", 'Less than a high school diploma, 2000', 'High school diploma only, 2000', "Some college or associate's degree, 2000", "Bachelor's degree or higher, 2000", 'Percent of adults with less than a high school diploma, 2000', 'Percent of adults with a high school diploma only, 2000', "Percent of adults completing some college or associate's degree, 2000", "Percent of adults with a bachelor's degree or higher, 2000", 'Less than a high school diploma, 2008-12', 'High school diploma only, 2008-12', "Some college or associate's degree, 2008-12", "Bachelor's degree or higher, 2008-12", 'Percent of adults with less than a high school diploma, 2008-12', 'Percent of adults with a high school diploma only, 2008-12', "Percent of adults completing some college or associate's degree, 2008-12", "Percent of adults with a bachelor's degree or higher, 2008-12", 'Less than a high school diploma, 2018-22', 'High school diploma only, 2018-22', "Some college or associate's degree, 2018-22", "Bachelor's degree or higher, 2018-22", 'Percent of adults with less than a high school diploma, 2018-22', 'Percent of adults with a high school diploma only, 2018-22', "Percent of adults completing some college or associate's degree, 2018-22", "Percent of adults with a bachelor's degree or higher, 2018-22", 'Urban_Influence_Code_2013', 'Metro_2013', 'Civilian_labor_force_2000', 'Employed_2000', 'Unemployed_2000', 'Unemployment_rate_2000', 'Civilian_labor_force_2001', 'Employed_2001', 'Unemployed_2001', 'Unemployment_rate_2001', 'Civilian_labor_force_2002', 'Employed_2002', 'Unemployed_2002', 'Unemployment_rate_2002', 'Civilian_labor_force_2003', 'Employed_2003', 'Unemployed_2003', 'Unemployment_rate_2003', 'Civilian_labor_force_2004', 'Employed_2004', 'Unemployed_2004', 'Unemployment_rate_2004', 'Civilian_labor_force_2005', 'Employed_2005', 'Unemployed_2005', 'Unemployment_rate_2005', 'Civilian_labor_force_2006', 'Employed_2006', 'Unemployed_2006', 'Unemployment_rate_2006', 'Civilian_labor_force_2007', 'Employed_2007', 'Unemployed_2007', 'Unemployment_rate_2007', 'Civilian_labor_force_2008', 'Employed_2008', 'Unemployed_2008', 'Unemployment_rate_2008', 'Civilian_labor_force_2009', 'Employed_2009', 'Unemployed_2009', 'Unemployment_rate_2009', 'Civilian_labor_force_2010', 'Employed_2010', 'Unemployed_2010', 'Unemployment_rate_2010', 'Civilian_labor_force_2011', 'Employed_2011', 'Unemployed_2011', 'Unemployment_rate_2011', 'Civilian_labor_force_2012', 'Employed_2012', 'Unemployed_2012', 'Unemployment_rate_2012', 'Civilian_labor_force_2013', 'Employed_2013', 'Unemployed_2013', 'Unemployment_rate_2013', 'Civilian_labor_force_2014', 'Employed_2014', 'Unemployed_2014', 'Unemployment_rate_2014', 'Civilian_labor_force_2015', 'Employed_2015', 'Unemployed_2015', 'Unemployment_rate_2015', 'Civilian_labor_force_2016', 'Employed_2016', 'Unemployed_2016', 'Unemployment_rate_2016', 'Civilian_labor_force_2017', 'Employed_2017', 'Unemployed_2017', 'Unemployment_rate_2017', 'Civilian_labor_force_2018', 'Employed_2018', 'Unemployed_2018', 'Unemployment_rate_2018', 'Civilian_labor_force_2019', 'Employed_2019', 'Unemployed_2019', 'Unemployment_rate_2019', 'Civilian_labor_force_2020', 'Employed_2020', 'Unemployed_2020', 'Unemployment_rate_2020', 'Civilian_labor_force_2021', 'Employed_2021', 'Unemployed_2021', 'Unemployment_rate_2021', 'Civilian_labor_force_2022', 'Employed_2022', 'Unemployed_2022', 'Unemployment_rate_2022', 'Median_Household_Income_2021', 'Med_HH_Income_Percent_of_State_Total_2021', 'Rural-urban_Continuum_Code_2003', 'Urban_Influence_Code_2003', 'Rural-urban_Continuum_Code_2013', 'Urban_Influence_Code_ 2013', 'POVALL_2021', 'CI90LBALL_2021', 'CI90UBALL_2021', 'PCTPOVALL_2021', 'CI90LBALLP_2021', 'CI90UBALLP_2021', 'POV017_2021', 'CI90LB017_2021', 'CI90UB017_2021', 'PCTPOV017_2021', 'CI90LB017P_2021', 'CI90UB017P_2021', 'POV517_2021', 'CI90LB517_2021', 'CI90UB517_2021', 'PCTPOV517_2021', 'CI90LB517P_2021', 'CI90UB517P_2021', 'MEDHHINC_2021', 'CI90LBINC_2021', 'CI90UBINC_2021', 'POV04_2021', 'CI90LB04_2021', 'CI90UB04_2021', 'PCTPOV04_2021', 'CI90LB04P_2021', 'CI90UB04P_2021', 'candidatevotes_2000_republican', 'candidatevotes_2004_republican', 'candidatevotes_2008_republican', 'candidatevotes_2012_republican', 'candidatevotes_2016_republican', 'share_2000_republican', 'share_2004_republican', 'share_2008_republican', 'share_2012_republican', 'share_2016_republican']

SimpleImputer

?Documentation for SimpleImputer

SimpleImputer()

StandardScaler

?Documentation for StandardScaler

StandardScaler()

category

['STATEFP', 'COUNTYFP', 'COUNTYNS', 'AFFGEOID', 'GEOID', 'NAME', 'LSAD', 'FIPS_x', 'state_name', 'county_fips', 'county_name', 'State', 'Area_Name', 'FIPS']

SimpleImputer

?Documentation for SimpleImputer

SimpleImputer(strategy='most_frequent')

OneHotEncoder

?Documentation for OneHotEncoder

OneHotEncoder(handle_unknown='ignore', sparse_output=False)

Lasso

?Documentation for Lasso

Lasso(alpha=0.1)

Le pipeline prend la forme suivante, une fois finalisé (question 4) :

Pipeline(steps=[('preprocess',
                 ColumnTransformer(transformers=[('number',
                                                  Pipeline(steps=[('impute',
                                                                   SimpleImputer()),
                                                                  ('scale',
                                                                   StandardScaler())]),
                                                  ['ALAND', 'AWATER',
                                                   'votes_gop', 'votes_dem',
                                                   'total_votes', 'diff',
                                                   'FIPS_y',
                                                   'Rural_Urban_Continuum_Code_2013',
                                                   'Rural_Urban_Continuum_Code_2023',
                                                   'Urban_Influence_2013',
                                                   'Economic_typology_2015',
                                                   'CENSUS_2020_POP',...
                                                   'NATURAL_CHG_2020', ...]),
                                                 ('category',
                                                  Pipeline(steps=[('impute',
                                                                   SimpleImputer(strategy='most_frequent')),
                                                                  ('one-hot',
                                                                   OneHotEncoder(handle_unknown='ignore',
                                                                                 sparse_output=False))]),
                                                  ['STATEFP', 'COUNTYFP',
                                                   'COUNTYNS', 'AFFGEOID',
                                                   'GEOID', 'NAME', 'LSAD',
                                                   'FIPS_x', 'state_name',
                                                   'county_fips', 'county_name',
                                                   'State', 'Area_Name',
                                                   'FIPS'])])),
                ('model', Lasso(alpha=0.1))])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Pipeline

?Documentation for PipelineiFitted

Pipeline(steps=[('preprocess',
                 ColumnTransformer(transformers=[('number',
                                                  Pipeline(steps=[('impute',
                                                                   SimpleImputer()),
                                                                  ('scale',
                                                                   StandardScaler())]),
                                                  ['ALAND', 'AWATER',
                                                   'votes_gop', 'votes_dem',
                                                   'total_votes', 'diff',
                                                   'FIPS_y',
                                                   'Rural_Urban_Continuum_Code_2013',
                                                   'Rural_Urban_Continuum_Code_2023',
                                                   'Urban_Influence_2013',
                                                   'Economic_typology_2015',
                                                   'CENSUS_2020_POP',...
                                                   'NATURAL_CHG_2020', ...]),
                                                 ('category',
                                                  Pipeline(steps=[('impute',
                                                                   SimpleImputer(strategy='most_frequent')),
                                                                  ('one-hot',
                                                                   OneHotEncoder(handle_unknown='ignore',
                                                                                 sparse_output=False))]),
                                                  ['STATEFP', 'COUNTYFP',
                                                   'COUNTYNS', 'AFFGEOID',
                                                   'GEOID', 'NAME', 'LSAD',
                                                   'FIPS_x', 'state_name',
                                                   'county_fips', 'county_name',
                                                   'State', 'Area_Name',
                                                   'FIPS'])])),
                ('model', Lasso(alpha=0.1))])

preprocess: ColumnTransformer

?Documentation for preprocess: ColumnTransformer

ColumnTransformer(transformers=[('number',
                                 Pipeline(steps=[('impute', SimpleImputer()),
                                                 ('scale', StandardScaler())]),
                                 ['ALAND', 'AWATER', 'votes_gop', 'votes_dem',
                                  'total_votes', 'diff', 'FIPS_y',
                                  'Rural_Urban_Continuum_Code_2013',
                                  'Rural_Urban_Continuum_Code_2023',
                                  'Urban_Influence_2013',
                                  'Economic_typology_2015', 'CENSUS_2020_POP',
                                  'ESTIMATES_BASE_2020', 'POP_EST...
                                  'DEATHS_2020', 'DEATHS_2021', 'DEATHS_2022',
                                  'DEATHS_2023', 'NATURAL_CHG_2020', ...]),
                                ('category',
                                 Pipeline(steps=[('impute',
                                                  SimpleImputer(strategy='most_frequent')),
                                                 ('one-hot',
                                                  OneHotEncoder(handle_unknown='ignore',
                                                                sparse_output=False))]),
                                 ['STATEFP', 'COUNTYFP', 'COUNTYNS', 'AFFGEOID',
                                  'GEOID', 'NAME', 'LSAD', 'FIPS_x',
                                  'state_name', 'county_fips', 'county_name',
                                  'State', 'Area_Name', 'FIPS'])])

number

['ALAND', 'AWATER', 'votes_gop', 'votes_dem', 'total_votes', 'diff', 'FIPS_y', 'Rural_Urban_Continuum_Code_2013', 'Rural_Urban_Continuum_Code_2023', 'Urban_Influence_2013', 'Economic_typology_2015', 'CENSUS_2020_POP', 'ESTIMATES_BASE_2020', 'POP_ESTIMATE_2020', 'POP_ESTIMATE_2021', 'POP_ESTIMATE_2022', 'POP_ESTIMATE_2023', 'N_POP_CHG_2020', 'N_POP_CHG_2021', 'N_POP_CHG_2022', 'N_POP_CHG_2023', 'BIRTHS_2020', 'BIRTHS_2021', 'BIRTHS_2022', 'BIRTHS_2023', 'DEATHS_2020', 'DEATHS_2021', 'DEATHS_2022', 'DEATHS_2023', 'NATURAL_CHG_2020', 'NATURAL_CHG_2021', 'NATURAL_CHG_2022', 'NATURAL_CHG_2023', 'INTERNATIONAL_MIG_2020', 'INTERNATIONAL_MIG_2021', 'INTERNATIONAL_MIG_2022', 'INTERNATIONAL_MIG_2023', 'DOMESTIC_MIG_2020', 'DOMESTIC_MIG_2021', 'DOMESTIC_MIG_2022', 'DOMESTIC_MIG_2023', 'NET_MIG_2020', 'NET_MIG_2021', 'NET_MIG_2022', 'NET_MIG_2023', 'RESIDUAL_2020', 'RESIDUAL_2021', 'RESIDUAL_2022', 'RESIDUAL_2023', 'GQ_ESTIMATES_BASE_2020', 'GQ_ESTIMATES_2020', 'GQ_ESTIMATES_2021', 'GQ_ESTIMATES_2022', 'GQ_ESTIMATES_2023', 'R_BIRTH_2021', 'R_BIRTH_2022', 'R_BIRTH_2023', 'R_DEATH_2021', 'R_DEATH_2022', 'R_DEATH_2023', 'R_NATURAL_CHG_2021', 'R_NATURAL_CHG_2022', 'R_NATURAL_CHG_2023', 'R_INTERNATIONAL_MIG_2021', 'R_INTERNATIONAL_MIG_2022', 'R_INTERNATIONAL_MIG_2023', 'R_DOMESTIC_MIG_2021', 'R_DOMESTIC_MIG_2022', 'R_DOMESTIC_MIG_2023', 'R_NET_MIG_2021', 'R_NET_MIG_2022', 'R_NET_MIG_2023', '2003 Urban Influence Code', '2013 Urban Influence Code', '2013 Rural-urban Continuum Code', '2023 Rural-urban Continuum Code', 'Less than a high school diploma, 1970', 'High school diploma only, 1970', 'Some college (1-3 years), 1970', 'Four years of college or higher, 1970', 'Percent of adults with less than a high school diploma, 1970', 'Percent of adults with a high school diploma only, 1970', 'Percent of adults completing some college (1-3 years), 1970', 'Percent of adults completing four years of college or higher, 1970', 'Less than a high school diploma, 1980', 'High school diploma only, 1980', 'Some college (1-3 years), 1980', 'Four years of college or higher, 1980', 'Percent of adults with less than a high school diploma, 1980', 'Percent of adults with a high school diploma only, 1980', 'Percent of adults completing some college (1-3 years), 1980', 'Percent of adults completing four years of college or higher, 1980', 'Less than a high school diploma, 1990', 'High school diploma only, 1990', "Some college or associate's degree, 1990", "Bachelor's degree or higher, 1990", 'Percent of adults with less than a high school diploma, 1990', 'Percent of adults with a high school diploma only, 1990', "Percent of adults completing some college or associate's degree, 1990", "Percent of adults with a bachelor's degree or higher, 1990", 'Less than a high school diploma, 2000', 'High school diploma only, 2000', "Some college or associate's degree, 2000", "Bachelor's degree or higher, 2000", 'Percent of adults with less than a high school diploma, 2000', 'Percent of adults with a high school diploma only, 2000', "Percent of adults completing some college or associate's degree, 2000", "Percent of adults with a bachelor's degree or higher, 2000", 'Less than a high school diploma, 2008-12', 'High school diploma only, 2008-12', "Some college or associate's degree, 2008-12", "Bachelor's degree or higher, 2008-12", 'Percent of adults with less than a high school diploma, 2008-12', 'Percent of adults with a high school diploma only, 2008-12', "Percent of adults completing some college or associate's degree, 2008-12", "Percent of adults with a bachelor's degree or higher, 2008-12", 'Less than a high school diploma, 2018-22', 'High school diploma only, 2018-22', "Some college or associate's degree, 2018-22", "Bachelor's degree or higher, 2018-22", 'Percent of adults with less than a high school diploma, 2018-22', 'Percent of adults with a high school diploma only, 2018-22', "Percent of adults completing some college or associate's degree, 2018-22", "Percent of adults with a bachelor's degree or higher, 2018-22", 'Urban_Influence_Code_2013', 'Metro_2013', 'Civilian_labor_force_2000', 'Employed_2000', 'Unemployed_2000', 'Unemployment_rate_2000', 'Civilian_labor_force_2001', 'Employed_2001', 'Unemployed_2001', 'Unemployment_rate_2001', 'Civilian_labor_force_2002', 'Employed_2002', 'Unemployed_2002', 'Unemployment_rate_2002', 'Civilian_labor_force_2003', 'Employed_2003', 'Unemployed_2003', 'Unemployment_rate_2003', 'Civilian_labor_force_2004', 'Employed_2004', 'Unemployed_2004', 'Unemployment_rate_2004', 'Civilian_labor_force_2005', 'Employed_2005', 'Unemployed_2005', 'Unemployment_rate_2005', 'Civilian_labor_force_2006', 'Employed_2006', 'Unemployed_2006', 'Unemployment_rate_2006', 'Civilian_labor_force_2007', 'Employed_2007', 'Unemployed_2007', 'Unemployment_rate_2007', 'Civilian_labor_force_2008', 'Employed_2008', 'Unemployed_2008', 'Unemployment_rate_2008', 'Civilian_labor_force_2009', 'Employed_2009', 'Unemployed_2009', 'Unemployment_rate_2009', 'Civilian_labor_force_2010', 'Employed_2010', 'Unemployed_2010', 'Unemployment_rate_2010', 'Civilian_labor_force_2011', 'Employed_2011', 'Unemployed_2011', 'Unemployment_rate_2011', 'Civilian_labor_force_2012', 'Employed_2012', 'Unemployed_2012', 'Unemployment_rate_2012', 'Civilian_labor_force_2013', 'Employed_2013', 'Unemployed_2013', 'Unemployment_rate_2013', 'Civilian_labor_force_2014', 'Employed_2014', 'Unemployed_2014', 'Unemployment_rate_2014', 'Civilian_labor_force_2015', 'Employed_2015', 'Unemployed_2015', 'Unemployment_rate_2015', 'Civilian_labor_force_2016', 'Employed_2016', 'Unemployed_2016', 'Unemployment_rate_2016', 'Civilian_labor_force_2017', 'Employed_2017', 'Unemployed_2017', 'Unemployment_rate_2017', 'Civilian_labor_force_2018', 'Employed_2018', 'Unemployed_2018', 'Unemployment_rate_2018', 'Civilian_labor_force_2019', 'Employed_2019', 'Unemployed_2019', 'Unemployment_rate_2019', 'Civilian_labor_force_2020', 'Employed_2020', 'Unemployed_2020', 'Unemployment_rate_2020', 'Civilian_labor_force_2021', 'Employed_2021', 'Unemployed_2021', 'Unemployment_rate_2021', 'Civilian_labor_force_2022', 'Employed_2022', 'Unemployed_2022', 'Unemployment_rate_2022', 'Median_Household_Income_2021', 'Med_HH_Income_Percent_of_State_Total_2021', 'Rural-urban_Continuum_Code_2003', 'Urban_Influence_Code_2003', 'Rural-urban_Continuum_Code_2013', 'Urban_Influence_Code_ 2013', 'POVALL_2021', 'CI90LBALL_2021', 'CI90UBALL_2021', 'PCTPOVALL_2021', 'CI90LBALLP_2021', 'CI90UBALLP_2021', 'POV017_2021', 'CI90LB017_2021', 'CI90UB017_2021', 'PCTPOV017_2021', 'CI90LB017P_2021', 'CI90UB017P_2021', 'POV517_2021', 'CI90LB517_2021', 'CI90UB517_2021', 'PCTPOV517_2021', 'CI90LB517P_2021', 'CI90UB517P_2021', 'MEDHHINC_2021', 'CI90LBINC_2021', 'CI90UBINC_2021', 'POV04_2021', 'CI90LB04_2021', 'CI90UB04_2021', 'PCTPOV04_2021', 'CI90LB04P_2021', 'CI90UB04P_2021', 'candidatevotes_2000_republican', 'candidatevotes_2004_republican', 'candidatevotes_2008_republican', 'candidatevotes_2012_republican', 'candidatevotes_2016_republican', 'share_2000_republican', 'share_2004_republican', 'share_2008_republican', 'share_2012_republican', 'share_2016_republican']

SimpleImputer

?Documentation for SimpleImputer

SimpleImputer()

StandardScaler

?Documentation for StandardScaler

StandardScaler()

2 Rôle de la pénalisation \(\alpha\) sur la sélection de variables

Nous avons jusqu’à présent pris l’hyperparamètre \(\alpha\) comme donné. Quel rôle joue-t-il dans les conclusions de notre modélisation ? Pour cela, nous pouvons explorer l’effet que sa valeur peut avoir sur le nombre de variables passant l’étape de sélection.

Pour le prochain exercice, nous allons considérer exclusivement les variables quantitatives pour accélérer les calculs. En effet, avec des modèles non parcimonieux, les multiples modalités de nos variables catégorielles rendent le problème d’optimisation difficilement tractable.

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

df2.replace([np.inf, -np.inf], np.nan, inplace=True)
X_train, X_test, y_train, y_test = train_test_split(
    df2.drop(["per_gop"], axis = 1),
    100*df2[['per_gop']], test_size=0.2, random_state=0
)

numerical_features = X_train.select_dtypes(include='number').columns.tolist()
categorical_features = X_train.select_dtypes(exclude='number').columns.tolist()

numeric_pipeline = Pipeline(steps=[
    ('impute', SimpleImputer(strategy='mean')),
    ('scale', StandardScaler())
])
preprocessed_features = pd.DataFrame(
  numeric_pipeline.fit_transform(
    X_train.drop(columns = categorical_features)
  )
)

/home/runner/work/python-datascientist/python-datascientist/.venv/lib/python3.12/site-packages/sklearn/impute/_base.py:635: UserWarning:

Skipping features without any observed values: ['POV04_2021' 'CI90LB04_2021' 'CI90UB04_2021' 'PCTPOV04_2021'
 'CI90LB04P_2021' 'CI90UB04P_2021']. At least one non-missing value is needed for imputation with strategy='mean'.

Exercice 2 : Rôle du paramètre de pénalisation

Utiliser la fonction lasso_path pour évaluer le nombre de paramètres sélectionnés par LASSO lorsque \(\alpha\) varie (parcourir \(\alpha \in [0.001,0.01,0.02,0.025,0.05,0.1,0.25,0.5,0.8,1.0]\)).

La relation que vous devriez obtenir entre \(\alpha\) et le nombre de paramètres est celle-ci :

On voit que plus \(\alpha\) est élevé, moins le modèle sélectionne de variables.

3 Validation croisée pour sélectionner le modèle

Quel \(\alpha\) faut-il privilégier ? Pour cela, il convient d’effectuer une validation croisée afin de choisir le modèle pour lequel les variables qui passent la phase de sélection permettent de mieux prédire le résultat Républicain.

from sklearn.linear_model import LassoCV

my_alphas = np.array([0.001,0.01,0.02,0.025,0.05,0.1,0.25,0.5,0.8,1.0])

lcv = (
  LassoCV(
    alphas=my_alphas,
    fit_intercept=False,
    random_state=0,
    cv=5
    ).fit(
      preprocessed_features, y_train
    )
)

On peut récupérer le “meilleur” \(\alpha\) :

print("alpha optimal :", lcv.alpha_)

alpha optimal : 1.0

Celui-ci peut être utilisé pour faire tourner un nouveau pipeline :

from sklearn.compose import make_column_transformer, ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

numeric_pipeline = Pipeline(steps=[
    ('impute', SimpleImputer(strategy='mean')),
    ('scale', StandardScaler())
])

categorical_pipeline = Pipeline(steps=[
    ('impute', SimpleImputer(strategy='most_frequent')),
    ('one-hot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])

preprocessor = ColumnTransformer(transformers=[
    ('number', numeric_pipeline, numerical_features),
    ('category', categorical_pipeline, categorical_features)
])

model = Lasso(
  fit_intercept=False, 
  alpha = lcv.alpha_
)  

lasso_pipeline = Pipeline(steps=[
    ('preprocess', preprocessor),
    ('model', model)
])

lasso_optimal = lasso_pipeline.fit(X_train,y_train)

features_selec2 = extract_features_selected(lasso_optimal)

Les variables sélectionnées sont :

0                                          R_BIRTH_2021
1                                          R_DEATH_2023
2     Percent of adults completing some college or a...
3     Percent of adults with a bachelor's degree or ...
4     Percent of adults with a bachelor's degree or ...
5     Percent of adults with a high school diploma o...
6                                        CI90LBINC_2021
7                        candidatevotes_2016_republican
8                                 share_2008_republican
9                                 share_2012_republican
10                                share_2016_republican
11                                           STATEFP_22
12                                              LSAD_06
dtype: object

Cela correspond à un modèle avec 13 variables sélectionnées.

Tip

Dans le cas où le modèle paraîtrait trop peu parcimonieux, il faudrait revoir la phase de définition des variables pertinentes pour comprendre si des échelles différentes de certaines variables ne seraient pas plus appropriées (par exemple du log).

Informations additionnelles

environment files have been tested on.

Latest built version: 2025-07-29

Python version used:

'3.12.3 (main, Jun 18 2025, 17:59:45) [GCC 13.3.0]'

Package	Version
affine	2.4.0
aiobotocore	2.22.0
aiohappyeyeballs	2.6.1
aiohttp	3.11.18
aioitertools	0.12.0
aiosignal	1.3.2
altair	5.4.1
annotated-types	0.7.0
anyio	4.9.0
appdirs	1.4.4
argon2-cffi	25.1.0
argon2-cffi-bindings	21.2.0
arrow	1.3.0
asttokens	3.0.0
async-lru	2.0.5
attrs	25.3.0
babel	2.17.0
beautifulsoup4	4.13.4
black	24.8.0
bleach	6.2.0
blis	1.3.0
boto3	1.37.3
botocore	1.37.3
branca	0.8.1
Brotli	1.1.0
bs4	0.0.2
cartiflette	0.0.3
Cartopy	0.24.1
catalogue	2.0.10
cattrs	24.1.3
certifi	2025.7.14
cffi	1.17.1
charset-normalizer	3.4.2
chromedriver-autoinstaller	0.6.4
click	8.2.1
click-plugins	1.1.1
cligj	0.7.2
cloudpathlib	0.21.1
comm	0.2.2
commonmark	0.9.1
confection	0.1.5
contextily	1.6.2
contourpy	1.3.2
cycler	0.12.1
cymem	2.0.11
dataclasses-json	0.6.7
debugpy	1.8.14
decorator	5.2.1
defusedxml	0.7.1
diskcache	5.6.3
duckdb	1.3.0
en_core_web_sm	3.8.0
et_xmlfile	2.0.0
executing	2.2.0
fastexcel	0.14.0
fastjsonschema	2.21.1
fiona	1.10.1
folium	0.19.6
fontawesomefree	6.6.0
fonttools	4.58.0
fqdn	1.5.1
frozenlist	1.6.0
fsspec	2025.5.0
geographiclib	2.0
geopandas	1.0.1
geoplot	0.5.1
geopy	2.4.1
graphviz	0.20.3
great-tables	0.12.0
greenlet	3.2.2
h11	0.16.0
htmltools	0.6.0
httpcore	1.0.9
httpx	0.28.1
httpx-sse	0.4.0
idna	3.10
imageio	2.37.0
importlib_metadata	8.7.0
importlib_resources	6.5.2
inflate64	1.0.1
ipykernel	6.29.5
ipython	9.3.0
ipython_pygments_lexers	1.1.1
ipywidgets	8.1.7
isoduration	20.11.0
jedi	0.19.2
Jinja2	3.1.6
jmespath	1.0.1
joblib	1.5.1
json5	0.12.0
jsonpatch	1.33
jsonpointer	3.0.0
jsonschema	4.23.0
jsonschema-specifications	2025.4.1
jupyter	1.1.1
jupyter-cache	1.0.0
jupyter_client	8.6.3
jupyter-console	6.6.3
jupyter_core	5.7.2
jupyter-events	0.12.0
jupyter-lsp	2.2.5
jupyter_server	2.16.0
jupyter_server_terminals	0.5.3
jupyterlab	4.4.3
jupyterlab_pygments	0.3.0
jupyterlab_server	2.27.3
jupyterlab_widgets	3.0.15
kaleido	0.2.1
kiwisolver	1.4.8
langchain	0.3.25
langchain-community	0.3.9
langchain-core	0.3.61
langchain-text-splitters	0.3.8
langcodes	3.5.0
langsmith	0.1.147
language_data	1.3.0
lazy_loader	0.4
loguru	0.7.3
lxml	5.4.0
mapclassify	2.8.1
marisa-trie	1.2.1
Markdown	3.8
markdown-it-py	3.0.0
MarkupSafe	3.0.2
marshmallow	3.26.1
matplotlib	3.10.3
matplotlib-inline	0.1.7
mdurl	0.1.2
mercantile	1.2.1
mistune	3.1.3
mizani	0.11.4
multidict	6.4.4
multivolumefile	0.2.3
murmurhash	1.0.13
mypy_extensions	1.1.0
narwhals	1.40.0
nbclient	0.10.0
nbconvert	7.16.6
nbformat	5.10.4
nest-asyncio	1.6.0
networkx	3.4.2
nltk	3.9.1
notebook	7.4.3
notebook_shim	0.2.4
numpy	2.2.6
openpyxl	3.1.5
orjson	3.10.18
outcome	1.3.0.post0
overrides	7.7.0
OWSLib	0.33.0
packaging	24.2
pandas	2.2.3
pandocfilters	1.5.1
parso	0.8.4
pathspec	0.12.1
patsy	1.0.1
Pebble	5.1.1
pexpect	4.9.0
pillow	11.2.1
pip	25.1.1
platformdirs	4.3.8
plotly	6.1.2
plotnine	0.13.6
polars	1.8.2
preshed	3.0.9
prometheus_client	0.22.1
prompt_toolkit	3.0.51
propcache	0.3.1
psutil	7.0.0
ptyprocess	0.7.0
pure_eval	0.2.3
py7zr	0.22.0
pyarrow	17.0.0
pybcj	1.0.6
pycparser	2.22
pycryptodomex	3.23.0
pydantic	2.11.5
pydantic_core	2.33.2
pydantic-settings	2.9.1
Pygments	2.19.1
pynsee	0.1.8
pyogrio	0.11.0
pyparsing	3.2.3
pyppmd	1.1.1
pyproj	3.7.1
pyshp	2.3.1
PySocks	1.7.1
python-dateutil	2.9.0.post0
python-dotenv	1.0.1
python-json-logger	3.3.0
python-magic	0.4.27
pytz	2025.2
pywaffle	1.1.1
PyYAML	6.0.2
pyzmq	26.4.0
pyzstd	0.17.0
rasterio	1.4.3
referencing	0.36.2
regex	2024.11.6
requests	2.32.3
requests-cache	1.2.1
requests-toolbelt	1.0.0
retrying	1.3.4
rfc3339-validator	0.1.4
rfc3986-validator	0.1.1
rich	14.0.0
rpds-py	0.25.1
rtree	1.4.0
s3fs	2025.5.0
s3transfer	0.11.3
scikit-image	0.24.0
scikit-learn	1.6.1
scipy	1.13.0
seaborn	0.13.2
selenium	4.34.2
Send2Trash	1.8.3
setuptools	80.8.0
shapely	2.1.1
shellingham	1.5.4
six	1.17.0
smart-open	7.1.0
sniffio	1.3.1
sortedcontainers	2.4.0
soupsieve	2.7
spacy	3.8.4
spacy-legacy	3.0.12
spacy-loggers	1.0.5
SQLAlchemy	2.0.41
srsly	2.5.1
stack-data	0.6.3
statsmodels	0.14.4
tabulate	0.9.0
tenacity	9.1.2
terminado	0.18.1
texttable	1.7.0
thinc	8.3.6
threadpoolctl	3.6.0
tifffile	2025.5.24
tinycss2	1.4.0
topojson	1.9
tornado	6.5.1
tqdm	4.67.1
traitlets	5.14.3
trio	0.30.0
trio-websocket	0.12.2
typer	0.15.3
types-python-dateutil	2.9.0.20250516
typing_extensions	4.14.1
typing-inspect	0.9.0
typing-inspection	0.4.1
tzdata	2025.2
Unidecode	1.4.0
uri-template	1.3.0
url-normalize	2.2.1
urllib3	2.5.0
wasabi	1.1.3
wcwidth	0.2.13
weasel	0.4.1
webcolors	24.11.1
webdriver-manager	4.0.2
webencodings	0.5.1
websocket-client	1.8.0
widgetsnbextension	4.0.14
wordcloud	1.9.3
wrapt	1.17.2
wsproto	1.2.0
xlrd	2.0.1
xyzservices	2025.4.0
yarl	1.20.0
yellowbrick	1.5
zipp	3.21.0

View file history

md`Ce fichier a été modifié __${table_commit.length}__ fois depuis sa création le ${creation_string} (dernière modification le ${last_modification_string})`

creation = d3.min(
  table_commit.map(d => new Date(d.Date))
)

last_modification = d3.max(
  table_commit.map(d => new Date(d.Date))
)

creation_string = creation.toLocaleString("fr", {
  "day": "numeric",
  "month": "long",
  "year": "numeric"
})

last_modification_string = last_modification.toLocaleString("fr", {
  "day": "numeric",
  "month": "long",
  "year": "numeric"
})

html`<div>${git_history_table}</div>`

html`<div>${git_history_plot}</div>`

SHA	Date	Author	Description
94648290	2025-07-22 18:57:48	Lino Galiana	Fix boxes now that it is better supported by jupyter (#628)
91431fa2	2025-06-09 17:08:00	Lino Galiana	Improve homepage hero banner (#612)
48dccf14	2025-01-14 21:45:34	lgaliana	Fix bug in modeling section
8c8ca4c0	2024-12-20 10:45:00	lgaliana	Traduction du chapitre clustering
a5ecaedc	2024-12-20 09:36:42	Lino Galiana	Traduction du chapitre modélisation (#582)
5ff770b5	2024-12-04 10:07:34	lgaliana	Partie ML plus esthétique
d2422572	2024-08-22 18:51:51	Lino Galiana	At this point, notebooks should now all be functional ! (#547)
c641de05	2024-08-22 11:37:13	Lino Galiana	A series of fix for notebooks that were bugging (#545)
0908656f	2024-08-20 16:30:39	Lino Galiana	English sidebar (#542)
06d003a1	2024-04-23 10:09:22	Lino Galiana	Continue la restructuration des sous-parties (#492)
8c316d0a	2024-04-05 19:00:59	Lino Galiana	Fix cartiflette deprecated snippets (#487)
005d89b8	2023-12-20 17:23:04	Lino Galiana	Finalise l’affichage des statistiques Git (#478)
3437373a	2023-12-16 20:11:06	Lino Galiana	Améliore l’exercice sur le LASSO (#473)
7d12af8b	2023-12-05 10:30:08	linogaliana	Modularise la partie import pour l’avoir partout
417fb669	2023-12-04 18:49:21	Lino Galiana	Corrections partie ML (#468)
0b405bc2	2023-11-27 20:58:37	Lino Galiana	Update box lasso
a06a2689	2023-11-23 18:23:28	Antoine Palazzolo	2ème relectures chapitres ML (#457)
889a71ba	2023-11-10 11:40:51	Antoine Palazzolo	Modification TP 3 (#443)
9a4e2267	2023-08-28 17:11:52	Lino Galiana	Action to check URL still exist (#399)
a8f90c2f	2023-08-28 09:26:12	Lino Galiana	Update featured paths (#396)
3bdf3b06	2023-08-25 11:23:02	Lino Galiana	Simplification de la structure 🤓 (#393)
78ea2cbd	2023-07-20 20:27:31	Lino Galiana	Change titles levels (#381)
29ff3f58	2023-07-07 14:17:53	linogaliana	description everywhere
f21a24d3	2023-07-02 10:58:15	Lino Galiana	Pipeline Quarto & Pages 🚀 (#365)
e12187b2	2023-06-12 10:31:40	Lino Galiana	Feature selection deprecated functions (#363)
f5ad0210	2022-11-15 17:40:16	Lino Galiana	Relec clustering et lasso (#322)
f10815b5	2022-08-25 16:00:03	Lino Galiana	Notebooks should now look more beautiful (#260)
494a85ae	2022-08-05 14:49:56	Lino Galiana	Images featured ✨ (#252)
d201e3cd	2022-08-03 15:50:34	Lino Galiana	Pimp la homepage ✨ (#249)
12965bac	2022-05-25 15:53:27	Lino Galiana	:launch: Bascule vers quarto (#226)
9c71d6e7	2022-03-08 10:34:26	Lino Galiana	Plus d’éléments sur S3 (#218)
70587527	2022-03-04 15:35:17	Lino Galiana	Relecture Word2Vec (#216)
c3bf4d42	2021-12-06 19:43:26	Lino Galiana	Finalise debug partie ML (#190)
fb14d406	2021-12-06 17:00:52	Lino Galiana	Modifie l’import du script (#187)
37ecfa3c	2021-12-06 14:48:05	Lino Galiana	Essaye nom différent (#186)
2c8fd0dd	2021-12-06 13:06:36	Lino Galiana	Problème d’exécution du script import data ML (#185)
5d0a5e38	2021-12-04 07:41:43	Lino Galiana	MAJ URL script recup data (#184)
5c104904	2021-12-03 17:44:08	Lino Galiana	Relec @antuki partie modelisation (#183)
2a8809fb	2021-10-27 12:05:34	Lino Galiana	Simplification des hooks pour gagner en flexibilité et clarté (#166)
2e4d5862	2021-09-02 12:03:39	Lino Galiana	Simplify badges generation (#130)
4cdb759c	2021-05-12 10:37:23	Lino Galiana	:sparkles: :star2: Nouveau thème hugo :snake: :fire: (#105)
7f9f97bc	2021-04-30 21:44:04	Lino Galiana	🐳 + 🐍 New workflow (docker 🐳) and new dataset for modelization (2020 🇺🇸 elections) (#99)
8fea62ed	2020-11-13 11:58:17	Lino Galiana	Correction de quelques typos partie ML (#85)
347f50f3	2020-11-12 15:08:18	Lino Galiana	Suite de la partie machine learning (#78)

git_history_table = Inputs.table(
  table_commit,
  {
    format: {
      SHA: x => md`[${x}](${github_repo}/commit/${x})`,
      Description: x => md`${replacePullRequestPattern(x, github_repo)}`,
      /*Date: x => x.toLocaleString("fr", {
        "month": "numeric",
        "day": "numeric",
        "year": "numeric"
        })
      */
    }
  }
)

git_history_plot = Plot.plot({
  marks: [
    Plot.ruleY([0], {stroke: "royalblue"}),
    Plot.dot(
          table_commit,
          Plot.pointerX({x: (d) => new Date(d.date), y: 0, stroke: "red"})),
    Plot.dot(table_commit, {x: (d) => new Date(d.Date), y: 0, fill: "royalblue"})
  ]
})

function replacePullRequestPattern(inputString, githubRepo) {
    // Use a regular expression to match the pattern #digit
    var pattern = /#(\d+)/g;

    // Replace the pattern with ${github_repo}/pull/#digit
    var replacedString = inputString.replace(pattern, '[#$1](' + githubRepo + '/pull/$1)');

    return replacedString;
}

github_repo = "https://github.com/linogaliana/python-datascientist"

table_commit = {

// Get the HTML table by its class name
var table = document.querySelector('.commit-table');

// Check if the table exists
if (table) {
    // Initialize an array to store the table data
    var dataArray = [];

    // Extract headers from the first row
    var headers = [];
    for (var i = 0; i < table.rows[0].cells.length; i++) {
        headers.push(table.rows[0].cells[i].textContent.trim());
    }

    // Iterate through the rows, starting from the second row
    for (var i = 1; i < table.rows.length; i++) {
        var row = table.rows[i];
        var rowData = {};

        // Iterate through the cells in the row
        for (var j = 0; j < row.cells.length; j++) {
            // Use headers as keys and cell content as values
            rowData[headers[j]] = row.cells[j].textContent.trim();
        }

        // Push the rowData object to the dataArray
        dataArray.push(rowData);
    }
  }

  return dataArray

}

// Get the element with class 'git-details'
{
  var gitDetails = document.querySelector('.commit-table');

  // Check if the element exists
  if (gitDetails) {
      // Hide the element
      gitDetails.style.display = 'none';
  }
}

Plot = require('@observablehq/plot@0.6.12/dist/plot.umd.min.js')

Retour au sommet

Citation

BibTeX

@book{galiana2023,
  author = {Galiana, Lino},
  title = {Python pour la data science},
  date = {2023},
  url = {https://pythonds.linogaliana.fr/},
  doi = {10.5281/zenodo.8229676},
  langid = {fr}
}

Veuillez citer ce travail comme suit :

Galiana, Lino. 2023. Python pour la data science. https://doi.org/10.5281/zenodo.8229676.

	parcimonieux	non parcimonieux
RMSE	2.699583	2.491548
R2	0.972809	0.976838
Nombre de paramètres	27.000000	256.000000

Dep. Variable:	per_gop	R-squared:	0.968
Model:	OLS	Adj. R-squared:	0.968
Method:	Least Squares	F-statistic:	9.292e+04
Date:	Tue, 29 Jul 2025	Prob (F-statistic):	0.00
Time:	14:01:36	Log-Likelihood:	6603.5
No. Observations:	3107	AIC:	-1.320e+04
Df Residuals:	3105	BIC:	-1.319e+04
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	0.0109	0.002	5.056	0.000	0.007	0.015
share_2016_republican	1.0101	0.003	304.835	0.000	1.004	1.017

Omnibus:	2045.232	Durbin-Watson:	1.982
Prob(Omnibus):	0.000	Jarque-Bera (JB):	51553.266
Skew:	2.731	Prob(JB):	0.00
Kurtosis:	22.193	Cond. No.	9.00