!pip install --upgrade xlrd #colab bug verson xlrd
!pip install geopandas
Ce chapitre utilise toujours le même jeu de données, présenté dans l’introduction de cette partie : les données de vote aux élections présidentielles américaines croisées à des variables sociodémographiques. Le code est disponible sur Github.
import requests
= 'https://raw.githubusercontent.com/linogaliana/python-datascientist/main/content/modelisation/get_data.py'
url = requests.get(url, allow_redirects=True)
r open('getdata.py', 'wb').write(r.content)
import getdata
= getdata.create_votes_dataframes() votes
So far, we have assumed that the variables useful for predicting the Republican vote were known to the modeler. Thus, we have only used a limited portion of the available variables in our data. However, beyond the computational burden of building a model with a large number of variables, choosing a limited number of variables (a parsimonious model) reduces the risk of overfitting.
How, then, can we choose the right number of variables and the best combination of these variables? There are multiple methods, including:
- Relying on statistical performance criteria that penalize non-parsimonious models. For example, BIC.
- Backward elimination techniques.
- Building models where the statistic of interest penalizes the lack of parsimony (which is what we aim to do here).
In this chapter, we will present the main challenges of variable selection through LASSO.
We will subsequently use the following functions or packages:
import numpy as np
from sklearn.svm import LinearSVC
from sklearn.feature_selection import SelectFromModel
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.linear_model import Lasso
import sklearn.metrics
from sklearn.linear_model import LinearRegression
import matplotlib.cm as cm
import matplotlib.pyplot as plt
from sklearn.linear_model import lasso_path
import seaborn as sns
1 The Principle of LASSO
1.1 General Principle
The class of feature selection models is very broad and includes a diverse range of models. We will focus on LASSO (Least Absolute Shrinkage and Selection Operator), which is an extension of linear regression aimed at selecting sparse models. This type of model is central to the field of Compressed sensing (where the term L1-regularization is more commonly used than LASSO). LASSO is a special case of elastic-net regressions, with another famous case being ridge regression. Unlike classical linear regression, these methods also work in a framework where \(p>N\), i.e., where the number of predictors is much larger than the number of observations.
1.2 Regularization
By adopting the principle of a penalized objective function, LASSO allows certain coefficients to be set to 0. Variables with non-zero norms thus pass the selection test.
LASSO is a constrained optimization problem. It seeks to find the estimator \(\beta\) that minimizes the quadratic error (linear regression) under an additional constraint regularizing the parameters: \[ \min_{\beta} \frac{1}{2}\mathbb{E}\bigg( \big( X\beta - y \big)^2 \bigg) \\ \text{s.t. } \sum_{j=1}^p |\beta_j| \leq t \]
This program is reformulated using the Lagrangian, allowing for a more tractable minimization program:
\[ \beta^{\text{LASSO}} = \arg \min_{\beta} \frac{1}{2}\mathbb{E}\bigg( \big( X\beta - y \big)^2 \bigg) + \alpha \sum_{j=1}^p |\beta_j| = \arg \min_{\beta} ||y-X\beta||_{2}^{2} + \lambda ||\beta||_1 \]
where \(\lambda\) is a reformulation of the previous regularization term, depending on \(\alpha\). The strength of the penalty applied to non-parsimonious models depends on this parameter.
1.3 First LASSO Regression
As we aim to find the best predictors of the Republican vote, we will remove variables that can be directly derived from these: the competitors’ scores!
import pandas as pd
= pd.DataFrame(votes.drop(columns='geometry'))
df2 = df2.loc[
df2
:,~df2.columns.str.endswith(
'_democrat','_green','_other', 'winner', 'per_point_diff', 'per_dem')
(
)
]
= df2.loc[:,~df2.columns.duplicated()] df2
In this exercise, we will also use a function to extract the variables selected by LASSO, here it is:
Fonction pour récupérer les variables validées par l’étape de sélection
from sklearn.linear_model import Lasso
from sklearn.pipeline import Pipeline
def extract_features_selected(lasso: Pipeline, preprocessing_step_name: str = 'preprocess') -> pd.Series:
"""
Extracts selected features based on the coefficients obtained from Lasso regression.
Parameters:
- lasso (Pipeline): The scikit-learn pipeline containing a trained Lasso regression model.
- preprocessing_step_name (str): The name of the preprocessing step in the pipeline. Default is 'preprocess'.
Returns:
- pd.Series: A Pandas Series containing selected features with non-zero coefficients.
"""
# Check if lasso object is provided
if not isinstance(lasso, Pipeline):
raise ValueError("The provided lasso object is not a scikit-learn pipeline.")
# Extract the final transformer from the pipeline
= lasso[-1]
lasso_model
# Check if lasso_model is a Lasso regression model
if not isinstance(lasso_model, Lasso):
raise ValueError("The final step of the pipeline is not a Lasso regression model.")
# Check if lasso model has 'coef_' attribute
if not hasattr(lasso_model, 'coef_'):
raise ValueError("The provided Lasso regression model does not have 'coef_' attribute. "
"Make sure it is a trained Lasso regression model.")
# Get feature names from the preprocessing step
= lasso[preprocessing_step_name].get_feature_names_out()
features_preprocessing
# Extract selected features based on non-zero coefficients
= pd.Series(features_preprocessing[np.abs(lasso_model.coef_) > 0])
features_selec
return features_selec
We are still trying to predict the variable per_gop
. Before making our estimation, we will create certain intermediate objects to define our pipeline:
In our
DataFrame
, replace infinite values withNaN
.Create a training sample and a test sample.
Now we can move on to defining our pipeline. This example might serve as inspiration, as well as this one.
- First, create the preprocessing steps for our model. For this, it is common to separate the steps applied to continuous numerical variables from those applied to categorical variables.
- For numerical variables, impute with the mean and then standardize;
- For categorical variables, linear regression techniques require using one-hot encoding. Before performing one-hot encoding, impute with the most frequent value.
- Finalize the pipeline by adding the estimation step and then estimate a LASSO model penalized with \(\alpha = 0.1\).
Assuming your pipeline is stored in an object named pipeline
and the last step is named model
, you can directly access this step using the object pipeline['model']
.
- Display the coefficient values. Which variables have a non-zero value?
- Show that the selected variables are sometimes highly correlated.
- Compare the performance of this parsimonious model with that of a model with more variables.
Help for Question 1
# Replace infinities with NaN
-np.inf], np.nan, inplace=True) df2.replace([np.inf,
Help for Question 3
The pipeline definition follows this structure:
= Pipeline(steps=[
numeric_pipeline 'impute', # define the imputation method here
(
),'scale', # define the standardization method here
(
)
])
= # adapt the template
categorical_pipeline
# Define numerical_features and categorical_features beforehand
= ColumnTransformer(transformers=[
preprocessor 'number', numeric_pipeline, numerical_features),
('category', categorical_pipeline, categorical_features)
( ])
The preprocessing pipeline (question 3) takes the following form:
ColumnTransformer(transformers=[('number', Pipeline(steps=[('impute', SimpleImputer()), ('scale', StandardScaler())]), ['ALAND', 'AWATER', 'votes_gop', 'votes_dem', 'total_votes', 'diff', 'FIPS_y', 'Rural_Urban_Continuum_Code_2013', 'Rural_Urban_Continuum_Code_2023', 'Urban_Influence_2013', 'Economic_typology_2015', 'CENSUS_2020_POP', 'ESTIMATES_BASE_2020', 'POP_EST... 'DEATHS_2020', 'DEATHS_2021', 'DEATHS_2022', 'DEATHS_2023', 'NATURAL_CHG_2020', ...]), ('category', Pipeline(steps=[('impute', SimpleImputer(strategy='most_frequent')), ('one-hot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))]), ['STATEFP', 'COUNTYFP', 'COUNTYNS', 'AFFGEOID', 'GEOID', 'NAME', 'LSAD', 'FIPS_x', 'state_name', 'county_fips', 'county_name', 'State', 'Area_Name', 'FIPS'])])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
ColumnTransformer(transformers=[('number', Pipeline(steps=[('impute', SimpleImputer()), ('scale', StandardScaler())]), ['ALAND', 'AWATER', 'votes_gop', 'votes_dem', 'total_votes', 'diff', 'FIPS_y', 'Rural_Urban_Continuum_Code_2013', 'Rural_Urban_Continuum_Code_2023', 'Urban_Influence_2013', 'Economic_typology_2015', 'CENSUS_2020_POP', 'ESTIMATES_BASE_2020', 'POP_EST... 'DEATHS_2020', 'DEATHS_2021', 'DEATHS_2022', 'DEATHS_2023', 'NATURAL_CHG_2020', ...]), ('category', Pipeline(steps=[('impute', SimpleImputer(strategy='most_frequent')), ('one-hot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))]), ['STATEFP', 'COUNTYFP', 'COUNTYNS', 'AFFGEOID', 'GEOID', 'NAME', 'LSAD', 'FIPS_x', 'state_name', 'county_fips', 'county_name', 'State', 'Area_Name', 'FIPS'])])
['ALAND', 'AWATER', 'votes_gop', 'votes_dem', 'total_votes', 'diff', 'FIPS_y', 'Rural_Urban_Continuum_Code_2013', 'Rural_Urban_Continuum_Code_2023', 'Urban_Influence_2013', 'Economic_typology_2015', 'CENSUS_2020_POP', 'ESTIMATES_BASE_2020', 'POP_ESTIMATE_2020', 'POP_ESTIMATE_2021', 'POP_ESTIMATE_2022', 'POP_ESTIMATE_2023', 'N_POP_CHG_2020', 'N_POP_CHG_2021', 'N_POP_CHG_2022', 'N_POP_CHG_2023', 'BIRTHS_2020', 'BIRTHS_2021', 'BIRTHS_2022', 'BIRTHS_2023', 'DEATHS_2020', 'DEATHS_2021', 'DEATHS_2022', 'DEATHS_2023', 'NATURAL_CHG_2020', 'NATURAL_CHG_2021', 'NATURAL_CHG_2022', 'NATURAL_CHG_2023', 'INTERNATIONAL_MIG_2020', 'INTERNATIONAL_MIG_2021', 'INTERNATIONAL_MIG_2022', 'INTERNATIONAL_MIG_2023', 'DOMESTIC_MIG_2020', 'DOMESTIC_MIG_2021', 'DOMESTIC_MIG_2022', 'DOMESTIC_MIG_2023', 'NET_MIG_2020', 'NET_MIG_2021', 'NET_MIG_2022', 'NET_MIG_2023', 'RESIDUAL_2020', 'RESIDUAL_2021', 'RESIDUAL_2022', 'RESIDUAL_2023', 'GQ_ESTIMATES_BASE_2020', 'GQ_ESTIMATES_2020', 'GQ_ESTIMATES_2021', 'GQ_ESTIMATES_2022', 'GQ_ESTIMATES_2023', 'R_BIRTH_2021', 'R_BIRTH_2022', 'R_BIRTH_2023', 'R_DEATH_2021', 'R_DEATH_2022', 'R_DEATH_2023', 'R_NATURAL_CHG_2021', 'R_NATURAL_CHG_2022', 'R_NATURAL_CHG_2023', 'R_INTERNATIONAL_MIG_2021', 'R_INTERNATIONAL_MIG_2022', 'R_INTERNATIONAL_MIG_2023', 'R_DOMESTIC_MIG_2021', 'R_DOMESTIC_MIG_2022', 'R_DOMESTIC_MIG_2023', 'R_NET_MIG_2021', 'R_NET_MIG_2022', 'R_NET_MIG_2023', '2003 Urban Influence Code', '2013 Urban Influence Code', '2013 Rural-urban Continuum Code', '2023 Rural-urban Continuum Code', 'Less than a high school diploma, 1970', 'High school diploma only, 1970', 'Some college (1-3 years), 1970', 'Four years of college or higher, 1970', 'Percent of adults with less than a high school diploma, 1970', 'Percent of adults with a high school diploma only, 1970', 'Percent of adults completing some college (1-3 years), 1970', 'Percent of adults completing four years of college or higher, 1970', 'Less than a high school diploma, 1980', 'High school diploma only, 1980', 'Some college (1-3 years), 1980', 'Four years of college or higher, 1980', 'Percent of adults with less than a high school diploma, 1980', 'Percent of adults with a high school diploma only, 1980', 'Percent of adults completing some college (1-3 years), 1980', 'Percent of adults completing four years of college or higher, 1980', 'Less than a high school diploma, 1990', 'High school diploma only, 1990', "Some college or associate's degree, 1990", "Bachelor's degree or higher, 1990", 'Percent of adults with less than a high school diploma, 1990', 'Percent of adults with a high school diploma only, 1990', "Percent of adults completing some college or associate's degree, 1990", "Percent of adults with a bachelor's degree or higher, 1990", 'Less than a high school diploma, 2000', 'High school diploma only, 2000', "Some college or associate's degree, 2000", "Bachelor's degree or higher, 2000", 'Percent of adults with less than a high school diploma, 2000', 'Percent of adults with a high school diploma only, 2000', "Percent of adults completing some college or associate's degree, 2000", "Percent of adults with a bachelor's degree or higher, 2000", 'Less than a high school diploma, 2008-12', 'High school diploma only, 2008-12', "Some college or associate's degree, 2008-12", "Bachelor's degree or higher, 2008-12", 'Percent of adults with less than a high school diploma, 2008-12', 'Percent of adults with a high school diploma only, 2008-12', "Percent of adults completing some college or associate's degree, 2008-12", "Percent of adults with a bachelor's degree or higher, 2008-12", 'Less than a high school diploma, 2018-22', 'High school diploma only, 2018-22', "Some college or associate's degree, 2018-22", "Bachelor's degree or higher, 2018-22", 'Percent of adults with less than a high school diploma, 2018-22', 'Percent of adults with a high school diploma only, 2018-22', "Percent of adults completing some college or associate's degree, 2018-22", "Percent of adults with a bachelor's degree or higher, 2018-22", 'Urban_Influence_Code_2013', 'Metro_2013', 'Civilian_labor_force_2000', 'Employed_2000', 'Unemployed_2000', 'Unemployment_rate_2000', 'Civilian_labor_force_2001', 'Employed_2001', 'Unemployed_2001', 'Unemployment_rate_2001', 'Civilian_labor_force_2002', 'Employed_2002', 'Unemployed_2002', 'Unemployment_rate_2002', 'Civilian_labor_force_2003', 'Employed_2003', 'Unemployed_2003', 'Unemployment_rate_2003', 'Civilian_labor_force_2004', 'Employed_2004', 'Unemployed_2004', 'Unemployment_rate_2004', 'Civilian_labor_force_2005', 'Employed_2005', 'Unemployed_2005', 'Unemployment_rate_2005', 'Civilian_labor_force_2006', 'Employed_2006', 'Unemployed_2006', 'Unemployment_rate_2006', 'Civilian_labor_force_2007', 'Employed_2007', 'Unemployed_2007', 'Unemployment_rate_2007', 'Civilian_labor_force_2008', 'Employed_2008', 'Unemployed_2008', 'Unemployment_rate_2008', 'Civilian_labor_force_2009', 'Employed_2009', 'Unemployed_2009', 'Unemployment_rate_2009', 'Civilian_labor_force_2010', 'Employed_2010', 'Unemployed_2010', 'Unemployment_rate_2010', 'Civilian_labor_force_2011', 'Employed_2011', 'Unemployed_2011', 'Unemployment_rate_2011', 'Civilian_labor_force_2012', 'Employed_2012', 'Unemployed_2012', 'Unemployment_rate_2012', 'Civilian_labor_force_2013', 'Employed_2013', 'Unemployed_2013', 'Unemployment_rate_2013', 'Civilian_labor_force_2014', 'Employed_2014', 'Unemployed_2014', 'Unemployment_rate_2014', 'Civilian_labor_force_2015', 'Employed_2015', 'Unemployed_2015', 'Unemployment_rate_2015', 'Civilian_labor_force_2016', 'Employed_2016', 'Unemployed_2016', 'Unemployment_rate_2016', 'Civilian_labor_force_2017', 'Employed_2017', 'Unemployed_2017', 'Unemployment_rate_2017', 'Civilian_labor_force_2018', 'Employed_2018', 'Unemployed_2018', 'Unemployment_rate_2018', 'Civilian_labor_force_2019', 'Employed_2019', 'Unemployed_2019', 'Unemployment_rate_2019', 'Civilian_labor_force_2020', 'Employed_2020', 'Unemployed_2020', 'Unemployment_rate_2020', 'Civilian_labor_force_2021', 'Employed_2021', 'Unemployed_2021', 'Unemployment_rate_2021', 'Civilian_labor_force_2022', 'Employed_2022', 'Unemployed_2022', 'Unemployment_rate_2022', 'Median_Household_Income_2021', 'Med_HH_Income_Percent_of_State_Total_2021', 'Rural-urban_Continuum_Code_2003', 'Urban_Influence_Code_2003', 'Rural-urban_Continuum_Code_2013', 'Urban_Influence_Code_ 2013', 'POVALL_2021', 'CI90LBALL_2021', 'CI90UBALL_2021', 'PCTPOVALL_2021', 'CI90LBALLP_2021', 'CI90UBALLP_2021', 'POV017_2021', 'CI90LB017_2021', 'CI90UB017_2021', 'PCTPOV017_2021', 'CI90LB017P_2021', 'CI90UB017P_2021', 'POV517_2021', 'CI90LB517_2021', 'CI90UB517_2021', 'PCTPOV517_2021', 'CI90LB517P_2021', 'CI90UB517P_2021', 'MEDHHINC_2021', 'CI90LBINC_2021', 'CI90UBINC_2021', 'POV04_2021', 'CI90LB04_2021', 'CI90UB04_2021', 'PCTPOV04_2021', 'CI90LB04P_2021', 'CI90UB04P_2021', 'candidatevotes_2000_republican', 'candidatevotes_2004_republican', 'candidatevotes_2008_republican', 'candidatevotes_2012_republican', 'candidatevotes_2016_republican', 'share_2000_republican', 'share_2004_republican', 'share_2008_republican', 'share_2012_republican', 'share_2016_republican']
SimpleImputer()
StandardScaler()
['STATEFP', 'COUNTYFP', 'COUNTYNS', 'AFFGEOID', 'GEOID', 'NAME', 'LSAD', 'FIPS_x', 'state_name', 'county_fips', 'county_name', 'State', 'Area_Name', 'FIPS']
SimpleImputer(strategy='most_frequent')
OneHotEncoder(handle_unknown='ignore', sparse_output=False)
/opt/conda/lib/python3.12/site-packages/sklearn/impute/_base.py:635: UserWarning:
Skipping features without any observed values: ['POV04_2021' 'CI90LB04_2021' 'CI90UB04_2021' 'PCTPOV04_2021'
'CI90LB04P_2021' 'CI90UB04P_2021']. At least one non-missing value is needed for imputation with strategy='mean'.
Pipeline(steps=[('preprocess', ColumnTransformer(transformers=[('number', Pipeline(steps=[('impute', SimpleImputer()), ('scale', StandardScaler())]), ['ALAND', 'AWATER', 'votes_gop', 'votes_dem', 'total_votes', 'diff', 'FIPS_y', 'Rural_Urban_Continuum_Code_2013', 'Rural_Urban_Continuum_Code_2023', 'Urban_Influence_2013', 'Economic_typology_2015', 'CENSUS_2020_POP',... 'NATURAL_CHG_2020', ...]), ('category', Pipeline(steps=[('impute', SimpleImputer(strategy='most_frequent')), ('one-hot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))]), ['STATEFP', 'COUNTYFP', 'COUNTYNS', 'AFFGEOID', 'GEOID', 'NAME', 'LSAD', 'FIPS_x', 'state_name', 'county_fips', 'county_name', 'State', 'Area_Name', 'FIPS'])])), ('model', Lasso(alpha=0.1))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('preprocess', ColumnTransformer(transformers=[('number', Pipeline(steps=[('impute', SimpleImputer()), ('scale', StandardScaler())]), ['ALAND', 'AWATER', 'votes_gop', 'votes_dem', 'total_votes', 'diff', 'FIPS_y', 'Rural_Urban_Continuum_Code_2013', 'Rural_Urban_Continuum_Code_2023', 'Urban_Influence_2013', 'Economic_typology_2015', 'CENSUS_2020_POP',... 'NATURAL_CHG_2020', ...]), ('category', Pipeline(steps=[('impute', SimpleImputer(strategy='most_frequent')), ('one-hot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))]), ['STATEFP', 'COUNTYFP', 'COUNTYNS', 'AFFGEOID', 'GEOID', 'NAME', 'LSAD', 'FIPS_x', 'state_name', 'county_fips', 'county_name', 'State', 'Area_Name', 'FIPS'])])), ('model', Lasso(alpha=0.1))])
ColumnTransformer(transformers=[('number', Pipeline(steps=[('impute', SimpleImputer()), ('scale', StandardScaler())]), ['ALAND', 'AWATER', 'votes_gop', 'votes_dem', 'total_votes', 'diff', 'FIPS_y', 'Rural_Urban_Continuum_Code_2013', 'Rural_Urban_Continuum_Code_2023', 'Urban_Influence_2013', 'Economic_typology_2015', 'CENSUS_2020_POP', 'ESTIMATES_BASE_2020', 'POP_EST... 'DEATHS_2020', 'DEATHS_2021', 'DEATHS_2022', 'DEATHS_2023', 'NATURAL_CHG_2020', ...]), ('category', Pipeline(steps=[('impute', SimpleImputer(strategy='most_frequent')), ('one-hot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))]), ['STATEFP', 'COUNTYFP', 'COUNTYNS', 'AFFGEOID', 'GEOID', 'NAME', 'LSAD', 'FIPS_x', 'state_name', 'county_fips', 'county_name', 'State', 'Area_Name', 'FIPS'])])
['ALAND', 'AWATER', 'votes_gop', 'votes_dem', 'total_votes', 'diff', 'FIPS_y', 'Rural_Urban_Continuum_Code_2013', 'Rural_Urban_Continuum_Code_2023', 'Urban_Influence_2013', 'Economic_typology_2015', 'CENSUS_2020_POP', 'ESTIMATES_BASE_2020', 'POP_ESTIMATE_2020', 'POP_ESTIMATE_2021', 'POP_ESTIMATE_2022', 'POP_ESTIMATE_2023', 'N_POP_CHG_2020', 'N_POP_CHG_2021', 'N_POP_CHG_2022', 'N_POP_CHG_2023', 'BIRTHS_2020', 'BIRTHS_2021', 'BIRTHS_2022', 'BIRTHS_2023', 'DEATHS_2020', 'DEATHS_2021', 'DEATHS_2022', 'DEATHS_2023', 'NATURAL_CHG_2020', 'NATURAL_CHG_2021', 'NATURAL_CHG_2022', 'NATURAL_CHG_2023', 'INTERNATIONAL_MIG_2020', 'INTERNATIONAL_MIG_2021', 'INTERNATIONAL_MIG_2022', 'INTERNATIONAL_MIG_2023', 'DOMESTIC_MIG_2020', 'DOMESTIC_MIG_2021', 'DOMESTIC_MIG_2022', 'DOMESTIC_MIG_2023', 'NET_MIG_2020', 'NET_MIG_2021', 'NET_MIG_2022', 'NET_MIG_2023', 'RESIDUAL_2020', 'RESIDUAL_2021', 'RESIDUAL_2022', 'RESIDUAL_2023', 'GQ_ESTIMATES_BASE_2020', 'GQ_ESTIMATES_2020', 'GQ_ESTIMATES_2021', 'GQ_ESTIMATES_2022', 'GQ_ESTIMATES_2023', 'R_BIRTH_2021', 'R_BIRTH_2022', 'R_BIRTH_2023', 'R_DEATH_2021', 'R_DEATH_2022', 'R_DEATH_2023', 'R_NATURAL_CHG_2021', 'R_NATURAL_CHG_2022', 'R_NATURAL_CHG_2023', 'R_INTERNATIONAL_MIG_2021', 'R_INTERNATIONAL_MIG_2022', 'R_INTERNATIONAL_MIG_2023', 'R_DOMESTIC_MIG_2021', 'R_DOMESTIC_MIG_2022', 'R_DOMESTIC_MIG_2023', 'R_NET_MIG_2021', 'R_NET_MIG_2022', 'R_NET_MIG_2023', '2003 Urban Influence Code', '2013 Urban Influence Code', '2013 Rural-urban Continuum Code', '2023 Rural-urban Continuum Code', 'Less than a high school diploma, 1970', 'High school diploma only, 1970', 'Some college (1-3 years), 1970', 'Four years of college or higher, 1970', 'Percent of adults with less than a high school diploma, 1970', 'Percent of adults with a high school diploma only, 1970', 'Percent of adults completing some college (1-3 years), 1970', 'Percent of adults completing four years of college or higher, 1970', 'Less than a high school diploma, 1980', 'High school diploma only, 1980', 'Some college (1-3 years), 1980', 'Four years of college or higher, 1980', 'Percent of adults with less than a high school diploma, 1980', 'Percent of adults with a high school diploma only, 1980', 'Percent of adults completing some college (1-3 years), 1980', 'Percent of adults completing four years of college or higher, 1980', 'Less than a high school diploma, 1990', 'High school diploma only, 1990', "Some college or associate's degree, 1990", "Bachelor's degree or higher, 1990", 'Percent of adults with less than a high school diploma, 1990', 'Percent of adults with a high school diploma only, 1990', "Percent of adults completing some college or associate's degree, 1990", "Percent of adults with a bachelor's degree or higher, 1990", 'Less than a high school diploma, 2000', 'High school diploma only, 2000', "Some college or associate's degree, 2000", "Bachelor's degree or higher, 2000", 'Percent of adults with less than a high school diploma, 2000', 'Percent of adults with a high school diploma only, 2000', "Percent of adults completing some college or associate's degree, 2000", "Percent of adults with a bachelor's degree or higher, 2000", 'Less than a high school diploma, 2008-12', 'High school diploma only, 2008-12', "Some college or associate's degree, 2008-12", "Bachelor's degree or higher, 2008-12", 'Percent of adults with less than a high school diploma, 2008-12', 'Percent of adults with a high school diploma only, 2008-12', "Percent of adults completing some college or associate's degree, 2008-12", "Percent of adults with a bachelor's degree or higher, 2008-12", 'Less than a high school diploma, 2018-22', 'High school diploma only, 2018-22', "Some college or associate's degree, 2018-22", "Bachelor's degree or higher, 2018-22", 'Percent of adults with less than a high school diploma, 2018-22', 'Percent of adults with a high school diploma only, 2018-22', "Percent of adults completing some college or associate's degree, 2018-22", "Percent of adults with a bachelor's degree or higher, 2018-22", 'Urban_Influence_Code_2013', 'Metro_2013', 'Civilian_labor_force_2000', 'Employed_2000', 'Unemployed_2000', 'Unemployment_rate_2000', 'Civilian_labor_force_2001', 'Employed_2001', 'Unemployed_2001', 'Unemployment_rate_2001', 'Civilian_labor_force_2002', 'Employed_2002', 'Unemployed_2002', 'Unemployment_rate_2002', 'Civilian_labor_force_2003', 'Employed_2003', 'Unemployed_2003', 'Unemployment_rate_2003', 'Civilian_labor_force_2004', 'Employed_2004', 'Unemployed_2004', 'Unemployment_rate_2004', 'Civilian_labor_force_2005', 'Employed_2005', 'Unemployed_2005', 'Unemployment_rate_2005', 'Civilian_labor_force_2006', 'Employed_2006', 'Unemployed_2006', 'Unemployment_rate_2006', 'Civilian_labor_force_2007', 'Employed_2007', 'Unemployed_2007', 'Unemployment_rate_2007', 'Civilian_labor_force_2008', 'Employed_2008', 'Unemployed_2008', 'Unemployment_rate_2008', 'Civilian_labor_force_2009', 'Employed_2009', 'Unemployed_2009', 'Unemployment_rate_2009', 'Civilian_labor_force_2010', 'Employed_2010', 'Unemployed_2010', 'Unemployment_rate_2010', 'Civilian_labor_force_2011', 'Employed_2011', 'Unemployed_2011', 'Unemployment_rate_2011', 'Civilian_labor_force_2012', 'Employed_2012', 'Unemployed_2012', 'Unemployment_rate_2012', 'Civilian_labor_force_2013', 'Employed_2013', 'Unemployed_2013', 'Unemployment_rate_2013', 'Civilian_labor_force_2014', 'Employed_2014', 'Unemployed_2014', 'Unemployment_rate_2014', 'Civilian_labor_force_2015', 'Employed_2015', 'Unemployed_2015', 'Unemployment_rate_2015', 'Civilian_labor_force_2016', 'Employed_2016', 'Unemployed_2016', 'Unemployment_rate_2016', 'Civilian_labor_force_2017', 'Employed_2017', 'Unemployed_2017', 'Unemployment_rate_2017', 'Civilian_labor_force_2018', 'Employed_2018', 'Unemployed_2018', 'Unemployment_rate_2018', 'Civilian_labor_force_2019', 'Employed_2019', 'Unemployed_2019', 'Unemployment_rate_2019', 'Civilian_labor_force_2020', 'Employed_2020', 'Unemployed_2020', 'Unemployment_rate_2020', 'Civilian_labor_force_2021', 'Employed_2021', 'Unemployed_2021', 'Unemployment_rate_2021', 'Civilian_labor_force_2022', 'Employed_2022', 'Unemployed_2022', 'Unemployment_rate_2022', 'Median_Household_Income_2021', 'Med_HH_Income_Percent_of_State_Total_2021', 'Rural-urban_Continuum_Code_2003', 'Urban_Influence_Code_2003', 'Rural-urban_Continuum_Code_2013', 'Urban_Influence_Code_ 2013', 'POVALL_2021', 'CI90LBALL_2021', 'CI90UBALL_2021', 'PCTPOVALL_2021', 'CI90LBALLP_2021', 'CI90UBALLP_2021', 'POV017_2021', 'CI90LB017_2021', 'CI90UB017_2021', 'PCTPOV017_2021', 'CI90LB017P_2021', 'CI90UB017P_2021', 'POV517_2021', 'CI90LB517_2021', 'CI90UB517_2021', 'PCTPOV517_2021', 'CI90LB517P_2021', 'CI90UB517P_2021', 'MEDHHINC_2021', 'CI90LBINC_2021', 'CI90UBINC_2021', 'POV04_2021', 'CI90LB04_2021', 'CI90UB04_2021', 'PCTPOV04_2021', 'CI90LB04P_2021', 'CI90UB04P_2021', 'candidatevotes_2000_republican', 'candidatevotes_2004_republican', 'candidatevotes_2008_republican', 'candidatevotes_2012_republican', 'candidatevotes_2016_republican', 'share_2000_republican', 'share_2004_republican', 'share_2008_republican', 'share_2012_republican', 'share_2016_republican']
SimpleImputer()
StandardScaler()
['STATEFP', 'COUNTYFP', 'COUNTYNS', 'AFFGEOID', 'GEOID', 'NAME', 'LSAD', 'FIPS_x', 'state_name', 'county_fips', 'county_name', 'State', 'Area_Name', 'FIPS']
SimpleImputer(strategy='most_frequent')
OneHotEncoder(handle_unknown='ignore', sparse_output=False)
Lasso(alpha=0.1)
The pipeline takes the following form once finalized (question 4):
Pipeline(steps=[('preprocess', ColumnTransformer(transformers=[('number', Pipeline(steps=[('impute', SimpleImputer()), ('scale', StandardScaler())]), ['ALAND', 'AWATER', 'votes_gop', 'votes_dem', 'total_votes', 'diff', 'FIPS_y', 'Rural_Urban_Continuum_Code_2013', 'Rural_Urban_Continuum_Code_2023', 'Urban_Influence_2013', 'Economic_typology_2015', 'CENSUS_2020_POP',... 'NATURAL_CHG_2020', ...]), ('category', Pipeline(steps=[('impute', SimpleImputer(strategy='most_frequent')), ('one-hot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))]), ['STATEFP', 'COUNTYFP', 'COUNTYNS', 'AFFGEOID', 'GEOID', 'NAME', 'LSAD', 'FIPS_x', 'state_name', 'county_fips', 'county_name', 'State', 'Area_Name', 'FIPS'])])), ('model', Lasso(alpha=0.1))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('preprocess', ColumnTransformer(transformers=[('number', Pipeline(steps=[('impute', SimpleImputer()), ('scale', StandardScaler())]), ['ALAND', 'AWATER', 'votes_gop', 'votes_dem', 'total_votes', 'diff', 'FIPS_y', 'Rural_Urban_Continuum_Code_2013', 'Rural_Urban_Continuum_Code_2023', 'Urban_Influence_2013', 'Economic_typology_2015', 'CENSUS_2020_POP',... 'NATURAL_CHG_2020', ...]), ('category', Pipeline(steps=[('impute', SimpleImputer(strategy='most_frequent')), ('one-hot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))]), ['STATEFP', 'COUNTYFP', 'COUNTYNS', 'AFFGEOID', 'GEOID', 'NAME', 'LSAD', 'FIPS_x', 'state_name', 'county_fips', 'county_name', 'State', 'Area_Name', 'FIPS'])])), ('model', Lasso(alpha=0.1))])
ColumnTransformer(transformers=[('number', Pipeline(steps=[('impute', SimpleImputer()), ('scale', StandardScaler())]), ['ALAND', 'AWATER', 'votes_gop', 'votes_dem', 'total_votes', 'diff', 'FIPS_y', 'Rural_Urban_Continuum_Code_2013', 'Rural_Urban_Continuum_Code_2023', 'Urban_Influence_2013', 'Economic_typology_2015', 'CENSUS_2020_POP', 'ESTIMATES_BASE_2020', 'POP_EST... 'DEATHS_2020', 'DEATHS_2021', 'DEATHS_2022', 'DEATHS_2023', 'NATURAL_CHG_2020', ...]), ('category', Pipeline(steps=[('impute', SimpleImputer(strategy='most_frequent')), ('one-hot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))]), ['STATEFP', 'COUNTYFP', 'COUNTYNS', 'AFFGEOID', 'GEOID', 'NAME', 'LSAD', 'FIPS_x', 'state_name', 'county_fips', 'county_name', 'State', 'Area_Name', 'FIPS'])])
['ALAND', 'AWATER', 'votes_gop', 'votes_dem', 'total_votes', 'diff', 'FIPS_y', 'Rural_Urban_Continuum_Code_2013', 'Rural_Urban_Continuum_Code_2023', 'Urban_Influence_2013', 'Economic_typology_2015', 'CENSUS_2020_POP', 'ESTIMATES_BASE_2020', 'POP_ESTIMATE_2020', 'POP_ESTIMATE_2021', 'POP_ESTIMATE_2022', 'POP_ESTIMATE_2023', 'N_POP_CHG_2020', 'N_POP_CHG_2021', 'N_POP_CHG_2022', 'N_POP_CHG_2023', 'BIRTHS_2020', 'BIRTHS_2021', 'BIRTHS_2022', 'BIRTHS_2023', 'DEATHS_2020', 'DEATHS_2021', 'DEATHS_2022', 'DEATHS_2023', 'NATURAL_CHG_2020', 'NATURAL_CHG_2021', 'NATURAL_CHG_2022', 'NATURAL_CHG_2023', 'INTERNATIONAL_MIG_2020', 'INTERNATIONAL_MIG_2021', 'INTERNATIONAL_MIG_2022', 'INTERNATIONAL_MIG_2023', 'DOMESTIC_MIG_2020', 'DOMESTIC_MIG_2021', 'DOMESTIC_MIG_2022', 'DOMESTIC_MIG_2023', 'NET_MIG_2020', 'NET_MIG_2021', 'NET_MIG_2022', 'NET_MIG_2023', 'RESIDUAL_2020', 'RESIDUAL_2021', 'RESIDUAL_2022', 'RESIDUAL_2023', 'GQ_ESTIMATES_BASE_2020', 'GQ_ESTIMATES_2020', 'GQ_ESTIMATES_2021', 'GQ_ESTIMATES_2022', 'GQ_ESTIMATES_2023', 'R_BIRTH_2021', 'R_BIRTH_2022', 'R_BIRTH_2023', 'R_DEATH_2021', 'R_DEATH_2022', 'R_DEATH_2023', 'R_NATURAL_CHG_2021', 'R_NATURAL_CHG_2022', 'R_NATURAL_CHG_2023', 'R_INTERNATIONAL_MIG_2021', 'R_INTERNATIONAL_MIG_2022', 'R_INTERNATIONAL_MIG_2023', 'R_DOMESTIC_MIG_2021', 'R_DOMESTIC_MIG_2022', 'R_DOMESTIC_MIG_2023', 'R_NET_MIG_2021', 'R_NET_MIG_2022', 'R_NET_MIG_2023', '2003 Urban Influence Code', '2013 Urban Influence Code', '2013 Rural-urban Continuum Code', '2023 Rural-urban Continuum Code', 'Less than a high school diploma, 1970', 'High school diploma only, 1970', 'Some college (1-3 years), 1970', 'Four years of college or higher, 1970', 'Percent of adults with less than a high school diploma, 1970', 'Percent of adults with a high school diploma only, 1970', 'Percent of adults completing some college (1-3 years), 1970', 'Percent of adults completing four years of college or higher, 1970', 'Less than a high school diploma, 1980', 'High school diploma only, 1980', 'Some college (1-3 years), 1980', 'Four years of college or higher, 1980', 'Percent of adults with less than a high school diploma, 1980', 'Percent of adults with a high school diploma only, 1980', 'Percent of adults completing some college (1-3 years), 1980', 'Percent of adults completing four years of college or higher, 1980', 'Less than a high school diploma, 1990', 'High school diploma only, 1990', "Some college or associate's degree, 1990", "Bachelor's degree or higher, 1990", 'Percent of adults with less than a high school diploma, 1990', 'Percent of adults with a high school diploma only, 1990', "Percent of adults completing some college or associate's degree, 1990", "Percent of adults with a bachelor's degree or higher, 1990", 'Less than a high school diploma, 2000', 'High school diploma only, 2000', "Some college or associate's degree, 2000", "Bachelor's degree or higher, 2000", 'Percent of adults with less than a high school diploma, 2000', 'Percent of adults with a high school diploma only, 2000', "Percent of adults completing some college or associate's degree, 2000", "Percent of adults with a bachelor's degree or higher, 2000", 'Less than a high school diploma, 2008-12', 'High school diploma only, 2008-12', "Some college or associate's degree, 2008-12", "Bachelor's degree or higher, 2008-12", 'Percent of adults with less than a high school diploma, 2008-12', 'Percent of adults with a high school diploma only, 2008-12', "Percent of adults completing some college or associate's degree, 2008-12", "Percent of adults with a bachelor's degree or higher, 2008-12", 'Less than a high school diploma, 2018-22', 'High school diploma only, 2018-22', "Some college or associate's degree, 2018-22", "Bachelor's degree or higher, 2018-22", 'Percent of adults with less than a high school diploma, 2018-22', 'Percent of adults with a high school diploma only, 2018-22', "Percent of adults completing some college or associate's degree, 2018-22", "Percent of adults with a bachelor's degree or higher, 2018-22", 'Urban_Influence_Code_2013', 'Metro_2013', 'Civilian_labor_force_2000', 'Employed_2000', 'Unemployed_2000', 'Unemployment_rate_2000', 'Civilian_labor_force_2001', 'Employed_2001', 'Unemployed_2001', 'Unemployment_rate_2001', 'Civilian_labor_force_2002', 'Employed_2002', 'Unemployed_2002', 'Unemployment_rate_2002', 'Civilian_labor_force_2003', 'Employed_2003', 'Unemployed_2003', 'Unemployment_rate_2003', 'Civilian_labor_force_2004', 'Employed_2004', 'Unemployed_2004', 'Unemployment_rate_2004', 'Civilian_labor_force_2005', 'Employed_2005', 'Unemployed_2005', 'Unemployment_rate_2005', 'Civilian_labor_force_2006', 'Employed_2006', 'Unemployed_2006', 'Unemployment_rate_2006', 'Civilian_labor_force_2007', 'Employed_2007', 'Unemployed_2007', 'Unemployment_rate_2007', 'Civilian_labor_force_2008', 'Employed_2008', 'Unemployed_2008', 'Unemployment_rate_2008', 'Civilian_labor_force_2009', 'Employed_2009', 'Unemployed_2009', 'Unemployment_rate_2009', 'Civilian_labor_force_2010', 'Employed_2010', 'Unemployed_2010', 'Unemployment_rate_2010', 'Civilian_labor_force_2011', 'Employed_2011', 'Unemployed_2011', 'Unemployment_rate_2011', 'Civilian_labor_force_2012', 'Employed_2012', 'Unemployed_2012', 'Unemployment_rate_2012', 'Civilian_labor_force_2013', 'Employed_2013', 'Unemployed_2013', 'Unemployment_rate_2013', 'Civilian_labor_force_2014', 'Employed_2014', 'Unemployed_2014', 'Unemployment_rate_2014', 'Civilian_labor_force_2015', 'Employed_2015', 'Unemployed_2015', 'Unemployment_rate_2015', 'Civilian_labor_force_2016', 'Employed_2016', 'Unemployed_2016', 'Unemployment_rate_2016', 'Civilian_labor_force_2017', 'Employed_2017', 'Unemployed_2017', 'Unemployment_rate_2017', 'Civilian_labor_force_2018', 'Employed_2018', 'Unemployed_2018', 'Unemployment_rate_2018', 'Civilian_labor_force_2019', 'Employed_2019', 'Unemployed_2019', 'Unemployment_rate_2019', 'Civilian_labor_force_2020', 'Employed_2020', 'Unemployed_2020', 'Unemployment_rate_2020', 'Civilian_labor_force_2021', 'Employed_2021', 'Unemployed_2021', 'Unemployment_rate_2021', 'Civilian_labor_force_2022', 'Employed_2022', 'Unemployed_2022', 'Unemployment_rate_2022', 'Median_Household_Income_2021', 'Med_HH_Income_Percent_of_State_Total_2021', 'Rural-urban_Continuum_Code_2003', 'Urban_Influence_Code_2003', 'Rural-urban_Continuum_Code_2013', 'Urban_Influence_Code_ 2013', 'POVALL_2021', 'CI90LBALL_2021', 'CI90UBALL_2021', 'PCTPOVALL_2021', 'CI90LBALLP_2021', 'CI90UBALLP_2021', 'POV017_2021', 'CI90LB017_2021', 'CI90UB017_2021', 'PCTPOV017_2021', 'CI90LB017P_2021', 'CI90UB017P_2021', 'POV517_2021', 'CI90LB517_2021', 'CI90UB517_2021', 'PCTPOV517_2021', 'CI90LB517P_2021', 'CI90UB517P_2021', 'MEDHHINC_2021', 'CI90LBINC_2021', 'CI90UBINC_2021', 'POV04_2021', 'CI90LB04_2021', 'CI90UB04_2021', 'PCTPOV04_2021', 'CI90LB04P_2021', 'CI90UB04P_2021', 'candidatevotes_2000_republican', 'candidatevotes_2004_republican', 'candidatevotes_2008_republican', 'candidatevotes_2012_republican', 'candidatevotes_2016_republican', 'share_2000_republican', 'share_2004_republican', 'share_2008_republican', 'share_2012_republican', 'share_2016_republican']
SimpleImputer()
StandardScaler()
['STATEFP', 'COUNTYFP', 'COUNTYNS', 'AFFGEOID', 'GEOID', 'NAME', 'LSAD', 'FIPS_x', 'state_name', 'county_fips', 'county_name', 'State', 'Area_Name', 'FIPS']
SimpleImputer(strategy='most_frequent')
OneHotEncoder(handle_unknown='ignore', sparse_output=False)
Lasso(alpha=0.1)
At the end of question 5, the selected variables are:
The model is quite parsimonious as it uses a subset of our initial variables (especially since our categorical variables have been split into numerous variables through one hot encoding).
0 ALAND
1 FIPS_y
2 Rural_Urban_Continuum_Code_2023
3 N_POP_CHG_2020
4 INTERNATIONAL_MIG_2023
5 DOMESTIC_MIG_2023
6 RESIDUAL_2020
7 RESIDUAL_2021
8 2023 Rural-urban Continuum Code
9 Percent of adults with a bachelor's degree or ...
10 Percent of adults with a high school diploma o...
11 Percent of adults with a bachelor's degree or ...
12 Percent of adults with less than a high school...
13 Percent of adults with a bachelor's degree or ...
14 Metro_2013
15 Unemployment_rate_2000
16 Unemployment_rate_2002
17 Unemployment_rate_2003
18 Unemployment_rate_2012
19 Unemployment_rate_2014
20 Rural-urban_Continuum_Code_2003
21 Rural-urban_Continuum_Code_2013
22 CI90LB017P_2021
23 CI90LB517P_2021
24 candidatevotes_2016_republican
25 share_2012_republican
26 share_2016_republican
dtype: object
Some variables make sense, such as education-related variables. Notably, one of the best predictors for the Republican score in 2020 is… the Republican score (and mechanically the Democratic score) in 2016 and 2012.
Additionally, redundant variables are being selected. A more thorough data cleaning phase would actually be necessary.
The parsimonious model is (slightly) more performant:
parcimonieux | non parcimonieux | |
---|---|---|
RMSE | 2.699583 | 2.491548 |
R2 | 0.972809 | 0.976838 |
Nombre de paramètres | 27.000000 | 256.000000 |
Moreover, it can already be noted that regressing the 2020 score on the 2016 score results in very good explanatory performance, suggesting that voting behaves like an autoregressive process:
import statsmodels.api as sm
import statsmodels.formula.api as smf
"per_gop ~ share_2016_republican", data = df2).fit().summary() smf.ols(
Dep. Variable: | per_gop | R-squared: | 0.968 |
Model: | OLS | Adj. R-squared: | 0.968 |
Method: | Least Squares | F-statistic: | 9.292e+04 |
Date: | Wed, 19 Mar 2025 | Prob (F-statistic): | 0.00 |
Time: | 20:03:25 | Log-Likelihood: | 6603.5 |
No. Observations: | 3107 | AIC: | -1.320e+04 |
Df Residuals: | 3105 | BIC: | -1.319e+04 |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
Intercept | 0.0109 | 0.002 | 5.056 | 0.000 | 0.007 | 0.015 |
share_2016_republican | 1.0101 | 0.003 | 304.835 | 0.000 | 1.004 | 1.017 |
Omnibus: | 2045.232 | Durbin-Watson: | 1.982 |
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 51553.266 |
Skew: | 2.731 | Prob(JB): | 0.00 |
Kurtosis: | 22.193 | Cond. No. | 9.00 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
2 Role of the Penalty \(\alpha\) in Variable Selection
So far, we have taken the hyperparameter \(\alpha\) as given. What role does it play in the conclusions of our modeling? To investigate this, we can explore the effect of its value on the number of variables passing the selection step.
For the next exercise, we will consider exclusively quantitative variables to speed up the computations. Indeed, with non-parsimonious models, the multiple categories of our categorical variables make the optimization problem difficult to handle.
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
-np.inf], np.nan, inplace=True)
df2.replace([np.inf, = train_test_split(
X_train, X_test, y_train, y_test "per_gop"], axis = 1),
df2.drop([100*df2[['per_gop']], test_size=0.2, random_state=0
)
= X_train.select_dtypes(include='number').columns.tolist()
numerical_features = X_train.select_dtypes(exclude='number').columns.tolist()
categorical_features
= Pipeline(steps=[
numeric_pipeline 'impute', SimpleImputer(strategy='mean')),
('scale', StandardScaler())
(
])= pd.DataFrame(
preprocessed_features
numeric_pipeline.fit_transform(= categorical_features)
X_train.drop(columns
) )
/opt/conda/lib/python3.12/site-packages/sklearn/impute/_base.py:635: UserWarning:
Skipping features without any observed values: ['POV04_2021' 'CI90LB04_2021' 'CI90UB04_2021' 'PCTPOV04_2021'
'CI90LB04P_2021' 'CI90UB04P_2021']. At least one non-missing value is needed for imputation with strategy='mean'.
Use the lasso_path
function to evaluate the number of parameters selected by LASSO as \(\alpha\)
varies (explore \(\alpha \in [0.001,0.01,0.02,0.025,0.05,0.1,0.25,0.5,0.8,1.0]\)).
The relationship you should obtain between \(\alpha\) and the number of parameters is as follows:
We see that the higher \(\alpha\) is, the fewer variables the model selects.
3 Cross-Validation to Select the Model
Which \(\alpha\) should be preferred? For this, cross-validation should be performed to choose the model for which the variables passing the selection phase best predict the Republican outcome.
from sklearn.linear_model import LassoCV
= np.array([0.001,0.01,0.02,0.025,0.05,0.1,0.25,0.5,0.8,1.0])
my_alphas
= (
lcv
LassoCV(=my_alphas,
alphas=False,
fit_intercept=0,
random_state=5
cv
).fit(
preprocessed_features, y_train
) )
The “best” \(\alpha\) can be retrieved as follows:
print("alpha optimal :", lcv.alpha_)
alpha optimal : 1.0
This can be used to run a new pipeline:
from sklearn.compose import make_column_transformer, ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
= Pipeline(steps=[
numeric_pipeline 'impute', SimpleImputer(strategy='mean')),
('scale', StandardScaler())
(
])
= Pipeline(steps=[
categorical_pipeline 'impute', SimpleImputer(strategy='most_frequent')),
('one-hot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
(
])
= ColumnTransformer(transformers=[
preprocessor 'number', numeric_pipeline, numerical_features),
('category', categorical_pipeline, categorical_features)
(
])
= Lasso(
model =False,
fit_intercept= lcv.alpha_
alpha
)
= Pipeline(steps=[
lasso_pipeline 'preprocess', preprocessor),
('model', model)
(
])
= lasso_pipeline.fit(X_train,y_train)
lasso_optimal
= extract_features_selected(lasso_optimal) features_selec2
Les variables sélectionnées sont :
0 R_BIRTH_2021
1 R_DEATH_2023
2 Percent of adults completing some college or a...
3 Percent of adults with a bachelor's degree or ...
4 Percent of adults with a bachelor's degree or ...
5 Percent of adults with a high school diploma o...
6 CI90LBINC_2021
7 candidatevotes_2016_republican
8 share_2008_republican
9 share_2012_republican
10 share_2016_republican
11 STATEFP_22
12 LSAD_06
dtype: object
Cela correspond à un modèle avec 13 variables sélectionnées.
If the model appears to be insufficiently parsimonious, it would be necessary to revisit the variable definition phase to determine whether different scales for some variables might be more appropriate (e.g., using the log
).
Informations additionnelles
environment files have been tested on.
Latest built version: 2025-03-19
Python version used:
'3.12.6 | packaged by conda-forge | (main, Sep 30 2024, 18:08:52) [GCC 13.3.0]'
Package | Version |
---|---|
affine | 2.4.0 |
aiobotocore | 2.21.1 |
aiohappyeyeballs | 2.6.1 |
aiohttp | 3.11.13 |
aioitertools | 0.12.0 |
aiosignal | 1.3.2 |
alembic | 1.13.3 |
altair | 5.4.1 |
aniso8601 | 9.0.1 |
annotated-types | 0.7.0 |
anyio | 4.8.0 |
appdirs | 1.4.4 |
archspec | 0.2.3 |
asttokens | 2.4.1 |
attrs | 25.3.0 |
babel | 2.17.0 |
bcrypt | 4.2.0 |
beautifulsoup4 | 4.12.3 |
black | 24.8.0 |
blinker | 1.8.2 |
blis | 1.2.0 |
bokeh | 3.5.2 |
boltons | 24.0.0 |
boto3 | 1.37.1 |
botocore | 1.37.1 |
branca | 0.7.2 |
Brotli | 1.1.0 |
bs4 | 0.0.2 |
cachetools | 5.5.0 |
cartiflette | 0.0.2 |
Cartopy | 0.24.1 |
catalogue | 2.0.10 |
cattrs | 24.1.2 |
certifi | 2025.1.31 |
cffi | 1.17.1 |
charset-normalizer | 3.4.1 |
chromedriver-autoinstaller | 0.6.4 |
click | 8.1.8 |
click-plugins | 1.1.1 |
cligj | 0.7.2 |
cloudpathlib | 0.21.0 |
cloudpickle | 3.0.0 |
colorama | 0.4.6 |
comm | 0.2.2 |
commonmark | 0.9.1 |
conda | 24.9.1 |
conda-libmamba-solver | 24.7.0 |
conda-package-handling | 2.3.0 |
conda_package_streaming | 0.10.0 |
confection | 0.1.5 |
contextily | 1.6.2 |
contourpy | 1.3.1 |
cryptography | 43.0.1 |
cycler | 0.12.1 |
cymem | 2.0.11 |
cytoolz | 1.0.0 |
dask | 2024.9.1 |
dask-expr | 1.1.15 |
databricks-sdk | 0.33.0 |
dataclasses-json | 0.6.7 |
debugpy | 1.8.6 |
decorator | 5.1.1 |
Deprecated | 1.2.14 |
diskcache | 5.6.3 |
distributed | 2024.9.1 |
distro | 1.9.0 |
docker | 7.1.0 |
duckdb | 1.2.1 |
en_core_web_sm | 3.8.0 |
entrypoints | 0.4 |
et_xmlfile | 2.0.0 |
exceptiongroup | 1.2.2 |
executing | 2.1.0 |
fastexcel | 0.11.6 |
fastjsonschema | 2.21.1 |
fiona | 1.10.1 |
Flask | 3.0.3 |
folium | 0.17.0 |
fontawesomefree | 6.6.0 |
fonttools | 4.56.0 |
fr_core_news_sm | 3.8.0 |
frozendict | 2.4.4 |
frozenlist | 1.5.0 |
fsspec | 2023.12.2 |
geographiclib | 2.0 |
geopandas | 1.0.1 |
geoplot | 0.5.1 |
geopy | 2.4.1 |
gitdb | 4.0.11 |
GitPython | 3.1.43 |
google-auth | 2.35.0 |
graphene | 3.3 |
graphql-core | 3.2.4 |
graphql-relay | 3.2.0 |
graphviz | 0.20.3 |
great-tables | 0.12.0 |
greenlet | 3.1.1 |
gunicorn | 22.0.0 |
h11 | 0.14.0 |
h2 | 4.1.0 |
hpack | 4.0.0 |
htmltools | 0.6.0 |
httpcore | 1.0.7 |
httpx | 0.28.1 |
httpx-sse | 0.4.0 |
hyperframe | 6.0.1 |
idna | 3.10 |
imageio | 2.37.0 |
importlib_metadata | 8.6.1 |
importlib_resources | 6.5.2 |
inflate64 | 1.0.1 |
ipykernel | 6.29.5 |
ipython | 8.28.0 |
itsdangerous | 2.2.0 |
jedi | 0.19.1 |
Jinja2 | 3.1.6 |
jmespath | 1.0.1 |
joblib | 1.4.2 |
jsonpatch | 1.33 |
jsonpointer | 3.0.0 |
jsonschema | 4.23.0 |
jsonschema-specifications | 2024.10.1 |
jupyter-cache | 1.0.0 |
jupyter_client | 8.6.3 |
jupyter_core | 5.7.2 |
kaleido | 0.2.1 |
kiwisolver | 1.4.8 |
langchain | 0.3.20 |
langchain-community | 0.3.9 |
langchain-core | 0.3.45 |
langchain-text-splitters | 0.3.6 |
langcodes | 3.5.0 |
langsmith | 0.1.147 |
language_data | 1.3.0 |
lazy_loader | 0.4 |
libmambapy | 1.5.9 |
locket | 1.0.0 |
loguru | 0.7.3 |
lxml | 5.3.1 |
lz4 | 4.3.3 |
Mako | 1.3.5 |
mamba | 1.5.9 |
mapclassify | 2.8.1 |
marisa-trie | 1.2.1 |
Markdown | 3.6 |
markdown-it-py | 3.0.0 |
MarkupSafe | 3.0.2 |
marshmallow | 3.26.1 |
matplotlib | 3.10.1 |
matplotlib-inline | 0.1.7 |
mdurl | 0.1.2 |
menuinst | 2.1.2 |
mercantile | 1.2.1 |
mizani | 0.11.4 |
mlflow | 2.16.2 |
mlflow-skinny | 2.16.2 |
msgpack | 1.1.0 |
multidict | 6.1.0 |
multivolumefile | 0.2.3 |
munkres | 1.1.4 |
murmurhash | 1.0.12 |
mypy-extensions | 1.0.0 |
narwhals | 1.30.0 |
nbclient | 0.10.0 |
nbformat | 5.10.4 |
nest_asyncio | 1.6.0 |
networkx | 3.4.2 |
nltk | 3.9.1 |
numpy | 2.2.3 |
opencv-python-headless | 4.10.0.84 |
openpyxl | 3.1.5 |
opentelemetry-api | 1.16.0 |
opentelemetry-sdk | 1.16.0 |
opentelemetry-semantic-conventions | 0.37b0 |
orjson | 3.10.15 |
outcome | 1.3.0.post0 |
OWSLib | 0.28.1 |
packaging | 24.2 |
pandas | 2.2.3 |
paramiko | 3.5.0 |
parso | 0.8.4 |
partd | 1.4.2 |
pathspec | 0.12.1 |
patsy | 1.0.1 |
Pebble | 5.1.0 |
pexpect | 4.9.0 |
pickleshare | 0.7.5 |
pillow | 11.1.0 |
pip | 24.2 |
platformdirs | 4.3.6 |
plotly | 5.24.1 |
plotnine | 0.13.6 |
pluggy | 1.5.0 |
polars | 1.8.2 |
preshed | 3.0.9 |
prometheus_client | 0.21.0 |
prometheus_flask_exporter | 0.23.1 |
prompt_toolkit | 3.0.48 |
propcache | 0.3.0 |
protobuf | 4.25.3 |
psutil | 7.0.0 |
ptyprocess | 0.7.0 |
pure_eval | 0.2.3 |
py7zr | 0.20.8 |
pyarrow | 17.0.0 |
pyarrow-hotfix | 0.6 |
pyasn1 | 0.6.1 |
pyasn1_modules | 0.4.1 |
pybcj | 1.0.3 |
pycosat | 0.6.6 |
pycparser | 2.22 |
pycryptodomex | 3.21.0 |
pydantic | 2.10.6 |
pydantic_core | 2.27.2 |
pydantic-settings | 2.8.1 |
Pygments | 2.19.1 |
PyNaCl | 1.5.0 |
pynsee | 0.1.8 |
pyogrio | 0.10.0 |
pyOpenSSL | 24.2.1 |
pyparsing | 3.2.1 |
pyppmd | 1.1.1 |
pyproj | 3.7.1 |
pyshp | 2.3.1 |
PySocks | 1.7.1 |
python-dateutil | 2.9.0.post0 |
python-dotenv | 1.0.1 |
python-magic | 0.4.27 |
pytz | 2025.1 |
pyu2f | 0.1.5 |
pywaffle | 1.1.1 |
PyYAML | 6.0.2 |
pyzmq | 26.3.0 |
pyzstd | 0.16.2 |
querystring_parser | 1.2.4 |
rasterio | 1.4.3 |
referencing | 0.36.2 |
regex | 2024.9.11 |
requests | 2.32.3 |
requests-cache | 1.2.1 |
requests-toolbelt | 1.0.0 |
retrying | 1.3.4 |
rich | 13.9.4 |
rpds-py | 0.23.1 |
rsa | 4.9 |
rtree | 1.4.0 |
ruamel.yaml | 0.18.6 |
ruamel.yaml.clib | 0.2.8 |
s3fs | 2023.12.2 |
s3transfer | 0.11.3 |
scikit-image | 0.24.0 |
scikit-learn | 1.6.1 |
scipy | 1.13.0 |
seaborn | 0.13.2 |
selenium | 4.29.0 |
setuptools | 76.0.0 |
shapely | 2.0.7 |
shellingham | 1.5.4 |
six | 1.17.0 |
smart-open | 7.1.0 |
smmap | 5.0.0 |
sniffio | 1.3.1 |
sortedcontainers | 2.4.0 |
soupsieve | 2.5 |
spacy | 3.8.4 |
spacy-legacy | 3.0.12 |
spacy-loggers | 1.0.5 |
SQLAlchemy | 2.0.39 |
sqlparse | 0.5.1 |
srsly | 2.5.1 |
stack-data | 0.6.2 |
statsmodels | 0.14.4 |
tabulate | 0.9.0 |
tblib | 3.0.0 |
tenacity | 9.0.0 |
texttable | 1.7.0 |
thinc | 8.3.4 |
threadpoolctl | 3.6.0 |
tifffile | 2025.3.13 |
toolz | 1.0.0 |
topojson | 1.9 |
tornado | 6.4.2 |
tqdm | 4.67.1 |
traitlets | 5.14.3 |
trio | 0.29.0 |
trio-websocket | 0.12.2 |
truststore | 0.9.2 |
typer | 0.15.2 |
typing_extensions | 4.12.2 |
typing-inspect | 0.9.0 |
tzdata | 2025.1 |
Unidecode | 1.3.8 |
url-normalize | 1.4.3 |
urllib3 | 1.26.20 |
uv | 0.6.8 |
wasabi | 1.1.3 |
wcwidth | 0.2.13 |
weasel | 0.4.1 |
webdriver-manager | 4.0.2 |
websocket-client | 1.8.0 |
Werkzeug | 3.0.4 |
wheel | 0.44.0 |
wordcloud | 1.9.3 |
wrapt | 1.17.2 |
wsproto | 1.2.0 |
xgboost | 2.1.1 |
xlrd | 2.0.1 |
xyzservices | 2025.1.0 |
yarl | 1.18.3 |
yellowbrick | 1.5 |
zict | 3.0.0 |
zipp | 3.21.0 |
zstandard | 0.23.0 |
View file history
SHA | Date | Author | Description |
---|---|---|---|
48dccf14 | 2025-01-14 21:45:34 | lgaliana | Fix bug in modeling section |
8c8ca4c0 | 2024-12-20 10:45:00 | lgaliana | Traduction du chapitre clustering |
a5ecaedc | 2024-12-20 09:36:42 | Lino Galiana | Traduction du chapitre modélisation (#582) |
5ff770b5 | 2024-12-04 10:07:34 | lgaliana | Partie ML plus esthétique |
d2422572 | 2024-08-22 18:51:51 | Lino Galiana | At this point, notebooks should now all be functional ! (#547) |
c641de05 | 2024-08-22 11:37:13 | Lino Galiana | A series of fix for notebooks that were bugging (#545) |
0908656f | 2024-08-20 16:30:39 | Lino Galiana | English sidebar (#542) |
06d003a1 | 2024-04-23 10:09:22 | Lino Galiana | Continue la restructuration des sous-parties (#492) |
8c316d0a | 2024-04-05 19:00:59 | Lino Galiana | Fix cartiflette deprecated snippets (#487) |
005d89b8 | 2023-12-20 17:23:04 | Lino Galiana | Finalise l’affichage des statistiques Git (#478) |
3437373a | 2023-12-16 20:11:06 | Lino Galiana | Améliore l’exercice sur le LASSO (#473) |
7d12af8b | 2023-12-05 10:30:08 | linogaliana | Modularise la partie import pour l’avoir partout |
417fb669 | 2023-12-04 18:49:21 | Lino Galiana | Corrections partie ML (#468) |
0b405bc2 | 2023-11-27 20:58:37 | Lino Galiana | Update box lasso |
a06a2689 | 2023-11-23 18:23:28 | Antoine Palazzolo | 2ème relectures chapitres ML (#457) |
889a71ba | 2023-11-10 11:40:51 | Antoine Palazzolo | Modification TP 3 (#443) |
9a4e2267 | 2023-08-28 17:11:52 | Lino Galiana | Action to check URL still exist (#399) |
a8f90c2f | 2023-08-28 09:26:12 | Lino Galiana | Update featured paths (#396) |
3bdf3b06 | 2023-08-25 11:23:02 | Lino Galiana | Simplification de la structure 🤓 (#393) |
78ea2cbd | 2023-07-20 20:27:31 | Lino Galiana | Change titles levels (#381) |
29ff3f58 | 2023-07-07 14:17:53 | linogaliana | description everywhere |
f21a24d3 | 2023-07-02 10:58:15 | Lino Galiana | Pipeline Quarto & Pages 🚀 (#365) |
e12187b2 | 2023-06-12 10:31:40 | Lino Galiana | Feature selection deprecated functions (#363) |
f5ad0210 | 2022-11-15 17:40:16 | Lino Galiana | Relec clustering et lasso (#322) |
f10815b5 | 2022-08-25 16:00:03 | Lino Galiana | Notebooks should now look more beautiful (#260) |
494a85ae | 2022-08-05 14:49:56 | Lino Galiana | Images featured ✨ (#252) |
d201e3cd | 2022-08-03 15:50:34 | Lino Galiana | Pimp la homepage ✨ (#249) |
12965bac | 2022-05-25 15:53:27 | Lino Galiana | :launch: Bascule vers quarto (#226) |
9c71d6e7 | 2022-03-08 10:34:26 | Lino Galiana | Plus d’éléments sur S3 (#218) |
70587527 | 2022-03-04 15:35:17 | Lino Galiana | Relecture Word2Vec (#216) |
c3bf4d42 | 2021-12-06 19:43:26 | Lino Galiana | Finalise debug partie ML (#190) |
fb14d406 | 2021-12-06 17:00:52 | Lino Galiana | Modifie l’import du script (#187) |
37ecfa3c | 2021-12-06 14:48:05 | Lino Galiana | Essaye nom différent (#186) |
2c8fd0dd | 2021-12-06 13:06:36 | Lino Galiana | Problème d’exécution du script import data ML (#185) |
5d0a5e38 | 2021-12-04 07:41:43 | Lino Galiana | MAJ URL script recup data (#184) |
5c104904 | 2021-12-03 17:44:08 | Lino Galiana | Relec @antuki partie modelisation (#183) |
2a8809fb | 2021-10-27 12:05:34 | Lino Galiana | Simplification des hooks pour gagner en flexibilité et clarté (#166) |
2e4d5862 | 2021-09-02 12:03:39 | Lino Galiana | Simplify badges generation (#130) |
4cdb759c | 2021-05-12 10:37:23 | Lino Galiana | :sparkles: :star2: Nouveau thème hugo :snake: :fire: (#105) |
7f9f97bc | 2021-04-30 21:44:04 | Lino Galiana | 🐳 + 🐍 New workflow (docker 🐳) and new dataset for modelization (2020 🇺🇸 elections) (#99) |
8fea62ed | 2020-11-13 11:58:17 | Lino Galiana | Correction de quelques typos partie ML (#85) |
347f50f3 | 2020-11-12 15:08:18 | Lino Galiana | Suite de la partie machine learning (#78) |
Citation
@book{galiana2023,
author = {Galiana, Lino},
title = {Python Pour La Data Science},
date = {2023},
url = {https://pythonds.linogaliana.fr/},
doi = {10.5281/zenodo.8229676},
langid = {en}
}