Preprocessing before building machine learning models

In order to obtain data that is consistent with modeling assumptions, it is essential to take the time to prepare the data to be supplied to a model. The quality of the prediction depends heavily on this preliminary work, known as preprocessing. This chapter presents the issues involved and illustrates them using the Scikit Learn library, which makes this work less tedious and more reliable.

Modélisation
Exercice
Author

Lino Galiana

Published

2025-03-19

If you want to try the examples in this tutorial:
View on GitHub Onyxia Onyxia Open In Colab

1 Introduction

The introduction to this section discussed the importance of adopting an algorithmic rather than a statistical approach for modeling empirical processes. The goal of this chapter is to introduce machine learning methodology and the choices that an algorithmic approach entails for structuring data work. This will also be an opportunity to introduce the Python machine learning ecosystem, particularly its core library: Scikit-Learn.

The aim of this chapter is to present some data preparation elements. This is a fundamental step that should not be overlooked. Models are based on certain assumptions, usually related to the theoretical distribution of variables, which are integrated into them.

It is necessary to align the empirical distribution with these assumptions, which requires a restructuring of the data. This will lead to more relevant modeling results. In the chapter on pipelines, we will see how to industrialize these preprocessing steps to simplify applying a model to a dataset different from the one on which it was estimated.

This chapter, like the entire machine learning section, is a practical introduction illustrated from an electoral prediction perspective. Specifically, it involves predicting the results of the 2020 U.S. elections at the county level based on socio-demographic variables. The underlying idea is that there are sociological, economic, or demographic factors influencing voting behavior, but the motivations or complex interactions between these factors are not well understood.

1.1 Introduction to the Scikit ecosystem

Scikit Learn is currently the go-to library in the machine learning ecosystem. It is a library that, despite its many implemented methods, offers the advantage of a unified entry point. This unified approach is one of the reasons for its early success. R only recently gained a unified library, namely tidymodels.

Another reason for Scikit’s success is its operational focus: deploying models developed through Scikit pipelines is cost-effective. A dedicated chapter of this course covers pipelines. Together with Romain Avouac, we offer a more advanced course in the final year at ENSAE, where we present some challenges related to deploying models developed with Scikit.

The Scikit user guide is a valuable reference to consult regularly. The section on preprocessing, the focus of this chapter, is available here.

Scikit Learn, a French Success! 🐓🥖🥐

Scikit Learn is an open-source library originating from the work of Inria 🇫🇷. For over 10 years, this French public institution has developed and maintained this package, which is downloaded 2 million times a day. In 2023, to secure the maintenance of this package, a startup named Probabl.ai was created around the team of Scikit developers.

To explore the depth of the Scikit ecosystem, it is recommended to follow the Scikit MOOC, developed as part of the Inria Academy initiative.

1.2 Data preparation

Exercise 1 allows those interested to try to recreate it step by step.

The following packages are needed to import and visualize the election data:

!pip install --upgrade xlrd
!pip install geopandas

The data sources are varied, so the code that builds the final dataset is provided directly. Building a single dataset can be somewhat tedious, but it’s a good exercise, which you can try, to review Pandas:

Exercise 1 (Optional): Build the Database

This exercise is OPTIONAL

  1. Download and import the shapefile from this link
  2. Exclude the following states: “02”, “69”, “66”, “78”, “60”, “72”, “15”
  3. Import election results from this link
  4. Import the datasets available on the USDA site, ensuring that FIPS code variables are named consistently across all 4 datasets
  5. Merge these 4 datasets into a single socioeconomic characteristics dataset
  6. Merge with the election data using the FIPS code
  7. Merge with the shapefile using the FIPS code. Be mindful of leading zeros in some codes. It is recommended to use the str.lstrip method to remove them
  8. Import election data from 2000 to 2016 from the MIT Election Lab. The data can be directly queried from this URL: https://dataverse.harvard.edu/api/access/datafile/3641280?gbrecs=false
  9. Create a share variable to account for each candidate’s vote share. Keep only the columns "year", "FIPS", "party", "candidatevotes", "share"
  10. Perform a long to wide conversion using the pivot_table method to keep one row per county x year with each candidate’s results in columns for that state.
  11. Merge with the rest of the dataset using the FIPS code.
import requests

url = 'https://raw.githubusercontent.com/linogaliana/python-datascientist/main/content/modelisation/get_data.py'
r = requests.get(url, allow_redirects=True)
open('getdata.py', 'wb').write(r.content)

import getdata
votes = getdata.create_votes_dataframes()

However, before focusing on data preparation, we will spend some time exploring the structure of the data from which we want to build a model. This is essential to understand its nature and choose an appropriate model.

This code introduces a dataset named votes into the environment. It is a combined dataset from different sources and appears as follows:

votes.loc[:, votes.columns != "geometry"].head(3)
STATEFP COUNTYFP COUNTYNS AFFGEOID GEOID NAME LSAD ALAND AWATER FIPS_x ... share_2008_democrat share_2008_other share_2008_republican share_2012_democrat share_2012_other share_2012_republican share_2016_democrat share_2016_other share_2016_republican winner
0 29 227 00758566 0500000US29227 29227 Worth 06 690564983 493903 29227 ... 0.363714 0.034072 0.602215 0.325382 0.041031 0.633588 0.186424 0.041109 0.772467 republican
1 31 061 00835852 0500000US31061 31061 Franklin 06 1491355860 487899 31061 ... 0.284794 0.019974 0.695232 0.250000 0.026042 0.723958 0.149432 0.045427 0.805140 republican
2 36 013 00974105 0500000US36013 36013 Chautauqua 06 2746047476 1139407865 36013 ... 0.495627 0.018104 0.486269 0.425017 0.115852 0.459131 0.352012 0.065439 0.582550 republican

3 rows × 305 columns

The following choropleth map provides a quick visualization of the results (Alaska and Hawaii are excluded).

Code pour reproduire cette carte
from plotnine import *

# republican : red, democrat : blue
color_dict = {'republican': '#FF0000', 'democrats': '#0000FF'}

(
  ggplot(votes) +
  geom_map(aes(fill = "winner")) +
  scale_fill_manual(color_dict) +
  labs(fill = "Winner") +
  theme_void() +
  theme(legend_position = "bottom")
)

The Territorial Trap

As mentioned in the chapter on mapping, choropleth maps can give a misleading impression that the Republican Party won by a large margin in 2020 because this type of graphic representation gives more importance to large areas rather than dense areas. This explains why this type of map has been used as justification for contesting the election results.

There are alternative representations better suited to such phenomena where density is significant. One such representation is proportional circles (see Insee (2018), “The Territorial Trap in Cartography”). Proportional circles allow the eye to focus on denser areas rather than large open spaces. With this representation, it becomes clear that the Democratic vote is in the majority, which was obscured by the flat color fill.

The GIF “Land does not vote, people do”, which gained popularity in 2020, offers another visualization approach. The original map was created with JavaScript. However, with Python, we have several tools to replicate this map at a low cost using one of the JavaScript overlays discussed in the visualization section.

Code pour reproduire cette carte interactive
import numpy as np
import pandas as pd
import geopandas as gpd
import plotly
import plotly.graph_objects as go


centroids = votes.copy()
centroids.geometry = centroids.centroid
centroids['size'] = centroids['CENSUS_2020_POP'] / 10000  # to get reasonable plotable number

color_dict = {"republican": '#FF0000', 'democrats': '#0000FF'}
centroids["winner"] =  np.where(centroids['votes_gop'] > centroids['votes_dem'], 'republican', 'democrats') 


centroids['lon'] = centroids['geometry'].x
centroids['lat'] = centroids['geometry'].y
centroids = pd.DataFrame(centroids[["county_name",'lon','lat','winner', 'CENSUS_2020_POP',"state_name"]])
groups = centroids.groupby('winner')

df = centroids.copy()

df['color'] = df['winner'].replace(color_dict)
df['size'] = df['CENSUS_2020_POP']/6000
df['text'] = df['CENSUS_2020_POP'].astype(int).apply(lambda x: '<br>Population: {:,} people'.format(x))
df['hover'] = df['county_name'].astype(str) +  df['state_name'].apply(lambda x: ' ({}) '.format(x)) + df['text']

fig_plotly = go.Figure(
  data=go.Scattergeo(
  locationmode = 'USA-states',
  lon=df["lon"], lat=df["lat"],
  text = df["hover"],
  mode = 'markers',
  marker_color = df["color"],
  marker_size = df['size'],
  hoverinfo="text"
  )
)

fig_plotly.update_traces(
  marker = {'opacity': 0.5, 'line_color': 'rgb(40,40,40)', 'line_width': 0.5, 'sizemode': 'area'}
)

fig_plotly.update_layout(
  title_text = "Reproduction of the \"Acres don't vote, people do\" map <br>(Click legend to toggle traces)",
  showlegend = True,
  geo = {"scope": 'usa', "landcolor": 'rgb(217, 217, 217)'}
)

fig_plotly.show()

2 General Approach

In this chapter, we will focus on data preparation to be done before modeling work. This step is essential to ensure consistency between the data and modeling assumptions and to produce scientifically valid analyses.

The general approach we will adopt in this chapter, which will be refined in subsequent chapters, is as follows:

Figure 2.1: Illustration of machine learning methodology

Figure 2.1 illustrates the structure of a machine learning problem.

First, the available dataset is split into two parts: training sample and validation sample. The former is used to train a model, and its prediction quality is evaluated on the latter to limit overfitting bias. The next chapter will delve deeper into model evaluation. At this stage of our progress, we will focus in this chapter on data issues.

The Scikit library is particularly convenient because it offers many machine learning algorithms with a few unified entry points, especially the fit and predict methods. However, the unification extends beyond algorithm training. All data preparation steps integrated into Scikit offer these same entry points. In other words, data preparations are built like parameter estimations that can be reapplied to another dataset. For example, this data preparation might be a mean and variance estimation for normalizing variables. The mean and variance are calculated on the training sample, and the same values can be reapplied to another dataset to normalize it in the same way.

3 Exploring data structure

The first necessary step before diving into modeling is to determine which variables to include in the model.

Pandas functionalities are sufficient at this stage for exploring simple structures. However, when dealing with a dataset with numerous explanatory variables (features in machine learning, covariates in econometrics), it is often wise to start with a variable selection step, which we will cover later in the dedicated section.

Before selecting the best set of explanatory variables, we will start with a small and arbitrary selection. The first task is to represent the relationships within the data, particularly the relationship between explanatory variables and the dependent variable (the Republican Party’s score), as well as relationships among explanatory variables.

Exercise 2 (Optional): Examining Correlations Between Variables

This exercise is OPTIONAL

  1. Create a smaller DataFrame df2 with the variables winner, votes_gop, Unemployment_rate_2019, Median_Household_Income_2019, Percent of adults with less than a high school diploma, 2015-19, Percent of adults with a bachelor's degree or higher, 2015-19.
  2. Use a graph to represent the correlation matrix. You can use the seaborn package and its heatmap function.
# 1. Créer le data.frame df2.
df2 = votes.set_index("GEOID").loc[
    : ,
    [
        "winner", "votes_gop",
        'Unemployment_rate_2019', 'Median_Household_Income_2021',
        'Percent of adults with less than a high school diploma, 2018-22',
        "Percent of adults with a bachelor's degree or higher, 2018-22"
    ]
]
df2 = df2.dropna()

Matrix created with seaborn (question 2) will look as follows:

The scatter plot obtained after question 3 will look like:

# 3. Matrice de nuages de points
pd.plotting.scatter_matrix(df2)
array([[<Axes: xlabel='votes_gop', ylabel='votes_gop'>,
        <Axes: xlabel='Unemployment_rate_2019', ylabel='votes_gop'>,
        <Axes: xlabel='Median_Household_Income_2021', ylabel='votes_gop'>,
        <Axes: xlabel='Percent of adults with less than a high school diploma, 2018-22', ylabel='votes_gop'>,
        <Axes: xlabel="Percent of adults with a bachelor's degree or higher, 2018-22", ylabel='votes_gop'>],
       [<Axes: xlabel='votes_gop', ylabel='Unemployment_rate_2019'>,
        <Axes: xlabel='Unemployment_rate_2019', ylabel='Unemployment_rate_2019'>,
        <Axes: xlabel='Median_Household_Income_2021', ylabel='Unemployment_rate_2019'>,
        <Axes: xlabel='Percent of adults with less than a high school diploma, 2018-22', ylabel='Unemployment_rate_2019'>,
        <Axes: xlabel="Percent of adults with a bachelor's degree or higher, 2018-22", ylabel='Unemployment_rate_2019'>],
       [<Axes: xlabel='votes_gop', ylabel='Median_Household_Income_2021'>,
        <Axes: xlabel='Unemployment_rate_2019', ylabel='Median_Household_Income_2021'>,
        <Axes: xlabel='Median_Household_Income_2021', ylabel='Median_Household_Income_2021'>,
        <Axes: xlabel='Percent of adults with less than a high school diploma, 2018-22', ylabel='Median_Household_Income_2021'>,
        <Axes: xlabel="Percent of adults with a bachelor's degree or higher, 2018-22", ylabel='Median_Household_Income_2021'>],
       [<Axes: xlabel='votes_gop', ylabel='Percent of adults with less than a high school diploma, 2018-22'>,
        <Axes: xlabel='Unemployment_rate_2019', ylabel='Percent of adults with less than a high school diploma, 2018-22'>,
        <Axes: xlabel='Median_Household_Income_2021', ylabel='Percent of adults with less than a high school diploma, 2018-22'>,
        <Axes: xlabel='Percent of adults with less than a high school diploma, 2018-22', ylabel='Percent of adults with less than a high school diploma, 2018-22'>,
        <Axes: xlabel="Percent of adults with a bachelor's degree or higher, 2018-22", ylabel='Percent of adults with less than a high school diploma, 2018-22'>],
       [<Axes: xlabel='votes_gop', ylabel="Percent of adults with a bachelor's degree or higher, 2018-22">,
        <Axes: xlabel='Unemployment_rate_2019', ylabel="Percent of adults with a bachelor's degree or higher, 2018-22">,
        <Axes: xlabel='Median_Household_Income_2021', ylabel="Percent of adults with a bachelor's degree or higher, 2018-22">,
        <Axes: xlabel='Percent of adults with less than a high school diploma, 2018-22', ylabel="Percent of adults with a bachelor's degree or higher, 2018-22">,
        <Axes: xlabel="Percent of adults with a bachelor's degree or higher, 2018-22", ylabel="Percent of adults with a bachelor's degree or higher, 2018-22">]],
      dtype=object)

The result of question 4, on the other hand, should look like the following chart:

4 Transforming Data

Differences in scale or distribution between variables can diverge from the underlying assumptions in models.

For example, in linear regression, categorical variables are not treated the same way as variables with values in \(\mathbb{R}\). A discrete variable (taking a finite number of values) must be transformed into a sequence of 0/1 variables (dummies) relative to a reference category to meet the assumptions of linear regression. This type of transformation is known as one-hot encoding, which we will revisit. It is one of many transformations available in Scikit to align a dataset with mathematical assumptions.

All these data preparation tasks fall under preprocessing or feature engineering. One advantage of using Scikit is that preprocessing tasks can be considered learning tasks. Preprocessing involves learning parameters from a data structure (e.g., estimating means and variances to subtract from each observation), and these parameters can then be applied to observations not used to calculate them. In other words, this data preparation fits seamlessly into the pipeline shown in Figure 2.1.

4.1 Continuous Variable Preprocessing

We will cover two very common preprocessing steps for continuous variables:

  1. Standardization transforms data so that the empirical distribution follows a \(\mathcal{N}(0,1)\) distribution.

  2. Normalization transforms data to achieve a unit norm (\(\mathcal{l}_1\) or \(\mathcal{l}_2\)). In other words, with the appropriate norm, the sum of elements equals 1.

There are other options, such as MinMaxScaler to rescale variables according to observed minimum and maximum bounds. The choice of method depends on the algorithms chosen later: the assumptions of k-nearest neighbors (knn) differ from those of a random forest. For this reason, complete pipelines are usually defined, integrating both preprocessing and learning, which will be discussed in upcoming chapters.

Caution

For statisticians, the term normalization in Scikit terminology can have a counterintuitive meaning. One might expect normalization to transform a variable so that \(X \sim \mathcal{N}(0,1)\). In Scikit, this is actually standardization.

4.1.1 Standardization

Standardization involves transforming data so that the empirical distribution follows a \(\mathcal{N}(0,1)\) distribution. For optimal performance, most machine learning models often require data to follow this distribution. Even when it’s not essential, as with logistic regressions, it can speed up the convergence rate of algorithms.

Exercise 3: Standardization
  1. Standardize the Median_Household_Income_2021 variable (do not overwrite the values!) and examine the histogram before and after normalization. This transformation should be applied to the entire column; the next questions will address sample splitting and extrapolation.

Note: This should yield a distribution centered at zero, and we could verify that the empirical variance is indeed 1. We could also check that this is true when transforming multiple columns at once.

  1. Create scaler, a Transformer built on the first 1000 rows of your df2 DataFrame, excluding the target variable winner. Check the mean and standard deviation of each column for these same observations.

Note: The parameters used for subsequent standardization are stored in the .mean_ and .scale_ attributes.

These attributes can be seen as parameters trained on a specific dataset that can be reused on another, as long as the dimensions match.

  1. Apply scaler to the remaining rows of the DataFrame and compare the distributions obtained for the Median_Household_Income_2021 variable.

Note: Once applied to another DataFrame, you may notice that the distribution is not exactly standardized in the DataFrame on which the parameters were not estimated. This is normal; the initial sample was not random, so the means and variances of this sample do not necessarily match those of the complete sample.

Before standardization, our variable has this distribution:

/opt/conda/lib/python3.12/site-packages/plotnine/stats/stat_bin.py:109: PlotnineWarning: 'stat_bin()' using 'bins = 57'. Pick better value with 'binwidth'.

After standardization, the scale of the variable has changed.

/opt/conda/lib/python3.12/site-packages/plotnine/stats/stat_bin.py:109: PlotnineWarning: 'stat_bin()' using 'bins = 57'. Pick better value with 'binwidth'.

We indeed obtain a mean equal to 0 and a variance equal to 1, within numerical approximations:

Statistique Valeur
0 Mean -0.000000
1 Variance 1.000323

For question 2, if we attempt to present the obtained statistics in a readable table, we get

Variable Mean before Scaling Std before Scaling Mean after Scaling Std after Scaling
votes_gop
17.4K
32.8K
−3.4E−17
1.00
Unemployment_rate_2019
3.81
1.29
1.1E−16
1.00
Median_Household_Income_2021
58.3K
14.0K
1.2E−16
1.00
Percent of adults with less than a high school diploma, 2018-22
11.7
5.97
1.6E−16
1.00
Percent of adults with a bachelor's degree or higher, 2018-22
23.0
9.79
2.6E−16
1.0
y_standard
−0.026
0.92
−7.1E−18
1.00

It is very clear from this table that the standardization worked well.

Now, if we create a formal transformer for our variables (question 3)

StandardScaler()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

We can extrapolate our standardizer to a larger dataset. If we look at the distribution obtained on the first 1000 rows (question 3), we find a scale consistent with a \(\mathcal{N(0,1)}\) distribution for the unemployment variable:

/opt/conda/lib/python3.12/site-packages/plotnine/stats/stat_bin.py:109: PlotnineWarning: 'stat_bin()' using 'bins = 80'. Pick better value with 'binwidth'.
Figure 4.1: Taux de chômage standardisé sur des observations qui ont servi à l’entraînement

However, we see that this distribution does not correspond to the one that would truly normalize the rest of the data.

/opt/conda/lib/python3.12/site-packages/plotnine/stats/stat_bin.py:109: PlotnineWarning: 'stat_bin()' using 'bins = 38'. Pick better value with 'binwidth'.
Figure 4.2: Taux de chômage standardisé sur des observations qui n’ont pas servi à l’entraînement

This is an illustration of a classic problem in machine learning, called data drift, which occurs when we attempt to extrapolate to data whose distribution no longer corresponds to that of the training data. This situation typically arises when an algorithm has been trained on a biased sample of the population or when we have non-stationary time series data. It is, therefore, essential to carefully consider the construction of the training sample and the potential for extrapolation to a broader population: the external validity of the model—whether data preparation or learning algorithm—can be null if this step was rushed.

Data Drift

Data drift refers to a shift in data distribution over time, leading to a degradation in the performance of a machine learning model that, by design, was trained on past data.

This phenomenon can occur due to changes in the target population, shifts in data characteristics, or external factors.

Detecting data drift is crucial to adjust or retrain the model, ensuring its relevance and accuracy. Detection techniques include statistical tests and monitoring specific metrics.

4.1.2 Normalization

Normalization is the process of transforming data to achieve a unit norm (\(\mathcal{l}_1\) or \(\mathcal{l}_2\)). In other words, with the appropriate norm, the sum of elements equals 1. By default, Scikit uses an \(\mathcal{l}_2\) norm. This transformation is especially useful in text classification or clustering.

This is also an opportunity to explore how to split data into multiple samples using the train_test_split function in Scikit. We will create a 70% sample of the data to estimate normalization parameters (training phase) and extrapolate to the remaining 30%. This split is fairly standard but, of course, adaptable depending on the project. The advantage of using train_test_split instead of manually sampling with Pandassample method is that Scikit’s function allows for much more control over sampling, particularly if stratification is desired, while being reliable. Doing this manually can be tedious and risky, as it is potentially complex to implement without errors.

Exercise 4: Normalization
  1. Using the documentation for the train_test_split function in Scikit, create two samples (70% and 30% of the data, respectively).
  2. Normalize the Median_Household_Income_2021 variable (do not overwrite the values!) and examine the histogram before and after normalization.
  3. Verify that the \(\mathcal{l}_2\) norm is indeed equal to 1 (using the np.linalg.norm function with the axis=1 argument) for the first 10 observations in the training set and then for the other observations.
/opt/conda/lib/python3.12/site-packages/plotnine/stats/stat_bin.py:109: PlotnineWarning: 'stat_bin()' using 'bins = 41'. Pick better value with 'binwidth'.

Question 2, avant normalisation

Question 2, avant normalisation
/opt/conda/lib/python3.12/site-packages/plotnine/stats/stat_bin.py:109: PlotnineWarning: 'stat_bin()' using 'bins = 121'. Pick better value with 'binwidth'.

Question 2, variable transformée, sur l’échantillon de normalisation

Question 2, variable transformée, sur l’échantillon de normalisation
/opt/conda/lib/python3.12/site-packages/plotnine/stats/stat_bin.py:109: PlotnineWarning: 'stat_bin()' using 'bins = 78'. Pick better value with 'binwidth'.

Question 2, variable transformée, à partir des paramètres entraînés

Question 2, variable transformée, à partir des paramètres entraînés

Finally, if we compute the norm, we obtain the expected result on both the train sample and the extrapolated sample.

X_train_norm2 X_test_norm2
0 1.0 1.0
1 1.0 1.0
2 1.0 1.0
3 1.0 1.0
4 1.0 1.0

4.2 Encoding Categorical Values

Categorical data must be recoded into numeric values to be integrated into machine learning models.

This can be done in several ways with Scikit:

  • LabelEncoder: transforms a vector ["a","b","c"] into a numeric vector [0,1,2]. This approach has the drawback of introducing an order to the categories, which is not always desirable.
  • OrdinalEncoder: a generalized version of LabelEncoder designed to apply to matrices (\(X\)), while LabelEncoder applies mainly to a vector (\(y\)).

For one-hot encoding, several methods are available:

  • pandas.get_dummies performs a dummy expansion. A vector of size n with K categories will be transformed into a matrix of size \(n \times K\), where each column represents a dummy variable for category k. There are \(K\) categories, resulting in multicollinearity. In linear regression with a constant, one category should be removed before estimation.

  • OneHotEncoder is a generalized (and optimized) version of dummy expansion. This is the recommended method.

4.3 Imputation

Data often contains missing values, that is, observations in our DataFrame containing a NaN. These gaps can cause bugs or misinterpretations when moving to modeling. One initial approach could be to remove all observations with a NaN in at least one column. However, if our table contains many NaNs, or if these are spread across numerous columns, we risk removing a significant number of rows, and, with that, losing important information, as missing values are rarely randomly distributed.

While this solution remains viable in many cases, a more robust approach called imputation exists. This method involves replacing missing values with a specified value. For example:

  • Mean imputation: replacing all NaNs in a column with the column’s average;
  • Median imputation on the same principle, or using the most frequent column value for categorical variables;
  • Regression imputation: using other variables to interpolate an appropriate replacement value.

More complex methods are available, but in many cases, the above approaches can provide much more satisfactory results. The Scikit package makes imputation very straightforward (documentation here).

4.4 Handling Outliers

Outliers are observations that significantly deviate from the general trend of other observations in a dataset. In other words, they are data points that stand out unusually from the overall data distribution. This may be due to data entry errors, respondents who incorrectly answered a survey, or simply extreme values that may bias a model too much.

For example, these could be 3 individuals measuring over 4 meters in height within a population or household incomes exceeding 10 million euros per month at a national level.

It is good practice to routinely examine the distribution of available variables to check if some values deviate too significantly from others. Sometimes these values will interest us, for instance, if we are focusing solely on very high incomes (top 0.1%) in France. However, often these values will be more of a hindrance, especially if they don’t make sense in the real world.

If we find that the presence of these extreme values or outliers in our dataset is more problematic than helpful, it is reasonable to simply remove them. Most of the time, we set a percentage of data to remove, such as 0.1%, 1%, or 5%, then remove the corresponding extreme values in both tails of the distribution.

Several packages can perform these operations, which can become complex if we examine outliers across multiple variables. The IsolationForest() function in the sklearn.ensemble package is particularly noteworthy.

4.5 Application exercise

Be careful with new categories!

Transformers create a mapping between text categories and numeric values. This assumes that the data used to build this mapping includes all possible values for the text categories.

However, if new categories appear, the classifier will not know how to transform these into numeric values, which will cause an error in Scikit. This technical error makes sense, as it would require updating not only the mapping but also the estimation of underlying parameters.

Exercise 5: Encoding Categorical Variables
  1. Create df that retains only the state_name and county_name variables in votes.

  2. Apply a LabelEncoder to state_name Note: The label encoding result is relatively intuitive, especially when related to the initial vector.

  3. Observe the dummy expansion of state_name

  4. Apply an OrdinalEncoder to df[['state_name', 'county_name']] Note: The ordinal encoding result is consistent with the label encoding

  5. Apply a OneHotEncoder to df[['state_name', 'county_name']]

Note: scikit optimizes the object required to store the result of a transformation model. For example, the One Hot encoding result is a very large object. In this case, scikit uses a Sparse matrix.

If we examine the labels and their numeric transpositions via LabelEncoder

array([[23, 'Missouri'],
       [25, 'Nebraska'],
       [30, 'New York'],
       ...,
       [41, 'Texas'],
       [41, 'Texas'],
       [41, 'Texas']], shape=(3107, 2), dtype=object)
Alabama Arizona Arkansas California Colorado Connecticut Delaware District of Columbia Florida Georgia ... South Dakota Tennessee Texas Utah Vermont Virginia Washington West Virginia Wisconsin Wyoming
0 False False False False False False False False False False ... False False False False False False False False False False
1 False False False False False False False False False False ... False False False False False False False False False False
2 False False False False False False False False False False ... False False False False False False False False False False
3 False False False False False False False False False False ... False False False False False False False False False False
4 False False False False False False False False False False ... False True False False False False False False False False
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3102 False False False False False False False False False False ... False False False False False False False False False False
3103 False False False False False False False False False False ... False False False False False False False False False False
3104 False False False False False False False False False False ... False False True False False False False False False False
3105 False False False False False False False False False False ... False False True False False False False False False False
3106 False False False False False False False False False False ... False False True False False False False False False False

3107 rows × 49 columns

If we examine the OrdinalEncoder:

array([23., 25., 30., ..., 41., 41., 41.], shape=(3107,))
<3107x1891 sparse matrix of type '<class 'numpy.float64'>'
    with 6214 stored elements in Compressed Sparse Row format>

Reference

Insee. 2018. “Guide de Sémiologie Cartographique.”

Informations additionnelles

environment files have been tested on.

Latest built version: 2025-03-19

Python version used:

'3.12.6 | packaged by conda-forge | (main, Sep 30 2024, 18:08:52) [GCC 13.3.0]'
Package Version
affine 2.4.0
aiobotocore 2.21.1
aiohappyeyeballs 2.6.1
aiohttp 3.11.13
aioitertools 0.12.0
aiosignal 1.3.2
alembic 1.13.3
altair 5.4.1
aniso8601 9.0.1
annotated-types 0.7.0
anyio 4.8.0
appdirs 1.4.4
archspec 0.2.3
asttokens 2.4.1
attrs 25.3.0
babel 2.17.0
bcrypt 4.2.0
beautifulsoup4 4.12.3
black 24.8.0
blinker 1.8.2
blis 1.2.0
bokeh 3.5.2
boltons 24.0.0
boto3 1.37.1
botocore 1.37.1
branca 0.7.2
Brotli 1.1.0
bs4 0.0.2
cachetools 5.5.0
cartiflette 0.0.2
Cartopy 0.24.1
catalogue 2.0.10
cattrs 24.1.2
certifi 2025.1.31
cffi 1.17.1
charset-normalizer 3.4.1
chromedriver-autoinstaller 0.6.4
click 8.1.8
click-plugins 1.1.1
cligj 0.7.2
cloudpathlib 0.21.0
cloudpickle 3.0.0
colorama 0.4.6
comm 0.2.2
commonmark 0.9.1
conda 24.9.1
conda-libmamba-solver 24.7.0
conda-package-handling 2.3.0
conda_package_streaming 0.10.0
confection 0.1.5
contextily 1.6.2
contourpy 1.3.1
cryptography 43.0.1
cycler 0.12.1
cymem 2.0.11
cytoolz 1.0.0
dask 2024.9.1
dask-expr 1.1.15
databricks-sdk 0.33.0
dataclasses-json 0.6.7
debugpy 1.8.6
decorator 5.1.1
Deprecated 1.2.14
diskcache 5.6.3
distributed 2024.9.1
distro 1.9.0
docker 7.1.0
duckdb 1.2.1
en_core_web_sm 3.8.0
entrypoints 0.4
et_xmlfile 2.0.0
exceptiongroup 1.2.2
executing 2.1.0
fastexcel 0.11.6
fastjsonschema 2.21.1
fiona 1.10.1
Flask 3.0.3
folium 0.17.0
fontawesomefree 6.6.0
fonttools 4.56.0
fr_core_news_sm 3.8.0
frozendict 2.4.4
frozenlist 1.5.0
fsspec 2023.12.2
geographiclib 2.0
geopandas 1.0.1
geoplot 0.5.1
geopy 2.4.1
gitdb 4.0.11
GitPython 3.1.43
google-auth 2.35.0
graphene 3.3
graphql-core 3.2.4
graphql-relay 3.2.0
graphviz 0.20.3
great-tables 0.12.0
greenlet 3.1.1
gunicorn 22.0.0
h11 0.14.0
h2 4.1.0
hpack 4.0.0
htmltools 0.6.0
httpcore 1.0.7
httpx 0.28.1
httpx-sse 0.4.0
hyperframe 6.0.1
idna 3.10
imageio 2.37.0
importlib_metadata 8.6.1
importlib_resources 6.5.2
inflate64 1.0.1
ipykernel 6.29.5
ipython 8.28.0
itsdangerous 2.2.0
jedi 0.19.1
Jinja2 3.1.6
jmespath 1.0.1
joblib 1.4.2
jsonpatch 1.33
jsonpointer 3.0.0
jsonschema 4.23.0
jsonschema-specifications 2024.10.1
jupyter-cache 1.0.0
jupyter_client 8.6.3
jupyter_core 5.7.2
kaleido 0.2.1
kiwisolver 1.4.8
langchain 0.3.20
langchain-community 0.3.9
langchain-core 0.3.45
langchain-text-splitters 0.3.6
langcodes 3.5.0
langsmith 0.1.147
language_data 1.3.0
lazy_loader 0.4
libmambapy 1.5.9
locket 1.0.0
loguru 0.7.3
lxml 5.3.1
lz4 4.3.3
Mako 1.3.5
mamba 1.5.9
mapclassify 2.8.1
marisa-trie 1.2.1
Markdown 3.6
markdown-it-py 3.0.0
MarkupSafe 3.0.2
marshmallow 3.26.1
matplotlib 3.10.1
matplotlib-inline 0.1.7
mdurl 0.1.2
menuinst 2.1.2
mercantile 1.2.1
mizani 0.11.4
mlflow 2.16.2
mlflow-skinny 2.16.2
msgpack 1.1.0
multidict 6.1.0
multivolumefile 0.2.3
munkres 1.1.4
murmurhash 1.0.12
mypy-extensions 1.0.0
narwhals 1.30.0
nbclient 0.10.0
nbformat 5.10.4
nest_asyncio 1.6.0
networkx 3.4.2
nltk 3.9.1
numpy 2.2.3
opencv-python-headless 4.10.0.84
openpyxl 3.1.5
opentelemetry-api 1.16.0
opentelemetry-sdk 1.16.0
opentelemetry-semantic-conventions 0.37b0
orjson 3.10.15
outcome 1.3.0.post0
OWSLib 0.28.1
packaging 24.2
pandas 2.2.3
paramiko 3.5.0
parso 0.8.4
partd 1.4.2
pathspec 0.12.1
patsy 1.0.1
Pebble 5.1.0
pexpect 4.9.0
pickleshare 0.7.5
pillow 11.1.0
pip 24.2
platformdirs 4.3.6
plotly 5.24.1
plotnine 0.13.6
pluggy 1.5.0
polars 1.8.2
preshed 3.0.9
prometheus_client 0.21.0
prometheus_flask_exporter 0.23.1
prompt_toolkit 3.0.48
propcache 0.3.0
protobuf 4.25.3
psutil 7.0.0
ptyprocess 0.7.0
pure_eval 0.2.3
py7zr 0.20.8
pyarrow 17.0.0
pyarrow-hotfix 0.6
pyasn1 0.6.1
pyasn1_modules 0.4.1
pybcj 1.0.3
pycosat 0.6.6
pycparser 2.22
pycryptodomex 3.21.0
pydantic 2.10.6
pydantic_core 2.27.2
pydantic-settings 2.8.1
Pygments 2.19.1
PyNaCl 1.5.0
pynsee 0.1.8
pyogrio 0.10.0
pyOpenSSL 24.2.1
pyparsing 3.2.1
pyppmd 1.1.1
pyproj 3.7.1
pyshp 2.3.1
PySocks 1.7.1
python-dateutil 2.9.0.post0
python-dotenv 1.0.1
python-magic 0.4.27
pytz 2025.1
pyu2f 0.1.5
pywaffle 1.1.1
PyYAML 6.0.2
pyzmq 26.3.0
pyzstd 0.16.2
querystring_parser 1.2.4
rasterio 1.4.3
referencing 0.36.2
regex 2024.9.11
requests 2.32.3
requests-cache 1.2.1
requests-toolbelt 1.0.0
retrying 1.3.4
rich 13.9.4
rpds-py 0.23.1
rsa 4.9
rtree 1.4.0
ruamel.yaml 0.18.6
ruamel.yaml.clib 0.2.8
s3fs 2023.12.2
s3transfer 0.11.3
scikit-image 0.24.0
scikit-learn 1.6.1
scipy 1.13.0
seaborn 0.13.2
selenium 4.29.0
setuptools 76.0.0
shapely 2.0.7
shellingham 1.5.4
six 1.17.0
smart-open 7.1.0
smmap 5.0.0
sniffio 1.3.1
sortedcontainers 2.4.0
soupsieve 2.5
spacy 3.8.4
spacy-legacy 3.0.12
spacy-loggers 1.0.5
SQLAlchemy 2.0.39
sqlparse 0.5.1
srsly 2.5.1
stack-data 0.6.2
statsmodels 0.14.4
tabulate 0.9.0
tblib 3.0.0
tenacity 9.0.0
texttable 1.7.0
thinc 8.3.4
threadpoolctl 3.6.0
tifffile 2025.3.13
toolz 1.0.0
topojson 1.9
tornado 6.4.2
tqdm 4.67.1
traitlets 5.14.3
trio 0.29.0
trio-websocket 0.12.2
truststore 0.9.2
typer 0.15.2
typing_extensions 4.12.2
typing-inspect 0.9.0
tzdata 2025.1
Unidecode 1.3.8
url-normalize 1.4.3
urllib3 1.26.20
uv 0.6.8
wasabi 1.1.3
wcwidth 0.2.13
weasel 0.4.1
webdriver-manager 4.0.2
websocket-client 1.8.0
Werkzeug 3.0.4
wheel 0.44.0
wordcloud 1.9.3
wrapt 1.17.2
wsproto 1.2.0
xgboost 2.1.1
xlrd 2.0.1
xyzservices 2025.1.0
yarl 1.18.3
yellowbrick 1.5
zict 3.0.0
zipp 3.21.0
zstandard 0.23.0

View file history

SHA Date Author Description
48dccf14 2025-01-14 21:45:34 lgaliana Fix bug in modeling section
64c12dc5 2024-11-07 19:28:22 lgaliana English version
e945ff4a 2024-11-07 18:02:05 lgaliana update
1a8267a1 2024-11-07 17:11:44 lgaliana Finalize chapter and fix problem
63b581f6 2024-11-07 10:48:51 lgaliana Normalisation
a2517095 2024-11-07 09:31:19 lgaliana Exemple
e0728909 2024-11-06 18:17:32 lgaliana Continue le cleaning
f3bbddce 2024-11-06 16:48:47 lgaliana Commence revoir premier chapitre modélisation
c641de05 2024-08-22 11:37:13 Lino Galiana A series of fix for notebooks that were bugging (#545)
1cdcd273 2024-08-08 07:19:23 linogaliana change url
580cba77 2024-08-07 18:59:35 Lino Galiana Multilingual version as quarto profile (#533)
06d003a1 2024-04-23 10:09:22 Lino Galiana Continue la restructuration des sous-parties (#492)
005d89b8 2023-12-20 17:23:04 Lino Galiana Finalise l’affichage des statistiques Git (#478)
3fba6124 2023-12-17 18:16:42 Lino Galiana Remove some badges from python (#476)
417fb669 2023-12-04 18:49:21 Lino Galiana Corrections partie ML (#468)
1f23de28 2023-12-01 17:25:36 Lino Galiana Stockage des images sur S3 (#466)
a06a2689 2023-11-23 18:23:28 Antoine Palazzolo 2ème relectures chapitres ML (#457)
b68369d4 2023-11-18 18:21:13 Lino Galiana Reprise du chapitre sur la classification (#455)
fd3c9557 2023-11-18 14:22:38 Lino Galiana Formattage des chapitres scikit (#453)
889a71ba 2023-11-10 11:40:51 Antoine Palazzolo Modification TP 3 (#443)
9a4e2267 2023-08-28 17:11:52 Lino Galiana Action to check URL still exist (#399)
a8f90c2f 2023-08-28 09:26:12 Lino Galiana Update featured paths (#396)
80823022 2023-08-25 17:48:36 Lino Galiana Mise à jour des scripts de construction des notebooks (#395)
3bdf3b06 2023-08-25 11:23:02 Lino Galiana Simplification de la structure 🤓 (#393)
78ea2cbd 2023-07-20 20:27:31 Lino Galiana Change titles levels (#381)
29ff3f58 2023-07-07 14:17:53 linogaliana description everywhere
f21a24d3 2023-07-02 10:58:15 Lino Galiana Pipeline Quarto & Pages 🚀 (#365)
ebca985b 2023-06-11 18:14:51 Lino Galiana Change handling precision (#361)
129b0012 2022-12-26 20:36:01 Lino Galiana CSS for ipynb (#337)
f5f0f9c4 2022-11-02 19:19:07 Lino Galiana Relecture début partie modélisation KA (#318)
f10815b5 2022-08-25 16:00:03 Lino Galiana Notebooks should now look more beautiful (#260)
494a85ae 2022-08-05 14:49:56 Lino Galiana Images featured ✨ (#252)
d201e3cd 2022-08-03 15:50:34 Lino Galiana Pimp la homepage ✨ (#249)
640b9605 2022-06-10 15:42:04 Lino Galiana Finir de régler le problème plotly (#236)
5698e303 2022-06-03 18:28:37 Lino Galiana Finalise widget (#232)
7b9f27be 2022-06-03 17:05:15 Lino Galiana Essaie régler les problèmes widgets JS (#231)
12965bac 2022-05-25 15:53:27 Lino Galiana :launch: Bascule vers quarto (#226)
9c71d6e7 2022-03-08 10:34:26 Lino Galiana Plus d’éléments sur S3 (#218)
8f99cd32 2021-12-08 15:26:28 linogaliana hide otuput
9c5f718d 2021-12-08 14:36:59 linogaliana format dico :sob:
41c89864 2021-12-08 14:25:18 linogaliana correction erreur
64747466 2021-12-08 14:08:06 linogaliana dict ici aussi
8e73912d 2021-12-08 12:32:17 linogaliana coquille plotly :sob:
37042139 2021-12-08 11:57:51 linogaliana essaye avec un dict classique
85565d5c 2021-12-08 08:15:29 linogaliana reformat
9ace7b92 2021-12-07 17:49:18 linogaliana évite la boucle crado
3514e090 2021-12-07 16:05:28 linogaliana sépare et document
63e67e67 2021-12-07 15:15:34 Lino Galiana Debug du plotly (temporaire) (#193)
65cecdd8 2021-12-07 10:29:18 Lino Galiana Encore une erreur de nom de colonne (#192)
81b7023a 2021-12-07 09:27:35 Lino Galiana Mise à jour liste des colonnes (#191)
c3bf4d42 2021-12-06 19:43:26 Lino Galiana Finalise debug partie ML (#190)
d91a5eb1 2021-12-06 18:53:33 Lino Galiana La bonne branche c’est master
d86129c0 2021-12-06 18:02:32 Lino Galiana verbose
fb14d406 2021-12-06 17:00:52 Lino Galiana Modifie l’import du script (#187)
37ecfa3c 2021-12-06 14:48:05 Lino Galiana Essaye nom différent (#186)
2c8fd0dd 2021-12-06 13:06:36 Lino Galiana Problème d’exécution du script import data ML (#185)
5d0a5e38 2021-12-04 07:41:43 Lino Galiana MAJ URL script recup data (#184)
5c104904 2021-12-03 17:44:08 Lino Galiana Relec (antuki?) partie modelisation (#183)
2a8809fb 2021-10-27 12:05:34 Lino Galiana Simplification des hooks pour gagner en flexibilité et clarté (#166)
2e4d5862 2021-09-02 12:03:39 Lino Galiana Simplify badges generation (#130)
49e2826f 2021-05-13 18:11:20 Lino Galiana Corrige quelques images n’apparaissant pas (#108)
4cdb759c 2021-05-12 10:37:23 Lino Galiana :sparkles: :star2: Nouveau thème hugo :snake: :fire: (#105)
7f9f97bc 2021-04-30 21:44:04 Lino Galiana 🐳 + 🐍 New workflow (docker 🐳) and new dataset for modelization (2020 🇺🇸 elections) (#99)
347f50f3 2020-11-12 15:08:18 Lino Galiana Suite de la partie machine learning (#78)
671f75a4 2020-10-21 15:15:24 Lino Galiana Introduction au Machine Learning (#72)
Back to top

Citation

BibTeX citation:
@book{galiana2023,
  author = {Galiana, Lino},
  title = {Python Pour La Data Science},
  date = {2023},
  url = {https://pythonds.linogaliana.fr/},
  doi = {10.5281/zenodo.8229676},
  langid = {en}
}
For attribution, please cite this work as:
Galiana, Lino. 2023. Python Pour La Data Science. https://doi.org/10.5281/zenodo.8229676.