Analyzing multiple datasets and producing summary statistics with Pandas

The introductory chapter to Pandas presented how data were organized as DataFrames and how the Pandas ecosystem can be useful to perform simple operations on datasets. This chapter consolidates these principles by introducing two classic types of operations from the data scientist’s toolbox: group statistics and data merging.

Tutorial
Data Manipulation
Author

Lino Galiana

Published

2024-07-10

To try the examples in this tutorial:

1 Introduction

The introductory chapter to Pandas presented the concept of data organized in the form of a DataFrame and the practicality of the Pandas ecosystem for performing simple operations on a dataset.

It is rare to work exclusively on a raw source. A dataset generally gains value when compared to other sources. For researchers, this allows contextualizing the information present in one source by comparing or associating it with other sources. For data scientists in the private sector, it often involves linking information about the same person in multiple customer databases or comparing customers with each other.

One of the benefits of modern data science tools, especially Pandas, is the ease with which they allow restructuring sources to work on multiple datasets in a project. This chapter consolidates the principles previously seen by refining the data processing. It will mainly explore two types of operations:

  • Group descriptive statistics;
  • Data merging by common characteristics.

Performing this work simply, reliably, and efficiently is essential for data scientists as this task is common. Fortunately, Pandas handles this very well with structured data. In the following chapters, and also throughout the section on text data processing, we will see how to handle less structured data.

Through this work, we will deepen our understanding of a real phenomenon through detailed descriptive statistics. This is an essential step before moving on to inferential statistics, the approach that consists of formalizing and generalizing correlations or causal relationships between observed characteristics and a phenomenon.

Skills at the end of this chapter
  • Retrieve an official dataset from Insee;
  • Build group descriptive statistics and switch between data levels;
  • Merge data (reshape, merge) to add value;
  • Create a beautiful table to communicate descriptive statistics.

1.1 Environment

The previous chapter used almost exclusively the Pandas library. In this chapter, we will use other packages in addition to it.

As explained below, we will use a library called pynsee to retrieve Insee data useful for enriching our Ademe dataset. This library is not installed by default in Python. Before using it, it is necessary to install it, along with the great_tables library that we will see at the end of this chapter:

!pip install xlrd
!pip install pynsee
!pip install great_tables

The instruction !pip install <pkg> is a way to tell Jupyter, the execution engine behind notebooks, that the following command (pip install <pkg>) is a system command to be executed outside of Python (in the terminal, for example, for a Linux system).

The essential packages to start this chapter are as follows:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import pynsee
import pynsee.download

To obtain reproducible results, we can set the root of the pseudo-random generator:

np.random.seed(123)

1.2 Data used

This tutorial continues the exploration of the dataset from the previous chapter:

  • Greenhouse gas emissions estimated at the municipal level by ADEME. The dataset is available on data.gouv and can be directly queried in Python with this URL;

The issues of data enrichment (associating one source with another based on common characteristics) will be presented using two sources produced by Insee:

  • The official geographic code, a reference produced by Insee used to identify municipalities with a unique code, unlike the postal code;
  • The Filosofi data, a source on French income at a fine spatial scale constructed by Insee from tax returns and social benefit information. In this case, we will use income levels and populations1 at the municipal level to compare them with our emissions data.

To facilitate the import of Insee data, it is recommended to use the pynsee package, which simplifies access to the main Insee datasets available on the insee.fr website or via APIs.

Note

The pynsee package has two main entry points:

  • The Insee APIs, which will be illustrated in the dedicated chapter.
  • Some datasets directly from the Insee website (insee.fr)

In this chapter, we will exclusively use the second approach through the pynsee.download module.

The list of available data from this package is here. The download_file function expects a unique identifier to know which database to fetch and restructure from the insee.fr website.

Knowing the list of available databases

To know the list of available databases, you can use the meta = pynsee.get_file_list() function after importing pynsee. This returns a DataFrame in which you can search, for example, using a keyword search:

import pynsee

meta = pynsee.get_file_list()
meta.loc[meta["label"].str.contains(r"Filosofi.*2016")]
id name label collection link type zip big_zip data_file tab ... label_col date_ref meta_file separator type_col long_col val_col encoding last_row missing_value
79 FILOSOFI_COM_2016 FILOSOFI_COM Données Filosofi niveau communal – 2016 FILOSOFI https://www.insee.fr/fr/statistiques/fichier/4... xls True False base-cc-filosofi-2016.xls COM ... NaN 2016-01-01 NaN NaN NaN NaN NaN NaN NaN NaN
80 FILOSOFI_EPCI_2016 FILOSOFI_EPCI Données Filosofi niveau EPCI – 2016 FILOSOFI https://www.insee.fr/fr/statistiques/fichier/4... xls True False base-cc-filosofi-2016.xls EPCI ... NaN 2016-01-01 NaN NaN NaN NaN NaN NaN NaN NaN
81 FILOSOFI_ARR_2016 FILOSOFI_ARR Données Filosofi niveau arondissement – 2016 FILOSOFI https://www.insee.fr/fr/statistiques/fichier/4... xls True False base-cc-filosofi-2016.xls ARR ... NaN 2016-01-01 NaN NaN NaN NaN NaN NaN NaN NaN
82 FILOSOFI_DEP_2016 FILOSOFI_DEP Données Filosofi niveau départemental – 2016 FILOSOFI https://www.insee.fr/fr/statistiques/fichier/4... xls True False base-cc-filosofi-2016.xls DEP ... NaN 2016-01-01 NaN NaN NaN NaN NaN NaN NaN NaN
83 FILOSOFI_REG_2016 FILOSOFI_REG Données Filosofi niveau régional – 2016 FILOSOFI https://www.insee.fr/fr/statistiques/fichier/4... xls True False base-cc-filosofi-2016.xls REG ... NaN 2016-01-01 NaN NaN NaN NaN NaN NaN NaN NaN
84 FILOSOFI_METRO_2016 FILOSOFI_METRO Données Filosofi niveau France métropolitaine ... FILOSOFI https://www.insee.fr/fr/statistiques/fichier/4... xls True False base-cc-filosofi-2016.xls METRO ... NaN 2016-01-01 NaN NaN NaN NaN NaN NaN NaN NaN
85 FILOSOFI_AU2010_2016 FILOSOFI_AU2010 Données Filosofi niveau aire urbaine – 2016 FILOSOFI https://www.insee.fr/fr/statistiques/fichier/4... xls True False base-cc-filosofi-2016.xls AU2010 ... NaN 2016-01-01 NaN NaN NaN NaN NaN NaN NaN NaN
86 FILOSOFI_UU2010_2016 FILOSOFI_UU2010 Données Filosofi niveau unité urbaine – 2016 FILOSOFI https://www.insee.fr/fr/statistiques/fichier/4... xls True False base-cc-filosofi-2016.xls UU2010 ... NaN 2016-01-01 NaN NaN NaN NaN NaN NaN NaN NaN
87 FILOSOFI_ZE2010_2016 FILOSOFI_ZE2010 Données Filosofi niveau zone d’emploi – 2016 FILOSOFI https://www.insee.fr/fr/statistiques/fichier/4... xls True False base-cc-filosofi-2016.xls ZE2010 ... NaN 2016-01-01 NaN NaN NaN NaN NaN NaN NaN NaN

9 rows × 24 columns

Here, meta[‘label’].str.contains(r”Filosofi.2016”) means: “pandas find me all labels containing the terms Filosofi and 2016.” (. means “no matter the number of words or characters in between”).

2 Retrieving data for this chapter

2.1 French carbon emissions dataset

As explained in the previous chapter, these data can be imported very simply with Pandas:

import pandas as pd

url = "https://koumoul.com/s/data-fair/api/v1/datasets/igt-pouvoir-de-rechauffement-global/convert"
emissions = pd.read_csv(url)
emissions.head(2)
INSEE commune Commune Agriculture Autres transports Autres transports international CO2 biomasse hors-total Déchets Energie Industrie hors-énergie Résidentiel Routier Tertiaire
0 01001 L'ABERGEMENT-CLEMENCIAT 3711.425991 NaN NaN 432.751835 101.430476 2.354558 6.911213 309.358195 793.156501 367.036172
1 01002 L'ABERGEMENT-DE-VAREY 475.330205 NaN NaN 140.741660 140.675439 2.354558 6.911213 104.866444 348.997893 112.934207

We will already keep the names of the emitting sectors present in the database to simplify subsequent uses:

secteurs = emissions.select_dtypes(include="number").columns

Subsequent exploitations of these data will use the departmental dimension, the construction of which we demonstrated in the previous chapter:

emissions["dep"] = emissions["INSEE commune"].str[:2]

2.2 Income dataset

We will use the Filosofi data (income data) at the municipal level for 2016. It is not the same year as the CO2 emissions data, so it is not perfectly rigorous, but it will still illustrate the main functionalities of Pandas.

The main entry point for the pynsee function is the download_file function.

The code to download the data is as follows:

from pynsee.download import download_file

filosofi = download_file("FILOSOFI_COM_2016")

The resulting DataFrame looks like this:

filosofi.sample(3)
CODGEO LIBGEO NBMENFISC16 NBPERSMENFISC16 MED16 PIMP16 TP6016 TP60AGE116 TP60AGE216 TP60AGE316 ... PPEN16 PPAT16 PPSOC16 PPFAM16 PPMINI16 PPLOGT16 PIMPOT16 D116 D916 RD16
34640 93077 Villemomble 11968 30123.5 21782.222222222223 61 19 22 23 19 ... 23.1 9 5.1 2.1 1.7 1.3 -19.6 9371.481481481482 40950.8 4.36972532901237
9988 28073 Champseru 121 326.5 24806 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
17537 48190 Termes 92 210.5 19108.571428571428 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

3 rows × 29 columns

Pandas automatically handled the variable types. It does this relatively well, but a check is always useful for variables that have a specific status.

For variables that are not of type float but should be, we modify their type.

filosofi = filosofi.astype({c: "float" for c in filosofi.columns[2:]})

A quick glance at the data gives a fairly precise idea of how the data are organized. We notice that some variables in filosofi seem to have many missing values (statistical secrecy), while others seem complete. If we want to exploit filosofi, we need to pay attention to the chosen variable.

Our ultimate goal will be to link the information contained between these two datasets. Otherwise, we risk being frustrated: we will want to know more about carbon emissions but will be very limited in the possibilities of analysis without adding additional information from filosofi.

3 Descriptive statistics by group

3.1 Principle

In the previous chapter, we saw how to obtain an aggregated statistic easily with Pandas. However, it is common to have data with intermediate analysis strata that are relevant: geographical variables, membership in socio-demographic groups related to recorded characteristics, temporal period indicators, etc. To better understand the structure of the data, data scientists are often led to construct descriptive statistics on sub-groups present in the data. For example, we previously constructed emission statistics at the national level. But what about the emission profiles of different departments? To answer this question, it will be useful to aggregate our data at the departmental level. This will give us different information from the initial dataset (municipal level) and the most aggregated level (national level).

In SQL, it is very simple to segment data to perform operations on coherent blocks and recollect results in the appropriate dimension. The underlying logic is that of split-apply-combine, which is adopted by data manipulation languages, including pandas which is no exception.

The following image, from this site, well represents how the split-apply-combine approach works:

Split-apply-combine (Source: unlhcc.github.io)

Split-apply-combine (Source: unlhcc.github.io)

In Pandas, we use groupby to segment the data according to one or more axes (this tutorial on the subject is particularly useful). All the aggregation operations (counting, averages, etc.) that we saw earlier can be applied by group.

Technically, this operation involves creating an association between labels (values of group variables) and observations. Using the groupby method does not trigger operations until a statistic is implemented; it simply creates a formal relationship between observations and groupings that will be used later:

filosofi["dep"] = filosofi["CODGEO"].str[:2]
filosofi.groupby("dep").__class__
pandas.core.groupby.generic.DataFrameGroupBy

As long as we do not call an action on a DataFrame by group, such as head or display, pandas performs no operations. This is called lazy evaluation. For example, the result of df.groupby('dep') is a transformation that has not yet been evaluated:

filosofi.groupby("dep")
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f464fc95550>

3.2 Illustration 1: counting by group

To illustrate the application of this principle to counting, we can count the number of municipalities by department in 2023 (this statistic changes every year due to municipal mergers). For this, we simply take the reference of French municipalities from the official geographical code (COG) and count by department using count:

import pandas as pd

url_cog_2023 = "https://www.insee.fr/fr/statistiques/fichier/6800675/v_commune_2023.csv"
cog_2023 = pd.read_csv(url_cog_2023)

With this dataset, without resorting to group statistics, we can already know how many municipalities, departments, and regions we have in France, respectively:

1communes = cog_2023.loc[cog_2023["TYPECOM"] == "COM"]
communes.loc[:, ["COM", "DEP", "REG"]].nunique()
1
We restrict to the status “Commune” because this file also contains Insee codes for other statuses, such as the “Municipal Arrondissements” of Paris, Lyon, and Marseille.
COM    34945
DEP      101
REG       18
dtype: int64

Now, let’s look at the departments with the most municipalities. It is the same counting function where we play, this time, on the group from which the statistic is calculated.

Calculating this statistic is quite straightforward when you understand the principle of calculating statistics with Pandas:

communes = cog_2023.loc[cog_2023["TYPECOM"] == "COM"]
communes.groupby("DEP").agg({"COM": "nunique"})
COM
DEP
01 392
02 798
03 317
04 198
05 162
... ...
971 32
972 34
973 22
974 24
976 17

101 rows × 1 columns

In SQL, we would use the following query:

SELECT dep, COUNT DISTINCT "COM" AS COM 
FROM communes
GROUP BY dep 
WHERE TYPECOM == 'COM';

The output is an indexed Series. This is not very convenient as we mentioned in the previous chapter. It is more practical to transform this object into a DataFrame with reset_index. Finally, with sort_values, we obtain the desired statistic:

(
    communes.groupby("DEP")
    .agg({"COM": "nunique"})
    .reset_index()
    .sort_values("COM", ascending=False)
)
DEP COM
62 62 890
1 02 798
80 80 772
57 57 725
76 76 708
... ... ...
96 971 32
99 974 24
98 973 22
100 976 17
75 75 1

101 rows × 2 columns

3.3 Illustration 2: aggregates by group

To illustrate aggregates by group, we can use the Insee filosofi dataset and count the population using the variable NBPERSMENFISC16.

To calculate the total for the whole of France, we can do it in two ways:

filosofi["NBPERSMENFISC16"].sum() * 1e-6
66.9322415
filosofi.agg({"NBPERSMENFISC16": "sum"}).div(1e6)
NBPERSMENFISC16    66.932242
dtype: float64

where the results are reported in millions of people. The logic is the same when doing group statistics, it’s just a matter of replacing filosofi with filosofi.groupby('dep') to create a partitioned version of our dataset by department:

1filosofi.groupby("dep")["NBPERSMENFISC16"].sum()
1
With this approach, you need to pay attention to the order of operations: first, perform the groupby and then select the column of interest
dep
01     613088.0
02     514249.0
03     329435.0
04     156537.5
05     133992.5
        ...    
92    1583682.0
93    1586664.5
94    1345977.0
95    1226059.0
97    1191947.0
Name: NBPERSMENFISC16, Length: 97, dtype: float64
filosofi.groupby("dep").agg({"NBPERSMENFISC16": "sum"})
NBPERSMENFISC16
dep
01 613088.0
02 514249.0
03 329435.0
04 156537.5
05 133992.5
... ...
92 1583682.0
93 1586664.5
94 1345977.0
95 1226059.0
97 1191947.0

97 rows × 1 columns

The second approach is more practical because it directly gives a Pandas DataFrame and not an unnamed indexed series. From this, a few basic manipulations can suffice to have a shareable table on departmental demographics. However, this table would be somewhat rudimentary as we currently only have the department numbers. To get the names of the departments, we would need to use a second dataset and merge the common information between them (in this case, the department number). This is the subject of the next part.

3.4 Application

This application exercise uses the Ademe dataset named emissions previously discussed.

Exercise 1: Group Aggregations
  1. Calculate the total emissions of the “Residential” sector by department and compare the value to the most polluting department in this sector. Draw insights from the reality that this statistic reflects.

  2. Calculate the total emissions for each sector in each department. For each department, calculate the proportion of total emissions coming from each sector.

Hint for this question
  • “Group by” = groupby
  • “Total emissions” = agg({*** : "sum"})

In question 1, the result should be as follows:

dep Résidentiel Résidentiel (% valeur max)
59 59 3.498347e+06 1.000000
75 75 1.934580e+06 0.552998
69 69 1.774653e+06 0.507283
62 62 1.738090e+06 0.496832
57 57 1.644192e+06 0.469991
This ranking may reflect demographics rather than the process we wish to measure. Without the addition of information on the population of each département to control for this factor, it is difficult to know whether there is a structural difference in behavior between the inhabitants of Nord (département 59) and Moselle (département 57). 






At the end of question 2, let's take the share of emissions from agriculture and the tertiary sector in departmental emissions: 
Agriculture Autres transports Autres transports international CO2 biomasse hors-total Déchets Energie Industrie hors-énergie Résidentiel Routier Tertiaire ... Part Agriculture Part Autres transports Part Autres transports international Part CO2 biomasse hors-total Part Déchets Part Energie Part Industrie hors-énergie Part Résidentiel Part Routier Part Tertiaire
dep
23 1.430068e+06 5060.057601 0.000000 210196.604389 26550.858041 9752.578164 28626.245699 134197.515156 434767.868975 70733.245013 ... 60.855172 0.215326 0.000000 8.944716 1.129846 0.415012 1.218163 5.710647 18.501132 3.009986
48 7.510594e+05 5697.938112 0.000000 70903.948092 26011.591018 6065.340741 17803.285591 61033.998303 253618.488432 43661.121359 ... 60.772448 0.461052 0.000000 5.737238 2.104744 0.490781 1.440564 4.938605 20.521701 3.532867
15 1.539204e+06 8261.874450 18.804572 228415.892777 44814.875202 13138.432196 85214.659284 128315.601994 443832.903418 84364.615635 ... 59.761414 0.320777 0.000730 8.868517 1.739990 0.510115 3.308560 4.982005 17.232336 3.275556
12 2.122331e+06 13796.608978 3124.844800 331420.983449 52412.681268 35405.484754 112897.690887 268862.654280 795413.985637 170211.845832 ... 54.336847 0.353227 0.080004 8.485186 1.341893 0.906467 2.890457 6.883541 20.364540 4.357839
32 1.026604e+06 4599.824552 0.000000 201732.703762 50950.668326 16651.432346 53468.498055 158218.000190 446345.993580 105662.674213 ... 49.732924 0.222835 0.000000 9.772766 2.468261 0.806664 2.590235 7.664734 21.622845 5.118737

5 rows × 21 columns

Agriculture Autres transports Autres transports international CO2 biomasse hors-total Déchets Energie Industrie hors-énergie Résidentiel Routier Tertiaire ... Part Agriculture Part Autres transports Part Autres transports international Part CO2 biomasse hors-total Part Déchets Part Energie Part Industrie hors-énergie Part Résidentiel Part Routier Part Tertiaire
dep
75 0.000000 42216.829025 1.837660e+02 1.186577e+06 27358.781206 147965.117571 434314.469384 1.934580e+06 1.625583e+06 1.331630e+06 ... 0.000000 0.627255 0.002730 17.630092 0.406495 2.198457 6.453018 28.743870 24.152808 19.785275
94 2259.429643 218992.353559 3.146283e+05 6.914050e+05 213619.661516 76341.230740 467189.038927 1.336894e+06 1.169432e+06 7.636502e+05 ... 0.043001 4.167781 5.987888 13.158562 4.065530 1.452898 8.891367 25.443275 22.256193 14.533505
92 91.408184 12340.794839 2.101194e+02 1.067889e+06 264497.880711 242842.018012 706597.424067 1.466794e+06 1.198420e+06 8.360132e+05 ... 0.001577 0.212930 0.003625 18.425550 4.563695 4.190041 12.191761 25.308332 20.677765 14.424724
93 2018.470982 59617.086124 1.101400e+06 7.259516e+05 252166.943778 102837.663903 433216.360990 1.316452e+06 1.396911e+06 8.630178e+05 ... 0.032277 0.953326 17.612287 11.608558 4.032355 1.644458 6.927483 21.051146 22.337751 13.800359
83 151715.557862 21772.374976 2.854770e+04 5.795888e+05 233522.964403 47044.063669 139710.930613 5.938382e+05 1.944266e+06 5.610540e+05 ... 3.527399 0.506209 0.663736 13.475487 5.429428 1.093778 3.248291 13.806786 45.204334 13.044551

5 rows × 21 columns

These results are quite logical; rural departments have a larger share of their emissions from agriculture, while urban departments have higher emissions from the tertiary sector, which is related to the higher density of these areas.

With these statistics, we progress in understanding our dataset and, consequently, the nature of CO2 emissions in France. Descriptive statistics by group help us better grasp the spatial heterogeneity of our phenomenon.

However, we remain limited in our ability to interpret the obtained statistics without using additional information. To give meaning and value to a statistic, it is generally necessary to associate it with additional knowledge; otherwise, it remains detached.

In the rest of this chapter, we will consider a primary approach which is the merging of complementary data. This process is called data enrichment. These data can be observations at the same level as the original source. For example, one of the most common merges is associating a customer database with a purchase database to relate purchasing behavior to characteristics that may explain it. Data merges can also occur at different conceptual levels, generally at a more aggregated level to contextualize finer data and compare an observation to measures within a similar group. For instance, we can associate individual travel times and modes with those of the same age group or people living in the same municipality to compare the differences between certain individuals and a similar sociodemographic group.

4 Restructuring datasets

4.1 Principle

When we have multiple pieces of information for the same individual or group, we generally find two types of data structures:

  • Wide format: the data contains repeated observations for the same individual (or group) in different columns.
  • Long format: the data contains repeated observations for the same individual in different rows, with a column distinguishing the observation levels.

An example of the distinction between the two can be taken from Hadley Wickham’s reference book, R for Data Science:

Wide and Long Data Formats (Source: R for Data Science)

Wide and Long Data Formats (Source: R for Data Science)

The following cheat sheet will help remember the functions to apply if needed:

Switching from a wide format to a long format (or vice versa) can be extremely practical because certain functions are more suitable for one form of data than the other.

Generally, with Python as with R, long formats are often preferable. Wide formats are rather designed for spreadsheets like Excel, where we have a limited number of rows to create pivot tables from.

4.2 Application

The ADEME data, and the Insee data as well, are in the wide format. The next exercise illustrates the benefit of converting from long to wide before creating a plot with the plot method seen in the previous chapter.

Exercice 2: Restructuring Data: Wide to Long
  1. Create a copy of the ADEME data by doing df_wide = emissions.copy()

  2. Restructure the data into the long format to have emission data by sector while keeping the commune as the level of analysis (pay attention to other identifying variables).

  3. Sum the emissions by sector and represent it graphically.

  4. For each department, identify the most polluting sector.

5 Joining data

5.1 Principle

Here we will focus on the most favorable case, which is the situation where information allows for an exact match between two databases[^fuzzy]. It is a daily necessity for data scientists to merge information present in multiple files. For example, in business databases, customer information (address, age, etc.) will be in one file, sales in another, and product characteristics in a third file. To have a complete base that compares all this information, it will be necessary to join these three files based on common information.

This practice stems from the fact that many information systems take the form of a star schema:

Illustration of the star schema (Source: Databricks)

Illustration of the star schema (Source: Databricks)

This structuring of information is closely related to the model of relational tables from the 1980s. Today, there are more flexible data models where information is stacked in a data lake without an a priori structure. Nevertheless, this star schema model retains relevance because it allows sharing information only with those who need it, leaving it to those who need to link data to do so.

Since the logic of the star schema historically comes from relational databases, it is natural that it is an approach intrinsically linked to the philosophy of SQL, even in the vocabulary. The term “data join” is often used, inherited from the SQL JOIN term, and the way to describe joins (left join, right join…) comes directly from the associated SQL instructions.

We generally speak of left and right bases to illustrate the joins:

Joins

Joins

5.2 Implementation with Pandas

In Pandas, the most practical method to join datasets based on common characteristics is merge. Its main arguments allow for controlling the join behavior. We will explore them visually.

In our case, for constructing statistics on carbon emissions, the left base will be the emissions DataFrame, and the right base will be the filosofi DataFrame:

emissions.head(2)
INSEE commune Commune Agriculture Autres transports Autres transports international CO2 biomasse hors-total Déchets Energie Industrie hors-énergie Résidentiel Routier Tertiaire dep
0 01001 L'ABERGEMENT-CLEMENCIAT 3711.425991 NaN NaN 432.751835 101.430476 2.354558 6.911213 309.358195 793.156501 367.036172 01
1 01002 L'ABERGEMENT-DE-VAREY 475.330205 NaN NaN 140.741660 140.675439 2.354558 6.911213 104.866444 348.997893 112.934207 01
filosofi.head(2)
CODGEO LIBGEO NBMENFISC16 NBPERSMENFISC16 MED16 PIMP16 TP6016 TP60AGE116 TP60AGE216 TP60AGE316 ... PPAT16 PPSOC16 PPFAM16 PPMINI16 PPLOGT16 PIMPOT16 D116 D916 RD16 dep
0 01001 L'Abergement-Clémenciat 313.0 795.5 22679.000000 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 01
1 01002 L'Abergement-de-Varey 101.0 248.0 24382.083333 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 01

2 rows × 30 columns

We refer to join keys as the variable(s) necessary for merging data. These are the variables common to both datasets. They do not need to have the same name, but they must share common values; otherwise, the intersection between these two datasets is the empty set.

We can manipulate two dimensions in the join (this will be clearer later with graphical examples):

  • There are mainly three types of merges: left join, right join, or a combination of the two, depending on the type of pivot we want to implement.
  • Then, there are two ways to merge the values once we have chosen a pivot: inner or outer join. In the first case, we only keep the observations where the join keys are present in both datasets; in the second, we keep all observations of the pivot key variables, even if the second dataset does not have such observations, resulting in missing values.

In the examples below, we will use the commune codes and departments as join keys. Using the department is not necessary since it is directly deduced from the commune code, but it helps illustrate the principle of joins on multiple variables. Note that the name of the commune is intentionally set aside for joins, even though it is common information to both datasets. However, as it is a textual field, which may follow different formatting norms in the two datasets, it is not reliable for an exact join.

To illustrate the principle of the left or right pivot, we will create two identifier variables for the row in our left and right datasets. This will allow us to easily find rows present in one dataset but not in the other.

emissions = emissions.reset_index(names=["id_left"])
filosofi = filosofi.reset_index(names=["id_right"])

5.2.1 Left join

Let’s start with the left join. As its name indicates, we will take the left variable as the pivot:

left_merged = emissions.merge(
    filosofi, left_on=["INSEE commune", "dep"], right_on=["CODGEO", "dep"], how="left"
)
left_merged.head(3)
id_left INSEE commune Commune Agriculture Autres transports Autres transports international CO2 biomasse hors-total Déchets Energie Industrie hors-énergie ... PPEN16 PPAT16 PPSOC16 PPFAM16 PPMINI16 PPLOGT16 PIMPOT16 D116 D916 RD16
0 0 01001 L'ABERGEMENT-CLEMENCIAT 3711.425991 NaN NaN 432.751835 101.430476 2.354558 6.911213 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 1 01002 L'ABERGEMENT-DE-VAREY 475.330205 NaN NaN 140.741660 140.675439 2.354558 6.911213 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 2 01004 AMBERIEU-EN-BUGEY 499.043526 212.577908 NaN 10313.446515 5314.314445 998.332482 2930.354461 ... 27.0 8.8 6.9 2.8 2.1 2.0 -15.7 10457.083333 33880.555556 3.239962

3 rows × 44 columns

It is recommended to always explicitly specify the join keys using the left_on, right_on, or on arguments if the variable names are common between the two datasets. If there are common variable names between the datasets that are not defined as join keys, they will not be used for the join but will be retained with a suffix that defaults to _x and _y (configurable using the suffixes argument).

The Pandas syntax is directly inspired by SQL, so we have a fairly transparent translation of the above instruction into SQL:

SELECT *
FROM emissions
LEFT JOIN filosofi
  ON emissions.`INSEE commune` = filosofi.CODGEO
  AND emissions.dep = filosofi.dep;

By performing a left join, we should, in principle, have as many rows as in the left dataset:

left_merged.shape[0] == emissions.shape[0]
True

Otherwise, it indicates that there is a duplicate key on the right. Thanks to our id_right variable, we can identify the commune codes on the right that do not exist on the left:

left_merged.loc[left_merged["id_right"].isna()].tail(3)
id_left INSEE commune Commune Agriculture Autres transports Autres transports international CO2 biomasse hors-total Déchets Energie Industrie hors-énergie ... PPEN16 PPAT16 PPSOC16 PPFAM16 PPMINI16 PPLOGT16 PIMPOT16 D116 D916 RD16
35348 35348 91182 COURCOURONNES 24.548795 103.360309 NaN 9623.065698 111.241872 1276.170296 3745.877636 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
35360 35360 91222 ESTOUCHES 1790.002871 NaN NaN 113.797978 30.548162 2.354558 6.911213 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
35687 35687 95259 GADANCOURT 312.298700 NaN NaN 142.113291 11.372909 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

3 rows × 44 columns

This is because we are using data that are not from the same reference year of the official geographical code (2016 vs 2018). During this interval, there were geographical changes, notably commune mergers. For example, the commune of Courcouronnes seen above can be found merged with Evry in the filosofi dataset (right base):

filosofi.loc[filosofi["LIBGEO"].str.lower().str.contains("courcouronnes")]
id_right CODGEO LIBGEO NBMENFISC16 NBPERSMENFISC16 MED16 PIMP16 TP6016 TP60AGE116 TP60AGE216 ... PPAT16 PPSOC16 PPFAM16 PPMINI16 PPLOGT16 PIMPOT16 D116 D916 RD16 dep
34441 34441 91228 Évry-Courcouronnes 23761.0 65184.5 17107.0 47.0 26.0 30.0 26.0 ... 4.0 10.5 4.4 3.1 2.9 -15.0 9139.090909 30888.0 3.379767 91

1 rows × 31 columns

In a public statistics construction exercise, we could not afford this discrepancy in years.

5.2.2 Right join

The principle is the same, but this time it is the right base that is taken as the pivot:

right_merged = emissions.merge(
    filosofi, left_on=["INSEE commune", "dep"], right_on=["CODGEO", "dep"], how="right"
)
right_merged.head(3)
id_left INSEE commune Commune Agriculture Autres transports Autres transports international CO2 biomasse hors-total Déchets Energie Industrie hors-énergie ... PPEN16 PPAT16 PPSOC16 PPFAM16 PPMINI16 PPLOGT16 PIMPOT16 D116 D916 RD16
0 0.0 01001 L'ABERGEMENT-CLEMENCIAT 3711.425991 NaN NaN 432.751835 101.430476 2.354558 6.911213 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 1.0 01002 L'ABERGEMENT-DE-VAREY 475.330205 NaN NaN 140.741660 140.675439 2.354558 6.911213 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 2.0 01004 AMBERIEU-EN-BUGEY 499.043526 212.577908 NaN 10313.446515 5314.314445 998.332482 2930.354461 ... 27.0 8.8 6.9 2.8 2.1 2.0 -15.7 10457.083333 33880.555556 3.239962

3 rows × 44 columns

The equivalent instruction in SQL would be:

SELECT *
FROM filosofi
RIGHT JOIN emissions
  ON filosofi.CODGEO = emissions.`INSEE commune`
  AND filosofi.dep = emissions.dep;

We can, as before, check the consistency of the dimensions:

right_merged.shape[0] == filosofi.shape[0]
True

To check the number of rows in the Filosofi data that we do not have in our greenhouse gas emissions dataset, we can do:

right_merged["id_left"].isna().sum()
61

It’s a small number. What are these observations?

right_merged.loc[
    right_merged["id_left"].isna(),
    filosofi.columns.tolist() + emissions.columns.tolist(),
]
id_right CODGEO LIBGEO NBMENFISC16 NBPERSMENFISC16 MED16 PIMP16 TP6016 TP60AGE116 TP60AGE216 ... Autres transports Autres transports international CO2 biomasse hors-total Déchets Energie Industrie hors-énergie Résidentiel Routier Tertiaire dep
4346 4346 13055 Marseille 362971.0 815045.5 18248.000000 47.0 26.0 34.0 30.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 13
27115 27115 69123 Lyon 225617.0 459305.5 22806.000000 60.0 15.0 21.0 13.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 69
29288 29288 75056 Paris 1027657.0 2074629.5 26808.000000 69.0 16.0 17.0 13.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 75
34874 34874 97201 L'Ajoupa-Bouillon 688.0 1634.0 14116.833333 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 97
34875 34875 97202 Les Anses-d'Arlet 1390.0 3274.5 14645.277778 25.0 37.0 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 97
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
34927 34927 97420 Sainte-Suzanne 7827.0 23118.0 15110.666667 26.0 38.0 53.0 39.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 97
34928 34928 97421 Salazie 2358.0 6967.0 11280.937500 12.0 58.0 NaN 63.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 97
34929 34929 97422 Le Tampon 27916.0 73468.5 14243.809524 25.0 41.0 56.0 45.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 97
34930 34930 97423 Les Trois-Bassins 2386.0 7105.0 14031.785714 23.0 42.0 NaN 42.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 97
34931 34931 97424 Cilaos 2038.0 5603.0 12034.375000 15.0 53.0 NaN 60.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 97

61 rows × 45 columns

It is surprising to see that Paris, Lyon, and Marseille are present in the communal statistics dataset but not in the emissions dataset. To understand why, let’s search in our emissions data for observations related to Marseille:

emissions.loc[emissions["Commune"].str.lower().str.contains("MARSEILLE")]
id_left INSEE commune Commune Agriculture Autres transports Autres transports international CO2 biomasse hors-total Déchets Energie Industrie hors-énergie Résidentiel Routier Tertiaire dep

This is because the Ademe emissions dataset provides information on districts in the three largest cities, whereas the Insee dataset does not have this breakdown.

5.2.3 Inner join

This is the dataset where the keys are found at the intersection of the two tables.

inner_merged = emissions.merge(
    filosofi, left_on=["INSEE commune", "dep"], right_on=["CODGEO", "dep"], how="inner"
)
inner_merged.head(3)
id_left INSEE commune Commune Agriculture Autres transports Autres transports international CO2 biomasse hors-total Déchets Energie Industrie hors-énergie ... PPEN16 PPAT16 PPSOC16 PPFAM16 PPMINI16 PPLOGT16 PIMPOT16 D116 D916 RD16
0 0 01001 L'ABERGEMENT-CLEMENCIAT 3711.425991 NaN NaN 432.751835 101.430476 2.354558 6.911213 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 1 01002 L'ABERGEMENT-DE-VAREY 475.330205 NaN NaN 140.741660 140.675439 2.354558 6.911213 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 2 01004 AMBERIEU-EN-BUGEY 499.043526 212.577908 NaN 10313.446515 5314.314445 998.332482 2930.354461 ... 27.0 8.8 6.9 2.8 2.1 2.0 -15.7 10457.083333 33880.555556 3.239962

3 rows × 44 columns

In SQL, this would be:

SELECT *
FROM emissions
INNER JOIN filosofi
  ON emissions.`INSEE commune` = filosofi.CODGEO
  AND emissions.dep = filosofi.dep;

The number of rows in our dataset can be compared to the left and right datasets:

inner_merged.shape[0] == (left_merged.shape[0] - left_merged["id_right"].isna().sum())
True
inner_merged.shape[0] == (right_merged.shape[0] - right_merged["id_left"].isna().sum())
True

5.2.4 Full join

The full join is a pivot to the left and then to the right for the information that was not found.

full_merged = emissions.merge(
    filosofi, left_on=["INSEE commune", "dep"], right_on=["CODGEO", "dep"], how="outer"
)
full_merged.head(3)
id_left INSEE commune Commune Agriculture Autres transports Autres transports international CO2 biomasse hors-total Déchets Energie Industrie hors-énergie ... PPEN16 PPAT16 PPSOC16 PPFAM16 PPMINI16 PPLOGT16 PIMPOT16 D116 D916 RD16
0 0.0 01001 L'ABERGEMENT-CLEMENCIAT 3711.425991 NaN NaN 432.751835 101.430476 2.354558 6.911213 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 1.0 01002 L'ABERGEMENT-DE-VAREY 475.330205 NaN NaN 140.741660 140.675439 2.354558 6.911213 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 2.0 01004 AMBERIEU-EN-BUGEY 499.043526 212.577908 NaN 10313.446515 5314.314445 998.332482 2930.354461 ... 27.0 8.8 6.9 2.8 2.1 2.0 -15.7 10457.083333 33880.555556 3.239962

3 rows × 44 columns

As usual, the translation to SQL is almost immediate:

SELECT *
FROM emissions
FULL OUTER JOIN filosofi
  ON emissions.`INSEE commune` = filosofi.CODGEO
  AND emissions.dep = filosofi.dep;

This time, we have a combination of our three initial datasets:

  • The inner join;
  • The left join on observations without the right key;
  • The right join on observations without the left key;
(full_merged["id_left"].isna().sum() + full_merged["id_right"].isna().sum()) == (
    left_merged["id_right"].isna().sum() + right_merged["id_left"].isna().sum()
)
True

5.2.5 In summary

5.3 Examples of identifiers in French data

5.3.1 The Official Geographic Code (COG): The identifier for geographic data

For geographic data, there are many identifiers depending on the study problem. Among the main needs is the ability to match geographic data using a common administrative identifier. For example, associating two datasets at the municipal level.

For this, the reference identifier is the Insee code, derived from the Official Geographic Code (COG), which we have been using since the last chapter and will extensively use throughout the different chapters of this course. Given that the administrative geography is constantly evolving, the Insee code database is a living base. The Insee website and APIs provide access to the post-war historical data for long-term geographic analysis.

Postal codes cannot be considered as an identifier: they can group several municipalities or, conversely, one municipality can have several postal codes. It is a system managed by La Poste that was not designed for statistical analysis.

To illustrate the problem, from the data provided by La Poste, we can see that postal code 11420 corresponds to 11 municipalities:

codes_postaux = pd.read_csv(
    "https://datanova.laposte.fr/data-fair/api/v1/datasets/laposte-hexasmal/raw",
    sep=";",
    encoding="latin1",
    dtype={"Code_postal": "str", "#Code_commune_INSEE": "str"},
)
codes_postaux.loc[codes_postaux["Code_postal"] == "11420"]
#Code_commune_INSEE Nom_de_la_commune Code_postal Libellé_d_acheminement Ligne_5
3921 11033 BELPECH 11420 BELPECH NaN
3944 11057 CAHUZAC 11420 CAHUZAC NaN
4080 11184 LAFAGE 11420 LAFAGE NaN
4124 11226 MAYREVILLE 11420 MAYREVILLE NaN
4134 11236 MOLANDIER 11420 MOLANDIER NaN
4176 11277 PECHARIC ET LE PY 11420 PECHARIC ET LE PY NaN
4177 11278 PECH LUNA 11420 PECH LUNA NaN
4182 11283 PEYREFITTE SUR L HERS 11420 PEYREFITTE SUR L HERS NaN
4189 11290 PLAIGNE 11420 PLAIGNE NaN
4264 11365 ST SERNIN 11420 ST SERNIN NaN
4317 11419 VILLAUTOU 11420 VILLAUTOU NaN

Anticipating on the skills developed in the upcoming chapters, we can represent the problem cartographically by taking the example of the Aude department. The code to produce the map of commune codes is given as is, not developed, as it uses concepts and libraries that will be presented in the next chapter:

from cartiflette import carti_download

shp_communes = carti_download(
    values=["11"],
    crs=4326,
    borders="COMMUNE",
    simplification=50,
    filter_by="DEPARTEMENT",
    source="EXPRESS-COG-CARTO-TERRITOIRE",
    year=2022,
1)

codes_postaux11 = shp_communes.merge(
    codes_postaux, left_on="INSEE_COM", right_on="#Code_commune_INSEE"
2)
3codes_postaux11 = codes_postaux11.dissolve(by="Code_postal")

4# Map
ax = shp_communes.plot(color="white", edgecolor="blue", linewidth=0.5)
ax = codes_postaux11.plot(ax=ax, color="none", edgecolor="black")
ax.set_axis_off()
1
Downloading the official contours of Aude produced by IGN using the cartiflette library
2
Joining using the commune code between the two data sources
3
Aggregating the geometry at the postal code level
4
Creating a map from our two layers
ERROR 1: PROJ: proj_create_from_database: Open of /opt/mamba/share/proj failed

Geography of postal codes and municipalities in Aude (11)

Geography of postal codes and municipalities in Aude (11)

5.3.2 Sirene: the identifier in business data

To connect French business microdata, there is a unique identification number: the Siren number. It is an identification number in a legal business directory essential for all legal, fiscal, and other procedures. For companies that have multiple establishments—for example, in several cities—there is a derived identifier called the Siret: the 9 digits of the Siren number are followed by 5 establishment identification digits. Moreover, public administrations are also concerned with the Siren number: being involved in market operations (purchasing equipment, renting goods, etc.), they also have a Siren identifier. As they are registered in legal directories whose information is public, the Siren numbers and the associated company names are available in open data, for example, on annuaire-entreprises.data.gouv.fr/ for occasional searches, or on data.gouv.fr.

This Sirene database is a treasure trove of information, sometimes amusing, about French companies. For example, the site tif.hair/ cataloged the proportion of hair salons with puns in their names. When an entrepreneur declares the creation of a business, they receive a Siren number and an activity code (the APE code) related to the description of their business activity. This code allows the classification of a business activity in the French Classification of Activities (NAF), which will be used by Insee for the publication of sectoral statistics. In the case of hairdressers, the code in the NAF is 96.02A. From the open data available, it is possible, in a few lines of Python, to get the list of all hairdressers and then explore this data (the subject of the next optional exercise).

The following optional exercise suggests replicating, in a simplified manner, the survey done by tif.hair/ on puns in hair salon names. It allows practicing some text manipulation methods, ahead of the chapter dedicated to regular expressions.

Since the dataset of all companies is quite large (around 4GB in CSV after decompression), it is more practical to use a dataset in Parquet format, which is more optimized (more details on this format in the advanced chapter dedicated to it).

To read this type of file optimally, it is recommended to use the DuckDB library, which allows consuming only the necessary data instead of downloading the entire file to read only a part of it as would be the case with Pandas (see the end of this chapter, section “Beyond Pandas”). The following SQL query translates into natural language as: “From the Parquet file, I only want a few columns of the file for hairdressers (APE: 96.02A) whose business name (denominationUsuelleEtablissement) is provided”:

import duckdb

coiffeurs = duckdb.sql(
    """
  SELECT
    siren, siret, dateDebut, enseigne1Etablissement, activitePrincipaleEtablissement, denominationUsuelleEtablissement
  FROM
    read_parquet('https://minio.lab.sspcloud.fr/lgaliana/data/sirene2024.parquet')
  WHERE
    activitePrincipaleEtablissement == '96.02A'
    AND
    denominationUsuelleEtablissement IS NOT NULL
"""
)
1coiffeurs = coiffeurs.df()
1
Convert the DuckDB dataframe to a Pandas DataFrame.
coiffeurs.head(3)
siren siret dateDebut enseigne1Etablissement activitePrincipaleEtablissement denominationUsuelleEtablissement
0 024050379 02405037900023 2016-01-14 None 96.02A SOPHA COIFFURE
1 024076481 02407648100019 2017-07-17 None 96.02A SAF COIFFURE
2 047142872 04714287200036 2011-07-31 None 96.02A JENNY
Optional Exercise: Punny Hairdressers

In this exercise, we will consider only the variable denominationUsuelleEtablissement.

  1. In this dataset, [ND] is a code for missing value. Since Python has no reason to know this a priori and therefore didn’t interpret these values as missing, use the replace method to replace [ND] with an empty text field. Also, recode missing values as an empty text field to avoid future errors related to the inability to apply certain text methods to missing values.
  2. Search for all occurrences where the term tif appears, paying attention to the capitalization of the variable. Look at some observations.
  3. Using this example, normalize the names of the salons by removing special characters and count the most frequent puns.

With question 2, we find a list of quite imaginative puns based on the term tif

Of course, to go further, it would be better to normalize the data more thoroughly, check that the information sought is not spread across multiple columns, and conduct visual inspections to detect hidden puns. But already, in just a few minutes, we have partial statistics on the phenomenon of punny hairdressers.

5.3.3 The NIR and the issue of individual identifiers’ confidentiality

For individuals, there exists a unique identifier that allows linking them across different data sources: the NIR, also known as the INSEE number or social security number. This number is necessary for the administration to manage social benefits (health, retirement, family…). Beyond this function, which can be useful daily, this number is a unique individual identifier in the National Register of Physical Persons (RNIPP).

This identifier is mainly present in management databases related to payroll, social benefits, etc. However, unlike the Sirene number, it contains several sensitive pieces of information and is inherently linked to the sensitive issue of social security rights.

Social security number (Source: Améli)

Social security number (Source: Améli)

To address this problem, the non-significant statistical code (CSNS) or hashed NIR, a non-identifying anonymous individual identifier, was recently implemented. The goal of this anonymized identifier is to reduce the dissemination of personal information that, although allowing civil servants and researchers to deterministically link numerous databases, provided analysts with non-essential information about the individuals in question.

Of course, to go any further, we’d need to standardize the data better, check that the information we’re looking for doesn’t straddle several columns and, of course, do some visual inspection for hidden puns. But already, in just a few minutes, we have partial statistics on the phenomenon of prank hairdressers.

5.3.4 The social security number and the issue of individual identifiers’ confidentiality

For individuals, there exists a unique identifier that allows linking them across different data sources: the NIR, also known as the INSEE number or social security number. This number is necessary for the administration to manage social benefits (health, retirement, family…). Beyond this function, which can be useful daily, this number is a unique individual identifier in the National Register of Physical Persons (RNIPP).

This identifier is mainly present in management databases related to payroll, social benefits, etc. However, unlike the Sirene number, it contains several sensitive pieces of information and is inherently linked to the sensitive issue of social security rights.

Social security number (Source: Améli)

Social security number (Source: Améli)

To address this problem, the non-significant statistical code (CSNS) or hashed NIR, a non-identifying anonymous individual identifier, was recently implemented. The goal of this anonymized identifier is to reduce the dissemination of personal information that, although allowing civil servants and researchers to deterministically link numerous databases, provided analysts with non-essential information about the individuals in question.

5.4 Application

5.4.1 Why do we need a commune code when we already have its name?

This exercise will take a step back to understand why we assumed above that the commune code was the key for joining data.

Exercise 3: Verification of Join Keys

Let’s start by checking the dimensions of the DataFrames and the structure of some key variables. In this case, the fundamental variables for linking our data are the communal variables. Here, we have two geographical variables: a commune code and a commune name.

  1. Check the dimensions of the DataFrames.

  2. Identify in filosofi the commune names that correspond to multiple commune codes and select their codes. In other words, identify the LIBGEO where there are duplicate CODGEO and store them in a vector x (tip: be careful with the index of x).

We temporarily focus on observations where the label involves more than two different commune codes.

  • Question 3. Look at these observations in filosofi.

  • Question 4. To get a better view, reorder the obtained dataset alphabetically.

  • Question 5. Determine the average size (variable number of people: NBPERSMENFISC16) and some descriptive statistics of this data. Compare it to the same statistics on the data where labels and commune codes coincide.

  • Question 6. Check the major cities (more than 100,000 people) for the proportion of cities where the same name is associated with different commune codes.

  • Question 7. Check in filosofi the cities where the label is equal to Montreuil. Also, check those that contain the term ‘Saint-Denis’.

This small exercise reassures us as the duplicated labels are actually the same commune names but in different departments. So, these are not duplicated observations. We can thus rely on the commune codes, which are unique.

5.4.2 Associating different sources to compute carbon footprints

Exercise 4: Calculate the Carbon Footprint per Inhabitant

First, we will calculate the carbon footprint of each commune.

  1. Create a variable emissions that corresponds to the total emissions of a commune.

  2. Perform a left join between the emissions data and the framing data2.

  3. Calculate the carbon footprint (total emissions / population).

At this stage, we might want to move towards modeling to try to explain the determinants of the carbon footprint based on communal variables. However, for an inferential approach to be relevant, it is necessary to check some descriptive statistics beforehand.

  1. Generate a histogram of the carbon footprint per commune in level and log.

With a better understanding of our data, we are getting closer to inferential statistics. However, so far we have constructed univariate statistics and have not sought to understand the results by looking at the relationship with other variables. This brings us to bivariate statistics, especially correlation analysis. This work is important because any subsequent modeling will consist of refining the correlation analysis to account for cross-correlations between multiple factors. Here, we propose to do this analysis in a minimal way.

  1. Look at the correlation between the framing variables and the carbon footprint. Do some variables seem to potentially influence the carbon footprint?

At the end of question 5, the correlation graph is as follows:

6 Formatting descriptive statistics tables

A Pandas DataFrame is automatically formatted when viewed from a notebook as a minimally styled HTML table. This formatting is convenient for viewing data, a necessary task for data scientists, but it doesn’t go much beyond that.

In an exploratory phase, it can be useful to have a more complete table, including minimal visualizations, to better understand the data. In the final phase of a project, when communicating about it, having an attractive visualization is advantageous. The outputs of notebooks are not a satisfactory solution for these needs and require the medium of the notebook, which can deter some users.

Fortunately, the young package great_tables allows for the creation of tables programmatically that rival tedious manual productions in Excel and are difficult to replicate. This package is a Python port of the GT package. great_tables builds HTML tables, offering great formatting richness and excellent integration with Quarto, the reproducible publishing tool developed by RStudio.

The following exercise will propose building a table with this package, step by step.

To focus on table construction, the necessary data preparations are provided directly. We will start from this dataset:

emissions_merged.head(3)
index_x id_left INSEE commune Commune Agriculture Autres transports Autres transports international CO2 biomasse hors-total Déchets Energie ... PPSOC16 PPFAM16 PPMINI16 PPLOGT16 PIMPOT16 D116 D916 RD16 dep_y empreinte
0 0 0 01001 L'ABERGEMENT-CLEMENCIAT 3711.425991 NaN NaN 432.751835 101.430476 2.354558 ... NaN NaN NaN NaN NaN NaN NaN NaN 01 7.196009
1 1 1 01002 L'ABERGEMENT-DE-VAREY 475.330205 NaN NaN 140.741660 140.675439 2.354558 ... NaN NaN NaN NaN NaN NaN NaN NaN 01 5.378273
2 2 2 01004 AMBERIEU-EN-BUGEY 499.043526 212.577908 NaN 10313.446515 5314.314445 998.332482 ... 6.9 2.8 2.1 2.0 -15.7 10457.083333 33880.555556 3.239962 01 4.446281

3 rows × 49 columns

We will construct a table of descriptive statistics by department, paralleling the carbon footprint, median income, and population of each department.

emissions_table = (
    emissions_merged.rename(
        columns={"dep_y": "dep", "NBPERSMENFISC16": "population", "MED16": "revenu"}
    )
    .groupby("dep")
    .agg(
        {"empreinte": "sum", "revenu": "median", "population": "sum"}
    )  # pas vraiment le revenu médian
    .reset_index()
    .sort_values(by="empreinte")
)

In this table, we will include horizontal bars, similar to the examples shown here. This is done by directly including the HTML code in the DataFrame column.

def create_bar(
    prop_fill: float, max_width: int, height: int, color: str = "green"
) -> str:
    """Create divs to represent prop_fill as a bar."""
    width = round(max_width * prop_fill, 2)
    px_width = f"{width}px"
    return f"""\
    <div style="width: {max_width}px; background-color: lightgrey;">\
        <div style="height:{height}px;width:{px_width};background-color:{color};"></div>\
    </div>\
    """


colors = {"empreinte": "green", "revenu": "red", "population": "blue"}

for variable in ["empreinte", "revenu", "population"]:
    emissions_table[f"raw_perc_{variable}"] = (
        emissions_table[variable] / emissions_table[variable].max()
    )
    emissions_table[f"bar_{variable}"] = emissions_table[f"raw_perc_{variable}"].map(
        lambda x: create_bar(x, max_width=75, height=20, color=colors[variable])
    )

We keep only the 5 smallest carbon footprints and the five largest.

emissions_min = (
    emissions_table.head(5)
    .assign(grp="5 départements les moins pollueurs")
    .reset_index(drop=True)
)
emissions_max = (
    emissions_table.tail(5)
    .assign(grp="5 départements les plus pollueurs")
    .reset_index(drop=True)
)

emissions_table = pd.concat([emissions_min, emissions_max])

Finally, to use some practical functions for selecting columns based on patterns, we will convert the data to the Polars format.

import polars as pl

emissions_table = pl.from_pandas(emissions_table)
Exercise 5: A Beautiful Descriptive Statistics Table (Open Exercise)

Using this base table

GT(emissions_table, groupname_col="grp", rowname_col="dep")

construct a table in the style of the one below

Carbon Footprint
Initial descriptive statistics to refine
Footprint Median Income Population
Carbon Footprint (%)* Income (%)* Population (%)*
5 départements les moins pollueurs
75 9.24 0.1%
30.5K 100.0%
2.07M 82.9%
92 20.72 0.2%
28.6K 93.6%
1.58M 63.3%
93 23.17 0.3%
17.3K 56.6%
1.59M 63.4%
94 32.57 0.4%
22.4K 73.5%
1.35M 53.8%
06 170.62 1.9%
21.3K 69.7%
1.11M 44.2%
5 départements les plus pollueurs
70 6,040.05 66.0%
20.0K 65.7%
225.88K 9.0%
76 6,108.32 66.7%
21.0K 69.0%
1.21M 48.3%
88 6,156.49 67.3%
19.9K 65.3%
353.37K 14.1%
62 6,292.99 68.7%
19.7K 64.5%
1.44M 57.5%
80 9,154.10 100.0%
20.0K 65.7%
535.82K 21.4%
*Note: The median income presented here is an approximation of the department's median income.
Reading: The (%) columns presented above are scaled to the maximum value of the variable
Source: Calculations based on Ademe data

Thanks to this, we can already understand that our definition of the carbon footprint is certainly flawed. It seems unlikely that the inhabitants of the 77th department have a carbon footprint 500 times greater than that of intra-muros Paris. The main reason? We are not dealing with a concept of consumption emissions but production emissions, which penalizes industrial areas or areas with airports…

To learn more about constructing tables with great_tables, you can replicate this exercise on producing electoral tables that I proposed for an R course with gt, the equivalent of great_tables for R.

7 Pandas: towards practice and beyond

7.1 Pandas in a chain of operations

Generally, in a project, data cleaning will consist of a series of methods applied to a DataFrame or a Series when working exclusively on a single column. In other words, what is usually expected when working with Pandas is to have a chain that takes a DataFrame as input and outputs the same DataFrame enriched or an aggregated version of it.

This way of proceeding is at the heart of the dplyr syntax in R but is not necessarily native in Pandas depending on the operations you want to implement. Indeed, the natural way to update a dataframe in Pandas often involves syntax like:

import numpy as np
import pandas as pd

data = [[8000, 1000], [9500, np.nan], [5000, 2000]]
df = pd.DataFrame(data, columns=["salaire", "autre_info"])
df["salaire_net"] = df["salaire"] * 0.8

In SQL you could directly update your database with the new column:

SELECT *, salaire*0.8 AS salaire_net FROM df

The tidyverse ecosystem in R, the equivalent of Pandas, works according to the same logic as SQL table updates. Indeed, you would use the following command with dplyr:

df %>% mutate(salaire_net = salaire*0.8) 

Technically, you could do this with an assign in Pandas:

1df = df.drop("salaire_net", axis="columns")
df = df.assign(salaire_net=lambda s: s["salaire"] * 0.8)
1
To delete the variable to start from the initial example

However, this assign syntax is not very natural. It requires passing a lambda function that expects a DataFrame as input where you would want a column. So, it is not really a readable and practical syntax.

It is nevertheless possible to chain operations on datasets using pipes. These follow the same philosophy as dplyr, itself inspired by the Linux pipe. This approach will make the code more readable by defining functions that perform operations on one or more columns of a DataFrame. The first argument to the function is the DataFrame, the others are those controlling its behavior:

def calcul_salaire_net(df: pd.DataFrame, col: str, taux: float = 0.8):
    df["salaire_net"] = df[col] * taux
    return df

This transforms our production chain into:

(df.pipe(calcul_salaire_net, "salaire"))
salaire autre_info salaire_net
0 8000 1000.0 6400.0
1 9500 NaN 7600.0
2 5000 2000.0 4000.0

7.2 Some limitations regarding Pandas syntax

There is a before and after Pandas in data analysis with Python. Without this incredibly practical package, Python, despite all its strengths, would have struggled to establish itself in the data analysis landscape. However, while Pandas offers a coherent syntax in many aspects, it is not perfect either. More recent data analysis paradigms in Python sometimes aim to correct these syntactical imperfections.

Among the most annoying points in everyday use is the need to regularly perform reset_index when building descriptive statistics. Indeed, it can be dangerous to keep indices that are not well controlled because, if we are not careful during the merge phases, they can be misused by Pandas to join data, leading to surprises.

Pandas is extremely well-designed for restructuring data from long to wide format or vice versa. However, this is not the only way to restructure a dataset that we might want to implement. It often happens that we want to compare the value of an observation to that of a group to which it belongs. This is particularly useful in anomaly analysis, outlier detection, or fraud investigation. Natively, in Pandas, you need to build an aggregate statistic by group and then merge it back to the initial data using the group variable. This is somewhat tedious:

emissions_moyennes = emissions.groupby("dep").agg({"Agriculture": "mean"}).reset_index()
emissions_enrichies = emissions.merge(
    emissions_moyennes, on="dep", suffixes=["", "_moyenne_dep"]
)
emissions_enrichies["relatives"] = (
    emissions_enrichies["Agriculture"] / emissions_enrichies["Agriculture_moyenne_dep"]
)
emissions_enrichies.head()
id_left INSEE commune Commune Agriculture Autres transports Autres transports international CO2 biomasse hors-total Déchets Energie Industrie hors-énergie Résidentiel Routier Tertiaire dep emissions Agriculture_moyenne_dep relatives
0 0 01001 L'ABERGEMENT-CLEMENCIAT 3711.425991 NaN NaN 432.751835 101.430476 2.354558 6.911213 309.358195 793.156501 367.036172 01 5724.424941 1974.535382 1.879645
1 1 01002 L'ABERGEMENT-DE-VAREY 475.330205 NaN NaN 140.741660 140.675439 2.354558 6.911213 104.866444 348.997893 112.934207 01 1333.811619 1974.535382 0.240730
2 2 01004 AMBERIEU-EN-BUGEY 499.043526 212.577908 NaN 10313.446515 5314.314445 998.332482 2930.354461 16616.822534 15642.420313 10732.376934 01 63261.689119 1974.535382 0.252740
3 3 01005 AMBERIEUX-EN-DOMBES 1859.160954 NaN NaN 1144.429311 216.217508 94.182310 276.448534 663.683146 1756.341319 782.404357 01 6795.867439 1974.535382 0.941569
4 4 01006 AMBLEON 448.966808 NaN NaN 77.033834 48.401549 NaN NaN 43.714019 398.786800 51.681756 01 1072.584766 1974.535382 0.227378

In the tidyverse, this two-step operation could be done in a single step, which is more convenient:

emissions %>%
  group_by(dep) %>%
  mutate(relatives = Agriculture/mean(Agriculture))

This isn’t too bad, but it does make Pandas processing chains longer and therefore increases the maintenance burden to keep them running over time.

More generally, Pandas processing chains can be quite verbose because it is often necessary to redefine the DataFrame rather than just the columns. For example, to filter rows and columns, you have to:

(
    emissions.loc[
        (emissions["dep"] == "12") & (emissions["Routier"] > 500),
        ["INSEE commune", "Commune"],
    ].head(5)
)
INSEE commune Commune
4058 12001 AGEN-D'AVEYRON
4059 12002 AGUESSAC
4062 12006 ALRANCE
4063 12007 AMBEYRAC
4064 12008 ANGLARS-SAINT-FELIX

In SQL, you could simply refer to the columns in the filter:

SELECT "INSEE commune", "Commune"
FROM emissions 
WHERE dep = "12" AND Routier > 500

In the tidyverse (R), you could also do this simply:

df %>%
  filter(dep == "12", Routier > 500) %>%
  select(`INSEE commune`, `Commune`)

8 Other paradigms

These two chapters have explored in depth the richness of the Pandas ecosystem, which is indispensable in the data scientist’s toolbox. Despite all the limitations we have mentioned, and the alternative solutions we will present, Pandas remains the central package of the data ecosystem with Python. In the following chapters, we will see its native integration with the Scikit ecosystem for machine learning or the extension of Pandas to spatial data with GeoPandas.

Other technical solutions that we will discuss here may be relevant if you want to handle large volumes of data or if you want to use alternative syntaxes.

The main alternatives to Pandas are Polars, DuckDB, and Spark. There is also Dask, a library for parallelizing Pandas operations.

8.1 Polars

Polars is certainly the paradigm most inspired by Pandas, even in the choice of name. The first fundamental difference lies in the internal layers used. Polars relies on the Rust implementation of Arrow, whereas Pandas relies on Numpy, which results in performance loss. This allows Polars to be more efficient on large volumes of data, especially since many operations are parallelized and rely on lazy evaluation, a programming principle that optimizes queries for logical rather than defined execution order.

Another strength of Polars is its more coherent syntax, benefiting from over fifteen years of Pandas existence and almost a decade of dplyr (the data manipulation package within the R tidyverse paradigm). To take the previous example, there is no longer a need to force the reference to the DataFrame; in an execution chain, all subsequent references will be made with respect to the initial DataFrame.

import polars as pl

emissions_polars = pl.from_pandas(emissions)
(
    emissions_polars.filter(pl.col("dep") == "12", pl.col("Routier") > 500)
    .select("INSEE commune", "Commune")
    .head(5)
)
shape: (5, 2)
INSEE commune Commune
str str
"12001" "AGEN-D'AVEYRON"
"12002" "AGUESSAC"
"12006" "ALRANCE"
"12007" "AMBEYRAC"
"12008" "ANGLARS-SAINT-FELIX"

To learn about Polars, many online resources are available, including this notebook built for the public statistics data scientists network.

8.2 DuckDB

DuckDB is the newcomer in the data analysis ecosystem, pushing the limits of data processing with Python without resorting to big data tools like Spark. DuckDB epitomizes a new paradigm, the “Big data is dead” paradigm, where large data volumes can be processed without imposing infrastructures.

Besides its great efficiency, as DuckDB can handle data volumes larger than the computer or server’s RAM, it offers the advantage of a uniform syntax across languages that call DuckDB (Python, R, C++, or Javascript). DuckDB favors SQL syntax for data processing with many pre-implemented functions to simplify certain data transformations (e.g., for text data, time data, etc.).

Compared to other SQL-based systems like PostGreSQL, DuckDB is very simple to install, as it is just a Python library, whereas many tools like PostGreSQL require an appropriate infrastructure.

To reuse the previous example, we can directly use the SQL code mentioned earlier.

import duckdb

duckdb.sql(
    """
  SELECT "INSEE commune", "Commune"
  FROM emissions
  WHERE dep=='12' AND Routier>500
  LIMIT 5
  """
)
┌───────────────┬─────────────────────┐
│ INSEE commune │       Commune       │
│    varchar    │       varchar       │
├───────────────┼─────────────────────┤
│ 12001         │ AGEN-D'AVEYRON      │
│ 12002         │ AGUESSAC            │
│ 12006         │ ALRANCE             │
│ 12007         │ AMBEYRAC            │
│ 12008         │ ANGLARS-SAINT-FELIX │
└───────────────┴─────────────────────┘

Here, the clause FROM emissions comes from the fact that we can directly execute SQL from a Pandas object via DuckDB. If we read directly in the query, it gets slightly more complex, but the logic remains the same.

import duckdb

duckdb.sql(
    f"""
  SELECT "INSEE commune", "Commune"
  FROM read_csv_auto("{url}")
  WHERE
    substring("INSEE commune",1,2)=='12'
    AND
    Routier>500
  LIMIT 5
  """
)
┌───────────────┬─────────────────────┐
│ INSEE commune │       Commune       │
│    varchar    │       varchar       │
├───────────────┼─────────────────────┤
│ 12001         │ AGEN-D'AVEYRON      │
│ 12002         │ AGUESSAC            │
│ 12006         │ ALRANCE             │
│ 12007         │ AMBEYRAC            │
│ 12008         │ ANGLARS-SAINT-FELIX │
└───────────────┴─────────────────────┘

The rendering of the DataFrame is slightly different from Pandas because, like Polars and many large data processing systems, DuckDB relies on lazy evaluation and thus only displays a sample of data. DuckDB and Polars are also well integrated with each other. You can run SQL on a Polars object via DuckDB or apply Polars functions to an initially read DuckDB object.

One of the interests of DuckDB is its excellent integration with the Parquet ecosystem, the already mentioned data format that is becoming a standard in data sharing (for example, it is the cornerstone of data sharing on the HuggingFace platform). To learn more about DuckDB and discover its usefulness for reading data from the French population census, you can check out this blog post.

8.3 Spark

DuckDB has pushed the boundaries of big data, which can be defined as the volume of data that can no longer be processed on a single machine without implementing a parallelization strategy.

Nevertheless, for very large data volumes, Python is well-equipped with the PySpark library. This is a Python API for the Spark language, a big data language based on Scala. This paradigm is built on the idea that Python users access it via clusters with many nodes to process data in parallel. The data will be read in blocks, processed in parallel depending on the number of parallel nodes. The Spark DataFrame API has a syntax close to previous paradigms with more complex engineering in the background related to native parallelization.

Informations additionnelles

environment files have been tested on.

Latest built version: 2024-07-10

Python version used:

'3.11.6 | packaged by conda-forge | (main, Oct  3 2023, 10:40:35) [GCC 12.3.0]'
Package Version
affine 2.4.0
aiobotocore 2.12.2
aiohttp 3.9.3
aioitertools 0.11.0
aiosignal 1.3.1
alembic 1.13.1
aniso8601 9.0.1
annotated-types 0.7.0
appdirs 1.4.4
archspec 0.2.3
astroid 3.1.0
asttokens 2.4.1
attrs 23.2.0
Babel 2.15.0
bcrypt 4.1.2
beautifulsoup4 4.12.3
black 24.4.2
blinker 1.7.0
blis 0.7.11
bokeh 3.4.0
boltons 23.1.1
boto3 1.34.51
botocore 1.34.51
branca 0.7.1
Brotli 1.1.0
cachetools 5.3.3
cartiflette 0.0.2
Cartopy 0.23.0
catalogue 2.0.10
cattrs 23.2.3
certifi 2024.2.2
cffi 1.16.0
charset-normalizer 3.3.2
click 8.1.7
click-plugins 1.1.1
cligj 0.7.2
cloudpathlib 0.18.1
cloudpickle 3.0.0
colorama 0.4.6
comm 0.2.2
commonmark 0.9.1
conda 24.3.0
conda-libmamba-solver 24.1.0
conda-package-handling 2.2.0
conda_package_streaming 0.9.0
confection 0.1.5
contextily 1.6.0
contourpy 1.2.1
cryptography 42.0.5
cycler 0.12.1
cymem 2.0.8
cytoolz 0.12.3
dask 2024.4.1
dask-expr 1.0.10
debugpy 1.8.1
decorator 5.1.1
dill 0.3.8
distributed 2024.4.1
distro 1.9.0
docker 7.0.0
duckdb 0.10.1
en-core-web-sm 3.7.1
entrypoints 0.4
et-xmlfile 1.1.0
exceptiongroup 1.2.0
executing 2.0.1
fastjsonschema 2.19.1
fiona 1.9.6
flake8 7.0.0
Flask 3.0.2
folium 0.16.0
fontawesomefree 6.5.1
fonttools 4.51.0
frozenlist 1.4.1
fsspec 2023.12.2
GDAL 3.8.4
gensim 4.3.2
geographiclib 2.0
geopandas 0.12.2
geoplot 0.5.1
geopy 2.4.1
gitdb 4.0.11
GitPython 3.1.43
google-auth 2.29.0
graphene 3.3
graphql-core 3.2.3
graphql-relay 3.2.0
graphviz 0.20.3
great-tables 0.10.0
greenlet 3.0.3
gunicorn 21.2.0
htmltools 0.5.2
hvac 2.1.0
idna 3.6
imageio 2.34.2
importlib_metadata 7.1.0
importlib_resources 6.4.0
inflate64 1.0.0
ipykernel 6.29.3
ipython 8.22.2
ipywidgets 8.1.2
isort 5.13.2
itsdangerous 2.1.2
jedi 0.19.1
Jinja2 3.1.3
jmespath 1.0.1
joblib 1.3.2
jsonpatch 1.33
jsonpointer 2.4
jsonschema 4.21.1
jsonschema-specifications 2023.12.1
jupyter-cache 1.0.0
jupyter_client 8.6.1
jupyter_core 5.7.2
jupyterlab_widgets 3.0.10
kaleido 0.2.1
kiwisolver 1.4.5
kubernetes 29.0.0
langcodes 3.4.0
language_data 1.2.0
lazy_loader 0.4
libmambapy 1.5.7
llvmlite 0.42.0
locket 1.0.0
lxml 5.2.2
lz4 4.3.3
Mako 1.3.2
mamba 1.5.7
mapclassify 2.6.1
marisa-trie 1.2.0
Markdown 3.6
markdown-it-py 3.0.0
MarkupSafe 2.1.5
matplotlib 3.8.3
matplotlib-inline 0.1.6
mccabe 0.7.0
mdurl 0.1.2
menuinst 2.0.2
mercantile 1.2.1
mizani 0.11.4
mlflow 2.11.3
mlflow-skinny 2.11.3
msgpack 1.0.7
multidict 6.0.5
multivolumefile 0.2.3
munkres 1.1.4
murmurhash 1.0.10
mypy 1.9.0
mypy-extensions 1.0.0
nbclient 0.10.0
nbformat 5.10.4
nest_asyncio 1.6.0
networkx 3.3
nltk 3.8.1
numba 0.59.1
numpy 1.26.4
oauthlib 3.2.2
opencv-python-headless 4.9.0.80
openpyxl 3.1.5
OWSLib 0.28.1
packaging 23.2
pandas 2.2.1
paramiko 3.4.0
parso 0.8.4
partd 1.4.1
pathspec 0.12.1
patsy 0.5.6
Pebble 5.0.7
pexpect 4.9.0
pickleshare 0.7.5
pillow 10.3.0
pip 24.0
pkgutil_resolve_name 1.3.10
platformdirs 4.2.0
plotly 5.19.0
plotnine 0.13.6
pluggy 1.4.0
polars 0.20.31
preshed 3.0.9
prometheus_client 0.20.0
prometheus-flask-exporter 0.23.0
prompt-toolkit 3.0.42
protobuf 4.25.3
psutil 5.9.8
ptyprocess 0.7.0
pure-eval 0.2.2
py7zr 0.20.8
pyarrow 15.0.0
pyarrow-hotfix 0.6
pyasn1 0.5.1
pyasn1-modules 0.3.0
pybcj 1.0.2
pycodestyle 2.11.1
pycosat 0.6.6
pycparser 2.21
pycryptodomex 3.20.0
pydantic 2.8.2
pydantic_core 2.20.1
pyflakes 3.2.0
Pygments 2.17.2
PyJWT 2.8.0
pylint 3.1.0
PyNaCl 1.5.0
pynsee 0.1.7
pyOpenSSL 24.0.0
pyparsing 3.1.2
pyppmd 1.1.0
pyproj 3.6.1
pyshp 2.3.1
PySocks 1.7.1
python-dateutil 2.9.0
python-dotenv 1.0.1
python-magic 0.4.27
pytz 2024.1
pyu2f 0.1.5
pywaffle 1.1.1
PyYAML 6.0.1
pyzmq 25.1.2
pyzstd 0.16.0
QtPy 2.4.1
querystring-parser 1.2.4
rasterio 1.3.10
referencing 0.34.0
regex 2023.12.25
requests 2.31.0
requests-cache 1.2.1
requests-oauthlib 2.0.0
rich 13.7.1
rpds-py 0.18.0
rsa 4.9
Rtree 1.2.0
ruamel.yaml 0.18.6
ruamel.yaml.clib 0.2.8
s3fs 2023.12.2
s3transfer 0.10.1
scikit-image 0.24.0
scikit-learn 1.4.1.post1
scipy 1.13.0
seaborn 0.13.2
setuptools 69.2.0
shapely 2.0.3
shellingham 1.5.4
six 1.16.0
smart_open 7.0.4
smmap 5.0.0
snuggs 1.4.7
sortedcontainers 2.4.0
soupsieve 2.5
spacy 3.7.5
spacy-legacy 3.0.12
spacy-loggers 1.0.5
SQLAlchemy 2.0.29
sqlparse 0.4.4
srsly 2.4.8
stack-data 0.6.2
statsmodels 0.14.1
tabulate 0.9.0
tblib 3.0.0
tenacity 8.2.3
texttable 1.7.0
thinc 8.2.5
threadpoolctl 3.4.0
tifffile 2024.7.2
tomli 2.0.1
tomlkit 0.12.4
toolz 0.12.1
topojson 1.9
tornado 6.4
tqdm 4.66.2
traitlets 5.14.2
truststore 0.8.0
typer 0.12.3
typing_extensions 4.11.0
tzdata 2024.1
Unidecode 1.3.8
url-normalize 1.4.3
urllib3 1.26.18
wasabi 1.1.3
wcwidth 0.2.13
weasel 0.4.1
webdriver-manager 4.0.1
websocket-client 1.7.0
Werkzeug 3.0.2
wheel 0.43.0
widgetsnbextension 4.0.10
wordcloud 1.9.3
wrapt 1.16.0
xgboost 2.0.3
xlrd 2.0.1
xyzservices 2024.4.0
yarl 1.9.4
yellowbrick 1.5
zict 3.0.0
zipp 3.17.0
zstandard 0.22.0

View file history

SHA Date Author Description
fe4edc9 2024-07-10 18:37:22 Lino Galiana Second pandas chapter translated (#524)
Back to top

Footnotes

  1. Ideally, it would be more coherent, for demographic data, to use the legal populations, from the census. However, this base is not yet natively integrated into the pynsee library that we will use in this chapter. An open exercise is proposed to construct population aggregates from anonymized individual census data (the detailed files).↩︎

  2. Ideally, it would be necessary to ensure that this join does not introduce bias. Indeed, since our reference years are not necessarily identical, there may be a mismatch between our two sources. Since the exercise is already long, we will not go down this path. Interested readers can perform such an analysis as an additional exercise.↩︎

Citation

BibTeX citation:
@book{galiana2023,
  author = {Galiana, Lino},
  title = {Python Pour La Data Science},
  date = {2023},
  url = {https://pythonds.linogaliana.fr/},
  doi = {10.5281/zenodo.8229676},
  langid = {en}
}
For attribution, please cite this work as:
Galiana, Lino. 2023. Python Pour La Data Science. https://doi.org/10.5281/zenodo.8229676.