!pip install xlrd
!pip install pynsee
!pip install great_tables
- How to construct fine aggregate statistics using
Pandas
methods; - Restructure your data and join several DataFrames together;
- Create attractive tables to communicate aggregated results;
- Know the limits of
Pandas
and alternative packages.
1 Introduction
The introductory chapter to Pandas
presented the concept of data organized in the form of a DataFrame and the practicality of the Pandas
ecosystem for performing simple operations on a dataset.
It is rare to work exclusively on a raw source. A dataset generally gains value when compared to other sources. For researchers, this allows contextualizing the information present in one source by comparing or associating it with other sources. For data scientists in the private sector, it often involves linking information about the same person in multiple customer databases or comparing customers with each other.
One of the benefits of modern data science tools, especially Pandas
, is the ease with which they allow restructuring sources to work on multiple datasets in a project. This chapter consolidates the principles previously seen by refining the data processing. It will mainly explore two types of operations:
- Group descriptive statistics;
- Data merging by common characteristics.
Performing this work simply, reliably, and efficiently is essential for data scientists as this task is common. Fortunately, Pandas
handles this very well with structured data. In the following chapters, and also throughout the section on text data processing, we will see how to handle less structured data.
Through this work, we will deepen our understanding of a real world phenomenon through detailed descriptive statistics. This is an essential step before moving on to inferential statistics, the approach that consists of formalizing and generalizing correlations or causal relationships between observed characteristics and a phenomenon.
- Retrieve an official dataset from Insee;
- Build group descriptive statistics and switch between data levels;
- Merge data (reshape, merge) to add value;
- Create a beautiful table to communicate descriptive statistics.
1.1 Environment
The previous chapter used almost exclusively the Pandas
library. In this chapter, we will use other packages in addition to it.
As explained below, we will use a library called pynsee
to retrieve Insee data useful for enriching our Ademe dataset. This library is not installed by default in Python
. Before using it, it is necessary to install it, along with the great_tables
library that we will see at the end of this chapter:
The instruction !pip install <pkg>
is a way to tell Jupyter
, the execution engine behind notebooks, that the following command (pip install <pkg>
) is a system command to be executed outside of Python
(in the terminal, for example, for a Linux
system).
The essential packages to start this chapter are as follows:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import pynsee
import pynsee.download
To obtain reproducible results, you can set the seed of the pseudo-random number generator.
123) np.random.seed(
1.2 Data used
This tutorial continues the exploration of the dataset from the previous chapter:
- Greenhouse gas emissions estimated at the municipal level by ADEME. The dataset is available on data.gouv and can be directly queried in Python with this URL;
The issues of data enrichment (associating one source with another based on common characteristics) will be presented using two sources produced by Insee:
- The official geographic code, a reference produced by Insee used to identify municipalities with a unique code, unlike the postal code;
- The Filosofi data, a source on French income at a fine spatial scale constructed by Insee from tax returns and social benefit information. In this case, we will use income levels and populations1 at the municipal level to compare them with our emissions data.
To facilitate the import of Insee data, it is recommended to use the pynsee
package, which simplifies access to the main Insee datasets available on the insee.fr website or via APIs.
The pynsee
package has two main entry points:
- The Insee APIs, which will be illustrated in the dedicated chapter.
- Some datasets directly from the Insee website (insee.fr)
In this chapter, we will exclusively use the second approach through the pynsee.download
module.
The list of available data from this package is here. The download_file
function expects a unique identifier to know which database to fetch and restructure from the insee.fr website.
Knowing the list of available databases
To know the list of available databases, you can use the meta = pynsee.get_file_list()
function after importing pynsee
. This returns a DataFrame
in which you can search, for example, using a keyword search:
import pynsee
= pynsee.get_file_list()
meta 'label'].str.contains(r"Filosofi.*2016")] meta.loc[meta[
Here, meta[‘label’].str.contains(r”Filosofi.2016”) means: “pandas find me all labels containing the terms Filosofi and 2016.” (. means “no matter the number of words or characters in between”).
2 Retrieving data for this chapter
2.1 French carbon emissions dataset
As explained in the previous chapter, these data can be imported very simply with Pandas
:
import pandas as pd
= "https://koumoul.com/s/data-fair/api/v1/datasets/igt-pouvoir-de-rechauffement-global/convert"
url = pd.read_csv(url)
emissions 2) emissions.head(
We will already keep the names of the emitting sectors present in the database to simplify subsequent uses:
= emissions.select_dtypes(include='number').columns secteurs
Subsequent exploitations of these data will use the departmental dimension, the construction of which we demonstrated in the previous chapter:
'dep'] = emissions["INSEE commune"].str[:2] emissions[
We will use the Filosofi data (income data) at the municipal level for 2016. It is not the same year as the CO2 emissions data, so it is not perfectly rigorous, but it will still illustrate the main functionalities of Pandas
.
The main entry point for the pynsee
function is the download_file
function.
The code to download the data is as follows:
from pynsee.download import download_file
= download_file("FILOSOFI_COM_2016") filosofi
The resulting DataFrame
looks like this:
3) filosofi.sample(
Pandas
automatically handled the variable types. It does this relatively well, but a check is always useful for variables that have a specific status.
For variables that are not of type float
but should be, we modify their type.
= (
filosofi
filosofi
.astype("float" for c in filosofi.columns[2:]}
{c:
) )
A quick glance at the data gives a fairly precise idea of how the data are organized. We notice that some variables in filosofi
seem to have many missing values (statistical secrecy), while others seem complete. If we want to exploit filosofi
, we need to pay attention to the chosen variable.
Our ultimate goal will be to link the information contained between these two datasets. Otherwise, we risk being frustrated: we will want to know more about carbon emissions but will be very limited in the possibilities of analysis without adding additional information from filosofi
.
3 Descriptive statistics by group
3.1 Principle
In the previous chapter, we saw how to obtain an aggregated statistic easily with Pandas
. However, it is common to have data with intermediate analysis strata that are relevant: geographical variables, membership in socio-demographic groups related to recorded characteristics, temporal period indicators, etc. To better understand the structure of the data, data scientists are often led to construct descriptive statistics on sub-groups present in the data. For example, we previously constructed emission statistics at the national level. But what about the emission profiles of different departments? To answer this question, it will be useful to aggregate our data at the departmental level. This will give us different information from the initial dataset (municipal level) and the most aggregated level (national level).
In SQL
, it is very simple to segment data to perform operations on coherent blocks and recollect results in the appropriate dimension. The underlying logic is that of split-apply-combine, which is adopted by data manipulation languages, including pandas
which is no exception.
The following image, from this site, well represents how the split
-apply
-combine
approach works:
In Pandas
, we use groupby
to segment the data according to one or more axes (this tutorial on the subject is particularly useful). All the aggregation operations (counting, averages, etc.) that we saw earlier can be applied by group.
Technically, this operation involves creating an association between labels (values of group variables) and observations. Using the groupby
method does not trigger operations until a statistic is implemented; it simply creates a formal relationship between observations and groupings that will be used later:
"dep"] = filosofi["CODGEO"].str[:2]
filosofi['dep').__class__ filosofi.groupby(
As long as we do not call an action on a DataFrame
by group, such as head
or display
, pandas
performs no operations. This is called lazy evaluation. For example, the result of df.groupby('dep')
is a transformation that has not yet been evaluated:
'dep') filosofi.groupby(
3.2 Illustration 1: counting by group
To illustrate the application of this principle to counting, we can count the number of municipalities by department in 2023 (this statistic changes every year due to municipal mergers). For this, we simply take the reference of French municipalities from the official geographical code (COG) and count by department using count
:
import requests
from io import StringIO
import pandas as pd
= "https://www.insee.fr/fr/statistiques/fichier/6800675/v_commune_2023.csv"
url_cog_2023 = "https://minio.lab.sspcloud.fr/lgaliana/data/python-ENSAE/cog_2023.csv"
url_backup
# Try-except clause to avoid timout issue sometimes
# Without timeout problem, pd.read_csv(url_cog_2023) would be sufficient
try:
= requests.get(url_cog_2023)
response
response.raise_for_status()= pd.read_csv(StringIO(response.text))
cog_2023 except requests.exceptions.Timeout:
print("Failing back to backup")
= pd.read_csv(url_backup) cog_2023
With this dataset, without resorting to group statistics, we can already know how many municipalities, departments, and regions we have in France, respectively:
1= cog_2023.loc[cog_2023['TYPECOM']=="COM"]
communes 'COM', 'DEP', 'REG']].nunique() communes.loc[:, [
- 1
- We restrict to the status “Commune” because this file also contains Insee codes for other statuses, such as the “Municipal Arrondissements” of Paris, Lyon, and Marseille.
Now, let’s look at the departments with the most municipalities. It is the same counting function where we play, this time, on the group from which the statistic is calculated.
Calculating this statistic is quite straightforward when you understand the principle of calculating statistics with Pandas
:
= cog_2023.loc[cog_2023['TYPECOM']=="COM"]
communes 'DEP').agg({'COM': 'nunique'}) communes.groupby(
In SQL, we would use the following query:
SELECT dep, COUNT DISTINCT "COM" AS COM
FROM communes
GROUP BY dep
WHERE TYPECOM == 'COM';
The output is an indexed Series
. This is not very convenient as we mentioned in the previous chapter. It is more practical to transform this object into a DataFrame
with reset_index
. Finally, with sort_values
, we obtain the desired statistic:
(
communes'DEP')
.groupby('COM': 'nunique'})
.agg({
.reset_index()'COM', ascending = False)
.sort_values( )
3.3 Illustration 2: aggregates by group
To illustrate aggregates by group, we can use the Insee filosofi
dataset and count the population using the variable NBPERSMENFISC16
.
To calculate the total for the whole of France, we can do it in two ways:
'NBPERSMENFISC16'].sum()* 1e-6 filosofi[
"NBPERSMENFISC16": "sum"}).div(1e6) filosofi.agg({
where the results are reported in millions of people. The logic is the same when doing group statistics, it’s just a matter of replacing filosofi
with filosofi.groupby('dep')
to create a partitioned version of our dataset by department:
1'dep')['NBPERSMENFISC16'].sum() filosofi.groupby(
- 1
-
With this approach, you need to pay attention to the order of operations: first, perform the
groupby
and then select the column of interest
'dep').agg({"NBPERSMENFISC16": "sum"}) filosofi.groupby(
The second approach is more practical because it directly gives a Pandas
DataFrame
and not an unnamed indexed series. From this, a few basic manipulations can suffice to have a shareable table on departmental demographics. However, this table would be somewhat rudimentary as we currently only have the department numbers. To get the names of the departments, we would need to use a second dataset and merge the common information between them (in this case, the department number). This is the subject of the next part.
3.4 Exercice d’application
This application exercise uses the Ademe
dataset named emissions
previously discussed.
Calculate the total emissions of the “Residential” sector by department and compare the value to the most polluting department in this sector. Draw insights from the reality that this statistic reflects.
Calculate the total emissions for each sector in each department. For each department, calculate the proportion of total emissions coming from each sector.
Hint for this question
- “Group by” =
groupby
- “Total emissions” =
agg({*** : "sum"})
In question 1, the result should be as follows:
This ranking may reflect demographics rather than the process we wish to measure. Without the addition of information on the population of each département to control for this factor, it is difficult to know whether there is a structural difference in behavior between the inhabitants of Nord (département 59) and Moselle (département 57).
At the end of question 2, let’s take the share of emissions from agriculture and the tertiary sector in departmental emissions:
These results are quite logical; rural departments have a larger share of their emissions from agriculture, while urban departments have higher emissions from the tertiary sector, which is related to the higher density of these areas.
With these statistics, we progress in understanding our dataset and, consequently, the nature of CO2 emissions in France. Descriptive statistics by group help us better grasp the spatial heterogeneity of our phenomenon.
However, we remain limited in our ability to interpret the obtained statistics without using additional information. To give meaning and value to a statistic, it is generally necessary to associate it with additional knowledge; otherwise, it remains detached.
In the rest of this chapter, we will consider a primary approach which is the merging of complementary data. This process is called data enrichment. These data can be observations at the same level as the original source. For example, one of the most common merges is associating a customer database with a purchase database to relate purchasing behavior to characteristics that may explain it. Data merges can also occur at different conceptual levels, generally at a more aggregated level to contextualize finer data and compare an observation to measures within a similar group. For instance, we can associate individual travel times and modes with those of the same age group or people living in the same municipality to compare the differences between certain individuals and a similar sociodemographic group.
4 Restructuring datasets
4.1 Principle
When we have multiple pieces of information for the same individual or group, we generally find two types of data structures:
- Wide format: the data contains repeated observations for the same individual (or group) in different columns.
- Long format: the data contains repeated observations for the same individual in different rows, with a column distinguishing the observation levels.
An example of the distinction between the two can be taken from Hadley Wickham’s reference book, R for Data Science:
The following cheat sheet will help remember the functions to apply if needed:
Switching from a wide format to a long format (or vice versa) can be extremely practical because certain functions are more suitable for one form of data than the other.
Generally, with Python
as with R
, long formats are often preferable. Wide formats are rather designed for spreadsheets like Excel
, where we have a limited number of rows to create pivot tables from.
4.2 Application
The ADEME data, and the Insee data as well, are in the wide format. The next exercise illustrates the benefit of converting from long to wide before creating a plot with the plot
method seen in the previous chapter.
Create a copy of the ADEME data by doing
df_wide = emissions.copy()
Restructure the data into the long format to have emission data by sector while keeping the commune as the level of analysis (pay attention to other identifying variables).
Sum the emissions by sector and represent it graphically.
For each department, identify the most polluting sector.
5 Joining data
5.1 Principle
Here we will focus on the most favorable case, which is the situation where information allows for an exact match between two databases[^fuzzy]. It is a daily necessity for data scientists to merge information present in multiple files. For example, in business databases, customer information (address, age, etc.) will be in one file, sales in another, and product characteristics in a third file. To have a complete base that compares all this information, it will be necessary to join these three files based on common information.
This practice stems from the fact that many information systems take the form of a star schema:
This structuring of information is closely related to the model of relational tables from the 1980s. Today, there are more flexible data models where information is stacked in a data lake without an a priori structure. Nevertheless, this star schema model retains relevance because it allows sharing information only with those who need it, leaving it to those who need to link data to do so.
Since the logic of the star schema historically comes from relational databases, it is natural that it is an approach intrinsically linked to the philosophy of SQL, even in the vocabulary. The term “data join” is often used, inherited from the SQL JOIN
term, and the way to describe joins (left join, right join…) comes directly from the associated SQL instructions.
We generally speak of left and right bases to illustrate the joins:
5.2 Implementation with Pandas
In Pandas
, the most practical method to join datasets based on common characteristics is merge
. Its main arguments allow for controlling the join behavior. We will explore them visually.
In our case, for constructing statistics on carbon emissions, the left base will be the emissions
DataFrame, and the right base will be the filosofi
DataFrame:
2) emissions.head(
2) filosofi.head(
We refer to join keys as the variable(s) necessary for merging data. These are the variables common to both datasets. They do not need to have the same name, but they must share common values; otherwise, the intersection between these two datasets is the empty set.
We can manipulate two dimensions in the join (this will be clearer later with graphical examples):
- There are mainly three types of merges: left join, right join, or a combination of the two, depending on the type of pivot we want to implement.
- Then, there are two ways to merge the values once we have chosen a pivot: inner or outer join. In the first case, we only keep the observations where the join keys are present in both datasets; in the second, we keep all observations of the pivot key variables, even if the second dataset does not have such observations, resulting in missing values.
In the examples below, we will use the commune codes and departments as join keys. Using the department is not necessary since it is directly deduced from the commune code, but it helps illustrate the principle of joins on multiple variables. Note that the name of the commune is intentionally set aside for joins, even though it is common information to both datasets. However, as it is a textual field, which may follow different formatting norms in the two datasets, it is not reliable for an exact join.
To illustrate the principle of the left or right pivot, we will create two identifier variables for the row in our left and right datasets. This will allow us to easily find rows present in one dataset but not in the other.
= emissions.reset_index(names = ['id_left'])
emissions = filosofi.reset_index(names = ['id_right']) filosofi
5.2.1 Left join
Let’s start with the left join. As its name indicates, we will take the left variable as the pivot:
= emissions.merge(
left_merged
filosofi,= ["INSEE commune", "dep"],
left_on = ["CODGEO", "dep"],
right_on = "left"
how
)3) left_merged.head(
It is recommended to always explicitly specify the join keys using the left_on
, right_on
, or on
arguments if the variable names are common between the two datasets.
If there are common variable names between the datasets that are not defined as join keys, they will not be used for the join but will be retained with a suffix that defaults to _x
and _y
(configurable using the suffixes
argument).
The Pandas
syntax is directly inspired by SQL, so we have a fairly transparent translation of the above instruction into SQL:
SELECT *
FROM emissions
LEFT JOIN filosofi
ON emissions.`INSEE commune` = filosofi.CODGEO
AND emissions.dep = filosofi.dep;
By performing a left join, we should, in principle, have as many rows as in the left dataset:
0] == emissions.shape[0] left_merged.shape[
Otherwise, it indicates that there is a duplicate key on the right. Thanks to our id_right
variable, we can identify the commune codes on the right that do not exist on the left:
'id_right'].isna()].tail(3) left_merged.loc[left_merged[
This is because we are using data that are not from the same reference year of the official geographical code (2016 vs 2018). During this interval, there were geographical changes, notably commune mergers. For example, the commune of Courcouronnes seen above can be found merged with Evry in the filosofi dataset (right base):
filosofi.loc['LIBGEO']
filosofi[str.lower()
.str.contains("courcouronnes")
. ]
In a public statistics construction exercise, we could not afford this discrepancy in years.
5.2.2 Right join
The principle is the same, but this time it is the right base that is taken as the pivot:
= emissions.merge(
right_merged
filosofi,= ["INSEE commune", "dep"],
left_on = ["CODGEO", "dep"],
right_on = "right"
how
)3) right_merged.head(
The equivalent instruction in SQL would be:
SELECT *
FROM filosofi
RIGHT JOIN emissions
ON filosofi.CODGEO = emissions.`INSEE commune`
AND filosofi.dep = emissions.dep;
We can, as before, check the consistency of the dimensions:
0] == filosofi.shape[0] right_merged.shape[
To check the number of rows in the Filosofi data that we do not have in our greenhouse gas emissions dataset, we can do:
'id_left'].isna().sum() right_merged[
It’s a small number. What are these observations?
right_merged.loc['id_left'].isna(),
right_merged[+ emissions.columns.tolist()
filosofi.columns.tolist() ]
It is surprising to see that Paris, Lyon, and Marseille are present in the communal statistics dataset but not in the emissions dataset. To understand why, let’s search in our emissions data for observations related to Marseille:
emissions.loc["Commune"]
emissions[str.lower()
.str.contains('MARSEILLE')
. ]
This is because the Ademe emissions dataset provides information on districts in the three largest cities, whereas the Insee dataset does not have this breakdown.
5.2.3 Inner join
This is the dataset where the keys are found at the intersection of the two tables.
= emissions.merge(
inner_merged
filosofi,= ["INSEE commune", "dep"],
left_on = ["CODGEO", "dep"],
right_on = "inner"
how
)3) inner_merged.head(
In SQL, this would be:
SELECT *
FROM emissions
INNER JOIN filosofi
ON emissions.`INSEE commune` = filosofi.CODGEO
AND emissions.dep = filosofi.dep;
The number of rows in our dataset can be compared to the left and right datasets:
0] == (
inner_merged.shape[0] - left_merged['id_right'].isna().sum()
left_merged.shape[ )
0] == (
inner_merged.shape[0] - right_merged['id_left'].isna().sum()
right_merged.shape[ )
5.2.4 Full join
The full join is a pivot to the left and then to the right for the information that was not found.
= emissions.merge(
full_merged
filosofi,= ["INSEE commune", "dep"],
left_on = ["CODGEO", "dep"],
right_on = "outer"
how
)3) full_merged.head(
As usual, the translation to SQL is almost immediate:
SELECT *
FROM emissions
FULL OUTER JOIN filosofi
ON emissions.`INSEE commune` = filosofi.CODGEO
AND emissions.dep = filosofi.dep;
This time, we have a combination of our three initial datasets:
- The inner join;
- The left join on observations without the right key;
- The right join on observations without the left key;
('id_left'].isna().sum() + full_merged['id_right'].isna().sum()
full_merged[== (
) 'id_right'].isna().sum() + right_merged['id_left'].isna().sum()
left_merged[ )
5.2.5 In summary
5.3 Examples of identifiers in French data
5.3.1 The Official Geographic Code (COG): The identifier for geographic data
For geographic data, there are many identifiers depending on the study problem. Among the main needs is the ability to match geographic data using a common administrative identifier. For example, associating two datasets at the municipal level.
For this, the reference identifier is the Insee code, derived from the Official Geographic Code (COG), which we have been using since the last chapter and will extensively use throughout the different chapters of this course. Given that the administrative geography is constantly evolving, the Insee code database is a living base. The Insee website and APIs provide access to the post-war historical data for long-term geographic analysis.
Postal codes cannot be considered as an identifier: they can group several municipalities or, conversely, one municipality can have several postal codes. It is a system managed by La Poste that was not designed for statistical analysis.
To illustrate the problem, from the data provided by La Poste, we can see that postal code 11420 corresponds to 11 municipalities:
= pd.read_csv(
codes_postaux "https://datanova.laposte.fr/data-fair/api/v1/datasets/laposte-hexasmal/raw",
= ";", encoding = "latin1",
sep = {"Code_postal": "str", "#Code_commune_INSEE": "str"}
dtype
)'Code_postal'] == "11420"] codes_postaux.loc[codes_postaux[
Anticipating on the skills developed in the upcoming chapters, we can represent the problem cartographically by taking the example of the Aude department. The code to produce the map of commune codes is given as is, not developed, as it uses concepts and libraries that will be presented in the next chapter:
from cartiflette import carti_download
= carti_download(
shp_communes = ["11"],
values = 4326,
crs = "COMMUNE",
borders =50,
simplification="DEPARTEMENT",
filter_by="EXPRESS-COG-CARTO-TERRITOIRE",
source1=2022)
year
= shp_communes.merge(
codes_postaux11
codes_postaux,= "INSEE_COM",
left_on = "#Code_commune_INSEE"
right_on 2
)3= codes_postaux11.dissolve(by = "Code_postal")
codes_postaux11
4# Map
= shp_communes.plot(color='white', edgecolor='blue', linewidth = 0.5)
ax = codes_postaux11.plot(ax = ax, color='none', edgecolor='black')
ax ax.set_axis_off()
- 1
-
Downloading the official contours of Aude produced by IGN using the
cartiflette
library - 2
- Joining using the commune code between the two data sources
- 3
- Aggregating the geometry at the postal code level
- 4
- Creating a map from our two layers
5.3.2 Sirene: the identifier in business data
To connect French business microdata, there is a unique identification number: the Siren number. It is an identification number in a legal business directory essential for all legal, fiscal, and other procedures. For companies that have multiple establishments—for example, in several cities—there is a derived identifier called the Siret: the 9 digits of the Siren number are followed by 5 establishment identification digits. Moreover, public administrations are also concerned with the Siren number: being involved in market operations (purchasing equipment, renting goods, etc.), they also have a Siren identifier. As they are registered in legal directories whose information is public, the Siren numbers and the associated company names are available in open data, for example, on annuaire-entreprises.data.gouv.fr/ for occasional searches, or on data.gouv.fr.
This Sirene database is a treasure trove of information, sometimes amusing, about French companies. For example, the site tif.hair/ cataloged the proportion of hair salons with puns in their names. When an entrepreneur declares the creation of a business, they receive a Siren number and an activity code (the APE code) related to the description of their business activity. This code allows the classification of a business activity in the French Classification of Activities (NAF), which will be used by Insee for the publication of sectoral statistics. In the case of hairdressers, the code in the NAF is 96.02A. From the open data available, it is possible, in a few lines of Python
, to get the list of all hairdressers and then explore this data (the subject of the next optional exercise).
The following optional exercise suggests replicating, in a simplified manner, the survey done by tif.hair/ on puns in hair salon names. It allows practicing some text manipulation methods, ahead of the chapter dedicated to regular expressions.
Since the dataset of all companies is quite large (around 4GB in CSV after decompression), it is more practical to use a dataset in Parquet
format, which is more optimized (more details on this format in the advanced chapter dedicated to it).
To read this type of file optimally, it is recommended to use the DuckDB
library, which allows consuming only the necessary data instead of downloading the entire file to read only a part of it as would be the case with Pandas
(see the end of this chapter, section “Beyond Pandas
”). The following SQL query translates into natural language as: “From the Parquet
file, I only want a few columns of the file for hairdressers (APE: 96.02A) whose business name (denominationUsuelleEtablissement
) is provided”:
import duckdb
= duckdb.sql("""
coiffeurs SELECT
siren, siret, dateDebut, enseigne1Etablissement, activitePrincipaleEtablissement, denominationUsuelleEtablissement
FROM
read_parquet('https://minio.lab.sspcloud.fr/lgaliana/data/sirene2024.parquet')
WHERE
activitePrincipaleEtablissement == '96.02A'
AND
denominationUsuelleEtablissement IS NOT NULL
""")
1= coiffeurs.df() coiffeurs
- 1
- Convert the DuckDB dataframe to a Pandas DataFrame.
3) coiffeurs.head(
In this exercise, we will consider only the variable denominationUsuelleEtablissement
.
- In this dataset,
[ND]
is a code for missing value. SincePython
has no reason to know this a priori and therefore didn’t interpret these values as missing, use thereplace
method to replace[ND]
with an empty text field. Also, recode missing values as an empty text field to avoid future errors related to the inability to apply certain text methods to missing values. - Search for all occurrences where the term
tif
appears, paying attention to the capitalization of the variable. Look at some observations. - Using this example, normalize the names of the salons by removing special characters and count the most frequent puns.
With question 2, we find a list of quite imaginative puns based on the term tif
In a more interactive form, here’s a list of all the hairdressers who have the word tif
in the name of their registered business in the official data:
Of course, to go further, it would be better to normalize the data more thoroughly, check that the information sought is not spread across multiple columns, and conduct visual inspections to detect hidden puns. But already, in just a few minutes, we have partial statistics on the phenomenon of punny hairdressers.
5.3.4 Why do we need a commune code when we already have its name?
This exercise will take a step back to understand why we assumed above that the commune code was the key for joining data.
Let’s start by checking the dimensions of the DataFrames
and the structure of some key variables.
In this case, the fundamental variables for linking our data are the communal variables.
Here, we have two geographical variables: a commune code and a commune name.
Check the dimensions of the
DataFrames
.Identify in
filosofi
the commune names that correspond to multiple commune codes and select their codes. In other words, identify theLIBGEO
where there are duplicateCODGEO
and store them in a vectorx
(tip: be careful with the index ofx
).
We temporarily focus on observations where the label involves more than two different commune codes.
Question 3. Look at these observations in
filosofi
.Question 4. To get a better view, reorder the obtained dataset alphabetically.
Question 5. Determine the average size (variable number of people:
NBPERSMENFISC16
) and some descriptive statistics of this data. Compare it to the same statistics on the data where labels and commune codes coincide.Question 6. Check the major cities (more than 100,000 people) for the proportion of cities where the same name is associated with different commune codes.
Question 7. Check in
filosofi
the cities where the label is equal to Montreuil. Also, check those that contain the term ‘Saint-Denis’.
This small exercise reassures us as the duplicated labels are actually the same commune names but in different departments. So, these are not duplicated observations. We can thus rely on the commune codes, which are unique.
5.3.5 Associating different sources to compute carbon footprints
First, we will calculate the carbon footprint of each commune.
Create a variable
emissions
that corresponds to the total emissions of a commune.Perform a left join between the emissions data and the framing data2.
Calculate the carbon footprint (total emissions / population).
At this stage, we might want to move towards modeling to try to explain the determinants of the carbon footprint based on communal variables. However, for an inferential approach to be relevant, it is necessary to check some descriptive statistics beforehand.
- Generate a histogram of the carbon footprint per commune in level and log.
With a better understanding of our data, we are getting closer to inferential statistics. However, so far we have constructed univariate statistics and have not sought to understand the results by looking at the relationship with other variables. This brings us to bivariate statistics, especially correlation analysis. This work is important because any subsequent modeling will consist of refining the correlation analysis to account for cross-correlations between multiple factors. Here, we propose to do this analysis in a minimal way.
- Look at the correlation between the framing variables and the carbon footprint. Do some variables seem to potentially influence the carbon footprint?
At the end of question 5, the correlation graph is as follows:
6 Formatting descriptive statistics tables
A Pandas
DataFrame is automatically formatted when viewed from a notebook as a minimally styled HTML table. This formatting is convenient for viewing data, a necessary task for data scientists, but it doesn’t go much beyond that.
In an exploratory phase, it can be useful to have a more complete table, including minimal visualizations, to better understand the data. In the final phase of a project, when communicating about it, having an attractive visualization is advantageous. The outputs of notebooks are not a satisfactory solution for these needs and require the medium of the notebook, which can deter some users.
Fortunately, the young package great_tables
allows for the creation of tables programmatically that rival tedious manual productions in Excel
and are difficult to replicate. This package is a Python
port of the GT
package. great_tables
builds HTML tables, offering great formatting richness and excellent integration with Quarto
, the reproducible publishing tool developed by RStudio.
The following exercise will propose building a table with this package, step by step.
To focus on table construction, the necessary data preparations are provided directly. We will start from this dataset:
To ensure you are able to complete the next exercise, here is the dataframe required for it.
'emissions'] = emissions.sum(axis = 1, numeric_only = True)
emissions[
= (
emissions_merged
emissions.reset_index()= "INSEE commune", right_on = "CODGEO")
.merge(filosofi, left_on
)'empreinte'] = emissions_merged['emissions']/emissions_merged['NBPERSMENFISC16']
emissions_merged['empreinte'] = emissions_merged['empreinte'].astype(float) emissions_merged[
= (
emissions_table
emissions_merged={"dep_y": "dep", "NBPERSMENFISC16": "population", "MED16": "revenu"})
.rename(columns"dep")
.groupby("empreinte": "sum", "revenu": "median", "population": "sum"}) #pas vraiment le revenu médian
.agg({
.reset_index()= "empreinte")
.sort_values(by )
In this table, we will include horizontal bars, similar to the examples shown here. This is done by directly including the HTML code in the DataFrame column.
def create_bar(prop_fill: float, max_width: int, height: int, color: str = "green") -> str:
"""Create divs to represent prop_fill as a bar."""
= round(max_width * prop_fill, 2)
width = f"{width}px"
px_width return f"""\
<div style="width: {max_width}px; background-color: lightgrey;">\
<div style="height:{height}px;width:{px_width};background-color:{color};"></div>\
</div>\
"""
= {'empreinte': "green", 'revenu': "red", 'population': "blue"}
colors
for variable in ['empreinte', 'revenu', 'population']:
f'raw_perc_{variable}'] = emissions_table[variable]/emissions_table[variable].max()
emissions_table[f'bar_{variable}'] = emissions_table[f'raw_perc_{variable}'].map(
emissions_table[lambda x: create_bar(x, max_width=75, height=20, color = colors[variable])
)
We keep only the 5 smallest carbon footprints and the five largest.
= emissions_table.head(5).assign(grp = "5 départements les moins pollueurs").reset_index(drop=True)
emissions_min = emissions_table.tail(5).assign(grp = "5 départements les plus pollueurs").reset_index(drop=True)
emissions_max
= pd.concat([
emissions_table
emissions_min,
emissions_max ])
Finally, to use some practical functions for selecting columns based on patterns, we will convert the data to the Polars
format.
import polars as pl
= pl.from_pandas(emissions_table) emissions_table
Using this base table
="grp", rowname_col="dep") GT(emissions_table, groupname_col
construct a table in the style of the one below.
# Start from here
="grp", rowname_col="dep") GT(emissions_table, groupname_col
The table you should have :
Thanks to this, we can already understand that our definition of the carbon footprint is certainly flawed. It seems unlikely that the inhabitants of the 77th department have a carbon footprint 500 times greater than that of intra-muros Paris. The main reason? We are not dealing with a concept of consumption emissions but production emissions, which penalizes industrial areas or areas with airports…
To learn more about constructing tables with great_tables
, you can replicate this exercise on producing electoral tables that I proposed for an R
course with gt
, the equivalent of great_tables
for R
.
7 Pandas
: towards practice and beyond
7.1 Pandas
in a chain of operations
Generally, in a project, data cleaning will consist of a series of methods applied to a DataFrame
or a Series
when working exclusively on a single column. In other words, what is usually expected when working with Pandas
is to have a chain that takes a DataFrame
as input and outputs the same DataFrame
enriched or an aggregated version of it.
This way of proceeding is at the heart of the dplyr
syntax in R
but is not necessarily native in Pandas
depending on the operations you want to implement. Indeed, the natural way to update a dataframe in Pandas
often involves syntax like:
import numpy as np
import pandas as pd
= [[8000, 1000], [9500, np.nan], [5000, 2000]]
data = pd.DataFrame(data, columns=['salaire', 'autre_info'])
df 'salaire_net'] = df['salaire']*0.8 df[
In SQL
you could directly update your database with the new column:
SELECT *, salaire*0.8 AS salaire_net FROM df
The tidyverse
ecosystem in R
, the equivalent of Pandas
, works according to the same logic as SQL table updates. Indeed, you would use the following command with dplyr
:
%>% mutate(salaire_net = salaire*0.8) df
Technically, you could do this with an assign
in Pandas
:
1= df.drop("salaire_net", axis = "columns")
df = df.assign(salaire_net = lambda s: s['salaire']*0.8) df
- 1
- To delete the variable to start from the initial example
However, this assign
syntax is not very natural. It requires passing a lambda function that expects a DataFrame
as input where you would want a column. So, it is not really a readable and practical syntax.
It is nevertheless possible to chain operations on datasets using pipes. These follow the same philosophy as dplyr
, itself inspired by the Linux pipe. This approach will make the code more readable by defining functions that perform operations on one or more columns of a DataFrame. The first argument to the function is the DataFrame
, the others are those controlling its behavior:
def calcul_salaire_net(df: pd.DataFrame, col: str, taux: float = 0.8):
"salaire_net"] = df[col]*taux
df[return df
This transforms our production chain into:
(
df"salaire")
.pipe(calcul_salaire_net, )
7.2 Some limitations regarding Pandas
syntax
There is a before and after Pandas
in data analysis with Python
. Without this incredibly practical package, Python, despite all its strengths, would have struggled to establish itself in the data analysis landscape. However, while Pandas
offers a coherent syntax in many aspects, it is not perfect either. More recent data analysis paradigms in Python
sometimes aim to correct these syntactical imperfections.
Among the most annoying points in everyday use is the need to regularly perform reset_index
when building descriptive statistics. Indeed, it can be dangerous to keep indices that are not well controlled because, if we are not careful during the merge phases, they can be misused by Pandas
to join data, leading to surprises.
Pandas
is extremely well-designed for restructuring data from long to wide format or vice versa. However, this is not the only way to restructure a dataset that we might want to implement. It often happens that we want to compare the value of an observation to that of a group to which it belongs. This is particularly useful in anomaly analysis, outlier detection, or fraud investigation. Natively, in Pandas
, you need to build an aggregate statistic by group and then merge it back to the initial data using the group variable. This is somewhat tedious:
= emissions.groupby("dep").agg({"Agriculture": "mean"}).reset_index()
emissions_moyennes = (
emissions_enrichies
emissions= "dep", suffixes = ['', '_moyenne_dep'])
.merge(emissions_moyennes, on
)'relatives'] = emissions_enrichies["Agriculture"]/emissions_enrichies["Agriculture_moyenne_dep"]
emissions_enrichies[ emissions_enrichies.head()
In the tidyverse
, this two-step operation could be done in a single step, which is more convenient:
%>%
emissions group_by(dep) %>%
mutate(relatives = Agriculture/mean(Agriculture))
This isn’t too bad, but it does make Pandas
processing chains longer and therefore increases the maintenance burden to keep them running over time.
More generally, Pandas
processing chains can be quite verbose because it is often necessary to redefine the DataFrame
rather than just the columns. For example, to filter rows and columns, you have to:
(
emissions
.loc["dep"] == "12") & (emissions["Routier"]>500), ['INSEE commune', 'Commune']
(emissions[
]5)
.head( )
In SQL, you could simply refer to the columns in the filter:
SELECT "INSEE commune", 'Commune'
FROM emissions
WHERE dep=="12" AND Routier>500
In the tidyverse
(R), you could also do this simply:
%>%
df filter(dep=="12", Routier>500) %>%
select(`INSEE commune`, `Commune`)
8 Other paradigms
These two chapters have explored in depth the richness of the Pandas
ecosystem, which is indispensable in the data scientist’s toolbox. Despite all the limitations we have mentioned, and the alternative solutions we will present, Pandas
remains the central package of the data ecosystem with Python
. In the following chapters, we will see its native integration with the Scikit
ecosystem for machine learning or the extension of Pandas
to spatial data with GeoPandas
.
Other technical solutions that we will discuss here may be relevant if you want to handle large volumes of data or if you want to use alternative syntaxes.
The main alternatives to Pandas
are Polars
, DuckDB
, and Spark
. There is also Dask
, a library for parallelizing Pandas
operations.
8.1 Polars
Polars
is certainly the paradigm most inspired by Pandas
, even in the choice of name. The first fundamental difference lies in the internal layers used. Polars
relies on the Rust
implementation of Arrow
, whereas Pandas
relies on Numpy
, which results in performance loss. This allows Polars
to be more efficient on large volumes of data, especially since many operations are parallelized and rely on lazy evaluation, a programming principle that optimizes queries for logical rather than defined execution order.
Another strength of Polars
is its more coherent syntax, benefiting from over fifteen years of Pandas
existence and almost a decade of dplyr
(the data manipulation package within the R
tidyverse paradigm). To take the previous example, there is no longer a need to force the reference to the DataFrame; in an execution chain, all subsequent references will be made with respect to the initial DataFrame.
²
import polars as pl
= pl.from_pandas(emissions)
emissions_polars
(
emissions_polarsfilter(pl.col("dep") == "12", pl.col("Routier") > 500)
.'INSEE commune', 'Commune')
.select(5)
.head( )
To learn about Polars
, many online resources are available, including this notebook built for the public statistics data scientists network.
8.2 DuckDB
DuckDB
is the newcomer in the data analysis ecosystem, pushing the limits of data processing with Python
without resorting to big data tools like Spark
. DuckDB
epitomizes a new paradigm, the “Big data is dead” paradigm, where large data volumes can be processed without imposing infrastructures.
Besides its great efficiency, as DuckDB
can handle data volumes larger than the computer or server’s RAM, it offers the advantage of a uniform syntax across languages that call DuckDB
(Python
, R
, C++
, or Javascript
). DuckDB
favors SQL syntax for data processing with many pre-implemented functions to simplify certain data transformations (e.g., for text data, time data, etc.).
Compared to other SQL-based systems like PostGreSQL
, DuckDB
is very simple to install, as it is just a Python
library, whereas many tools like PostGreSQL
require an appropriate infrastructure.
To reuse the previous example, we can directly use the SQL code mentioned earlier.
import duckdb
duckdb.sql("""
SELECT "INSEE commune", "Commune"
FROM emissions
WHERE dep=='12' AND Routier>500
LIMIT 5
""")
Here, the clause FROM emissions
comes from the fact that we can directly execute SQL from a Pandas
object via DuckDB
. If we read directly in the query, it gets slightly more complex, but the logic remains the same.
import duckdb
duckdb.sql(f"""
SELECT "INSEE commune", "Commune"
FROM read_csv_auto("{url}")
WHERE
substring("INSEE commune",1,2)=='12'
AND
Routier>500
LIMIT 5
""")
The rendering of the DataFrame is slightly different from Pandas
because, like Polars
and many large data processing systems, DuckDB
relies on lazy evaluation and thus only displays a sample of data. DuckDB
and Polars
are also well integrated with each other. You can run SQL on a Polars
object via DuckDB
or apply Polars
functions to an initially read DuckDB
object.
One of the interests of DuckDB
is its excellent integration with the Parquet
ecosystem, the already mentioned data format that is becoming a standard in data sharing (for example, it is the cornerstone of data sharing on the HuggingFace platform). To learn more about DuckDB
and discover its usefulness for reading data from the French population census, you can check out this blog post.
8.3 Spark
DuckDB
has pushed the boundaries of big data, which can be defined as the volume of data that can no longer be processed on a single machine without implementing a parallelization strategy.
Nevertheless, for very large data volumes, Python
is well-equipped with the PySpark
library. This is a Python API for the Spark language, a big data language based on Scala. This paradigm is built on the idea that Python
users access it via clusters with many nodes to process data in parallel. The data will be read in blocks, processed in parallel depending on the number of parallel nodes. The Spark
DataFrame API has a syntax close to previous paradigms with more complex engineering in the background related to native parallelization.
Informations additionnelles
environment files have been tested on.
Python version used:
Package | Version |
---|---|
affine | 2.4.0 |
aiobotocore | 2.21.1 |
aiohappyeyeballs | 2.6.1 |
aiohttp | 3.11.13 |
aioitertools | 0.12.0 |
aiosignal | 1.3.2 |
alembic | 1.13.3 |
altair | 5.4.1 |
aniso8601 | 9.0.1 |
annotated-types | 0.7.0 |
anyio | 4.8.0 |
appdirs | 1.4.4 |
archspec | 0.2.3 |
asttokens | 2.4.1 |
attrs | 25.3.0 |
babel | 2.17.0 |
bcrypt | 4.2.0 |
beautifulsoup4 | 4.12.3 |
black | 24.8.0 |
blinker | 1.8.2 |
blis | 1.2.0 |
bokeh | 3.5.2 |
boltons | 24.0.0 |
boto3 | 1.37.1 |
botocore | 1.37.1 |
branca | 0.7.2 |
Brotli | 1.1.0 |
bs4 | 0.0.2 |
cachetools | 5.5.0 |
cartiflette | 0.0.2 |
Cartopy | 0.24.1 |
catalogue | 2.0.10 |
cattrs | 24.1.2 |
certifi | 2025.1.31 |
cffi | 1.17.1 |
charset-normalizer | 3.4.1 |
chromedriver-autoinstaller | 0.6.4 |
click | 8.1.8 |
click-plugins | 1.1.1 |
cligj | 0.7.2 |
cloudpathlib | 0.21.0 |
cloudpickle | 3.0.0 |
colorama | 0.4.6 |
comm | 0.2.2 |
commonmark | 0.9.1 |
conda | 24.9.1 |
conda-libmamba-solver | 24.7.0 |
conda-package-handling | 2.3.0 |
conda_package_streaming | 0.10.0 |
confection | 0.1.5 |
contextily | 1.6.2 |
contourpy | 1.3.1 |
cryptography | 43.0.1 |
cycler | 0.12.1 |
cymem | 2.0.11 |
cytoolz | 1.0.0 |
dask | 2024.9.1 |
dask-expr | 1.1.15 |
databricks-sdk | 0.33.0 |
dataclasses-json | 0.6.7 |
debugpy | 1.8.6 |
decorator | 5.1.1 |
Deprecated | 1.2.14 |
diskcache | 5.6.3 |
distributed | 2024.9.1 |
distro | 1.9.0 |
docker | 7.1.0 |
duckdb | 1.2.1 |
en_core_web_sm | 3.8.0 |
entrypoints | 0.4 |
et_xmlfile | 2.0.0 |
exceptiongroup | 1.2.2 |
executing | 2.1.0 |
fastexcel | 0.11.6 |
fastjsonschema | 2.21.1 |
fiona | 1.10.1 |
Flask | 3.0.3 |
folium | 0.17.0 |
fontawesomefree | 6.6.0 |
fonttools | 4.56.0 |
fr_core_news_sm | 3.8.0 |
frozendict | 2.4.4 |
frozenlist | 1.5.0 |
fsspec | 2023.12.2 |
geographiclib | 2.0 |
geopandas | 1.0.1 |
geoplot | 0.5.1 |
geopy | 2.4.1 |
gitdb | 4.0.11 |
GitPython | 3.1.43 |
google-auth | 2.35.0 |
graphene | 3.3 |
graphql-core | 3.2.4 |
graphql-relay | 3.2.0 |
graphviz | 0.20.3 |
great-tables | 0.12.0 |
greenlet | 3.1.1 |
gunicorn | 22.0.0 |
h11 | 0.14.0 |
h2 | 4.1.0 |
hpack | 4.0.0 |
htmltools | 0.6.0 |
httpcore | 1.0.7 |
httpx | 0.28.1 |
httpx-sse | 0.4.0 |
hyperframe | 6.0.1 |
idna | 3.10 |
imageio | 2.37.0 |
importlib_metadata | 8.6.1 |
importlib_resources | 6.5.2 |
inflate64 | 1.0.1 |
ipykernel | 6.29.5 |
ipython | 8.28.0 |
itsdangerous | 2.2.0 |
jedi | 0.19.1 |
Jinja2 | 3.1.6 |
jmespath | 1.0.1 |
joblib | 1.4.2 |
jsonpatch | 1.33 |
jsonpointer | 3.0.0 |
jsonschema | 4.23.0 |
jsonschema-specifications | 2024.10.1 |
jupyter-cache | 1.0.0 |
jupyter_client | 8.6.3 |
jupyter_core | 5.7.2 |
kaleido | 0.2.1 |
kiwisolver | 1.4.8 |
langchain | 0.3.20 |
langchain-community | 0.3.9 |
langchain-core | 0.3.45 |
langchain-text-splitters | 0.3.6 |
langcodes | 3.5.0 |
langsmith | 0.1.147 |
language_data | 1.3.0 |
lazy_loader | 0.4 |
libmambapy | 1.5.9 |
locket | 1.0.0 |
loguru | 0.7.3 |
lxml | 5.3.1 |
lz4 | 4.3.3 |
Mako | 1.3.5 |
mamba | 1.5.9 |
mapclassify | 2.8.1 |
marisa-trie | 1.2.1 |
Markdown | 3.6 |
markdown-it-py | 3.0.0 |
MarkupSafe | 3.0.2 |
marshmallow | 3.26.1 |
matplotlib | 3.10.1 |
matplotlib-inline | 0.1.7 |
mdurl | 0.1.2 |
menuinst | 2.1.2 |
mercantile | 1.2.1 |
mizani | 0.11.4 |
mlflow | 2.16.2 |
mlflow-skinny | 2.16.2 |
msgpack | 1.1.0 |
multidict | 6.1.0 |
multivolumefile | 0.2.3 |
munkres | 1.1.4 |
murmurhash | 1.0.12 |
mypy-extensions | 1.0.0 |
narwhals | 1.30.0 |
nbclient | 0.10.0 |
nbformat | 5.10.4 |
nest_asyncio | 1.6.0 |
networkx | 3.4.2 |
nltk | 3.9.1 |
numpy | 2.2.3 |
opencv-python-headless | 4.10.0.84 |
openpyxl | 3.1.5 |
opentelemetry-api | 1.16.0 |
opentelemetry-sdk | 1.16.0 |
opentelemetry-semantic-conventions | 0.37b0 |
orjson | 3.10.15 |
outcome | 1.3.0.post0 |
OWSLib | 0.28.1 |
packaging | 24.2 |
pandas | 2.2.3 |
paramiko | 3.5.0 |
parso | 0.8.4 |
partd | 1.4.2 |
pathspec | 0.12.1 |
patsy | 1.0.1 |
Pebble | 5.1.0 |
pexpect | 4.9.0 |
pickleshare | 0.7.5 |
pillow | 11.1.0 |
pip | 24.2 |
platformdirs | 4.3.6 |
plotly | 5.24.1 |
plotnine | 0.13.6 |
pluggy | 1.5.0 |
polars | 1.8.2 |
preshed | 3.0.9 |
prometheus_client | 0.21.0 |
prometheus_flask_exporter | 0.23.1 |
prompt_toolkit | 3.0.48 |
propcache | 0.3.0 |
protobuf | 4.25.3 |
psutil | 7.0.0 |
ptyprocess | 0.7.0 |
pure_eval | 0.2.3 |
py7zr | 0.20.8 |
pyarrow | 17.0.0 |
pyarrow-hotfix | 0.6 |
pyasn1 | 0.6.1 |
pyasn1_modules | 0.4.1 |
pybcj | 1.0.3 |
pycosat | 0.6.6 |
pycparser | 2.22 |
pycryptodomex | 3.21.0 |
pydantic | 2.10.6 |
pydantic_core | 2.27.2 |
pydantic-settings | 2.8.1 |
Pygments | 2.19.1 |
PyNaCl | 1.5.0 |
pynsee | 0.1.8 |
pyogrio | 0.10.0 |
pyOpenSSL | 24.2.1 |
pyparsing | 3.2.1 |
pyppmd | 1.1.1 |
pyproj | 3.7.1 |
pyshp | 2.3.1 |
PySocks | 1.7.1 |
python-dateutil | 2.9.0.post0 |
python-dotenv | 1.0.1 |
python-magic | 0.4.27 |
pytz | 2025.1 |
pyu2f | 0.1.5 |
pywaffle | 1.1.1 |
PyYAML | 6.0.2 |
pyzmq | 26.3.0 |
pyzstd | 0.16.2 |
querystring_parser | 1.2.4 |
rasterio | 1.4.3 |
referencing | 0.36.2 |
regex | 2024.9.11 |
requests | 2.32.3 |
requests-cache | 1.2.1 |
requests-toolbelt | 1.0.0 |
retrying | 1.3.4 |
rich | 13.9.4 |
rpds-py | 0.23.1 |
rsa | 4.9 |
rtree | 1.4.0 |
ruamel.yaml | 0.18.6 |
ruamel.yaml.clib | 0.2.8 |
s3fs | 2023.12.2 |
s3transfer | 0.11.3 |
scikit-image | 0.24.0 |
scikit-learn | 1.6.1 |
scipy | 1.13.0 |
seaborn | 0.13.2 |
selenium | 4.29.0 |
setuptools | 76.0.0 |
shapely | 2.0.7 |
shellingham | 1.5.4 |
six | 1.17.0 |
smart-open | 7.1.0 |
smmap | 5.0.0 |
sniffio | 1.3.1 |
sortedcontainers | 2.4.0 |
soupsieve | 2.5 |
spacy | 3.8.4 |
spacy-legacy | 3.0.12 |
spacy-loggers | 1.0.5 |
SQLAlchemy | 2.0.39 |
sqlparse | 0.5.1 |
srsly | 2.5.1 |
stack-data | 0.6.2 |
statsmodels | 0.14.4 |
tabulate | 0.9.0 |
tblib | 3.0.0 |
tenacity | 9.0.0 |
texttable | 1.7.0 |
thinc | 8.3.4 |
threadpoolctl | 3.6.0 |
tifffile | 2025.3.13 |
toolz | 1.0.0 |
topojson | 1.9 |
tornado | 6.4.2 |
tqdm | 4.67.1 |
traitlets | 5.14.3 |
trio | 0.29.0 |
trio-websocket | 0.12.2 |
truststore | 0.9.2 |
typer | 0.15.2 |
typing_extensions | 4.12.2 |
typing-inspect | 0.9.0 |
tzdata | 2025.1 |
Unidecode | 1.3.8 |
url-normalize | 1.4.3 |
urllib3 | 1.26.20 |
uv | 0.6.8 |
wasabi | 1.1.3 |
wcwidth | 0.2.13 |
weasel | 0.4.1 |
webdriver-manager | 4.0.2 |
websocket-client | 1.8.0 |
Werkzeug | 3.0.4 |
wheel | 0.44.0 |
wordcloud | 1.9.3 |
wrapt | 1.17.2 |
wsproto | 1.2.0 |
xgboost | 2.1.1 |
xlrd | 2.0.1 |
xyzservices | 2025.1.0 |
yarl | 1.18.3 |
yellowbrick | 1.5 |
zict | 3.0.0 |
zipp | 3.21.0 |
zstandard | 0.23.0 |
View file history
SHA | Date | Author | Description |
---|---|---|---|
3f1d2f3f | 2025-03-15 15:55:59 | Lino Galiana | Fix problem with uv and malformed files (#599) |
b26ad9a9 | 2025-02-02 10:34:58 | lgaliana | check without eval true |
2bb4397f | 2025-01-14 23:00:30 | lgaliana | remove failed pynsee code |
9cf2bde5 | 2024-10-18 15:49:47 | lgaliana | Reconstruction complète du chapitre de cartographie |
3b6301d4 | 2024-09-25 21:11:19 | lgaliana | change order |
127d17c5 | 2024-09-25 19:24:01 | lgaliana | formatting |
f2430634 | 2024-09-25 19:21:23 | lgaliana | le bon chemin |
3fc7b541 | 2024-09-25 19:11:23 | lgaliana | build pipeline |
6a00c518 | 2024-09-25 17:28:46 | Lino Galiana | Special callouts for vscode (#558) |
488780a4 | 2024-09-25 14:32:16 | Lino Galiana | Change badge (#556) |
46f038a4 | 2024-09-23 15:28:36 | Lino Galiana | Mise à jour du premier chapitre sur les figures (#553) |
59f5803d | 2024-09-22 16:41:46 | Lino Galiana | Update bike count source data for visualisation tutorial (#552) |
4404ec10 | 2024-09-21 18:28:42 | lgaliana | Back to eval true |
72d44dd6 | 2024-09-21 12:50:38 | lgaliana | Force build for pandas chapters |
0908656f | 2024-08-20 16:30:39 | Lino Galiana | English sidebar (#542) |
5108922f | 2024-08-08 18:43:37 | Lino Galiana | Improve notebook generation and tests on PR (#536) |
f4367f78 | 2024-08-08 11:44:37 | Lino Galiana | Traduction chapitre suite Pandas avec un profile (#534) |
72f42bb7 | 2024-07-25 19:06:38 | Lino Galiana | Language message on notebooks (#529) |
195dc9e9 | 2024-07-25 11:59:19 | linogaliana | Switch language button |
fe4edc98 | 2024-07-10 18:37:22 | Lino Galiana | Second pandas chapter translated (#524) |
065b0abd | 2024-07-08 11:19:43 | Lino Galiana | Nouveaux callout dans la partie manipulation (#513) |
c6ffbb60 | 2024-05-08 07:38:24 | lgaliana | Putting badges in advanced Pandas |
c3873ed6 | 2024-05-07 12:19:12 | lgaliana | Ajoute code commune |
e0d615e3 | 2024-05-03 11:15:29 | Lino Galiana | Restructure la partie Pandas (#497) |
Footnotes
Ideally, it would be more coherent, for demographic data, to use the legal populations, from the census. However, this base is not yet natively integrated into the
pynsee
library that we will use in this chapter. An open exercise is proposed to construct population aggregates from anonymized individual census data (the detailed files).↩︎Ideally, it would be necessary to ensure that this join does not introduce bias. Indeed, since our reference years are not necessarily identical, there may be a mismatch between our two sources. Since the exercise is already long, we will not go down this path. Interested readers can perform such an analysis as an additional exercise.↩︎
Citation
@book{galiana2023,
author = {Galiana, Lino},
title = {Python Pour La Data Science},
date = {2023},
url = {https://pythonds.linogaliana.fr/},
doi = {10.5281/zenodo.8229676},
langid = {en}
}