Introduction to Pandas

Pandas is the central piece of the Python ecosystem for data science. This chapter presents the first data exploration that can be performed with Pandas in order to explore the structure of a dataset.

Tutorial
Exercises
Manipulation
Author

Lino Galiana

Published

2024-07-10

To try the examples in this tutorial:

1 Introduction

The Pandas package has been the central piece of the data science ecosystem for about a decade. The DataFrame, a central object in languages like R or Stata, had long been absent in the Python ecosystem. Yet, thanks to Numpy, all the basic components were present but needed to be reconfigured to meet the needs of data scientists.

Wes McKinney, when he built Pandas to provide a dataframe leveraging the numerical computation library Numpy in the background, enabled a significant leap forward for Python in data analysis, explaining its popularity in the data science ecosystem. Pandas is not without limitations1, which we will have the opportunity to discuss, but the vast array of analysis methods it offers greatly simplifies data analysis work. For more information on this package, the reference book by McKinney (2012) presents many of the package’s features.

In this chapter, we will focus on the most relevant elements in the context of an introduction to data science, leaving interested users to deepen their knowledge with the abundant resources available on the subject.

As datasets generally gain value by associating multiple sources, for example, to relate a record to contextual data or to link two client databases to obtain meaningful data, the next chapter will present how to merge different datasets with Pandas. By the end of the next chapter, thanks to data merging, we will have a detailed database on the carbon footprints of the French2.

1.1 Data used in this chapter

In this Pandas tutorial, we will use:

  • Greenhouse gas emissions estimated at the municipal level by ADEME. The dataset is available on data.gouv and can be queried directly in Python with this URL.

The next chapter will allow us to apply the elements presented in this chapter with the above data combined with contextual data at the municipal level.

Skills to be acquired by the end of this chapter
  • Import a dataset as a Pandas dataframe and explore its structure;
  • Perform manipulations on columns and rows;
  • Construct aggregate statistics and chain operations;
  • Use Pandas graphical methods to quickly represent data distribution.

1.2 Environment

We will follow the usual conventions in importing packages:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

To obtain reproducible results, you can set the seed of the pseudo-random number generator.

np.random.seed(123)

Throughout this demonstration of the main Pandas functionalities, and in the next chapter, I recommend regularly referring to the following resources:

As a reminder, to execute the code examples in an interactive notebook, you can use the shortcuts at the top of the page to launch it in your preferred environment.

2 Pandas logic

2.1 Anatomy of a Pandas table

The central object in the Pandas logic is the DataFrame. It is a special data structure with two dimensions, structured by aligning rows and columns. Unlike a matrix, columns can be of different types.

A DataFrame consists of the following elements:

  • the row index;
  • the column name;
  • the data value;

Structure of a Pandas DataFrame, borrowed from https://x.com/epfl_exts/status/997506000600084480

Structure of a Pandas DataFrame, borrowed from https://x.com/epfl_exts/status/997506000600084480

2.2 Before seeing DataFrame, we need to know Pandas Series

In fact, a DataFrame is a collection of objects called pandas.Series. These Series are one-dimensional objects that are extensions of the one-dimensional Numpy arrays3. In particular, to facilitate the handling of categorical or temporal data, additional variable types are available in Pandas compared to Numpy (categorical, datetime64, and timedelta64). These types are associated with optimized methods to facilitate the processing of this data.

There are several possible types for a pandas.Series, extending the basic data types in Python, which will determine the behavior of this variable. Indeed, many operations do not have the same meaning depending on whether the value is numeric or not.

The simplest types (int or float) correspond to numeric values:

poids = pd.Series([3, 7, 12])
poids
0     3
1     7
2    12
dtype: int64
Missing Values are tricky !

In general, if Pandas exclusively detects integer values in a variable, it will use the int type to optimize memory. This choice makes sense. However, it has a drawback: Numpy, and therefore by extension Pandas, cannot represent missing values for the int type (more on missing values below).

Pending the shift to Arrow, which can handle missing values in int, the method to use is to convert to the float type if the variable will have missing values, which is quite simple:

For textual data, it is just as simple:

animal = pd.Series(["cat", "dog", "koala"])
animal
0      cat
1      dog
2    koala
dtype: object

The object type is a catch-all for exclusively textual data types (type str) or a mix of textual and numerical data (type mixed). Historically, it was an intermediate type between the factor and character of R. However, recently, there is an equivalent type to factor in Pandas for variables with a finite and relatively short list of values, the category type. The object type can cause unexpected errors due to its mixed nature, so it is recommended to choose the nature of a variable and convert it:

1animal.astype("category")
2animal.astype(str)
1
To convert to category (the logical choice here)
2
To convert to str (if you want to perform subsequent textual operations)
0      cat
1      dog
2    koala
dtype: object

It is important to examine the types of your Pandas objects and convert them if they do not make sense; Pandas makes optimized choices, but it may be necessary to correct them because Pandas does not know your future data usage. This is one of the tasks to do during feature engineering, the set of steps for preparing data for future use.

3 From Series to DataFrame

We have created two independent series, animal and poids, which are related. In the matrix world, this would correspond to moving from a vector to a matrix. In the Pandas world, this means moving from a Series to a DataFrame.

This is done naturally with Pandas:

1animaux = pd.DataFrame(zip(animal, poids), columns=["animal", "poids"])
animaux
1
We need to use zip here because Pandas expects a structure like {"var1": [val1, val2], "var2": [val1, val2]}, which is not what we prepared previously. However, we will see that this approach is not the most common for creating a DataFrame.
animal poids
0 cat 3
1 dog 7
2 koala 12

3.1 Indexing

The essential difference between a Series and a Numpy object is indexing. In Numpy, indexing is implicit; it allows accessing data (the one at the index located at position i). With a Series, you can, of course, use a positional index, but more importantly, you can use more explicit indices.

This allows accessing data more naturally, using column names, for example:

animaux["poids"]
0     3
1     7
2    12
Name: poids, dtype: int64

The existence of an index makes subsetting, that is, selecting rows or columns, particularly easy. DataFrames have two indices: those for rows and those for columns. You can make selections on both dimensions. Anticipating later exercises, you can see that this will facilitate row selection:

animaux.loc[animaux["animal"] == "chat", "poids"]
Series([], Name: poids, dtype: int64)

This instruction is equivalent to the SQL command:

SELECT poids FROM animaux WHERE animal == "chat"

If we return to our animaux dataset, we can see the row number displayed on the left:

animaux
animal poids
0 cat 3
1 dog 7
2 koala 12

This is the default index for the row dimension because we did not configure one. It is not mandatory; it is quite possible to have an index corresponding to a variable of interest (we will discover this when we explore groupby in the next chapter). However, this can be tricky, and it is recommended that this be only transitory, hence the importance of regularly performing reset_index.

3.2 The concept of tidy data

The concept of tidy data, popularized by Hadley Wickham through his R packages (see Wickham, Çetinkaya-Rundel, and Grolemund (2023)), is highly relevant for describing the structure of a Pandas DataFrame. The three rules of tidy data are as follows:

  • Each variable has its own column;
  • Each observation has its own row;
  • A value, representing an observation of a variable, is located in a single cell.

Illustration of the tidy data concept (borrowed from H. Wickham)

Illustration of the tidy data concept (borrowed from H. Wickham)

These principles may seem like common sense, but you will find that many data formats do not adhere to them. For example, Excel spreadsheets often have values spanning multiple columns or several merged rows. Restructuring this data according to the tidy data principle will be crucial to performing analysis on it.

4 Importing Data with Pandas

If you had to manually create all your DataFrames from vectors, Pandas would not be practical. Pandas offers many functions to read data stored in different formats.

The simplest data to read are tabular data stored in an appropriate format. The two main formats to know are CSV and Parquet. The former has the advantage of simplicity - it is universal, known to all, and readable by any text editor. The latter is gaining popularity in the data ecosystem because it addresses some limitations of CSV (optimized storage, predefined variable types…) but has the disadvantage of not being readable without a suitable tool (which fortunately are increasingly present in standard code editors like VSCode). For more information on the difference between CSV and Parquet, refer to the deep dive chapters on the subject.

Data stored in other text-derived formats from CSV (.txt, .tsv…) or formats like JSON are readable with Pandas, but sometimes it takes a bit of iteration to find the correct reading parameters. In other chapters, we will discover that other data formats related to different data structures are also readable with Python.

Flat formats (.csv, .txt…) and the Parquet format are easy to use with Pandas because they are not proprietary and they store data in a tidy form. Data from spreadsheets, Excel or LibreOffice, are more or less complicated to import depending on whether they follow this schema or not. This is because these tools are used indiscriminately. While in the data science world, they should primarily be used to disseminate final tables for reporting, they are often used to disseminate raw data that could be better shared through more suitable channels.

One of the main difficulties with spreadsheets is that the data are usually associated with documentation in the same tab - for example, the spreadsheet has some lines describing the sources before the table - which requires human intelligence to assist Pandas during the import phase. This will always be possible, but when there is an alternative in the form of a flat file, there should be no hesitation.

4.1 Reading Data from a Local Path

This exercise aims to demonstrate the benefit of using a relative path rather than an absolute path to enhance code reproducibility. However, we will later recommend reading directly from the internet when possible and when it does not involve the recurrent download of a large file.

To prepare for this exercise, the following code will allow you to download data and write it locally:

import requests

url = "https://www.insee.fr/fr/statistiques/fichier/6800675/v_commune_2023.csv"
response = requests.get(url)

# Assurez-vous que la requête a réussi
if response.status_code == 200:
    with open("cog_2023.csv", "wb") as file:
        file.write(response.content)
else:
    print("Download failed. HTTP status :", response.status_code)
Preliminary Exercise: Importing a CSV (Optional)
  1. Use the code above ☝️ to download the data. Use Pandas to read the downloaded file.
  2. Find where the data was written. Observe the structure of this directory.
  3. Create a new folder using the file explorer (on the left in Jupyter or VSCode). Move the CSV and the notebook. Restart the kernel and adjust your code if needed. Repeat this process several times with different folders. What issues might you encounter?
TYPECOM COM REG DEP CTCD ARR TNCC NCC NCCENR LIBELLE CAN COMPARENT
0 COM 01001 84.0 01 01D 012 5 ABERGEMENT CLEMENCIAT Abergement-Clémenciat L'Abergement-Clémenciat 0108 NaN
1 COM 01002 84.0 01 01D 011 5 ABERGEMENT DE VAREY Abergement-de-Varey L'Abergement-de-Varey 0101 NaN

The main issue with reading from files stored locally is the risk of becoming dependent on a file system that is not necessarily shared. It is better, whenever possible, to directly read the data with an HTTPS link, which Pandas can handle. Moreover, when working with open data, this ensures that you are using the latest available data and not a local duplication that may not be up to date.

4.2 Reading from a CSV Available on the Internet

The URL to access the data can be stored in an ad hoc variable:

url = "https://koumoul.com/s/data-fair/api/v1/datasets/igt-pouvoir-de-rechauffement-global/convert"

The goal of the next exercise is to get familiar with importing and displaying data using Pandas and displaying a few observations.

Exercise 1: Importing a CSV and Exploring Data Structure
  1. Import the data from Ademe using the Pandas package and the dedicated command for importing CSVs. Name the obtained DataFrame emissions4.
  2. Use the appropriate methods to display the first 10 values, the last 15 values, and a random sample of 10 values using the appropriate methods from the Pandas package.
  3. Draw 5 percent of the sample without replacement.
  4. Keep only the first 10 rows and randomly draw from these to obtain a DataFrame of 100 data points.
  5. Make 100 draws from the first 6 rows with a probability of 1/2 for the first observation and a uniform probability for the others.
If you get stuck on question 1

Read the documentation for read_csv (very well done) or look for examples online to discover this function.

As illustrated by this exercise, displaying DataFrames in notebooks is quite ergonomic. The first and last rows are displayed automatically. For valuation tables present in a report or research article, the next chapter introduces great_tables, which offers very rich table formatting features.

Warning

Be careful with display and commands that reveal data (head, tail, etc.) in a notebook that handles confidential data when using version control software like Git (see dedicated chapters).

Indeed, you may end up sharing data inadvertently in the Git history. As explained in the chapter dedicated to Git, a file named .gitignore is sufficient to create some rules to avoid unintentional sharing of data with Git.

5 Exploring the Structure of a DataFrame

Pandas offers a data schema quite familiar to users of statistical software like R. Similar to the main data processing paradigms like the tidyverse (R), the grammar of Pandas inherits from SQL logic. The philosophy is very similar: operations are performed to select rows, columns, sort rows based on column values, apply standardized treatments to variables, etc. Generally, operations that reference variable names are preferred over those that reference row or column numbers.

Whether you are familiar with SQL or R, you will find a similar logic to what you know, although the names might differ: df.loc[df['y']=='b'] may be written as df %>% filter(y=='b') (R) or SELECT * FROM df WHERE y == 'b' (SQL), but the logic is the same.

Pandas offers a plethora of pre-implemented functionalities. It is highly recommended, before writing a function, to consider if it is natively implemented in Numpy, Pandas, etc. Most of the time, if a solution is implemented in a library, it should be used as it will be more efficient than what you would implement.

To present the most practical methods for data analysis, we can use the example of the municipal CO2 consumption data from Ademe, which was the focus of the previous exercises.

df = pd.read_csv(
    "https://koumoul.com/s/data-fair/api/v1/datasets/igt-pouvoir-de-rechauffement-global/convert"
)
df
INSEE commune Commune Agriculture Autres transports Autres transports international CO2 biomasse hors-total Déchets Energie Industrie hors-énergie Résidentiel Routier Tertiaire
0 01001 L'ABERGEMENT-CLEMENCIAT 3711.425991 NaN NaN 432.751835 101.430476 2.354558 6.911213 309.358195 793.156501 367.036172
1 01002 L'ABERGEMENT-DE-VAREY 475.330205 NaN NaN 140.741660 140.675439 2.354558 6.911213 104.866444 348.997893 112.934207
2 01004 AMBERIEU-EN-BUGEY 499.043526 212.577908 NaN 10313.446515 5314.314445 998.332482 2930.354461 16616.822534 15642.420313 10732.376934
3 01005 AMBERIEUX-EN-DOMBES 1859.160954 NaN NaN 1144.429311 216.217508 94.182310 276.448534 663.683146 1756.341319 782.404357
4 01006 AMBLEON 448.966808 NaN NaN 77.033834 48.401549 NaN NaN 43.714019 398.786800 51.681756
... ... ... ... ... ... ... ... ... ... ... ... ...
35793 95676 VILLERS-EN-ARTHIES 1628.065094 NaN NaN 165.045396 65.063617 11.772789 34.556067 176.098160 309.627908 235.439109
35794 95678 VILLIERS-ADAM 698.630772 NaN NaN 1331.126598 111.480954 2.354558 6.911213 1395.529811 18759.370071 403.404815
35795 95680 VILLIERS-LE-BEL 107.564967 NaN NaN 8367.174532 225.622903 534.484607 1568.845431 22613.830247 12217.122402 13849.512001
35796 95682 VILLIERS-LE-SEC 1090.890170 NaN NaN 326.748418 108.969749 2.354558 6.911213 67.235487 4663.232127 85.657725
35797 95690 WY-DIT-JOLI-VILLAGE 1495.103542 NaN NaN 125.236417 97.728612 4.709115 13.822427 117.450851 504.400972 147.867245

35798 rows × 12 columns

5.1 Dimensions and Structure of a DataFrame

The first useful methods allow displaying some attributes of a DataFrame.

df.axes
[RangeIndex(start=0, stop=35798, step=1),
 Index(['INSEE commune', 'Commune', 'Agriculture', 'Autres transports',
        'Autres transports international', 'CO2 biomasse hors-total', 'Déchets',
        'Energie', 'Industrie hors-énergie', 'Résidentiel', 'Routier',
        'Tertiaire'],
       dtype='object')]
df.columns
Index(['INSEE commune', 'Commune', 'Agriculture', 'Autres transports',
       'Autres transports international', 'CO2 biomasse hors-total', 'Déchets',
       'Energie', 'Industrie hors-énergie', 'Résidentiel', 'Routier',
       'Tertiaire'],
      dtype='object')
df.index
RangeIndex(start=0, stop=35798, step=1)

To know the dimensions of a DataFrame, some practical methods can be used:

df.ndim
2
df.shape
(35798, 12)
df.size
429576

To determine the number of unique values of a variable, rather than writing a function yourself, use the nunique method. For example,

df["Commune"].nunique()
33338

Pandas offers many useful methods. Here is a summary of those related to data structure, accompanied by a comparison with R:

Operation pandas dplyr (R) data.table (R)
Retrieve column names df.columns colnames(df) colnames(df)
Retrieve dimensions df.shape dim(df) dim(df)
Retrieve unique values of a variable df['myvar'].nunique() df %>% summarise(distinct(myvar)) df[,uniqueN(myvar)]

5.2 Accessing Elements of a DataFrame

In SQL, performing operations on columns is done with the SELECT command. With Pandas, to access an entire column, several approaches can be used:

  • dataframe.variable, for example df.Energie. This method requires column names without spaces or special characters, which excludes many real datasets. It is not recommended.
  • dataframe[['variable']] to return the variable as a DataFrame. This method can be tricky for a single variable; it’s better to use dataframe.loc[:,['variable']], which is more explicit about the nature of the resulting object.
  • dataframe['variable'] to return the variable as a Series. For example, df[['Autres transports']] or df['Autres transports']. This is the preferred method.

To retrieve multiple columns at once, there are two approaches, with the second being preferable:

  • dataframe[['variable1', 'variable2']]
  • dataframe.loc[:, ['variable1', 'variable2']]

This is equivalent to SELECT variable1, variable2 FROM dataframe in SQL.

Using .loc may seem excessively verbose, but it ensures that you are performing a subset operation on the column dimension. DataFrames have two indices, for rows and columns, and implicit operations can sometimes cause surprises, so it’s more reliable to be explicit.

5.3 Accessing Rows

To access one or more values in a DataFrame, there are two recommended methods, depending on the form of the row or column indices used:

  • df.iloc: uses indices. This method is somewhat unreliable because a DataFrame’s indices can change during processing (especially when performing group operations).
  • df.loc: uses labels. This method is recommended.
Warning

Code snippets using the df.ix structure should be avoided as the function is deprecated and may disappear at any time.

iloc refers to the indexing from 0 to N, where N equals df.shape[0] of a pandas.DataFrame. loc refers to the values of df’s index. For example, with the pandas.DataFrame df_example:

df_example = pd.DataFrame(
    {"month": [1, 4, 7, 10], "year": [2012, 2014, 2013, 2014], "sale": [55, 40, 84, 31]}
)
df_example = df_example.set_index("month")
df_example
year sale
month
1 2012 55
4 2014 40
7 2013 84
10 2014 31
  • df_example.loc[1, :] will return the first row of df (row where the month index equals 1);
  • df_example.iloc[1, :] will return the second row (since Python indexing starts at 0);
  • df_example.iloc[:, 1] will return the second column, following the same principle.

Later exercises will allow practicing this syntax on our dataset of carbon emissions.

6 Main Data Manipulations

The most frequent operations in SQL are summarized in the following table. It is useful to know them (many data manipulation syntaxes use these terms) because, one way or another, they cover most data manipulation uses. We will describe some of them later:

Operation SQL pandas dplyr (R) data.table (R)
Select variables by name SELECT df[['Autres transports','Energie']] df %>% select(Autres transports, Energie) df[, c('Autres transports','Energie')]
Select observations based on one or more conditions FILTER df[df['Agriculture']>2000] df %>% filter(Agriculture>2000) df[Agriculture>2000]
Sort the table by one or more variables SORT BY df.sort_values(['Commune','Agriculture']) df %>% arrange(Commune, Agriculture) df[order(Commune, Agriculture)]
Add variables that are functions of other variables SELECT *, LOG(Agriculture) AS x FROM df df['x'] = np.log(df['Agriculture']) df %>% mutate(x = log(Agriculture)) df[,x := log(Agriculture)]
Perform an operation by group GROUP BY df.groupby('Commune').mean() df %>% group_by(Commune) %>% summarise(m = mean) df[,mean(Commune), by = Commune]
Join two databases (inner join) SELECT * FROM table1 INNER JOIN table2 ON table1.id = table2.x table1.merge(table2, left_on = 'id', right_on = 'x') table1 %>% inner_join(table2, by = c('id'='x')) merge(table1, table2, by.x = 'id', by.y = 'x')

6.1 Operations on Columns: Adding or Removing Variables, Renaming Them, etc.

Technically, Pandas DataFrames are mutable objects in the Python language, meaning it is possible to change the DataFrame as needed during processing.

The most classic operation is adding or removing variables to the data table. The simplest way to add columns is by reassignment. For example, to create a dep variable that corresponds to the first two digits of the commune code (INSEE code), simply take the variable and apply the appropriate treatment (in this case, keep only its first two characters):

df["dep"] = df["INSEE commune"].str[:2]
df.head(3)
INSEE commune Commune Agriculture Autres transports Autres transports international CO2 biomasse hors-total Déchets Energie Industrie hors-énergie Résidentiel Routier Tertiaire dep
0 01001 L'ABERGEMENT-CLEMENCIAT 3711.425991 NaN NaN 432.751835 101.430476 2.354558 6.911213 309.358195 793.156501 367.036172 01
1 01002 L'ABERGEMENT-DE-VAREY 475.330205 NaN NaN 140.741660 140.675439 2.354558 6.911213 104.866444 348.997893 112.934207 01
2 01004 AMBERIEU-EN-BUGEY 499.043526 212.577908 NaN 10313.446515 5314.314445 998.332482 2930.354461 16616.822534 15642.420313 10732.376934 01

In SQL, the method depends on the execution engine. In pseudo-code, it would be:

SELECT everything(), SUBSTR("code_insee", 2) AS dep FROM df

It is possible to apply this approach to creating columns on multiple columns. One of the advantages of this approach is that it allows recycling column names.

vars = ["Agriculture", "Déchets", "Energie"]

df[[v + "_log" for v in vars]] = np.log(df.loc[:, vars])
df.head(3)
INSEE commune Commune Agriculture Autres transports Autres transports international CO2 biomasse hors-total Déchets Energie Industrie hors-énergie Résidentiel Routier Tertiaire dep Agriculture_log Déchets_log Energie_log
0 01001 L'ABERGEMENT-CLEMENCIAT 3711.425991 NaN NaN 432.751835 101.430476 2.354558 6.911213 309.358195 793.156501 367.036172 01 8.219171 4.619374 0.856353
1 01002 L'ABERGEMENT-DE-VAREY 475.330205 NaN NaN 140.741660 140.675439 2.354558 6.911213 104.866444 348.997893 112.934207 01 6.164010 4.946455 0.856353
2 01004 AMBERIEU-EN-BUGEY 499.043526 212.577908 NaN 10313.446515 5314.314445 998.332482 2930.354461 16616.822534 15642.420313 10732.376934 01 6.212693 8.578159 6.906086

The equivalent SQL query would be quite tedious to write. For such operations, the benefit of a high-level library like Pandas becomes clear.

Warning

This is possible thanks to the native vectorization of Numpy operations and the magic of Pandas that rearranges everything. This is not usable with just any function. For other functions, you will need to use assign, generally through lambda functions, temporary functions acting as pass-throughs. For example, to create a variable using this approach, you would do:

df.assign(Energie_log=lambda x: np.log(x["Energie"]))
INSEE commune Commune Agriculture Autres transports Autres transports international CO2 biomasse hors-total Déchets Energie Industrie hors-énergie Résidentiel Routier Tertiaire dep Agriculture_log Déchets_log Energie_log
0 01001 L'ABERGEMENT-CLEMENCIAT 3711.425991 NaN NaN 432.751835 101.430476 2.354558 6.911213 309.358195 793.156501 367.036172 01 8.219171 4.619374 0.856353
1 01002 L'ABERGEMENT-DE-VAREY 475.330205 NaN NaN 140.741660 140.675439 2.354558 6.911213 104.866444 348.997893 112.934207 01 6.164010 4.946455 0.856353
2 01004 AMBERIEU-EN-BUGEY 499.043526 212.577908 NaN 10313.446515 5314.314445 998.332482 2930.354461 16616.822534 15642.420313 10732.376934 01 6.212693 8.578159 6.906086
3 01005 AMBERIEUX-EN-DOMBES 1859.160954 NaN NaN 1144.429311 216.217508 94.182310 276.448534 663.683146 1756.341319 782.404357 01 7.527881 5.376285 4.545232
4 01006 AMBLEON 448.966808 NaN NaN 77.033834 48.401549 NaN NaN 43.714019 398.786800 51.681756 01 6.106949 3.879532 NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
35793 95676 VILLERS-EN-ARTHIES 1628.065094 NaN NaN 165.045396 65.063617 11.772789 34.556067 176.098160 309.627908 235.439109 95 7.395148 4.175366 2.465791
35794 95678 VILLIERS-ADAM 698.630772 NaN NaN 1331.126598 111.480954 2.354558 6.911213 1395.529811 18759.370071 403.404815 95 6.549122 4.713854 0.856353
35795 95680 VILLIERS-LE-BEL 107.564967 NaN NaN 8367.174532 225.622903 534.484607 1568.845431 22613.830247 12217.122402 13849.512001 95 4.678095 5.418865 6.281303
35796 95682 VILLIERS-LE-SEC 1090.890170 NaN NaN 326.748418 108.969749 2.354558 6.911213 67.235487 4663.232127 85.657725 95 6.994749 4.691070 0.856353
35797 95690 WY-DIT-JOLI-VILLAGE 1495.103542 NaN NaN 125.236417 97.728612 4.709115 13.822427 117.450851 504.400972 147.867245 95 7.309951 4.582194 1.549500

35798 rows × 16 columns

With methods from Pandas or Numpy like this, it is not beneficial and even counterproductive as it slows down the code.

Variables can be easily renamed using the rename method, which works well with dictionaries. To rename columns, specify the parameter axis = 'columns' or axis=1. The axis parameter is often necessary because many Pandas methods assume by default that operations are done on the row index:

df = df.rename({"Energie": "eneg", "Agriculture": "agr"}, axis=1)
df.head()
INSEE commune Commune agr Autres transports Autres transports international CO2 biomasse hors-total Déchets eneg Industrie hors-énergie Résidentiel Routier Tertiaire dep Agriculture_log Déchets_log Energie_log
0 01001 L'ABERGEMENT-CLEMENCIAT 3711.425991 NaN NaN 432.751835 101.430476 2.354558 6.911213 309.358195 793.156501 367.036172 01 8.219171 4.619374 0.856353
1 01002 L'ABERGEMENT-DE-VAREY 475.330205 NaN NaN 140.741660 140.675439 2.354558 6.911213 104.866444 348.997893 112.934207 01 6.164010 4.946455 0.856353
2 01004 AMBERIEU-EN-BUGEY 499.043526 212.577908 NaN 10313.446515 5314.314445 998.332482 2930.354461 16616.822534 15642.420313 10732.376934 01 6.212693 8.578159 6.906086
3 01005 AMBERIEUX-EN-DOMBES 1859.160954 NaN NaN 1144.429311 216.217508 94.182310 276.448534 663.683146 1756.341319 782.404357 01 7.527881 5.376285 4.545232
4 01006 AMBLEON 448.966808 NaN NaN 77.033834 48.401549 NaN NaN 43.714019 398.786800 51.681756 01 6.106949 3.879532 NaN

Finally, to delete columns, use the drop method with the columns argument:

df = df.drop(columns=["eneg", "agr"])

6.2 Reordering Observations

The sort_values method allows reordering observations in a DataFrame, keeping the column order the same.

For example, to sort in descending order of CO2 consumption in the residential sector, you would do:

df = df.sort_values("Résidentiel", ascending=False)
df.head(3)
INSEE commune Commune Autres transports Autres transports international CO2 biomasse hors-total Déchets Industrie hors-énergie Résidentiel Routier Tertiaire dep Agriculture_log Déchets_log Energie_log
12167 31555 TOULOUSE 4482.980062 130.792683 576394.181208 88863.732538 277062.573234 410675.902028 586054.672836 288175.400126 31 7.268255 11.394859 11.424640
16774 44109 NANTES 138738.544337 250814.701179 193478.248177 18162.261628 77897.138554 354259.013785 221068.632724 173447.582779 44 5.513507 9.807101 9.767748
27294 67482 STRASBOURG 124998.576639 122266.944279 253079.442156 119203.251573 135685.440035 353586.424577 279544.852332 179562.761386 67 6.641974 11.688585 9.885411

Thus, in one line of code, you can identify the cities where the residential sector consumes the most. In SQL, you would do:

SELECT * FROM df ORDER BY "Résidentiel" DESC

6.3 Filtering

The operation of selecting rows is called FILTER in SQL. It is used based on a logical condition (clause WHERE). Data is selected based on a logical condition.

There are several methods in Pandas. The simplest is to use boolean masks, as seen in the chapter numpy.

For example, to select the municipalities in Hauts-de-Seine, you can start by using the result of the str.startswith method (which returns True or False):

df["INSEE commune"].str.startswith("92")
12167    False
16774    False
27294    False
12729    False
22834    False
         ...  
20742    False
20817    False
20861    False
20898    False
20957    False
Name: INSEE commune, Length: 35798, dtype: bool

str. is a special method in Pandas that allows treating each value of a vector as a native string in Python to which a subsequent method (in this case, startswith) is applied.

The above instruction returns a vector of booleans. We previously saw that the loc method is used for subsetting on both row and column indices. It works with boolean vectors. In this case, if subsetting on the row dimension (or column), it will return all observations (or variables) that satisfy this condition.

Thus, by combining these two elements, we can filter our data to get only the results for the 92:

df.loc[df["INSEE commune"].str.startswith("92")].head(2)
INSEE commune Commune Autres transports Autres transports international CO2 biomasse hors-total Déchets Industrie hors-énergie Résidentiel Routier Tertiaire dep Agriculture_log Déchets_log Energie_log
35494 92012 BOULOGNE-BILLANCOURT 1250.483441 34.234669 51730.704250 964.828694 25882.493998 92216.971456 64985.280901 60349.109482 92 NaN 6.871951 9.084530
35501 92025 COLOMBES 411.371588 14.220061 53923.847088 698.685861 50244.664227 87469.549463 52070.927943 41526.600867 92 NaN 6.549201 9.461557

The equivalent SQL code may vary depending on the execution engine (DuckDB, PostGre, MySQL) but would take a form similar to this:

SELECT * FROM df WHERE STARTSWITH("INSEE commune", "92")

6.4 Summary of Main Operations

The main manipulations are as follows:

Selecting Columns Renaming Columns

Creating New Columns Selecting Rows

Reordering the DataFrame

Reordering the DataFrame

7 Descriptive Statistics

To start again from the raw source, let’s recreate our dataset for the examples:

df = pd.read_csv(
    "https://koumoul.com/s/data-fair/api/v1/datasets/igt-pouvoir-de-rechauffement-global/convert"
)
df.head(3)
INSEE commune Commune Agriculture Autres transports Autres transports international CO2 biomasse hors-total Déchets Energie Industrie hors-énergie Résidentiel Routier Tertiaire
0 01001 L'ABERGEMENT-CLEMENCIAT 3711.425991 NaN NaN 432.751835 101.430476 2.354558 6.911213 309.358195 793.156501 367.036172
1 01002 L'ABERGEMENT-DE-VAREY 475.330205 NaN NaN 140.741660 140.675439 2.354558 6.911213 104.866444 348.997893 112.934207
2 01004 AMBERIEU-EN-BUGEY 499.043526 212.577908 NaN 10313.446515 5314.314445 998.332482 2930.354461 16616.822534 15642.420313 10732.376934

Pandas includes several methods to construct aggregate statistics: sum, count of unique values, count of non-missing values, mean, variance, etc.

The most generic method is describe

df.describe()
Agriculture Autres transports Autres transports international CO2 biomasse hors-total Déchets Energie Industrie hors-énergie Résidentiel Routier Tertiaire
count 35736.000000 9979.000000 2.891000e+03 35798.000000 35792.000000 3.449000e+04 3.449000e+04 35792.000000 35778.000000 35798.000000
mean 2459.975760 654.919940 7.692345e+03 1774.381550 410.806329 6.625698e+02 2.423128e+03 1783.677872 3535.501245 1105.165915
std 2926.957701 9232.816833 1.137643e+05 7871.341922 4122.472608 2.645571e+04 5.670374e+04 8915.902379 9663.156628 5164.182507
min 0.003432 0.000204 3.972950e-04 3.758088 0.132243 2.354558e+00 1.052998e+00 1.027266 0.555092 0.000000
25% 797.682631 52.560412 1.005097e+01 197.951108 25.655166 2.354558e+00 6.911213e+00 96.052911 419.700460 94.749885
50% 1559.381285 106.795928 1.992434e+01 424.849988 54.748653 4.709115e+00 1.382243e+01 227.091193 1070.895593 216.297718
75% 3007.883903 237.341501 3.298311e+01 1094.749825 110.820941 5.180027e+01 1.520467e+02 749.469293 3098.612157 576.155869
max 98949.317760 513140.971691 3.303394e+06 576394.181208 275500.374439 2.535858e+06 6.765119e+06 410675.902028 586054.672836 288175.400126

which is similar to the eponymous method in Stata or the PROC FREQ of SAS, two proprietary languages. However, it provides a lot of information, which can often be overwhelming. Therefore, it is more practical to work directly on a few columns or choose the statistics you want to use.

7.1 Counts

The first type of statistics you might want to implement involves counting or enumerating values.

For example, if you want to know the number of municipalities in your dataset, you can use the count method or nunique if you are interested in unique values.

df["Commune"].count()
35798
df["Commune"].nunique()
33338

In SQL, the first instruction would be SELECT COUNT(Commune) FROM df, the second SELECT COUNT DISTINCT Commune FROM df. Here, this allows us to understand that there may be duplicates in the Commune column, which needs to be considered if we want to uniquely identify our municipalities (this will be addressed in the next chapter on data merging).

The coherence of the Pandas syntax allows you to do this for several columns simultaneously. In SQL, this would be possible but the code would start to become quite verbose:

df.loc[:, ["Commune", "INSEE commune"]].count()
Commune          35798
INSEE commune    35798
dtype: int64
df.loc[:, ["Commune", "INSEE commune"]].nunique()
Commune          33338
INSEE commune    35798
dtype: int64

With these two simple commands, we understand that our INSEE commune variable (the INSEE code) will be more reliable for identifying municipalities than the names, which are not necessarily unique. The purpose of the INSEE code is to provide a unique identifier, unlike the postal code which can be shared by several municipalities.

7.2 Aggregated Statistics

Pandas includes several methods to construct statistics on multiple columns: sum, mean, variance, etc.

The methods are quite straightforward:

df["Agriculture"].sum()
df["Agriculture"].mean()
2459.975759687974

Again, the consistency of Pandas allows generalizing the calculation of statistics to multiple columns:

df.loc[:, ["Agriculture", "Résidentiel"]].sum()
df.loc[:, ["Agriculture", "Résidentiel"]].mean()
Agriculture    2459.975760
Résidentiel    1783.677872
dtype: float64

It is possible to generalize this to all columns. However, it is necessary to introduce the numeric_only parameter to perform the aggregation task only on the relevant variables:

df.mean(numeric_only=True)
Agriculture                        2459.975760
Autres transports                   654.919940
Autres transports international    7692.344960
CO2 biomasse hors-total            1774.381550
Déchets                             410.806329
Energie                             662.569846
Industrie hors-énergie             2423.127789
Résidentiel                        1783.677872
Routier                            3535.501245
Tertiaire                          1105.165915
dtype: float64
Warning

Version 2.0 of Pandas introduced a change in the behavior of aggregation methods.

It is now necessary to specify whether you want to perform operations exclusively on numeric columns. This is why we explicitly state the argument numeric_only = True here. This behavior was previously implicit.

The practical method to know is agg. This allows defining the statistics you want to calculate for each variable:

df.agg(
    {
        "Agriculture": ["sum", "mean"],
        "Résidentiel": ["mean", "std"],
        "Commune": "nunique",
    }
)
Agriculture Résidentiel Commune
sum 8.790969e+07 NaN NaN
mean 2.459976e+03 1783.677872 NaN
std NaN 8915.902379 NaN
nunique NaN NaN 33338.0

The output of aggregation methods is an indexed Series (methods like df.sum()) or directly a DataFrame (the agg method). It is generally more practical to have a DataFrame than an indexed Series if you want to rework the table to draw conclusions. Therefore, it is useful to transform the outputs from Series to DataFrame and then apply the reset_index method to convert the index into a column. From there, you can modify the DataFrame to make it more readable.

For example, if you are interested in the share of each sector in total emissions, you can proceed in two steps. First, create an observation per sector representing its total emissions:

# Step 1: create a clean DataFrame
emissions_totales = pd.DataFrame(
    df.sum(numeric_only=True), columns=["emissions"]
).reset_index(names="secteur")
emissions_totales
secteur emissions
0 Agriculture 8.790969e+07
1 Autres transports 6.535446e+06
2 Autres transports international 2.223857e+07
3 CO2 biomasse hors-total 6.351931e+07
4 Déchets 1.470358e+07
5 Energie 2.285203e+07
6 Industrie hors-énergie 8.357368e+07
7 Résidentiel 6.384140e+07
8 Routier 1.264932e+08
9 Tertiaire 3.956273e+07

Then, work minimally on the dataset to get some interesting conclusions about the structure of emissions in France:

emissions_totales["emissions (%)"] = (
    100 * emissions_totales["emissions"] / emissions_totales["emissions"].sum()
)
(emissions_totales.sort_values("emissions", ascending=False).round())
secteur emissions emissions (%)
8 Routier 126493164.0 24.0
0 Agriculture 87909694.0 17.0
6 Industrie hors-énergie 83573677.0 16.0
7 Résidentiel 63841398.0 12.0
3 CO2 biomasse hors-total 63519311.0 12.0
9 Tertiaire 39562729.0 7.0
5 Energie 22852034.0 4.0
2 Autres transports international 22238569.0 4.0
4 Déchets 14703580.0 3.0
1 Autres transports 6535446.0 1.0

This table is not well-formatted and far from being presentable, but it is already useful from an exploratory perspective. It helps us understand the most emitting sectors, namely transport, agriculture, and industry, excluding energy. The fact that energy is relatively low in emissions can be explained by the French energy mix, where nuclear power represents a majority of electricity production.

To go further in formatting this table to have communicable statistics outside of Python, we will explore great_tables in the next chapter.

Note

The data structure resulting from df.sum is quite practical (it is tidy). We could do exactly the same operation as df.sum(numeric_only = True) with the following code:

df.select_dtypes(include="number").agg(func=sum)
/tmp/ipykernel_1301/1114394405.py:1: FutureWarning:

The provided callable <built-in function sum> is currently using DataFrame.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
Agriculture                        8.790969e+07
Autres transports                  6.535446e+06
Autres transports international    2.223857e+07
CO2 biomasse hors-total            6.351931e+07
Déchets                            1.470358e+07
Energie                            2.285203e+07
Industrie hors-énergie             8.357368e+07
Résidentiel                        6.384140e+07
Routier                            1.264932e+08
Tertiaire                          3.956273e+07
dtype: float64

7.3 Missing Values

So far, we haven’t discussed what could be a stumbling block for a data scientist: missing values.

Real datasets are rarely complete, and missing values can reflect many realities: data retrieval issues, irrelevant variables for a given observation, etc.

Technically, Pandas handles missing values without issues (except for int variables, but that’s an exception). By default, missing values are displayed as NaN and are of type np.nan (for temporal values, i.e., of type datetime64, missing values are NaT). We get consistent aggregation behavior when combining two columns, one of which has missing values.

ventes = pd.DataFrame(
    {
        "prix": np.random.uniform(size=5),
        "client1": [i + 1 for i in range(5)],
        "client2": [i + 1 for i in range(4)] + [np.nan],
        "produit": [np.nan] + ["yaourt", "pates", "riz", "tomates"],
    }
)
ventes
prix client1 client2 produit
0 0.846498 1 1.0 NaN
1 0.688088 2 2.0 yaourt
2 0.183628 3 3.0 pates
3 0.697141 4 4.0 riz
4 0.260635 5 NaN tomates

Pandas will refuse to aggregate because, for it, a missing value is not zero:

ventes["client1"] + ventes["client2"]
0    2.0
1    4.0
2    6.0
3    8.0
4    NaN
dtype: float64

It is possible to remove missing values using dropna(). This method will remove all rows where there is at least one missing value.

ventes.dropna()
prix client1 client2 produit
1 0.688088 2 2.0 yaourt
2 0.183628 3 3.0 pates
3 0.697141 4 4.0 riz

In this case, we lose two rows. It is also possible to remove only the columns where there are missing values in a DataFrame with dropna() using the subset parameter.

ventes.dropna(subset=["produit"])
prix client1 client2 produit
1 0.688088 2 2.0 yaourt
2 0.183628 3 3.0 pates
3 0.697141 4 4.0 riz
4 0.260635 5 NaN tomates

This time we lose only one row, the one where produit is missing.

Pandas provides the ability to impute missing values using the fillna() method. For example, if you think the missing values in produit are zeros, you can do:

ventes.dropna(subset=["produit"]).fillna(0)
prix client1 client2 produit
1 0.688088 2 2.0 yaourt
2 0.183628 3 3.0 pates
3 0.697141 4 4.0 riz
4 0.260635 5 0.0 tomates

If you want to impute the median for the client2 variable, you can slightly change this code by encapsulating the median calculation inside:

(ventes["client2"].fillna(ventes["client2"].median()))
0    1.0
1    2.0
2    3.0
3    4.0
4    2.5
Name: client2, dtype: float64

For real-world datasets, it is useful to use the isna (or isnull) method combined with sum or mean to understand the extent of missing values in a dataset.

df.isnull().mean().sort_values(ascending=False)
Autres transports international    0.919241
Autres transports                  0.721241
Energie                            0.036538
Industrie hors-énergie             0.036538
Agriculture                        0.001732
Routier                            0.000559
Déchets                            0.000168
Résidentiel                        0.000168
INSEE commune                      0.000000
Commune                            0.000000
CO2 biomasse hors-total            0.000000
Tertiaire                          0.000000
dtype: float64

This preparatory step is useful for anticipating the question of imputation or filtering on missing values: are they missing at random or do they reflect an issue in data retrieval? The choices related to handling missing values are not neutral methodological choices. Pandas provides the technical tools to do this, but the legitimacy and relevance of these choices are specific to each dataset. Data explorations aim to detect clues to make an informed decision.

8 Quick Graphical Representations

Numerical tables are certainly useful for understanding the structure of a dataset, but their dense aspect makes them difficult to grasp. Having a simple graph can be useful to visualize the distribution of the data at a glance and thus understand the normality of an observation.

Pandas includes basic graphical methods to meet this need. They are practical for quickly producing a graph, especially after complex data manipulation operations. We will delve deeper into the issue of data visualizations in the Communicate section.

You can apply the plot() method directly to a Series:

df["Déchets"].plot()

The equivalent code with matplotlib would be:

import matplotlib.pyplot as plt

plt.plot(df.index, df["Déchets"])

By default, the obtained visualization is a series. This is not necessarily what is expected since it only makes sense for time series. As a data scientist working with microdata, you are more often interested in a histogram to get an idea of the data distribution. To do this, simply add the argument kind = 'hist':

df["Déchets"].hist()

With data that has a non-normalized distribution, which represents many real-world variables, histograms are generally not very informative. The log can be a solution to bring some extreme values to a comparable scale:

df["Déchets"].plot(kind="hist", logy=True)

The output is a matplotlib object. Customizing these figures is thus possible (and even desirable because the default matplotlib graphs are quite basic). However, this is a quick method for constructing figures that require work for a finalized visualization. This involves thorough work on the matplotlib object or using a higher-level library for graphical representation (seaborn, plotnine, plotly, etc.).

The part of this course dedicated to data visualization will briefly present these different visualization paradigms. These do not exempt you from using common sense in choosing the graph used to represent a descriptive statistic (see this conference by Eric Mauvière).

9 Synthesis Exercise

This exercise synthesizes several steps of data preparation and exploration to better understand the structure of the phenomenon we want to study, namely carbon emissions in France.

It is recommended to start from a clean session (in a notebook, you should do Restart Kernel) to avoid an environment polluted by other objects. You can then run the following code to get the necessary base:

import pandas as pd

emissions = pd.read_csv(
    "https://koumoul.com/s/data-fair/api/v1/datasets/igt-pouvoir-de-rechauffement-global/convert"
)
emissions.head(2)
INSEE commune Commune Agriculture Autres transports Autres transports international CO2 biomasse hors-total Déchets Energie Industrie hors-énergie Résidentiel Routier Tertiaire
0 01001 L'ABERGEMENT-CLEMENCIAT 3711.425991 NaN NaN 432.751835 101.430476 2.354558 6.911213 309.358195 793.156501 367.036172
1 01002 L'ABERGEMENT-DE-VAREY 475.330205 NaN NaN 140.741660 140.675439 2.354558 6.911213 104.866444 348.997893 112.934207
Exercice 2: Découverte des verbes de Pandas pour manipuler des données

En premier lieu, on propose de se familiariser avec les opérations sur les colonnes.

  1. Créer un dataframe emissions_copy ne conservant que les colonnes INSEE commune, Commune, Autres transports et Autres transports international
Indice pour cette question

  1. Comme les noms de variables sont peu pratiques, les renommer de la manière suivante :
    • INSEE commune \(\to\) code_insee
    • Autres transports \(\to\) transports
    • Autres transports international \(\to\) transports_international
Indice pour cette question

  1. On propose, pour simplifier, de remplacer les valeurs manquantes (NA) par la valeur 0. Utiliser la méthode fillna pour transformer les valeurs manquantes en 0.

  2. Créer les variables suivantes :

    • dep: le département. Celui-ci peut être créé grâce aux deux premiers caractères de code_insee en appliquant la méthode str ;
    • transports_total: les émissions du secteur transports (somme des deux variables)
Indice pour cette question

  1. Ordonner les données du plus gros pollueur au plus petit puis ordonner les données du plus gros pollueur au plus petit par département (du 01 au 95).
Indice pour cette question

  1. Ne conserver que les communes appartenant aux départements 13 ou 31. Ordonner ces communes du plus gros pollueur au plus petit.
Indice pour cette question

Revenir au jeu emission initial

  1. Calculer les émissions totales par secteur. Calculer la part de chaque secteur dans les émissions totales. Transformer en tonnes les volumes avant d’afficher les résultats

  2. Calculer pour chaque commune les émissions totales après avoir imputé les valeurs manquantes à 0. Garder les 100 communes les plus émettrices. Calculer la part de chaque secteur dans cette émission. Comprendre les facteurs pouvant expliquer ce classement.

Aide si vous êtes en difficulté sur la question 8

Jouer avec le paramètre axis lors de la construction d’une statistique agrégée.

In question 5, when the communes are ordered exclusively on the basis of the variable transport_total, the result is as follows:

code_insee Commune transports transports_international dep transports_total
31108 77291 LE MESNIL-AMELOT 133834.090767 3.303394e+06 77 3.437228e+06
31099 77282 MAUREGARD 133699.072712 3.303394e+06 77 3.437093e+06
31111 77294 MITRY-MORY 89815.529858 2.202275e+06 77 2.292090e+06

Question 6 gives us this classification:

code_insee Commune transports transports_international dep transports_total
4438 13096 SAINTES-MARIES-DE-LA-MER 271182.758578 0.000000 13 271182.758578
4397 13054 MARIGNANE 245375.418650 527360.799265 13 772736.217915
11684 31069 BLAGNAC 210157.688544 403717.366279 31 613875.054823

In question 7, the resulting table looks like this

secteur emissions emissions (%)
8 Routier 126493.0 24.0
0 Agriculture 87910.0 17.0
6 Industrie hors-énergie 83574.0 16.0
7 Résidentiel 63841.0 12.0
3 CO2 biomasse hors-total 63519.0 12.0
INSEE commune Commune Agriculture Autres transports Autres transports international CO2 biomasse hors-total Déchets Energie Industrie hors-énergie Résidentiel Routier Tertiaire
4382 13039 FOS-SUR-MER 305.092893 1893.383189 1.722723e+04 50891.367548 275500.374439 2.296711e+06 6.765119e+06 9466.388806 74631.401993 42068.140058
22671 59183 DUNKERQUE 811.390947 3859.548994 3.327586e+05 71922.181764 23851.780482 1.934988e+06 5.997333e+06 113441.727216 94337.865738 70245.678455
4398 13056 MARTIGUES 855.299300 2712.749275 3.043476e+04 35925.561051 44597.426397 1.363402e+06 2.380185e+06 22530.797276 84624.862481 44394.822725
30560 76476 PORT-JEROME-SUR-SEINE 2736.931327 121.160849 2.086403e+04 22846.964780 78.941581 1.570236e+06 2.005643e+06 21072.566129 9280.824961 15270.357772
31108 77291 LE MESNIL-AMELOT 782.183307 133834.090767 3.303394e+06 3330.404124 111.613197 8.240952e+02 2.418925e+03 1404.400153 11712.541682 13680.471909
... ... ... ... ... ... ... ... ... ... ... ... ...
30056 74281 THONON-LES-BAINS 382.591429 145.151526 0.000000e+00 144503.815623 21539.524775 2.776024e+03 1.942086e+05 44545.339492 25468.667604 19101.562585
33954 86194 POITIERS 2090.630127 20057.297403 2.183418e+04 95519.054766 13107.640015 5.771021e+03 1.693938e+04 110218.156827 91921.787626 73121.300634
1918 06019 BLAUSASC 194.382622 0.000000 0.000000e+00 978.840592 208.084792 2.260375e+02 4.362880e+05 538.816198 3949.838612 740.022785
27607 68253 OTTMARSHEIM 506.238288 679.598495 3.526748e+01 4396.853500 237.508651 1.172570e+03 4.067258e+05 3220.875864 19317.977050 4921.144781
28643 71076 CHALON-SUR-SAONE 1344.584721 1336.547546 6.558180e+01 77082.218324 570.662275 1.016933e+04 2.075650e+05 77330.230592 29020.646500 34997.262920

100 rows × 12 columns

At the end of question 8, we better understand the factors that can explain high emissions at the municipal level. If we look at the top three emitting municipalities, we can see that they are cities with refineries:

INSEE commune Commune Agriculture Autres transports Autres transports international CO2 biomasse hors-total Déchets Energie Industrie hors-énergie Résidentiel ... Part Autres transports Part Autres transports international Part CO2 biomasse hors-total Part Déchets Part Energie Part Industrie hors-énergie Part Résidentiel Part Routier Part Tertiaire Part total
4382 13039 FOS-SUR-MER 305.092893 1893.383189 17227.234585 50891.367548 275500.374439 2.296711e+06 6.765119e+06 9466.388806 ... 0.019860 0.180696 0.533799 2.889719 24.090160 70.959215 0.099293 0.782807 0.441252 100.0
22671 59183 DUNKERQUE 811.390947 3859.548994 332758.630269 71922.181764 23851.780482 1.934988e+06 5.997333e+06 113441.727216 ... 0.044652 3.849791 0.832091 0.275949 22.386499 69.385067 1.312444 1.091425 0.812695 100.0
4398 13056 MARTIGUES 855.299300 2712.749275 30434.762405 35925.561051 44597.426397 1.363402e+06 2.380185e+06 22530.797276 ... 0.067655 0.759035 0.895975 1.112249 34.002905 59.361219 0.561912 2.110523 1.107196 100.0

3 rows × 24 columns

Thanks to our minimal explorations with Pandas, we see that this dataset provides information about the nature of the French productive fabric and the environmental consequences of certain activities.

References

  • The site pandas.pydata serves as a reference

  • The book Modern Pandas by Tom Augspurger: https://tomaugspurger.github.io/modern-1-intro.html

McKinney, Wes. 2012. Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython. " O’Reilly Media, Inc.".
Wickham, Hadley, Mine Çetinkaya-Rundel, and Garrett Grolemund. 2023. R for Data Science. " O’Reilly Media, Inc.".

Informations additionnelles

environment files have been tested on.

Latest built version: 2024-07-10

Python version used:

'3.11.6 | packaged by conda-forge | (main, Oct  3 2023, 10:40:35) [GCC 12.3.0]'
Package Version
affine 2.4.0
aiobotocore 2.12.2
aiohttp 3.9.3
aioitertools 0.11.0
aiosignal 1.3.1
alembic 1.13.1
aniso8601 9.0.1
annotated-types 0.7.0
appdirs 1.4.4
archspec 0.2.3
astroid 3.1.0
asttokens 2.4.1
attrs 23.2.0
Babel 2.15.0
bcrypt 4.1.2
beautifulsoup4 4.12.3
black 24.4.2
blinker 1.7.0
blis 0.7.11
bokeh 3.4.0
boltons 23.1.1
boto3 1.34.51
botocore 1.34.51
branca 0.7.1
Brotli 1.1.0
cachetools 5.3.3
cartiflette 0.0.2
Cartopy 0.23.0
catalogue 2.0.10
cattrs 23.2.3
certifi 2024.2.2
cffi 1.16.0
charset-normalizer 3.3.2
click 8.1.7
click-plugins 1.1.1
cligj 0.7.2
cloudpathlib 0.18.1
cloudpickle 3.0.0
colorama 0.4.6
comm 0.2.2
commonmark 0.9.1
conda 24.3.0
conda-libmamba-solver 24.1.0
conda-package-handling 2.2.0
conda_package_streaming 0.9.0
confection 0.1.5
contextily 1.6.0
contourpy 1.2.1
cryptography 42.0.5
cycler 0.12.1
cymem 2.0.8
cytoolz 0.12.3
dask 2024.4.1
dask-expr 1.0.10
debugpy 1.8.1
decorator 5.1.1
dill 0.3.8
distributed 2024.4.1
distro 1.9.0
docker 7.0.0
duckdb 0.10.1
en-core-web-sm 3.7.1
entrypoints 0.4
et-xmlfile 1.1.0
exceptiongroup 1.2.0
executing 2.0.1
fastjsonschema 2.19.1
fiona 1.9.6
flake8 7.0.0
Flask 3.0.2
folium 0.16.0
fontawesomefree 6.5.1
fonttools 4.51.0
frozenlist 1.4.1
fsspec 2023.12.2
GDAL 3.8.4
gensim 4.3.2
geographiclib 2.0
geopandas 0.12.2
geoplot 0.5.1
geopy 2.4.1
gitdb 4.0.11
GitPython 3.1.43
google-auth 2.29.0
graphene 3.3
graphql-core 3.2.3
graphql-relay 3.2.0
graphviz 0.20.3
great-tables 0.10.0
greenlet 3.0.3
gunicorn 21.2.0
htmltools 0.5.2
hvac 2.1.0
idna 3.6
imageio 2.34.2
importlib_metadata 7.1.0
importlib_resources 6.4.0
inflate64 1.0.0
ipykernel 6.29.3
ipython 8.22.2
ipywidgets 8.1.2
isort 5.13.2
itsdangerous 2.1.2
jedi 0.19.1
Jinja2 3.1.3
jmespath 1.0.1
joblib 1.3.2
jsonpatch 1.33
jsonpointer 2.4
jsonschema 4.21.1
jsonschema-specifications 2023.12.1
jupyter-cache 1.0.0
jupyter_client 8.6.1
jupyter_core 5.7.2
jupyterlab_widgets 3.0.10
kaleido 0.2.1
kiwisolver 1.4.5
kubernetes 29.0.0
langcodes 3.4.0
language_data 1.2.0
lazy_loader 0.4
libmambapy 1.5.7
llvmlite 0.42.0
locket 1.0.0
lxml 5.2.2
lz4 4.3.3
Mako 1.3.2
mamba 1.5.7
mapclassify 2.6.1
marisa-trie 1.2.0
Markdown 3.6
markdown-it-py 3.0.0
MarkupSafe 2.1.5
matplotlib 3.8.3
matplotlib-inline 0.1.6
mccabe 0.7.0
mdurl 0.1.2
menuinst 2.0.2
mercantile 1.2.1
mizani 0.11.4
mlflow 2.11.3
mlflow-skinny 2.11.3
msgpack 1.0.7
multidict 6.0.5
multivolumefile 0.2.3
munkres 1.1.4
murmurhash 1.0.10
mypy 1.9.0
mypy-extensions 1.0.0
nbclient 0.10.0
nbformat 5.10.4
nest_asyncio 1.6.0
networkx 3.3
nltk 3.8.1
numba 0.59.1
numpy 1.26.4
oauthlib 3.2.2
opencv-python-headless 4.9.0.80
openpyxl 3.1.5
OWSLib 0.28.1
packaging 23.2
pandas 2.2.1
paramiko 3.4.0
parso 0.8.4
partd 1.4.1
pathspec 0.12.1
patsy 0.5.6
Pebble 5.0.7
pexpect 4.9.0
pickleshare 0.7.5
pillow 10.3.0
pip 24.0
pkgutil_resolve_name 1.3.10
platformdirs 4.2.0
plotly 5.19.0
plotnine 0.13.6
pluggy 1.4.0
polars 0.20.31
preshed 3.0.9
prometheus_client 0.20.0
prometheus-flask-exporter 0.23.0
prompt-toolkit 3.0.42
protobuf 4.25.3
psutil 5.9.8
ptyprocess 0.7.0
pure-eval 0.2.2
py7zr 0.20.8
pyarrow 15.0.0
pyarrow-hotfix 0.6
pyasn1 0.5.1
pyasn1-modules 0.3.0
pybcj 1.0.2
pycodestyle 2.11.1
pycosat 0.6.6
pycparser 2.21
pycryptodomex 3.20.0
pydantic 2.8.2
pydantic_core 2.20.1
pyflakes 3.2.0
Pygments 2.17.2
PyJWT 2.8.0
pylint 3.1.0
PyNaCl 1.5.0
pynsee 0.1.7
pyOpenSSL 24.0.0
pyparsing 3.1.2
pyppmd 1.1.0
pyproj 3.6.1
pyshp 2.3.1
PySocks 1.7.1
python-dateutil 2.9.0
python-dotenv 1.0.1
python-magic 0.4.27
pytz 2024.1
pyu2f 0.1.5
pywaffle 1.1.1
PyYAML 6.0.1
pyzmq 25.1.2
pyzstd 0.16.0
QtPy 2.4.1
querystring-parser 1.2.4
rasterio 1.3.10
referencing 0.34.0
regex 2023.12.25
requests 2.31.0
requests-cache 1.2.1
requests-oauthlib 2.0.0
rich 13.7.1
rpds-py 0.18.0
rsa 4.9
Rtree 1.2.0
ruamel.yaml 0.18.6
ruamel.yaml.clib 0.2.8
s3fs 2023.12.2
s3transfer 0.10.1
scikit-image 0.24.0
scikit-learn 1.4.1.post1
scipy 1.13.0
seaborn 0.13.2
setuptools 69.2.0
shapely 2.0.3
shellingham 1.5.4
six 1.16.0
smart_open 7.0.4
smmap 5.0.0
snuggs 1.4.7
sortedcontainers 2.4.0
soupsieve 2.5
spacy 3.7.5
spacy-legacy 3.0.12
spacy-loggers 1.0.5
SQLAlchemy 2.0.29
sqlparse 0.4.4
srsly 2.4.8
stack-data 0.6.2
statsmodels 0.14.1
tabulate 0.9.0
tblib 3.0.0
tenacity 8.2.3
texttable 1.7.0
thinc 8.2.5
threadpoolctl 3.4.0
tifffile 2024.7.2
tomli 2.0.1
tomlkit 0.12.4
toolz 0.12.1
topojson 1.9
tornado 6.4
tqdm 4.66.2
traitlets 5.14.2
truststore 0.8.0
typer 0.12.3
typing_extensions 4.11.0
tzdata 2024.1
Unidecode 1.3.8
url-normalize 1.4.3
urllib3 1.26.18
wasabi 1.1.3
wcwidth 0.2.13
weasel 0.4.1
webdriver-manager 4.0.1
websocket-client 1.7.0
Werkzeug 3.0.2
wheel 0.43.0
widgetsnbextension 4.0.10
wordcloud 1.9.3
wrapt 1.16.0
xgboost 2.0.3
xlrd 2.0.1
xyzservices 2024.4.0
yarl 1.9.4
yellowbrick 1.5
zict 3.0.0
zipp 3.17.0
zstandard 0.22.0

View file history

SHA Date Author Description
6ca4c1c 2024-07-08 17:24:11 Lino Galiana Traduction du premier chapitre Pandas (#520)
Back to top

Footnotes

  1. The equivalent ecosystem in R, the tidyverse, developed by Posit, is of more recent design than Pandas. Its philosophy could thus draw inspiration from Pandas while addressing some limitations of the Pandas syntax. Since both syntaxes are an implementation in Python or R of the SQL philosophy, it is natural that they resemble each other and that it is pertinent for data scientists to know both languages.↩︎

  2. Actually, it is not the carbon footprint but the national inventory since the database corresponds to a production view, not consumption. Emissions made in one municipality to satisfy the consumption of another will be attributed to the former where the carbon footprint concept would attribute it to the latter. Moreover, the emissions presented here do not include those produced by goods made abroad. This exercise is not about constructing a reliable statistic but rather understanding the logic of data merging to construct descriptive statistics.↩︎

  3. The original goal of Pandas is to provide a high-level library for more abstract low-level layers, such as Numpy arrays. Pandas is gradually changing these low-level layers to favor Arrow over Numpy without destabilizing the high-level commands familiar to Pandas users. This shift is due to the fact that Arrow, a low-level computation library, is more powerful and flexible than Numpy. For example, Numpy offers limited textual types, whereas Arrow provides greater freedom.↩︎

  4. Due to a lack of imagination, we are often tempted to call our main dataframe df or data. This is often a bad idea because the name is not very informative when you read the code a few weeks later. Self-documenting code, an approach that consists of having code that is self-explanatory, is a good practice, and it is recommended to give a simple yet effective name to know the nature of the dataset in question.↩︎

Citation

BibTeX citation:
@book{galiana2023,
  author = {Galiana, Lino},
  title = {Python Pour La Data Science},
  date = {2023},
  url = {https://pythonds.linogaliana.fr/},
  doi = {10.5281/zenodo.8229676},
  langid = {en}
}
For attribution, please cite this work as:
Galiana, Lino. 2023. Python Pour La Data Science. https://doi.org/10.5281/zenodo.8229676.