Part 1: data wrangling

Python has established itself as a credible alternative to R for data manipulation. The Pandas ecosystem has made it possible to democratize the use of DataFrames in Python and make it easier to manipulate of structured data thanks to the SQL philosophy. Python is also the most practical language for retrieving and manipulating unstructured data (webscraping, APIs). Python is tending to become, thanks to the development of APIs for other languages (C, Spark, Postgres, ElasticSearch…), the “one to rule them all” language.

Manipulation
Introduction
Author

Lino Galiana

Published

2025-03-19

Although data scientists are often associated with the implementation of artificial intelligence models, it is important not to forget that training and using these models do not necessarily represent the daily work of data scientists.

In practice, gathering heterogeneous data sources, structuring and harmonizing them for exploratory analysis prior to modeling or visualization represents a significant part of data scientists’ work. In many environments, this is even the essence of a data scientist’s role. Developing relevant models indeed requires deep reflection on the data; an essential step that should not be overlooked.

This course, like many introductory resources on data science (Wickham, Çetinkaya-Rundel, and Grolemund 2023; VanderPlas 2016; McKinney 2012), will therefore offer a lot of content on data manipulation, an essential skill for data scientists.

Programming software based around the database concept has become the main tool for data scientists. Being able to apply a number of standard operations on databases, regardless of their nature, allows programmers to be more efficient than if they had to repeat these operations manually, as in Excel.

All the dominant programming languages in the data science ecosystem are based on the dataframe principle. It is even a central object in some software, notably R. The logic of SQL, a language for declaring data operations that has been around for over fifty years, provides a relevant framework for performing standardized operations on columns (creating new columns, selecting subsets of rows, etc.).

However, the dataframe only recently became established in Python, thanks to the Pandas package created by Wes McKinney. The rise of the Pandas library (downloaded over 5 million times per day in 2023) is largely responsible for Python’s success in the data science ecosystem and has led, in just a few years, to a complete renewal of how coding in Python, such a flexible language, is approached for data analysis.

This part of the course is a general introduction to the rich ecosystem of data manipulation with Python. These chapters cover both data retrieval and the restructuring and analysis of that data.

Summary of that section

Pandas has become essential in the Python ecosystem for data science. Pandas itself is built on top of the Numpy package, which is useful to understand to be comfortable with Pandas. Numpy is a low-level library for storing and manipulating data. Numpy is at the heart of the data science ecosystem because most libraries, even those handling unstructured objects, use objects built from Numpy1.

The Pandas approach, which provides a unified entry point for manipulating datasets of very different natures, has been extended to geographic objects with Geopandas. This allows for the manipulation of geographic data as if it were classic structured data. Geographic data and cartographic representation are becoming increasingly common with the rise of open localized data and geolocated big-data.

However, structured data imported from flat files is not the only data source. APIs and web scraping allow for flexible downloading or extraction of data from web pages or specialized portals. These data, particularly those obtained through web scraping, often require a bit more data cleaning work, especially with character strings.

The Pandas ecosystem thus represents a Swiss army knife for data analysis. This is why this course will cover it extensively. Before trying to implement an ad hoc solution, it is often useful to ask the following question: “Could I do this with the basic functionalities of Pandas?” Asking this question can prevent arduous paths and save a lot of time.

However, Pandas is not suitable for handling large volumes of data. To process such data, it is recommended to use Polars or Dask, which adopt the logic of Pandas but optimize its functionality, Spark if you have suitable infrastructure, generally in big data environments, or DuckDB if you are willing to use SQL queries rather than a high-level library.

Exercises

This section provides both detailed tutorials and guided exercises.

You can view them on this site or use one of the badges at the beginning of the chapter, for example these to open the Pandas exercises chapter:

View on GitHub Onyxia Onyxia Open In Colab

Going further

This course does not really address issues of volume or speed of computation. Pandas can show its limits in this area with large datasets (several gigabytes).

It is therefore interesting to consider:

References

Here is a selective bibliography of interesting books complementary to the chapters in the “Manipulation” section of this course:

McKinney, Wes. 2012. Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython. " O’Reilly Media, Inc.".
VanderPlas, Jake. 2016. Python Data Science Handbook: Essential Tools for Working with Data. " O’Reilly Media, Inc.".
Wickham, Hadley, Mine Çetinkaya-Rundel, and Garrett Grolemund. 2023. R for Data Science. " O’Reilly Media, Inc.".

Informations additionnelles

environment files have been tested on.

Latest built version: 2025-03-19

Python version used:

'3.12.6 | packaged by conda-forge | (main, Sep 30 2024, 18:08:52) [GCC 13.3.0]'
Package Version
affine 2.4.0
aiobotocore 2.21.1
aiohappyeyeballs 2.6.1
aiohttp 3.11.13
aioitertools 0.12.0
aiosignal 1.3.2
alembic 1.13.3
altair 5.4.1
aniso8601 9.0.1
annotated-types 0.7.0
anyio 4.8.0
appdirs 1.4.4
archspec 0.2.3
asttokens 2.4.1
attrs 25.3.0
babel 2.17.0
bcrypt 4.2.0
beautifulsoup4 4.12.3
black 24.8.0
blinker 1.8.2
blis 1.2.0
bokeh 3.5.2
boltons 24.0.0
boto3 1.37.1
botocore 1.37.1
branca 0.7.2
Brotli 1.1.0
bs4 0.0.2
cachetools 5.5.0
cartiflette 0.0.2
Cartopy 0.24.1
catalogue 2.0.10
cattrs 24.1.2
certifi 2025.1.31
cffi 1.17.1
charset-normalizer 3.4.1
chromedriver-autoinstaller 0.6.4
click 8.1.8
click-plugins 1.1.1
cligj 0.7.2
cloudpathlib 0.21.0
cloudpickle 3.0.0
colorama 0.4.6
comm 0.2.2
commonmark 0.9.1
conda 24.9.1
conda-libmamba-solver 24.7.0
conda-package-handling 2.3.0
conda_package_streaming 0.10.0
confection 0.1.5
contextily 1.6.2
contourpy 1.3.1
cryptography 43.0.1
cycler 0.12.1
cymem 2.0.11
cytoolz 1.0.0
dask 2024.9.1
dask-expr 1.1.15
databricks-sdk 0.33.0
dataclasses-json 0.6.7
debugpy 1.8.6
decorator 5.1.1
Deprecated 1.2.14
diskcache 5.6.3
distributed 2024.9.1
distro 1.9.0
docker 7.1.0
duckdb 1.2.1
en_core_web_sm 3.8.0
entrypoints 0.4
et_xmlfile 2.0.0
exceptiongroup 1.2.2
executing 2.1.0
fastexcel 0.11.6
fastjsonschema 2.21.1
fiona 1.10.1
Flask 3.0.3
folium 0.17.0
fontawesomefree 6.6.0
fonttools 4.56.0
fr_core_news_sm 3.8.0
frozendict 2.4.4
frozenlist 1.5.0
fsspec 2023.12.2
geographiclib 2.0
geopandas 1.0.1
geoplot 0.5.1
geopy 2.4.1
gitdb 4.0.11
GitPython 3.1.43
google-auth 2.35.0
graphene 3.3
graphql-core 3.2.4
graphql-relay 3.2.0
graphviz 0.20.3
great-tables 0.12.0
greenlet 3.1.1
gunicorn 22.0.0
h11 0.14.0
h2 4.1.0
hpack 4.0.0
htmltools 0.6.0
httpcore 1.0.7
httpx 0.28.1
httpx-sse 0.4.0
hyperframe 6.0.1
idna 3.10
imageio 2.37.0
importlib_metadata 8.6.1
importlib_resources 6.5.2
inflate64 1.0.1
ipykernel 6.29.5
ipython 8.28.0
itsdangerous 2.2.0
jedi 0.19.1
Jinja2 3.1.6
jmespath 1.0.1
joblib 1.4.2
jsonpatch 1.33
jsonpointer 3.0.0
jsonschema 4.23.0
jsonschema-specifications 2024.10.1
jupyter-cache 1.0.0
jupyter_client 8.6.3
jupyter_core 5.7.2
kaleido 0.2.1
kiwisolver 1.4.8
langchain 0.3.20
langchain-community 0.3.9
langchain-core 0.3.45
langchain-text-splitters 0.3.6
langcodes 3.5.0
langsmith 0.1.147
language_data 1.3.0
lazy_loader 0.4
libmambapy 1.5.9
locket 1.0.0
loguru 0.7.3
lxml 5.3.1
lz4 4.3.3
Mako 1.3.5
mamba 1.5.9
mapclassify 2.8.1
marisa-trie 1.2.1
Markdown 3.6
markdown-it-py 3.0.0
MarkupSafe 3.0.2
marshmallow 3.26.1
matplotlib 3.10.1
matplotlib-inline 0.1.7
mdurl 0.1.2
menuinst 2.1.2
mercantile 1.2.1
mizani 0.11.4
mlflow 2.16.2
mlflow-skinny 2.16.2
msgpack 1.1.0
multidict 6.1.0
multivolumefile 0.2.3
munkres 1.1.4
murmurhash 1.0.12
mypy-extensions 1.0.0
narwhals 1.30.0
nbclient 0.10.0
nbformat 5.10.4
nest_asyncio 1.6.0
networkx 3.4.2
nltk 3.9.1
numpy 2.2.3
opencv-python-headless 4.10.0.84
openpyxl 3.1.5
opentelemetry-api 1.16.0
opentelemetry-sdk 1.16.0
opentelemetry-semantic-conventions 0.37b0
orjson 3.10.15
outcome 1.3.0.post0
OWSLib 0.28.1
packaging 24.2
pandas 2.2.3
paramiko 3.5.0
parso 0.8.4
partd 1.4.2
pathspec 0.12.1
patsy 1.0.1
Pebble 5.1.0
pexpect 4.9.0
pickleshare 0.7.5
pillow 11.1.0
pip 24.2
platformdirs 4.3.6
plotly 5.24.1
plotnine 0.13.6
pluggy 1.5.0
polars 1.8.2
preshed 3.0.9
prometheus_client 0.21.0
prometheus_flask_exporter 0.23.1
prompt_toolkit 3.0.48
propcache 0.3.0
protobuf 4.25.3
psutil 7.0.0
ptyprocess 0.7.0
pure_eval 0.2.3
py7zr 0.20.8
pyarrow 17.0.0
pyarrow-hotfix 0.6
pyasn1 0.6.1
pyasn1_modules 0.4.1
pybcj 1.0.3
pycosat 0.6.6
pycparser 2.22
pycryptodomex 3.21.0
pydantic 2.10.6
pydantic_core 2.27.2
pydantic-settings 2.8.1
Pygments 2.19.1
PyNaCl 1.5.0
pynsee 0.1.8
pyogrio 0.10.0
pyOpenSSL 24.2.1
pyparsing 3.2.1
pyppmd 1.1.1
pyproj 3.7.1
pyshp 2.3.1
PySocks 1.7.1
python-dateutil 2.9.0.post0
python-dotenv 1.0.1
python-magic 0.4.27
pytz 2025.1
pyu2f 0.1.5
pywaffle 1.1.1
PyYAML 6.0.2
pyzmq 26.3.0
pyzstd 0.16.2
querystring_parser 1.2.4
rasterio 1.4.3
referencing 0.36.2
regex 2024.9.11
requests 2.32.3
requests-cache 1.2.1
requests-toolbelt 1.0.0
retrying 1.3.4
rich 13.9.4
rpds-py 0.23.1
rsa 4.9
rtree 1.4.0
ruamel.yaml 0.18.6
ruamel.yaml.clib 0.2.8
s3fs 2023.12.2
s3transfer 0.11.3
scikit-image 0.24.0
scikit-learn 1.6.1
scipy 1.13.0
seaborn 0.13.2
selenium 4.29.0
setuptools 76.0.0
shapely 2.0.7
shellingham 1.5.4
six 1.17.0
smart-open 7.1.0
smmap 5.0.0
sniffio 1.3.1
sortedcontainers 2.4.0
soupsieve 2.5
spacy 3.8.4
spacy-legacy 3.0.12
spacy-loggers 1.0.5
SQLAlchemy 2.0.39
sqlparse 0.5.1
srsly 2.5.1
stack-data 0.6.2
statsmodels 0.14.4
tabulate 0.9.0
tblib 3.0.0
tenacity 9.0.0
texttable 1.7.0
thinc 8.3.4
threadpoolctl 3.6.0
tifffile 2025.3.13
toolz 1.0.0
topojson 1.9
tornado 6.4.2
tqdm 4.67.1
traitlets 5.14.3
trio 0.29.0
trio-websocket 0.12.2
truststore 0.9.2
typer 0.15.2
typing_extensions 4.12.2
typing-inspect 0.9.0
tzdata 2025.1
Unidecode 1.3.8
url-normalize 1.4.3
urllib3 1.26.20
uv 0.6.8
wasabi 1.1.3
wcwidth 0.2.13
weasel 0.4.1
webdriver-manager 4.0.2
websocket-client 1.8.0
Werkzeug 3.0.4
wheel 0.44.0
wordcloud 1.9.3
wrapt 1.17.2
wsproto 1.2.0
xgboost 2.1.1
xlrd 2.0.1
xyzservices 2025.1.0
yarl 1.18.3
yellowbrick 1.5
zict 3.0.0
zipp 3.21.0
zstandard 0.23.0

View file history

SHA Date Author Description
5f08b572 2024-08-29 10:33:57 Lino Galiana Traduction de l’introduction (#551)
005d89b8 2023-12-20 17:23:04 Lino Galiana Finalise l’affichage des statistiques Git (#478)
1f23de28 2023-12-01 17:25:36 Lino Galiana Stockage des images sur S3 (#466)
69cf52bd 2023-11-21 16:12:37 Antoine Palazzolo [On-going] Suggestions chapitres modélisation (#452)
154f09e4 2023-09-26 14:59:11 Antoine Palazzolo Des typos corrigées par Antoine (#411)
9a4e2267 2023-08-28 17:11:52 Lino Galiana Action to check URL still exist (#399)
80823022 2023-08-25 17:48:36 Lino Galiana Mise à jour des scripts de construction des notebooks (#395)
3bdf3b06 2023-08-25 11:23:02 Lino Galiana Simplification de la structure 🤓 (#393)
5d4874a8 2023-08-11 15:09:33 Lino Galiana Pimp les introductions des trois premières parties (#387)
8e5edba6 2022-09-02 11:59:57 Lino Galiana Ajoute un chapitre dask (#264)
f10815b5 2022-08-25 16:00:03 Lino Galiana Notebooks should now look more beautiful (#260)
d201e3cd 2022-08-03 15:50:34 Lino Galiana Pimp la homepage ✨ (#249)
12965bac 2022-05-25 15:53:27 Lino Galiana :launch: Bascule vers quarto (#226)
5cac236e 2021-12-16 19:46:43 Lino Galiana un petit mot sur mercator (#201)
4cdb759c 2021-05-12 10:37:23 Lino Galiana :sparkles: :star2: Nouveau thème hugo :snake: :fire: (#105)
0a0d0348 2021-03-26 20:16:22 Lino Galiana Ajout d’une section sur S3 (#97)
4677769b 2020-09-15 18:19:24 Lino Galiana Nettoyage des coquilles pour premiers TP (#37)
d48e68fa 2020-09-08 18:35:07 Lino Galiana Continuer la partie pandas (#13)
913047d3 2020-09-08 14:44:41 Lino Galiana Harmonisation des niveaux de titre (#17)
Back to top

Footnotes

  1. Some libraries are gradually moving away from Numpy, which is not always the most suitable for managing certain types of data. The Arrow framework is becoming the lower layer used by more and more data science libraries. This blog post provides a detailed explanation of this topic.↩︎

Citation

BibTeX citation:
@book{galiana2023,
  author = {Galiana, Lino},
  title = {Python Pour La Data Science},
  date = {2023},
  url = {https://pythonds.linogaliana.fr/},
  doi = {10.5281/zenodo.8229676},
  langid = {en}
}
For attribution, please cite this work as:
Galiana, Lino. 2023. Python Pour La Data Science. https://doi.org/10.5281/zenodo.8229676.