This introduction presents the course objective, pedagogical approach, the main theme of this of the course, as well as the practical practical details.
This course brings together all the material from the
Python for Data Science class I’ve been teaching at ENSAE since 20201. Each year, around 190 students take this course. In 2024, an English version—equivalent to the French original—was gradually introduced. It is designed as an introductory data science course for European statistical institutes, following a European call for projects.
The site (pythonds.linogaliana.fr) serves as the main hub for the course. It gathers all course content, including practical assignments and additional materials aimed at continuing education. The course is open source, and I welcome suggestions for improvement either on GitHub or in the comments section at the bottom of each page.
Because Python is a living, fast-evolving language, the course is continuously updated to reflect the changing data science ecosystem. At the same time, it strives to differentiate lasting trends from short-lived fads.
You can find more information in the introductory slides. More advanced topics are covered in another course focused on deploying data science projects to production, which I co-teach with Romain Avouac in the final year at ENSAE (ensae-reproductibilite.github.io/website).
Website Architecture
This course offers comprehensive tutorials and exercises that can be read directly on the site or edited and run in an interactive Jupyter Notebook environment (see the next chapter for details).
Each page is built around a concrete problem and introduces a general approach to solving it. All examples are based on open data and are fully reproducible.
You can navigate the site using the table of contents or the previous/next links at the bottom of each page. Some sections - such as the one on modeling - include highlighted examples that illustrate the methodology and present different possible approaches to solving the same problem.
2 Why Python ?
Python whose recognizable logo appears as , is a language that’s been around for over thirty years. But it was in the 2010s that it experienced a major resurgence, driven by the growing popularity of data science.
More than any other language, Python brings together a wide range of communities: statisticians, application developers, IT infrastructure managers, high school students (it has been part of the French baccalaureate curriculum for several years), and researchers in both theoretical and applied fields.
Unlike many programming languages that cater to relatively homogeneous communities, Python has succeeded in uniting diverse users thanks to a few key principles: its readable syntax, the simplicity of using modules, the ease of integration with more powerful languages for specific tasks, and the vast amount of online documentation. Sometimes, being the second-best tool for a task—while offering a broader set of advantages—can be the key to success.
Python success story is closely tied to the rise of the data scientist role, a profile capable of working across the entire data processing pipeline. In the Harvard Business Review, Davenport and Patil (2012) famously called it “the sexiest job of the 21st century.” A decade later, he and his co-authors provided a full update on the evolving expectations for data scientists(Davenport and Patil 2022).
But it’s not only data scientists who need to use Python. In the broader ecosystem of data-related professions—data scientists, data engineers, ML engineers, and more—Python serves as a kind of Tower of Babel, enabling collaboration among interdependent roles.
This course introduces various tools that use Python to connect data with theoretical concepts from statistics and the economic and social sciences. However, it goes beyond a simple introduction to the language: it regularly reflects on both the strengths and the limitations of Python in meeting operational and scientific needs.
3 Why use Python for data analysis?
This question is slightly different: if Python is already a popular language for learning programming due to its ease of use, how did it also become the dominant language in the data and AI ecosystem?
Python first gained traction in the data science world for offering tools to train machine learning algorithms, even before such approaches became mainstream. Of course, the success of libraries like Scikit-Learn, TensorFlow, and more recently PyTorch, played a major role in Python’s adoption by the data science community2. However, reducing Python to a handful of machine learning libraries would be overly simplistic. It is truly a Swiss Army knife for data scientists, social scientists, economists, and data practitioners of all kinds. Its success is not only due to offering the right tools at the right time, but also because the language itself offers real advantages for newcomers to data work.
What makes Python appealing is its central role in a broader ecosystem of powerful, flexible, and open-source tools. Like , it belongs to a category of languages suitable for everyday use across a wide variety of tasks. In many of the fields covered in this course, Python has by far the richest and most accessible ecosystem. Unlike other popular languages such as JavaScript or Rust, it has a very gentle learning curve, allowing users to write high-quality code quickly - provided they learn the right habits, which this course (and the companion course on production workflows) aims to instill.
Beyond AI projects[^nte-ia-en], Python is also indispensable for retrieving data via APIs or through web scraping3, two techniques introduced early in the course. In areas like tabular data analysis4, web publishing or data visualization, Python now offers an ecosystem comparable to , thanks in part to growing investment from Posit, which has ported many of ’s most successful libraries—such as ggplot to Python.
Note 3.1: Why discuss AI so little in a Python course?
While a significant portion of this course covers machine learning and related algorithms, I tend to resist the current trend - especially strong since the release of ChatGPT in late 2022 - of labeling everything as “AI”.
First, because the term is vague, overused, and often exploited for marketing purposes, capitalizing on its symbolic power drawn from science fiction to sell miracle solutions or stoke fear.
Second, because the term “AI” covers a vast range of possible methods, depending on how broadly we define it. The sections on modeling and NLP in this course, which are the closest to the AI field, focus on learning-based methods. But as definitions from Russell and Norvig (2020) or the European AI Act show, artificial intelligence encompasses much more:
The study of [intelligent] agents that perceive their environment and act upon it. Each such agent is implemented by a function that maps perceptions to actions. We study different ways to define this function, such as production systems, reactive agents, logical planners, neural networks, and decision-theoretic systems.
“AI system” means a machine-based system designed to operate with varying levels of autonomy and capable of adapting after deployment. It infers, based on its inputs, how to generate outputs—such as predictions, content, recommendations, or decisions—that can influence physical or virtual environments.
Finally, there’s also a pedagogical reason. Since 2023, “AI” has largely become synonymous with generative AI. But to understand how this radically different paradigm works - and to implement meaningful, value-driven generative AI projects - one must first understand the foundations and limitations of the machine learning approach. Otherwise, there’s a risk of building overly complex solutions for simple problems or misjudging the value of generative models compared to more traditional methods. Since this is an introductory course, I’ve chosen to focus on machine learning and introductory NLP, deep enough to be meaningful, while leaving it to the curious to explore generative AI further on their own.
That said, this course does not aim to stir up the sterile debate between and Python. The two languages share far more than they differ, and best practices are often transferable between them. This idea is explored more deeply in the advanced course I co-teach with Romain Avouac (ensae-reproductibilite.github.io/website).
In practice, data scientists and researchers in social sciences or economics increasingly use and Python interchangeably. This course will frequently draw analogies between the two, to help learners already familiar with transition smoothly to Python.
4 Why learn Python when code-generating AIs exist?
Code assistants like Copilot and ChatGPT have fundamentally transformed software development. These tools are now part of the everyday toolkit of a data scientist, offering remarkable convenience by generating Python code from more or less well-specified instructions. Trained on massive amounts of publicly available code—and often fine-tuned for solving development tasks—they can be extremely helpful. The concept of vibe coding even pushes this further, aiming to let large language models (LLMs) take initiative without requiring human intermediaries to access the computational resources needed to run the code they generate.
So, if AIs can now generate code, why should we still learn how to code?
Because coding is not just about writing lines of code. It’s about understanding a problem, crafting a step-by-step strategy to tackle it, considering multiple solutions and trade-offs (e.g., speed, simplicity), testing and debugging. Code is a means to an engineering end. While AIs are very good at generating code, relating problems to known patterns, and even translating solutions across languages into Python, that’s only part of the picture.
Working with data is first and foremost an engineering process. Code is not the goal—it’s the tool that supports structured reasoning toward solving real-world problems. Just like an engineer designs a bridge to meet a practical need, a data scientist begins with an operational goal—such as building a recommendation system, evaluating the impact of a product launch, or forecasting sales—and reformulates it into an analytical task. This means translating scientific or business ideas into a set of questions, then breaking those down into logical steps, each of which can be executed by a computer.
In this context, an LLM can act as a valuable assistant—but only when the problem is well formulated. If the task is vague or ill-defined, the model’s answers will be approximate or even useless. On standard problems, the results may appear accurate. But for more specific, non-standard tasks, it often becomes necessary to iterate, refine the prompt, reframe the problem… and sometimes still fail to get a satisfactory result. Not because the model is poor, but because good problem formulation—the essence of problem engineering—makes all the difference5.
For instance, in the year 2025, uv saw rapid adoption, as did ruff the year before. It will still be some time before generative AIs propose this environment manager on their own, rather than poetry. The existence of generative AIs does not, therefore, dispense us, as before, from keeping an active technical watch and being vigilant about changes in practices.
5 Course Objectives
5.1 Introducing the data science approach
This course is intended for practitioners of data science, understood here in the broadest sense as the combination of techniques from mathematics, statistics, and computer science to extract useful knowledge from data.
Since data science is not only an academic discipline but also a practical field aimed at achieving operational goals, learning its main tool—namely, the Python programming language—goes hand in hand with adopting a rigorous, scientific approach to data.
The objective of this course is to explore how to approach a dataset, identify and address common challenges, develop appropriate solutions, and reflect on their broader implications. It is therefore not merely a course about a technical tool, disconnected from scientific reasoning, but one rooted in understanding data through both technical and conceptual lenses.
:
from IPython.display import HTMLstyle =''' <style> .callout { border: 2px solid #d1d5db; border-radius: 8px; box-shadow: 0 4px 6px rgba(0, 0, 0, 0.1); margin-bottom: 20px; background-color: #ffffff; padding: 15px;}.callout-header-note { font-weight: bold; margin-bottom: 10px; color: #ffffff; background-color: #47648a; padding: 10px; border-radius: 6px 6px 0 0;}.callout-header-tip { font-weight: bold; margin-bottom: 10px; color: #ffffff; background-color: #41745d; padding: 10px; border-radius: 6px 6px 0 0;}.callout-header-exercise { font-weight: bold; margin-bottom: 10px; color: #ffffff; background-color: #c46aad; padding: 10px; border-radius: 6px 6px 0 0;}.callout-header-warning { font-weight: bold; margin-bottom: 10px; color: #ffffff; background-color: #967b30; padding: 10px; border-radius: 6px 6px 0 0;}.callout-header-important { font-weight: bold; margin-bottom: 10px; color: #ffffff; background-color: #86252b; padding: 10px; border-radius: 6px 6px 0 0;}.callout-header-caution { font-weight: bold; margin-bottom: 10px; color: #ffffff; background-color: #a7663b; padding: 10px; border-radius: 6px 6px 0 0;}.callout-body { margin: 10px 0;} </style>'''content_html =''' <div class="callout callout-tip"> <div class="callout-header-tip"> <i class="fa-solid fa-lightbulb"></i> Do I need a math background for this course? </div> <div class="callout-body"> <p>This course assumes you are interested in using data-intensive <code>Python</code> within a rigorous statistical framework. It does not delve deeply into the statistical or algorithmic foundations of the techniques presented - many of which are covered in dedicated courses, particularly at ENSAE.</p><p>That said, not being familiar with these concepts shoud not prevent from following this course. More advanced ideas are typically introduced separately, in dedicated callout boxes. Thanks to Python\'s ease of use, you will not need to implement complex models from scratch - making it possible to apply techniques even if you are not an expert in the underlying theory. What <em>is</em> important, however, is having enough understanding to correctly interpret the results.</p><p>Still, while <code>Python</code> makes it relatively easy to run sophisticated models, it is very helpful to have some perspective before diving into modeling. That explains why modeling appears later in the course: in addition to relying on advanced statistical concepts, effective modeling also requires a solid understanding of the data. You need to identify key patterns and assess whether your data fits the assumptions of the model. Without this foundation, it is difficult to build models that are truly meaningful or reliable.</p> </div> </div>'''HTML(f'<script src="https://kit.fontawesome.com/3c27c932d3.js" crossorigin="anonymous"></script>\n{style}\n{content_html}')
Do I need a math background for this course?
This course assumes you are interested in using data-intensive Python within a rigorous statistical framework. It does not delve deeply into the statistical or algorithmic foundations of the techniques presented - many of which are covered in dedicated courses, particularly at ENSAE.
That said, not being familiar with these concepts shoud not prevent from following this course. More advanced ideas are typically introduced separately, in dedicated callout boxes. Thanks to Python's ease of use, you will not need to implement complex models from scratch - making it possible to apply techniques even if you are not an expert in the underlying theory. What is important, however, is having enough understanding to correctly interpret the results.
Still, while Python makes it relatively easy to run sophisticated models, it is very helpful to have some perspective before diving into modeling. That explains why modeling appears later in the course: in addition to relying on advanced statistical concepts, effective modeling also requires a solid understanding of the data. You need to identify key patterns and assess whether your data fits the assumptions of the model. Without this foundation, it is difficult to build models that are truly meaningful or reliable.
:
5.2 Reproducibility
This course places strong emphasis on reproducibility. This principle is reflected in several ways. First and foremost, by ensuring that all examples and exercises can be run and tested using Jupyternotebooks6.
All content on the website is designed to be reproducible across different computing environments. Of course, you’re free to copy and paste code snippets directly from the site using the button available at the top of each code block.
Click on the button to copy this content and paste it elsewhere.
However, since this site includes many examples, constantly switching between a Python environment and the website can become tedious. To make things easier, each chapter can be downloaded as a Jupyternotebook using the buttons provided at the top of each page.
For example, here are the buttons for the first chapter on Pandas:
Recommendations on the best environments for using these notebooks are provided in the next chapter.
The focus on reproducibility is also reflected in the choice of examples used throughout the course. All content on this site is based on open data, sourced either from French platforms - primarily the centralized portal data.gouv, which aggregates public datasets from French institutions, or the official statistics agency Insee, France’s national institute for statistics and economic studies - or from U.S. datasets. This ensures that results are reproducible for anyone working in an identical environment7.
:
from IPython.display import HTMLstyle =''' <style> .callout { border: 2px solid #d1d5db; border-radius: 8px; box-shadow: 0 4px 6px rgba(0, 0, 0, 0.1); margin-bottom: 20px; background-color: #ffffff; padding: 15px;}.callout-header-note { font-weight: bold; margin-bottom: 10px; color: #ffffff; background-color: #47648a; padding: 10px; border-radius: 6px 6px 0 0;}.callout-header-tip { font-weight: bold; margin-bottom: 10px; color: #ffffff; background-color: #41745d; padding: 10px; border-radius: 6px 6px 0 0;}.callout-header-exercise { font-weight: bold; margin-bottom: 10px; color: #ffffff; background-color: #c46aad; padding: 10px; border-radius: 6px 6px 0 0;}.callout-header-warning { font-weight: bold; margin-bottom: 10px; color: #ffffff; background-color: #967b30; padding: 10px; border-radius: 6px 6px 0 0;}.callout-header-important { font-weight: bold; margin-bottom: 10px; color: #ffffff; background-color: #86252b; padding: 10px; border-radius: 6px 6px 0 0;}.callout-header-caution { font-weight: bold; margin-bottom: 10px; color: #ffffff; background-color: #a7663b; padding: 10px; border-radius: 6px 6px 0 0;}.callout-body { margin: 10px 0;} </style>'''content_html =''' <div class="callout callout-note"> <div class="callout-header-note"> <i class="fa-solid fa-comment"></i> Note </div> <div class="callout-body"> <p>American researchers have described a reproducibility crisis in the field of <em>machine learning</em> [@Reproducibilitycrisis-en]. The distortions of the scientific publishing ecosystem - combined with the economic incentives driving academic publications in <em>machine learning</em> - are often cited as major contributing factors.</p><p>However, university education also bears a share of the responsibility. Students and researchers are rarely trained in the principles of reproducibility, and if these practices are not introduced early in their careers, they are unlikely to adopt them later. This is why, in addition to teaching <code>Python</code> {{< fa brands python >}} and <em>data science</em>, this course includes a dedicated section on using version control with <code>Git</code> {{< fa brands git-alt >}}.</p><p>All student projects are required to be <em>open source</em>—one of the most effective ways for instructors to encourage high-quality, transparent, and reproducible code.</p> </div> </div>'''HTML(f'<script src="https://kit.fontawesome.com/3c27c932d3.js" crossorigin="anonymous"></script>\n{style}\n{content_html}')
Note
American researchers have described a reproducibility crisis in the field of machine learning [@Reproducibilitycrisis-en]. The distortions of the scientific publishing ecosystem - combined with the economic incentives driving academic publications in machine learning - are often cited as major contributing factors.
However, university education also bears a share of the responsibility. Students and researchers are rarely trained in the principles of reproducibility, and if these practices are not introduced early in their careers, they are unlikely to adopt them later. This is why, in addition to teaching Python {{< fa brands python >}} and data science, this course includes a dedicated section on using version control with Git {{< fa brands git-alt >}}.
All student projects are required to be open source—one of the most effective ways for instructors to encourage high-quality, transparent, and reproducible code.
:
5.3 Assessment
Students at ENSAE complete the course by working on an in-depth project. Details on how the course is assessed, along with a list of past student projects, can be found in the Evaluation section.
6 Course Outline
This course serves as an introduction to the core challenges of data science through learning the Python programming language. As the term “data science” implies, a significant portion of the course is dedicated to working directly with data: retrieving it, structuring it, exploring it, and combining it.
These topics are covered in the first part of the course, “Manipulating Data”, which lays the foundation for everything that follows. Unfortunately, many training programs in data science, applied statistics, or the economic and social sciences tend to overlook this crucial aspect of a data scientist’s work—often referred to as “data wrangling” or “feature engineering”. And yet, not only does it represent a large share of the day-to-day work in data science, it’s also essential for building relevant and accurate models.
The goal of this first section is to highlight the challenges involved in accessing and leveraging different types of data sources using Python. The examples are diverse, reflecting the variety of data that can be analyzed with Python: municipal \(CO_2\) emissions in France, real estate transaction records, housing energy performance diagnostics, bike-sharing data from the Velib system, and more.
The second part of the course focuses on creating visualizations with Python. Once your data has been cleaned and processed, you’ll typically want to summarize it—through tables, graphs, or maps. This part, “Communicating with Python”, offers a concise introduction to the topic. While somewhat introductory, it provides essential concepts that will be reinforced later in the course.
The third part centers on modeling, using electoral science as the main example (“Modeling with Python”). This section introduces the scientific reasoning behind machine learning, explores both methodological and technical choices, and sets the stage for deeper topics addressed later in the program.
The fourth part takes a step back to focus on the specific challenges of working with text data. This is the “Introduction to Natural Language Processing (NLP) with Python” chapter. Given that NLP is a rapidly evolving field, this section serves only as an introduction. For more advanced coverage, see Russell and Norvig (2020), chapter 24.
This chapter also includes a section on version control with Git (Discover Git). Why include this in a course about Python ? Because learning Git helps you write better code, collaborate effectively, and test or share your work in reproducible environments. This is especially valuable in a world where platforms like GitHub act as professional showcases—and where companies and public institutions increasingly expect their data scientists to be proficient with Git.
———. 2022. “Is Data Scientist Still the Sexiest Job of the 21st Century?”Harvard Business Review 90.
Russell, Stuart J., and Peter Norvig. 2020. Artificial Intelligence: A Modern Approach (4th Edition). Pearson. http://aima.cs.berkeley.edu/.
Informations additionnelles
environment files have been tested on.
Latest built version: 2025-06-18
Python version used:
'3.12.3 (main, Feb 4 2025, 14:48:35) [GCC 13.3.0]'
Package
Version
affine
2.4.0
aiobotocore
2.22.0
aiohappyeyeballs
2.6.1
aiohttp
3.11.18
aioitertools
0.12.0
aiosignal
1.3.2
altair
5.4.1
annotated-types
0.7.0
anyio
4.9.0
appdirs
1.4.4
argon2-cffi
25.1.0
argon2-cffi-bindings
21.2.0
arrow
1.3.0
asttokens
3.0.0
async-lru
2.0.5
attrs
25.3.0
babel
2.17.0
beautifulsoup4
4.13.4
black
24.8.0
bleach
6.2.0
blis
1.3.0
boto3
1.37.3
botocore
1.37.3
branca
0.8.1
Brotli
1.1.0
bs4
0.0.2
cartiflette
0.0.3
Cartopy
0.24.1
catalogue
2.0.10
cattrs
24.1.3
certifi
2025.4.26
cffi
1.17.1
charset-normalizer
3.4.2
chromedriver-autoinstaller
0.6.4
click
8.2.1
click-plugins
1.1.1
cligj
0.7.2
cloudpathlib
0.21.1
comm
0.2.2
commonmark
0.9.1
confection
0.1.5
contextily
1.6.2
contourpy
1.3.2
cycler
0.12.1
cymem
2.0.11
dataclasses-json
0.6.7
debugpy
1.8.14
decorator
5.2.1
defusedxml
0.7.1
diskcache
5.6.3
duckdb
1.3.0
en_core_web_sm
3.8.0
et_xmlfile
2.0.0
executing
2.2.0
fastexcel
0.14.0
fastjsonschema
2.21.1
fiona
1.10.1
folium
0.19.6
fontawesomefree
6.6.0
fonttools
4.58.0
fqdn
1.5.1
fr_core_news_sm
3.8.0
frozenlist
1.6.0
fsspec
2025.5.0
geographiclib
2.0
geopandas
1.0.1
geoplot
0.5.1
geopy
2.4.1
graphviz
0.20.3
great-tables
0.12.0
greenlet
3.2.2
h11
0.16.0
htmltools
0.6.0
httpcore
1.0.9
httpx
0.28.1
httpx-sse
0.4.0
idna
3.10
imageio
2.37.0
importlib_metadata
8.7.0
importlib_resources
6.5.2
inflate64
1.0.1
ipykernel
6.29.5
ipython
9.3.0
ipython_pygments_lexers
1.1.1
ipywidgets
8.1.7
isoduration
20.11.0
jedi
0.19.2
Jinja2
3.1.6
jmespath
1.0.1
joblib
1.5.1
json5
0.12.0
jsonpatch
1.33
jsonpointer
3.0.0
jsonschema
4.23.0
jsonschema-specifications
2025.4.1
jupyter
1.1.1
jupyter-cache
1.0.0
jupyter_client
8.6.3
jupyter-console
6.6.3
jupyter_core
5.7.2
jupyter-events
0.12.0
jupyter-lsp
2.2.5
jupyter_server
2.16.0
jupyter_server_terminals
0.5.3
jupyterlab
4.4.3
jupyterlab_pygments
0.3.0
jupyterlab_server
2.27.3
jupyterlab_widgets
3.0.15
kaleido
0.2.1
kiwisolver
1.4.8
langchain
0.3.25
langchain-community
0.3.9
langchain-core
0.3.61
langchain-text-splitters
0.3.8
langcodes
3.5.0
langsmith
0.1.147
language_data
1.3.0
lazy_loader
0.4
loguru
0.7.3
lxml
5.4.0
mapclassify
2.8.1
marisa-trie
1.2.1
Markdown
3.8
markdown-it-py
3.0.0
MarkupSafe
3.0.2
marshmallow
3.26.1
matplotlib
3.10.3
matplotlib-inline
0.1.7
mdurl
0.1.2
mercantile
1.2.1
mistune
3.1.3
mizani
0.11.4
multidict
6.4.4
multivolumefile
0.2.3
murmurhash
1.0.13
mypy_extensions
1.1.0
narwhals
1.40.0
nbclient
0.10.0
nbconvert
7.16.6
nbformat
5.10.4
nest-asyncio
1.6.0
networkx
3.4.2
nltk
3.9.1
notebook
7.4.3
notebook_shim
0.2.4
numpy
2.2.6
openpyxl
3.1.5
orjson
3.10.18
outcome
1.3.0.post0
overrides
7.7.0
OWSLib
0.33.0
packaging
24.2
pandas
2.2.3
pandocfilters
1.5.1
parso
0.8.4
pathspec
0.12.1
patsy
1.0.1
Pebble
5.1.1
pexpect
4.9.0
pillow
11.2.1
pip
25.1.1
platformdirs
4.3.8
plotly
6.1.2
plotnine
0.13.6
polars
1.8.2
preshed
3.0.9
prometheus_client
0.22.1
prompt_toolkit
3.0.51
propcache
0.3.1
psutil
7.0.0
ptyprocess
0.7.0
pure_eval
0.2.3
py7zr
0.22.0
pyarrow
17.0.0
pybcj
1.0.6
pycparser
2.22
pycryptodomex
3.23.0
pydantic
2.11.5
pydantic_core
2.33.2
pydantic-settings
2.9.1
Pygments
2.19.1
pynsee
0.1.8
pyogrio
0.11.0
pyparsing
3.2.3
pyppmd
1.1.1
pyproj
3.7.1
pyshp
2.3.1
PySocks
1.7.1
python-dateutil
2.9.0.post0
python-dotenv
1.0.1
python-json-logger
3.3.0
python-magic
0.4.27
pytz
2025.2
pywaffle
1.1.1
PyYAML
6.0.2
pyzmq
26.4.0
pyzstd
0.17.0
rasterio
1.4.3
referencing
0.36.2
regex
2024.11.6
requests
2.32.3
requests-cache
1.2.1
requests-toolbelt
1.0.0
retrying
1.3.4
rfc3339-validator
0.1.4
rfc3986-validator
0.1.1
rich
14.0.0
rpds-py
0.25.1
rtree
1.4.0
s3fs
2025.5.0
s3transfer
0.11.3
scikit-image
0.24.0
scikit-learn
1.6.1
scipy
1.13.0
seaborn
0.13.2
selenium
4.33.0
Send2Trash
1.8.3
setuptools
80.8.0
shapely
2.1.1
shellingham
1.5.4
six
1.17.0
smart-open
7.1.0
sniffio
1.3.1
sortedcontainers
2.4.0
soupsieve
2.7
spacy
3.8.4
spacy-legacy
3.0.12
spacy-loggers
1.0.5
SQLAlchemy
2.0.41
srsly
2.5.1
stack-data
0.6.3
statsmodels
0.14.4
tabulate
0.9.0
tenacity
9.1.2
terminado
0.18.1
texttable
1.7.0
thinc
8.3.6
threadpoolctl
3.6.0
tifffile
2025.5.24
tinycss2
1.4.0
topojson
1.9
tornado
6.5.1
tqdm
4.67.1
traitlets
5.14.3
trio
0.30.0
trio-websocket
0.12.2
typer
0.15.3
types-python-dateutil
2.9.0.20250516
typing_extensions
4.13.2
typing-inspect
0.9.0
typing-inspection
0.4.1
tzdata
2025.2
Unidecode
1.4.0
uri-template
1.3.0
url-normalize
2.2.1
urllib3
2.4.0
wasabi
1.1.3
wcwidth
0.2.13
weasel
0.4.1
webcolors
24.11.1
webdriver-manager
4.0.2
webencodings
0.5.1
websocket-client
1.8.0
widgetsnbextension
4.0.14
wordcloud
1.9.3
wrapt
1.17.2
wsproto
1.2.0
xlrd
2.0.1
xyzservices
2025.4.0
yarl
1.20.0
yellowbrick
1.5
zipp
3.21.0
View file history
md`Ce fichier a été modifié __${table_commit.length}__ fois depuis sa création le ${creation_string} (dernière modification le ${last_modification_string})`
functionreplacePullRequestPattern(inputString, githubRepo) {// Use a regular expression to match the pattern #digitvar pattern =/#(\d+)/g;// Replace the pattern with ${github_repo}/pull/#digitvar replacedString = inputString.replace(pattern,'[#$1]('+ githubRepo +'/pull/$1)');return replacedString;}
table_commit = {// Get the HTML table by its class namevar table =document.querySelector('.commit-table');// Check if the table existsif (table) {// Initialize an array to store the table datavar dataArray = [];// Extract headers from the first rowvar headers = [];for (var i =0; i < table.rows[0].cells.length; i++) { headers.push(table.rows[0].cells[i].textContent.trim()); }// Iterate through the rows, starting from the second rowfor (var i =1; i < table.rows.length; i++) {var row = table.rows[i];var rowData = {};// Iterate through the cells in the rowfor (var j =0; j < row.cells.length; j++) {// Use headers as keys and cell content as values rowData[headers[j]] = row.cells[j].textContent.trim(); }// Push the rowData object to the dataArray dataArray.push(rowData); } }return dataArray}
// Get the element with class 'git-details'{var gitDetails =document.querySelector('.commit-table');// Check if the element existsif (gitDetails) {// Hide the element gitDetails.style.display='none'; }}
Scikit-Learn is a library developed since 2007 by French public research labs at INRIA. Open source from the outset, it is now maintained by :probabl., a startup created to manage the Scikit ecosystem, bringing together some of the INRIA research teams behind the core of the modern machine learning stack.
TensorFlow was developed internally at Google and made public in 2015. Although now less widely used - partly due to the rise of PyTorch - it played a major role in popularizing neural networks in both research and production during the 2010s.
PyTorch was developed by Meta starting 2018 and has been governed by the PyTorch Foundation since 2022. It is now the most widely used framework to train neural networks.↩︎
In the domains of API access and web scraping, JavaScript is Python’s most serious competitor. However, its community is more focused on web development than on data science.↩︎
Tabular data refers to structured data organized in tables that map observations to variables. This structure contrasts with unstructured data like free text, images, audio, or video. In unstructured data analysis, Python dominates. For tabular data, Python advantage is less clear - especially compared to - but both languages now offer similar capabilities. We will regularly draw parallels between them in chapters on the Pandas library.↩︎
On this topic, see Thomas Wolf’s blog post The Einstein AI model. Although the post focuses on disruptive innovation and pays less attention to incremental progress, it’s insightful in understanding that LLMs—despite bold predictions from tech influencers—are still just tools. They may excel at standardized tasks, but for now, they remain assistants.↩︎
Jupyter notebooks are interactive documents that allow you to combine code, text, and visualizations in a single file. They’re widely used in data science and education to make code both readable and executable.↩︎
Opening the chapters as notebooks in standardized environments - something explained in the next chapter - ensures you are working in a controlled setup. Personal Python installations often involve tweaks and adjustments that can alter your environment and lead to unexpected, hard-to-diagnose errors. For this reason, such local setups are not recommended for this course. As you’ll see in the next chapter, cloud-based environments offer the advantage of consistent, preconfigured setups that greatly improve reliability and ease of use.↩︎
Citation
BibTeX citation:
@book{galiana2023,
author = {Galiana, Lino},
title = {Python Pour La Data Science},
date = {2023},
url = {https://pythonds.linogaliana.fr/},
doi = {10.5281/zenodo.8229676},
langid = {en}
}