1= "Try to copy-paste me" x
- 1
- Click on the button to copy this content and paste it elsewhere.
Lino Galiana
2025-03-19
This course gathers all the content of the course Python for Data Science that I have been teaching at ENSAE since 2018. This course was previously taught by Xavier Dupré. About 170 students take this course each year. In 2024, a gradual introduction of an English version equivalent to the French version began, aimed at serving as an introductory course in data science for European statistical institutes thanks to a European call for projects.
This site (pythonds.linogaliana.fr/) is the main entry point for the course. It centralizes
all the content created during the course for practical work or provided additionally for continuing education purposes.
This course is open source
and I welcome suggestions for improvement on Github
or through the comments at the bottom of each page. As Python
is a living and dynamic language, practices evolve and this course continuously adapts to the changing ecosystem of data science, while trying to distinguish lasting practice evolutions from passing trends.
Additional elements are available in the introductory slides. More advanced elements are present in another course dedicated to deploying data science projects that I teach with Romain Avouac in the final year of ENSAE (ensae-reproductibilite.github.io/website).
This course features tutorials and complete exercises. Each page is structured around a concrete problem and presents the generic approach to solving this general problem.
You can navigate the site architecture via the table of contents or through the links to previous or subsequent content at the end of each page. Some sections, notably the one dedicated to modeling, offer extended examples to illustrate the approach in more detail.
Python
, with its recognizable logo in the form of ,
is a language that has been around for over thirty years
but has experienced a renaissance during the 2010s
due to the surge in interest around
data science.
Python
, more than any other
programming language, brings together diverse communities such as statisticians, developers,
application or IT infrastructure managers,
high school students - Python
has been part of the French baccalaureate program
for several years - and researchers
in both theoretical and applied fields.
Unlike many programming languages that have a fairly homogeneous community,
Python
has managed to bring together a wide range of users thanks to a few central principles: the readability
of the language, the ease of using modules,
the simplicity of integrating it with more performant languages
for specific tasks, the vast amount of documentation
available online…
Being the second best language for performing a given
task
can thus be a source of success when competitors do not have
a similarly broad range of advantages.
The success of Python
, due to its nature as a
Swiss Army knife language, is inseparable
from the emergence of the data scientist profile, a role
capable of integrating at different levels in data valuation.
Davenport and Patil (2012), in the Harvard Business Review,
talked about the “sexiest job of the 21st century”
and, ten years later, provided a comprehensive overview of the evolving
skills expected of a data scientist in the same review (Davenport and Patil 2022). It’s not only data scientists
who are expected to use Python
; within the ecosystem
of data-related jobs (data scientist, data engineer, ML engineer…),
Python
serves as a Babel tower enabling communication between these
interdependent profiles.
The richness of Python
allows it to be used in all phases of data processing, from retrieval and structuring from
various sources to its valuation.
Through the lens of data science, we will see that Python
is
an excellent candidate to assist data scientists in all
aspects of data work.
This course introduces various tools that allow for the connection
of data and theories using Python
. However, this course
goes beyond a simple introduction to the language and provides
more advanced elements, especially on the latest
innovations enabled by data science in work methods.
Python
for Data Analysis?Python
is first known in the world of data science for having
provided early on the tools useful for training machine learning algorithms on various types of data. Indeed,
the success of Scikit Learn
1,
Tensorflow
2, or more
recently PyTorch
3 in the data science community has greatly contributed to the adoption of Python
. However,
reducing Python
to a few machine learning libraries would be limiting, as it is
truly a Swiss Army knife for data scientists,
social scientists, or economists. The success story of Python
is not just about having provided machine learning libraries at an opportune time: this
language has real advantages for new data practitioners.
The appeal of Python
is its central role in a
larger ecosystem of powerful, flexible, and open-source tools.
Like , it belongs to the class
of languages that can be used daily for a wide variety of tasks.
In many areas explored in this course, Python
is, by far,
the programming language offering the most complete and accessible ecosystem.
Beyond machine learning, which we have already discussed, Python
is
indispensable when it comes to retrieving data via
APIs or web scraping4, two approaches that we will explore
in the first part of the course. In the fields of tabular data analysis5,
web content publishing, or graphic production, Python
presents an ecosystem
increasingly similar to due to the growing investment of Posit
,
the company behind the major libraries for data science, in the
Python
community.
Nevertheless, these elements are not meant to engage in the
sterile debate of vs Python
.
These two languages have many more points of convergence than divergence,
making it very simple to transpose good practices from one
language to the other. This is a point that is discussed more extensively
in the advanced course I teach with Romain Avouac in the final year
at ENSAE: ensae-reproductibilite.github.io/website.
Ultimately, data scientists and researchers in social sciences or
economics will use or Python
almost interchangeably and alternately.
This course
will regularly present analogies with to help
those discovering Python
, but who are already familiar with , to
better understand certain points.
This course is aimed at practitioners of data science,
understood here in a broad sense as the combination of techniques from mathematics, statistics, and computer science to produce useful knowledge from data.
As data science is not only a scientific discipline but also aims to provide a set of tools to meet operational objectives, learning the main tool necessary for acquiring knowledge in data science, namely the Python
language, is also an opportunity to discuss the rigorous scientific approach to be adopted when working with data. This course aims to present the approach to handling a dataset, the problems encountered, the solutions to overcome them, and the implications of these solutions. It is therefore not just a course on a technical tool, detached from scientific issues.
This course assumes a desire to use Python
intensively for data analysis within a rigorous statistical framework. It only briefly touches on the statistical or algorithmic foundations behind some of the techniques discussed, which are often the subject of dedicated teachings, particularly at ENSAE.
Not knowing these concepts does not prevent understanding the content of this website, as more advanced concepts are generally presented separately in dedicated boxes. The ease of using Python
avoids the need to program a model oneself, which makes it possible to apply models without being an expert. Knowledge of models will be more necessary for interpreting results.
However, even though it is relatively easy to use complex models with Python
, it is very useful to have some background on them before embarking on a modeling approach. This is one of the reasons why modeling comes later in this course: in addition to involving advanced statistical concepts, it is necessary to have understood the stylized facts in our data to produce relevant modeling. A thorough understanding of data structure and its alignment with model assumptions is essential for building high-quality models.
This course places a central emphasis on the concept of reproducibility. This requirement is reflected in various ways throughout this teaching, primarily by ensuring that all examples and exercises in this course can be tested using Jupyter
notebooks6.
The entire content of the website is reproducible in various computing environments. It is, of course, possible to copy and paste the code snippets present on this site, using the button above the code examples:
1x = "Try to copy-paste me"
However, since this site presents many examples, the back-and-forth between a Python testing environment and this site could be cumbersome. Each chapter is therefore easily retrievable as a Jupyter
notebook via buttons at the beginning of each page. For example, here are those buttons for the Numpy
tutorial:
Recommendations regarding the preferred environments for using these notebooks are deferred to the next chapter.
The requirement for reproducibility is also evident
in the choice of examples used in this course.
All content on this site relies on open data, whether it is French data (mainly
from the centralizing platform data.gouv
or the
website of Insee) or American data. Results are therefore reproducible for someone
with an identical environment7.
American researchers have discussed a reproducibility crisis in the field of machine learning (Kapoor and Narayanan 2022). Issues with the scientific publishing ecosystem and the economic stakes behind academic publications in the field of machine learning are prominent factors that may explain this.
However, academic teaching also bears a responsibility
in this area. Students and researchers are not trained in these topics, and if they
do not adopt this requirement early in their careers, they may not be encouraged to do so later. For this reason, in addition to training in Python
and data science, this course
introduces the use of
the version control software Git
in a dedicated section.
All student projects must be open source, which is one of the best ways
for a teacher to ensure that students produce quality code.
ENSAE students validate the course through an in-depth project. Details about the course assessment, as well as a list of previously completed projects, are available in the Assessment section.
This course is an introduction to the issues of data science through the learning of the Python
language. As the term “data science” suggests, a significant part of this course is dedicated to working with data: retrieval, structuring, exploration, and linking. This is the subject of the first part of the course
“Manipulating Data”, which serves as the foundation for the rest of the course. Unfortunately, many programs in data science, applied statistics, or social and economic sciences, overlook this part of the data scientist’s work sometimes referred to as “data wrangling”
or “feature engineering”, which, in addition to being a significant portion of the data scientist’s work, is essential for building a relevant model.
The goal of this part is to illustrate the challenges related to retrieving various types of data sources and their exploitation using Python
. The examples will be varied to illustrate the richness of the data that can be analyzed with Python
: municipal \(CO_2\) emission statistics, real estate transaction data, energy diagnostics of housing, Vélib station attendance data…
The second part is dedicated to producing visualizations with Python
. After retrieving and cleaning data, one generally wants to synthesize it through tables, graphics, or maps. This part is a brief introduction to this topic (“Communicating with Python
”). Being quite introductory, the goal of this part is mainly to provide some concepts that will be consolidated later.
The third part is dedicated to modeling through the example of electoral science (“Modeling with Python
”). The goal of this part is to illustrate the scientific approach of machine learning, the related methodological and technical choices, and to open up to the following issues that will be discussed in the rest of the university curriculum.
The fourth part of the course takes a step aside to focus on specific issues related to the exploitation of textual data. This is the chapter on “Introduction to Natural Language Processing (NLP) with Python
”. This research field being particularly active, it is only an introduction to the subject. For further reading, refer to Russell and Norvig (2020), chapter 24.
environment files have been tested on.
Latest built version: 2025-03-19
Python version used:
'3.12.6 | packaged by conda-forge | (main, Sep 30 2024, 18:08:52) [GCC 13.3.0]'
Package | Version |
---|---|
affine | 2.4.0 |
aiobotocore | 2.21.1 |
aiohappyeyeballs | 2.6.1 |
aiohttp | 3.11.13 |
aioitertools | 0.12.0 |
aiosignal | 1.3.2 |
alembic | 1.13.3 |
altair | 5.4.1 |
aniso8601 | 9.0.1 |
annotated-types | 0.7.0 |
anyio | 4.8.0 |
appdirs | 1.4.4 |
archspec | 0.2.3 |
asttokens | 2.4.1 |
attrs | 25.3.0 |
babel | 2.17.0 |
bcrypt | 4.2.0 |
beautifulsoup4 | 4.12.3 |
black | 24.8.0 |
blinker | 1.8.2 |
blis | 1.2.0 |
bokeh | 3.5.2 |
boltons | 24.0.0 |
boto3 | 1.37.1 |
botocore | 1.37.1 |
branca | 0.7.2 |
Brotli | 1.1.0 |
bs4 | 0.0.2 |
cachetools | 5.5.0 |
cartiflette | 0.0.2 |
Cartopy | 0.24.1 |
catalogue | 2.0.10 |
cattrs | 24.1.2 |
certifi | 2025.1.31 |
cffi | 1.17.1 |
charset-normalizer | 3.4.1 |
chromedriver-autoinstaller | 0.6.4 |
click | 8.1.8 |
click-plugins | 1.1.1 |
cligj | 0.7.2 |
cloudpathlib | 0.21.0 |
cloudpickle | 3.0.0 |
colorama | 0.4.6 |
comm | 0.2.2 |
commonmark | 0.9.1 |
conda | 24.9.1 |
conda-libmamba-solver | 24.7.0 |
conda-package-handling | 2.3.0 |
conda_package_streaming | 0.10.0 |
confection | 0.1.5 |
contextily | 1.6.2 |
contourpy | 1.3.1 |
cryptography | 43.0.1 |
cycler | 0.12.1 |
cymem | 2.0.11 |
cytoolz | 1.0.0 |
dask | 2024.9.1 |
dask-expr | 1.1.15 |
databricks-sdk | 0.33.0 |
dataclasses-json | 0.6.7 |
debugpy | 1.8.6 |
decorator | 5.1.1 |
Deprecated | 1.2.14 |
diskcache | 5.6.3 |
distributed | 2024.9.1 |
distro | 1.9.0 |
docker | 7.1.0 |
duckdb | 1.2.1 |
en_core_web_sm | 3.8.0 |
entrypoints | 0.4 |
et_xmlfile | 2.0.0 |
exceptiongroup | 1.2.2 |
executing | 2.1.0 |
fastexcel | 0.11.6 |
fastjsonschema | 2.21.1 |
fiona | 1.10.1 |
Flask | 3.0.3 |
folium | 0.17.0 |
fontawesomefree | 6.6.0 |
fonttools | 4.56.0 |
fr_core_news_sm | 3.8.0 |
frozendict | 2.4.4 |
frozenlist | 1.5.0 |
fsspec | 2023.12.2 |
geographiclib | 2.0 |
geopandas | 1.0.1 |
geoplot | 0.5.1 |
geopy | 2.4.1 |
gitdb | 4.0.11 |
GitPython | 3.1.43 |
google-auth | 2.35.0 |
graphene | 3.3 |
graphql-core | 3.2.4 |
graphql-relay | 3.2.0 |
graphviz | 0.20.3 |
great-tables | 0.12.0 |
greenlet | 3.1.1 |
gunicorn | 22.0.0 |
h11 | 0.14.0 |
h2 | 4.1.0 |
hpack | 4.0.0 |
htmltools | 0.6.0 |
httpcore | 1.0.7 |
httpx | 0.28.1 |
httpx-sse | 0.4.0 |
hyperframe | 6.0.1 |
idna | 3.10 |
imageio | 2.37.0 |
importlib_metadata | 8.6.1 |
importlib_resources | 6.5.2 |
inflate64 | 1.0.1 |
ipykernel | 6.29.5 |
ipython | 8.28.0 |
itsdangerous | 2.2.0 |
jedi | 0.19.1 |
Jinja2 | 3.1.6 |
jmespath | 1.0.1 |
joblib | 1.4.2 |
jsonpatch | 1.33 |
jsonpointer | 3.0.0 |
jsonschema | 4.23.0 |
jsonschema-specifications | 2024.10.1 |
jupyter-cache | 1.0.0 |
jupyter_client | 8.6.3 |
jupyter_core | 5.7.2 |
kaleido | 0.2.1 |
kiwisolver | 1.4.8 |
langchain | 0.3.20 |
langchain-community | 0.3.9 |
langchain-core | 0.3.45 |
langchain-text-splitters | 0.3.6 |
langcodes | 3.5.0 |
langsmith | 0.1.147 |
language_data | 1.3.0 |
lazy_loader | 0.4 |
libmambapy | 1.5.9 |
locket | 1.0.0 |
loguru | 0.7.3 |
lxml | 5.3.1 |
lz4 | 4.3.3 |
Mako | 1.3.5 |
mamba | 1.5.9 |
mapclassify | 2.8.1 |
marisa-trie | 1.2.1 |
Markdown | 3.6 |
markdown-it-py | 3.0.0 |
MarkupSafe | 3.0.2 |
marshmallow | 3.26.1 |
matplotlib | 3.10.1 |
matplotlib-inline | 0.1.7 |
mdurl | 0.1.2 |
menuinst | 2.1.2 |
mercantile | 1.2.1 |
mizani | 0.11.4 |
mlflow | 2.16.2 |
mlflow-skinny | 2.16.2 |
msgpack | 1.1.0 |
multidict | 6.1.0 |
multivolumefile | 0.2.3 |
munkres | 1.1.4 |
murmurhash | 1.0.12 |
mypy-extensions | 1.0.0 |
narwhals | 1.30.0 |
nbclient | 0.10.0 |
nbformat | 5.10.4 |
nest_asyncio | 1.6.0 |
networkx | 3.4.2 |
nltk | 3.9.1 |
numpy | 2.2.3 |
opencv-python-headless | 4.10.0.84 |
openpyxl | 3.1.5 |
opentelemetry-api | 1.16.0 |
opentelemetry-sdk | 1.16.0 |
opentelemetry-semantic-conventions | 0.37b0 |
orjson | 3.10.15 |
outcome | 1.3.0.post0 |
OWSLib | 0.28.1 |
packaging | 24.2 |
pandas | 2.2.3 |
paramiko | 3.5.0 |
parso | 0.8.4 |
partd | 1.4.2 |
pathspec | 0.12.1 |
patsy | 1.0.1 |
Pebble | 5.1.0 |
pexpect | 4.9.0 |
pickleshare | 0.7.5 |
pillow | 11.1.0 |
pip | 24.2 |
platformdirs | 4.3.6 |
plotly | 5.24.1 |
plotnine | 0.13.6 |
pluggy | 1.5.0 |
polars | 1.8.2 |
preshed | 3.0.9 |
prometheus_client | 0.21.0 |
prometheus_flask_exporter | 0.23.1 |
prompt_toolkit | 3.0.48 |
propcache | 0.3.0 |
protobuf | 4.25.3 |
psutil | 7.0.0 |
ptyprocess | 0.7.0 |
pure_eval | 0.2.3 |
py7zr | 0.20.8 |
pyarrow | 17.0.0 |
pyarrow-hotfix | 0.6 |
pyasn1 | 0.6.1 |
pyasn1_modules | 0.4.1 |
pybcj | 1.0.3 |
pycosat | 0.6.6 |
pycparser | 2.22 |
pycryptodomex | 3.21.0 |
pydantic | 2.10.6 |
pydantic_core | 2.27.2 |
pydantic-settings | 2.8.1 |
Pygments | 2.19.1 |
PyNaCl | 1.5.0 |
pynsee | 0.1.8 |
pyogrio | 0.10.0 |
pyOpenSSL | 24.2.1 |
pyparsing | 3.2.1 |
pyppmd | 1.1.1 |
pyproj | 3.7.1 |
pyshp | 2.3.1 |
PySocks | 1.7.1 |
python-dateutil | 2.9.0.post0 |
python-dotenv | 1.0.1 |
python-magic | 0.4.27 |
pytz | 2025.1 |
pyu2f | 0.1.5 |
pywaffle | 1.1.1 |
PyYAML | 6.0.2 |
pyzmq | 26.3.0 |
pyzstd | 0.16.2 |
querystring_parser | 1.2.4 |
rasterio | 1.4.3 |
referencing | 0.36.2 |
regex | 2024.9.11 |
requests | 2.32.3 |
requests-cache | 1.2.1 |
requests-toolbelt | 1.0.0 |
retrying | 1.3.4 |
rich | 13.9.4 |
rpds-py | 0.23.1 |
rsa | 4.9 |
rtree | 1.4.0 |
ruamel.yaml | 0.18.6 |
ruamel.yaml.clib | 0.2.8 |
s3fs | 2023.12.2 |
s3transfer | 0.11.3 |
scikit-image | 0.24.0 |
scikit-learn | 1.6.1 |
scipy | 1.13.0 |
seaborn | 0.13.2 |
selenium | 4.29.0 |
setuptools | 76.0.0 |
shapely | 2.0.7 |
shellingham | 1.5.4 |
six | 1.17.0 |
smart-open | 7.1.0 |
smmap | 5.0.0 |
sniffio | 1.3.1 |
sortedcontainers | 2.4.0 |
soupsieve | 2.5 |
spacy | 3.8.4 |
spacy-legacy | 3.0.12 |
spacy-loggers | 1.0.5 |
SQLAlchemy | 2.0.39 |
sqlparse | 0.5.1 |
srsly | 2.5.1 |
stack-data | 0.6.2 |
statsmodels | 0.14.4 |
tabulate | 0.9.0 |
tblib | 3.0.0 |
tenacity | 9.0.0 |
texttable | 1.7.0 |
thinc | 8.3.4 |
threadpoolctl | 3.6.0 |
tifffile | 2025.3.13 |
toolz | 1.0.0 |
topojson | 1.9 |
tornado | 6.4.2 |
tqdm | 4.67.1 |
traitlets | 5.14.3 |
trio | 0.29.0 |
trio-websocket | 0.12.2 |
truststore | 0.9.2 |
typer | 0.15.2 |
typing_extensions | 4.12.2 |
typing-inspect | 0.9.0 |
tzdata | 2025.1 |
Unidecode | 1.3.8 |
url-normalize | 1.4.3 |
urllib3 | 1.26.20 |
uv | 0.6.8 |
wasabi | 1.1.3 |
wcwidth | 0.2.13 |
weasel | 0.4.1 |
webdriver-manager | 4.0.2 |
websocket-client | 1.8.0 |
Werkzeug | 3.0.4 |
wheel | 0.44.0 |
wordcloud | 1.9.3 |
wrapt | 1.17.2 |
wsproto | 1.2.0 |
xgboost | 2.1.1 |
xlrd | 2.0.1 |
xyzservices | 2025.1.0 |
yarl | 1.18.3 |
yellowbrick | 1.5 |
zict | 3.0.0 |
zipp | 3.21.0 |
zstandard | 0.23.0 |
View file history
md`Ce fichier a été modifié __${table_commit.length}__ fois depuis sa création le ${creation_string} (dernière modification le ${last_modification_string})`
creation = d3.min(
table_commit.map(d => new Date(d.Date))
)
last_modification = d3.max(
table_commit.map(d => new Date(d.Date))
)
creation_string = creation.toLocaleString("fr", {
"day": "numeric",
"month": "long",
"year": "numeric"
})
last_modification_string = last_modification.toLocaleString("fr", {
"day": "numeric",
"month": "long",
"year": "numeric"
})
SHA | Date | Author | Description |
---|---|---|---|
3f1d2f3f | 2025-03-15 15:55:59 | Lino Galiana | Fix problem with uv and malformed files (#599) |
388fd975 | 2025-02-28 17:34:09 | Lino Galiana | Colab again and again… (#595) |
488780a4 | 2024-09-25 14:32:16 | Lino Galiana | Change badge (#556) |
5d15b063 | 2024-09-23 15:39:40 | lgaliana | Handling badges problem |
f8b04136 | 2024-08-28 15:15:04 | Lino Galiana | Révision complète de la partie introductive (#549) |
0908656f | 2024-08-20 16:30:39 | Lino Galiana | English sidebar (#542) |
a987feaa | 2024-06-23 18:43:06 | Lino Galiana | Fix broken links (#506) |
69a45850 | 2024-06-12 20:02:14 | Antoine Palazzolo | correct link (#502) |
005d89b8 | 2023-12-20 17:23:04 | Lino Galiana | Finalise l’affichage des statistiques Git (#478) |
16842200 | 2023-12-02 12:06:40 | Antoine Palazzolo | Première partie de relecture de fin du cours (#467) |
1f23de28 | 2023-12-01 17:25:36 | Lino Galiana | Stockage des images sur S3 (#466) |
69cf52bd | 2023-11-21 16:12:37 | Antoine Palazzolo | [On-going] Suggestions chapitres modélisation (#452) |
a7711832 | 2023-10-09 11:27:45 | Antoine Palazzolo | Relecture TD2 par Antoine (#418) |
e8d0062d | 2023-09-26 15:54:49 | Kim A | Relecture KA 25/09/2023 (#412) |
154f09e4 | 2023-09-26 14:59:11 | Antoine Palazzolo | Des typos corrigées par Antoine (#411) |
6178ebeb | 2023-09-26 14:18:34 | Lino Galiana | Change quarto project type (#409) |
9a4e2267 | 2023-08-28 17:11:52 | Lino Galiana | Action to check URL still exist (#399) |
80823022 | 2023-08-25 17:48:36 | Lino Galiana | Mise à jour des scripts de construction des notebooks (#395) |
3bdf3b06 | 2023-08-25 11:23:02 | Lino Galiana | Simplification de la structure 🤓 (#393) |
dde3e934 | 2023-07-21 22:22:05 | Lino Galiana | Fix bug on chapter order (#385) |
2dbf8533 | 2023-07-05 11:21:40 | Lino Galiana | Add nice featured images (#368) |
f21a24d3 | 2023-07-02 10:58:15 | Lino Galiana | Pipeline Quarto & Pages 🚀 (#365) |
git_history_table = Inputs.table(
table_commit,
{
format: {
SHA: x => md`[${x}](${github_repo}/commit/${x})`,
Description: x => md`${replacePullRequestPattern(x, github_repo)}`,
/*Date: x => x.toLocaleString("fr", {
"month": "numeric",
"day": "numeric",
"year": "numeric"
})
*/
}
}
)
git_history_plot = Plot.plot({
marks: [
Plot.ruleY([0], {stroke: "royalblue"}),
Plot.dot(
table_commit,
Plot.pointerX({x: (d) => new Date(d.date), y: 0, stroke: "red"})),
Plot.dot(table_commit, {x: (d) => new Date(d.Date), y: 0, fill: "royalblue"})
]
})
function replacePullRequestPattern(inputString, githubRepo) {
// Use a regular expression to match the pattern #digit
var pattern = /#(\d+)/g;
// Replace the pattern with ${github_repo}/pull/#digit
var replacedString = inputString.replace(pattern, '[#$1](' + githubRepo + '/pull/$1)');
return replacedString;
}
table_commit = {
// Get the HTML table by its class name
var table = document.querySelector('.commit-table');
// Check if the table exists
if (table) {
// Initialize an array to store the table data
var dataArray = [];
// Extract headers from the first row
var headers = [];
for (var i = 0; i < table.rows[0].cells.length; i++) {
headers.push(table.rows[0].cells[i].textContent.trim());
}
// Iterate through the rows, starting from the second row
for (var i = 1; i < table.rows.length; i++) {
var row = table.rows[i];
var rowData = {};
// Iterate through the cells in the row
for (var j = 0; j < row.cells.length; j++) {
// Use headers as keys and cell content as values
rowData[headers[j]] = row.cells[j].textContent.trim();
}
// Push the rowData object to the dataArray
dataArray.push(rowData);
}
}
return dataArray
}
Library developed by the French public research laboratories of INRIA since 2007.↩︎
Library initially used by Google for their internal needs, it was made public in 2015. Although less used now, this library had a significant influence in the 2010s by promoting the use of neural networks in research and operational applications.↩︎
Library developed by Meta since 2018 and affiliated since 2022 with the PyTorch foundation.↩︎
In these two areas, the most serious competitor to Python
is Javascript
. However, the community around this language is more focused
on web development issues than on data science.↩︎
Tabular data are structured data, organized,
as their name indicates, in a table format that allows matching
observations with variables. This structuring differs from other types
of more complex data: free texts, images, sounds, videos… In the domain of unstructured data,
Python
is the hegemonic language for analysis. In the domain of tabular data, Python
’s competitive advantage is less pronounced, particularly compared to ,
but these two languages offer a core set of fairly similar functionalities. We will
regularly draw parallels between these two languages
in the chapters dedicated to the Pandas
library.↩︎
Un notebook est un environnement interactif qui permet d’écrire et d’exécuter du code en direct. Il combine, dans un seul document, du texte, du code qui peut être exécuté et dont les sorties s’affichent après calculs. C’est extrêmement pratique pour l’apprentissage du langage Python
. Pour plus de détails, consultez la documentation officielle de Jupyter.↩︎
Opening chapters as notebooks in standardized environments, as will be proposed starting from the next chapter, ensures that you have a controlled environment. Personal installations of Python
are likely to have undergone modifications that can alter your environment and cause unexpected and hard-to-understand errors: this is not a recommended use for this course. As you will discover in the next chapter, cloud environments offer comfort regarding environment standardization.↩︎
@book{galiana2023,
author = {Galiana, Lino},
title = {Python Pour La Data Science},
date = {2023},
url = {https://pythonds.linogaliana.fr/},
doi = {10.5281/zenodo.8229676},
langid = {en}
}