Mastering regular expressions

Regular expressions provide a very practical framework for flexibly manipulating textual data. They are especially useful for natural language processing (NLP) tasks or cleaning textual data.

Tutoriel
Manipulation
Author

Lino Galiana

Published

2025-03-19

If you want to try the examples in this tutorial:
View on GitHub Onyxia Onyxia Open In Colab

1 Introduction

Python offers a lot of very useful functionalities for handling textual data. This is one of the reasons for its success in the natural language processing (NLP) community (see the dedicated section).

In previous chapters, we sometimes needed to search for basic textual elements. This was possible with the str.find method from the Pandas package, which is a vectorized version of the basic find method. We could also use the basic method directly, especially when performing web scraping.

However, this search function quickly reaches its limits. For instance, if we want to find both the singular and plural occurrences of a term, we will need to use the find method at least twice. For conjugated verbs, it becomes even more complex, especially if their form changes according to the subject.

For complicated expressions, it is advisable to use regular expressions or “regex”. This is a feature found in many programming languages. It is a form of grammar that allows for searching for patterns.

Part of the content in this section is an adaptation of the collaborative documentation on R called utilitR to which I contributed. This chapter also draws from the book R for Data Science which presents a very pedagogical chapter on regex.

We will use the package re to illustrate our examples of regular expressions. This is the reference package used by Pandas in the background to vectorize text searches.

import re
import pandas as pd
Tip

Regular expressions (regex) are notoriously difficult to master. There are tools that make working with regular expressions easier.

  • The reference tool for this is [https://regex101.com/] which allows you to test regex in Python with an explanation accompanying the test.

  • Similarly, this site has a cheat sheet at the bottom of the page.

  • The Regex Crossword games allow you to learn regular expressions while having fun.

It can be useful to ask assistant AIs, such as Github Copilot or ChatGPT, for a first version of a regex by explaining the content you want to extract. This can save a lot of time, except when the AI is overconfident and offers you a completely wrong regex…

2 Principle

Regular expressions are a tool used to describe a set of possible strings according to a precise syntax, and thus define a pattern. Regular expressions are used, for example, when you want to extract a part of a string or replace a part of a string. A regular expression takes the form of a string, which can contain both literal elements and special characters with logical meaning.

For example, "ch.+n" is a regular expression that describes the following pattern: the literal string ch, followed by any string of at least one character (.+), followed by the letter n. In the string "J'ai un chien.", the substring "chien" matches this pattern. The same goes for "chapeau ron" in "J'ai un chapeau rond". In contrast, in the string "La soupe est chaude.", no substring matches this pattern (because no n appears after the ch).

To convince ourselves, we can look at the first two cases:

pattern = "ch.+n"
print(re.search(pattern, "La soupe est chaude."))
None

In the previous example, we had two adjacent quantifiers .+. The first (.) means any character1. The second (+) means “repeat the previous pattern”. In our case, the combination .+ allows us to repeat any character before finding an n. The number of times is indeterminate: it may not be necessary to intersperse characters before the n or it may be necessary to capture several:

print(re.search(pattern, "J'ai un chino"))
print(re.search(pattern, "J'ai un chiot très mignon."))
<re.Match object; span=(8, 12), match='chin'>
<re.Match object; span=(8, 25), match='chiot très mignon'>

2.1 Character classes

When searching, we are interested in characters and often in character classes: we look for a digit, a letter, a character in a specific set, or a character that does not belong to a specific set. Some sets are predefined, others must be defined using brackets.

To define a character set, you need to write this set within brackets. For example, [0123456789] denotes a digit. Since it is a sequence of consecutive characters, we can summarize this notation as [0-9].

For example, if we want to find all patterns that start with a c followed by an h and then a vowel (a, e, i, o, u), we can try this regular expression:

re.findall("[c][h][aeiou]", "chat, chien, veau, vache, chèvre")
['cha', 'chi', 'che']

It would be more practical to use Pandas in this case to isolate the lines that meet the logical condition (by adding the accents that are otherwise not included):

import pandas as pd
txt = pd.Series("chat, chien, veau, vache, chèvre".split(", "))
txt.str.match("ch[aeéèiou]")
0     True
1     True
2    False
3    False
4     True
dtype: bool

However, the usage of character classes as shown above is not the most common. They are preferred for identifying complex patterns rather than a sequence of literal characters. Memory aid tables illustrate some of the most common character classes ([:digit:] or \d…)

2.2 Quantifiers

We encountered quantifiers with our first regular expression. They control the number of times a pattern is matched.

The most common are:

  • ? : 0 or 1 match;
  • + : 1 or more matches;
  • * : 0 or more matches.

For example, colou?r will match both the American and British spellings:

re.findall("colou?r", "Did you write color or colour?")
['color', 'colour']

These quantifiers can of course be combined with other types of characters, especially character classes. This can be extremely useful. For example, \d+ will capture one or more digits, \s? will optionally add a space, [\w]{6,8} will match a word between six and eight letters.

It is also possible to define the number of repetitions with {}:

  • {n} matches exactly n times;
  • {n,} matches at least n times;
  • {n,m} matches between n and m times.

However, the repetition of terms by default only applies to the last character preceding the quantifier. We can confirm this with the example above:

print(re.match("toc{4}","toctoctoctoc"))
None

To address this issue, parentheses are used. The principle is the same as with numeric rules: parentheses allow for introducing hierarchy. To revisit the previous example, we get the expected result thanks to the parentheses:

print(re.match("(toc){4}","toctoctoctoc"))
print(re.match("(toc){5}","toctoctoctoc"))
print(re.match("(toc){2,4}","toctoctoctoc"))
<re.Match object; span=(0, 12), match='toctoctoctoc'>
None
<re.Match object; span=(0, 12), match='toctoctoctoc'>
Note

The regular expression algorithm always tries to match the largest piece to the regular expression.

For example, consider an HTML string:

s = "<h1>Super titre HTML</h1>"

The regular expression re.findall("<.*>", s) potentially matches three pieces:

  • <h1>
  • </h1>
  • <h1>Super titre HTML</h1>

It is the last one that will be chosen, as it is the largest. To select the smallest, you need to write the quantifiers like this: *?, +?. Here are a few examples:

s = "<h1>Super titre HTML</h1>\n<p><code>Python</code> est un langage très flexible</p>"
print(re.findall("<.*>", s))
print(re.findall("<p>.*</p>", s))
print(re.findall("<p>.*?</p>", s))
print(re.compile("<.*?>").findall(s))
['<h1>Super titre HTML</h1>', '<p><code>Python</code> est un langage très flexible</p>']
['<p><code>Python</code> est un langage très flexible</p>']
['<p><code>Python</code> est un langage très flexible</p>']
['<h1>', '</h1>', '<p>', '<code>', '</code>', '</p>']

2.3 Cheat sheet

The table below serves as a cheat sheet for regex:

Regular expression Meaning
"^" Start of the string
"$" End of the string
"\\." A dot
"." Any character
".+" Any non-empty sequence of characters
".*" Any sequence of characters, possibly empty
"[:alnum:]" An alphanumeric character
"[:alpha:]" A letter
"[:digit:]" A digit
"[:lower:]" A lowercase letter
"[:punct:]" A punctuation mark
"[:space:]" A space
"[:upper:]" An uppercase letter
"[[:alnum:]]+" A sequence of at least one alphanumeric character
"[[:alpha:]]+" A sequence of at least one letter
"[[:digit:]]+" A sequence of at least one digit
"[[:lower:]]+" A sequence of at least one lowercase letter
"[[:punct:]]+" A sequence of at least one punctuation mark
"[[:space:]]+" A sequence of at least one space
"[[:upper:]]+" A sequence of at least one uppercase letter
"[[:alnum:]]*" A sequence of alphanumeric characters, possibly empty
"[[:alpha:]]*" A sequence of letters, possibly empty
"[[:digit:]]*" A sequence of digits, possibly empty
"[[:lower:]]*" A sequence of lowercase letters, possibly empty
"[[:upper:]]*" A sequence of uppercase letters, possibly empty
"[[:punct:]]*" A sequence of punctuation marks, possibly empty
"[^[:alpha:]]+" A sequence of at least one character that is not a letter
"[^[:digit:]]+" A sequence of at least one character that is not a digit
"\|" Either the expression x or y is present
[abyz] One of the specified characters
[abyz]+ One or more of the specified characters (possibly repeated)
[^abyz] None of the specified characters are present

Some character classes have lighter syntax because they are very common. Among them:

Regular expression Meaning
\d Any digit
\D Any character that is not a digit
\s Any space (space, tab, newline)
\S Any character that is not a space
\w Any word character (letters and numbers)
\W Any non-word character (letters and numbers)

In the following exercise, you will be able to practice the previous examples on a slightly more complete regex. This exercise does not require knowledge of the nuances of the re package; you will only need re.findall.

This exercise will use the following string:

s = """date 0 : 14/9/2000
date 1 : 20/04/1971     date 2 : 14/09/1913     date 3 : 2/3/1978
date 4 : 1/7/1986     date 5 : 7/3/47     date 6 : 15/10/1914
date 7 : 08/03/1941     date 8 : 8/1/1980     date 9 : 30/6/1976"""
s
'date 0 : 14/9/2000\ndate 1 : 20/04/1971     date 2 : 14/09/1913     date 3 : 2/3/1978\ndate 4 : 1/7/1986     date 5 : 7/3/47     date 6 : 15/10/1914\ndate 7 : 08/03/1941     date 8 : 8/1/1980     date 9 : 30/6/1976'
Exercise 1
  1. First, extract the day of birth.
    • The first digit of the day is 0, 1, 2, or 3. Translate this into a [X-X] sequence.
    • The second digit of the day is between 0 and 9. Translate this into the appropriate sequence.
    • Note that the first digit of the day is optional. Insert the appropriate quantifier between the two character classes.
    • Add the slash after the pattern.
    • Test with re.findall. You should get many more matches than needed. This is normal; at this stage, the regex is not yet finalized.
  2. Follow the same logic for the months, noting that Gregorian calendar months never exceed the first dozen. Test with re.findall.
  3. Do the same for the birth years, noting that, unless proven otherwise, for people alive today, the relevant millennia are limited. Test with re.findall.
  4. This regex is not natural; one could be satisfied with generic character classes \d, even though they might practically select impossible birth dates (e.g., 43/78/4528). This would simplify the regex, making it more readable. Don’t forget the usefulness of quantifiers.
  5. How can the regex be adapted to always be valid for our cases but also capture dates of the type YYYY/MM/DD? Test with 1998/07/12.

At the end of question 1, you should have this result:

['14/',
 '9/',
 '20/',
 '04/',
 '14/',
 '09/',
 '2/',
 '3/',
 '1/',
 '7/',
 '7/',
 '3/',
 '15/',
 '10/',
 '08/',
 '03/',
 '8/',
 '1/',
 '30/',
 '6/']

At the end of question 2, you should have this result, which is starting to take shape:

['14/9',
 '20/04',
 '14/09',
 '2/3',
 '1/7',
 '7/3',
 '15/10',
 '08/03',
 '8/1',
 '30/6']

At the end of question 3, you should be able to extract the dates:

['14/9/2000',
 '20/04/1971',
 '14/09/1913',
 '2/3/1978',
 '1/7/1986',
 '7/3/47',
 '15/10/1914',
 '08/03/1941',
 '8/1/1980',
 '30/6/1976']

If all goes well, by question 5, your regex should work:

['14/9/2000',
 '20/04/1971',
 '14/09/1913',
 '2/3/1978',
 '1/7/1986',
 '7/3/47',
 '15/10/1914',
 '08/03/1941',
 '8/1/1980',
 '30/6/1976',
 '1998/07/12']

3 Main re functions

Here is a summary table of the main functions of the re package with examples.

We have mainly used re.findall so far, which is one of the most practical functions in the package. re.sub and re.search are also quite useful. The others are less critical but can be helpful in specific cases.

Function Purpose
re.match(<regex>, s) Find and return the first match of the regular expression <regex> from the beginning of the string s
re.search(<regex>, s) Find and return the first match of the regular expression <regex> regardless of its position in the string s
re.finditer(<regex>, s) Find and return an iterator storing all matches of the regular expression <regex> regardless of their position(s) in the string s. Typically, a loop is performed over this iterator
re.findall(<regex>, s) Find and return all matches of the regular expression <regex> regardless of their position(s) in the string s as a list
re.sub(<regex>, new_text, s) Find and replace all matches of the regular expression <regex> regardless of their position(s) in the string s

To illustrate these functions, here are some examples:

Example of re.match 👇

re.match can only capture a pattern at the start of a string. Its utility is thus limited. Let’s capture toto:

re.match("(to){2}", "toto at the beach")
<re.Match object; span=(0, 4), match='toto'>
Example of re.search 👇

re.search is more powerful than re.match, allowing capture of terms regardless of their position in a string. For example, to capture age:

re.search("age", "toto is of age to go to the beach")
<re.Match object; span=(11, 14), match='age'>

And to capture exclusively “age” at the end of the string:

re.search("age$", "toto is of age to go to the beach")
Example of re.finditer 👇

re.finditer is, in my opinion, less practical than re.findall. Its main use compared to re.findall is capturing the position within a text field:

s = "toto is of age to go to the beach"
for match in re.finditer("age", s):
    start = match.start()
    end = match.end()
    print(f'String match "{s[start:end]}" at {start}:{end}')
String match "age" at 11:14
Example of re.sub 👇

re.sub allows capturing and replacing expressions. For example, let’s replace “age” with “âge”. But be careful, you don’t want to do this when the pattern is present in “beach”. So, we’ll add a negative condition: capture “age” only if it is not at the end of the string (which translates to regex as ?!$).

re.sub("age(?!$)", "âge", "toto a l'age d'aller à la plage")
"toto a l'âge d'aller à la plage"
When to use re.compile and raw strings?

re.compile can be useful when you use a regular expression multiple times in your code. It allows you to compile the regular expression into an object recognized by re, which can be more efficient in terms of performance when the regular expression is used repeatedly or on large data sets.

Raw strings (raw string) are special strings in Python that start with r. For example, r"toto at the beach". They can be useful to prevent escape characters from being interpreted by Python. For instance, if you want to search for a string containing a backslash \ in a string, you need to use a raw string to prevent the backslash from being interpreted as an escape character (\t, \n, etc.). The tester https://regex101.com/ also assumes you are using raw strings, so it can be useful to get used to them.

4 Generalization with Pandas

Pandas methods are extensions of those in re that avoid looping to check each line with a regex. In practice, when working with DataFrames, the pandas API is preferred over re. Code of the form df.apply(lambda x: re.<function>(<regex>,x), axis = 1) should be avoided as it is very inefficient.

The names sometimes change slightly compared to their re equivalents.

Method Description
str.count() Count the number of occurrences of the pattern in each line
str.replace() Replace the pattern with another value. Vectorized version of re.sub()
str.contains() Test if the pattern appears, line by line. Vectorized version of re.search()
str.extract() Extract groups that match a pattern and return them in a column
str.findall() Find and return all occurrences of a pattern. If a line contains multiple matches, a list is returned. Vectorized version of re.findall()

Additionally, there are str.split() and str.rsplit() methods which are quite useful.

Example of str.count 👇

You can count the number of times a pattern appears with str.count:

df = pd.DataFrame({"a": ["toto", "titi"]})
df['a'].str.count("to")
0    2
1    0
Name: a, dtype: int64
Example of str.replace 👇

Replace the pattern “ti” at the end of the string:

df = pd.DataFrame({"a": ["toto", "titi"]})
df['a'].str.replace("ti$", " punch")
0    toto
1    titi
Name: a, dtype: object
Example of str.contains 👇

Check the cases where our line ends with “ti”:

df = pd.DataFrame({"a": ["toto", "titi"]})
df['a'].str.contains("ti$")
0    False
1     True
Name: a, dtype: bool
Example of str.findall 👇
df = pd.DataFrame({"a": ["toto", "titi"]})
df['a'].str.findall("to")
0    [to, to]
1          []
Name: a, dtype: object
Warning

Currently, it is not necessary to add the regex = True argument, but this should be the case in a future version of pandas. It might be worth getting into the habit of adding it.

5 For more information

6 Additional exercises

6.1 Extracting email addresses

This is a classic use of regex

text_emails = 'Hello from toto@gmail.com to titi.grominet@yahoo.com about the meeting @2PM'
Exercise 2: Extracting email addresses

Use the structure of an email address [XXXX]@[XXXX] to retrieve this content.

['toto@gmail.com', 'titi.grominet@yahoo.com']

6.2 Extracting years from a pandas DataFrame

The general objective of the exercise is to clean columns in a DataFrame using regular expressions.

Exercise 3

The dataset in question contains books from the British Library and some related information. The dataset is available here: https://raw.githubusercontent.com/realpython/python-data-cleaning/master/Datasets/BL-Flickr-Images-Book.csv

The “Date of Publication” column is not always a year; sometimes there are other details. The goal of the exercise is to have a clean book publication date and to examine the distribution of publication years.

To do this, you can:

  • Either choose to perform the exercise without help. Your reading of the instructions ends here. You should carefully examine the dataset and transform it yourself.

  • Or follow the step-by-step instructions below.

Guided version 👇
  1. Read the data from the URL https://raw.githubusercontent.com/realpython/python-data-cleaning/master/Datasets/BL-Flickr-Images-Book.csv. Be careful with the separator.
  2. Keep only the columns ['Identifier', 'Place of Publication', 'Date of Publication', 'Publisher', 'Title', 'Author'].
  3. Observe the ‘Date of Publication’ column and note the issues with some rows (e.g., row 13).
  4. Start by looking at the number of missing values. We cannot do better after regex, and normally we should not have fewer…
  5. Determine the regex pattern for a publication date. Presumably, there are 4 digits forming a year. Use the str.extract() method with the expand = False argument (to keep only the first date matching our pattern)?
  6. We have 2 NaN values that were not present at the start of the exercise. What are they and why?
  7. What is the distribution of publication dates in the dataset? You can, for example, display a histogram using the plot method with the kind = "hist" argument.

Here is an example of the problem to detect in question 3:

Date of Publication Title
13 1839, 38-54 De Aardbol. Magazijn van hedendaagsche land- e...
14 1897 Cronache Savonesi dal 1500 al 1570 ... Accresc...
15 1865 See-Saw; a novel ... Edited [or rather, writte...
16 1860-63 Géodésie d'une partie de la Haute Éthiopie,...
17 1873 [With eleven maps.]
18 1866 [Historia geográfica, civil y politica de la ...
19 1899 The Crisis of the Revolution, being the story ...

Question 4 answer should be

np.int64(181)

With our regex (question 5), we obtain a DataFrame that is more in line with our expectations:

Date of Publication year
0 1879 [1878] 1879
7 NaN NaN
13 1839, 38-54 1839
16 1860-63 1860
23 1847, 48 [1846-48] 1847
... ... ...
8278 1883, [1884] 1883
8279 1898-1912 1898
8283 1831, 32 1831
8284 [1806]-22 1806
8286 1834-43 1834

1759 rows × 2 columns

As for the new NaN values, they are rows that did not contain any strings resembling years:

Date of Publication year
1081 112. G. & W. B. Whittaker NaN
7391 17 vols. University Press NaN

Finally, we obtain the following histogram of publication dates:

Informations additionnelles

environment files have been tested on.

Latest built version: 2025-03-19

Python version used:

'3.12.6 | packaged by conda-forge | (main, Sep 30 2024, 18:08:52) [GCC 13.3.0]'
Package Version
affine 2.4.0
aiobotocore 2.21.1
aiohappyeyeballs 2.6.1
aiohttp 3.11.13
aioitertools 0.12.0
aiosignal 1.3.2
alembic 1.13.3
altair 5.4.1
aniso8601 9.0.1
annotated-types 0.7.0
anyio 4.8.0
appdirs 1.4.4
archspec 0.2.3
asttokens 2.4.1
attrs 25.3.0
babel 2.17.0
bcrypt 4.2.0
beautifulsoup4 4.12.3
black 24.8.0
blinker 1.8.2
blis 1.2.0
bokeh 3.5.2
boltons 24.0.0
boto3 1.37.1
botocore 1.37.1
branca 0.7.2
Brotli 1.1.0
bs4 0.0.2
cachetools 5.5.0
cartiflette 0.0.2
Cartopy 0.24.1
catalogue 2.0.10
cattrs 24.1.2
certifi 2025.1.31
cffi 1.17.1
charset-normalizer 3.4.1
chromedriver-autoinstaller 0.6.4
click 8.1.8
click-plugins 1.1.1
cligj 0.7.2
cloudpathlib 0.21.0
cloudpickle 3.0.0
colorama 0.4.6
comm 0.2.2
commonmark 0.9.1
conda 24.9.1
conda-libmamba-solver 24.7.0
conda-package-handling 2.3.0
conda_package_streaming 0.10.0
confection 0.1.5
contextily 1.6.2
contourpy 1.3.1
cryptography 43.0.1
cycler 0.12.1
cymem 2.0.11
cytoolz 1.0.0
dask 2024.9.1
dask-expr 1.1.15
databricks-sdk 0.33.0
dataclasses-json 0.6.7
debugpy 1.8.6
decorator 5.1.1
Deprecated 1.2.14
diskcache 5.6.3
distributed 2024.9.1
distro 1.9.0
docker 7.1.0
duckdb 1.2.1
en_core_web_sm 3.8.0
entrypoints 0.4
et_xmlfile 2.0.0
exceptiongroup 1.2.2
executing 2.1.0
fastexcel 0.11.6
fastjsonschema 2.21.1
fiona 1.10.1
Flask 3.0.3
folium 0.17.0
fontawesomefree 6.6.0
fonttools 4.56.0
fr_core_news_sm 3.8.0
frozendict 2.4.4
frozenlist 1.5.0
fsspec 2023.12.2
geographiclib 2.0
geopandas 1.0.1
geoplot 0.5.1
geopy 2.4.1
gitdb 4.0.11
GitPython 3.1.43
google-auth 2.35.0
graphene 3.3
graphql-core 3.2.4
graphql-relay 3.2.0
graphviz 0.20.3
great-tables 0.12.0
greenlet 3.1.1
gunicorn 22.0.0
h11 0.14.0
h2 4.1.0
hpack 4.0.0
htmltools 0.6.0
httpcore 1.0.7
httpx 0.28.1
httpx-sse 0.4.0
hyperframe 6.0.1
idna 3.10
imageio 2.37.0
importlib_metadata 8.6.1
importlib_resources 6.5.2
inflate64 1.0.1
ipykernel 6.29.5
ipython 8.28.0
itsdangerous 2.2.0
jedi 0.19.1
Jinja2 3.1.6
jmespath 1.0.1
joblib 1.4.2
jsonpatch 1.33
jsonpointer 3.0.0
jsonschema 4.23.0
jsonschema-specifications 2024.10.1
jupyter-cache 1.0.0
jupyter_client 8.6.3
jupyter_core 5.7.2
kaleido 0.2.1
kiwisolver 1.4.8
langchain 0.3.20
langchain-community 0.3.9
langchain-core 0.3.45
langchain-text-splitters 0.3.6
langcodes 3.5.0
langsmith 0.1.147
language_data 1.3.0
lazy_loader 0.4
libmambapy 1.5.9
locket 1.0.0
loguru 0.7.3
lxml 5.3.1
lz4 4.3.3
Mako 1.3.5
mamba 1.5.9
mapclassify 2.8.1
marisa-trie 1.2.1
Markdown 3.6
markdown-it-py 3.0.0
MarkupSafe 3.0.2
marshmallow 3.26.1
matplotlib 3.10.1
matplotlib-inline 0.1.7
mdurl 0.1.2
menuinst 2.1.2
mercantile 1.2.1
mizani 0.11.4
mlflow 2.16.2
mlflow-skinny 2.16.2
msgpack 1.1.0
multidict 6.1.0
multivolumefile 0.2.3
munkres 1.1.4
murmurhash 1.0.12
mypy-extensions 1.0.0
narwhals 1.30.0
nbclient 0.10.0
nbformat 5.10.4
nest_asyncio 1.6.0
networkx 3.4.2
nltk 3.9.1
numpy 2.2.3
opencv-python-headless 4.10.0.84
openpyxl 3.1.5
opentelemetry-api 1.16.0
opentelemetry-sdk 1.16.0
opentelemetry-semantic-conventions 0.37b0
orjson 3.10.15
outcome 1.3.0.post0
OWSLib 0.28.1
packaging 24.2
pandas 2.2.3
paramiko 3.5.0
parso 0.8.4
partd 1.4.2
pathspec 0.12.1
patsy 1.0.1
Pebble 5.1.0
pexpect 4.9.0
pickleshare 0.7.5
pillow 11.1.0
pip 24.2
platformdirs 4.3.6
plotly 5.24.1
plotnine 0.13.6
pluggy 1.5.0
polars 1.8.2
preshed 3.0.9
prometheus_client 0.21.0
prometheus_flask_exporter 0.23.1
prompt_toolkit 3.0.48
propcache 0.3.0
protobuf 4.25.3
psutil 7.0.0
ptyprocess 0.7.0
pure_eval 0.2.3
py7zr 0.20.8
pyarrow 17.0.0
pyarrow-hotfix 0.6
pyasn1 0.6.1
pyasn1_modules 0.4.1
pybcj 1.0.3
pycosat 0.6.6
pycparser 2.22
pycryptodomex 3.21.0
pydantic 2.10.6
pydantic_core 2.27.2
pydantic-settings 2.8.1
Pygments 2.19.1
PyNaCl 1.5.0
pynsee 0.1.8
pyogrio 0.10.0
pyOpenSSL 24.2.1
pyparsing 3.2.1
pyppmd 1.1.1
pyproj 3.7.1
pyshp 2.3.1
PySocks 1.7.1
python-dateutil 2.9.0.post0
python-dotenv 1.0.1
python-magic 0.4.27
pytz 2025.1
pyu2f 0.1.5
pywaffle 1.1.1
PyYAML 6.0.2
pyzmq 26.3.0
pyzstd 0.16.2
querystring_parser 1.2.4
rasterio 1.4.3
referencing 0.36.2
regex 2024.9.11
requests 2.32.3
requests-cache 1.2.1
requests-toolbelt 1.0.0
retrying 1.3.4
rich 13.9.4
rpds-py 0.23.1
rsa 4.9
rtree 1.4.0
ruamel.yaml 0.18.6
ruamel.yaml.clib 0.2.8
s3fs 2023.12.2
s3transfer 0.11.3
scikit-image 0.24.0
scikit-learn 1.6.1
scipy 1.13.0
seaborn 0.13.2
selenium 4.29.0
setuptools 76.0.0
shapely 2.0.7
shellingham 1.5.4
six 1.17.0
smart-open 7.1.0
smmap 5.0.0
sniffio 1.3.1
sortedcontainers 2.4.0
soupsieve 2.5
spacy 3.8.4
spacy-legacy 3.0.12
spacy-loggers 1.0.5
SQLAlchemy 2.0.39
sqlparse 0.5.1
srsly 2.5.1
stack-data 0.6.2
statsmodels 0.14.4
tabulate 0.9.0
tblib 3.0.0
tenacity 9.0.0
texttable 1.7.0
thinc 8.3.4
threadpoolctl 3.6.0
tifffile 2025.3.13
toolz 1.0.0
topojson 1.9
tornado 6.4.2
tqdm 4.67.1
traitlets 5.14.3
trio 0.29.0
trio-websocket 0.12.2
truststore 0.9.2
typer 0.15.2
typing_extensions 4.12.2
typing-inspect 0.9.0
tzdata 2025.1
Unidecode 1.3.8
url-normalize 1.4.3
urllib3 1.26.20
uv 0.6.8
wasabi 1.1.3
wcwidth 0.2.13
weasel 0.4.1
webdriver-manager 4.0.2
websocket-client 1.8.0
Werkzeug 3.0.4
wheel 0.44.0
wordcloud 1.9.3
wrapt 1.17.2
wsproto 1.2.0
xgboost 2.1.1
xlrd 2.0.1
xyzservices 2025.1.0
yarl 1.18.3
yellowbrick 1.5
zict 3.0.0
zipp 3.21.0
zstandard 0.23.0

View file history

SHA Date Author Description
3f1d2f3f 2025-03-15 15:55:59 Lino Galiana Fix problem with uv and malformed files (#599)
6c6dfe52 2024-12-20 13:40:33 lgaliana eval false for API chapter
9d8e69c3 2024-10-21 17:10:03 lgaliana update badges shortcode for all manipulation part
1953609d 2024-08-12 16:18:19 linogaliana One button is enough
c3f6cbc8 2024-08-12 12:18:22 linogaliana Correction LUA filters
f4e08292 2024-08-12 14:12:06 Lino Galiana Traduction chapitre regex (#539)
0d4cf51e 2024-08-08 06:58:53 linogaliana restore URL regex chapter
580cba77 2024-08-07 18:59:35 Lino Galiana Multilingual version as quarto profile (#533)
101465fb 2024-08-07 13:56:35 Lino Galiana regex, webscraping and API chapters in 🇬🇧 (#532)
065b0abd 2024-07-08 11:19:43 Lino Galiana Nouveaux callout dans la partie manipulation (#513)
06d003a1 2024-04-23 10:09:22 Lino Galiana Continue la restructuration des sous-parties (#492)
005d89b8 2023-12-20 17:23:04 Lino Galiana Finalise l’affichage des statistiques Git (#478)
3fba6124 2023-12-17 18:16:42 Lino Galiana Remove some badges from python (#476)
a06a2689 2023-11-23 18:23:28 Antoine Palazzolo 2ème relectures chapitres ML (#457)
69cf52bd 2023-11-21 16:12:37 Antoine Palazzolo [On-going] Suggestions chapitres modélisation (#452)
889a71ba 2023-11-10 11:40:51 Antoine Palazzolo Modification TP 3 (#443)
a7711832 2023-10-09 11:27:45 Antoine Palazzolo Relecture TD2 par Antoine (#418)
154f09e4 2023-09-26 14:59:11 Antoine Palazzolo Des typos corrigées par Antoine (#411)
9a4e2267 2023-08-28 17:11:52 Lino Galiana Action to check URL still exist (#399)
3bdf3b06 2023-08-25 11:23:02 Lino Galiana Simplification de la structure 🤓 (#393)
130ed717 2023-07-18 19:37:11 Lino Galiana Restructure les titres (#374)
f0c583c0 2023-07-07 14:12:22 Lino Galiana Images viz (#371)
ef28fefd 2023-07-07 08:14:42 Lino Galiana Listing pour la première partie (#369)
f21a24d3 2023-07-02 10:58:15 Lino Galiana Pipeline Quarto & Pages 🚀 (#365)
62b2a7c2 2022-12-28 15:00:50 Lino Galiana Suite chapitre regex (#340)
3c880d59 2022-12-27 17:34:59 Lino Galiana Chapitre regex + Change les boites dans plusieurs chapitres (#339)
f10815b5 2022-08-25 16:00:03 Lino Galiana Notebooks should now look more beautiful (#260)
494a85ae 2022-08-05 14:49:56 Lino Galiana Images featured ✨ (#252)
d201e3cd 2022-08-03 15:50:34 Lino Galiana Pimp la homepage ✨ (#249)
12965bac 2022-05-25 15:53:27 Lino Galiana :launch: Bascule vers quarto (#226)
9c71d6e7 2022-03-08 10:34:26 Lino Galiana Plus d’éléments sur S3 (#218)
9a3f7ad8 2021-10-31 18:36:25 Lino Galiana Nettoyage partie API + Git (#170)
2a8809fb 2021-10-27 12:05:34 Lino Galiana Simplification des hooks pour gagner en flexibilité et clarté (#166)
b138cf3e 2021-10-21 18:05:59 Lino Galiana Mise à jour TP webscraping et API (#164)
2e4d5862 2021-09-02 12:03:39 Lino Galiana Simplify badges generation (#130)
a5f48243 2021-07-16 14:20:27 Lino Galiana Exo supplémentaire webscraping marmiton 🍝 (#121) (#124)
4cdb759c 2021-05-12 10:37:23 Lino Galiana :sparkles: :star2: Nouveau thème hugo :snake: :fire: (#105)
6d010fa2 2020-09-29 18:45:34 Lino Galiana Simplifie l’arborescence du site, partie 1 (#57)
66f9f87a 2020-09-24 19:23:04 Lino Galiana Introduction des figures générées par python dans le site (#52)
5c1e76d9 2020-09-09 11:25:38 Lino Galiana Ajout des éléments webscraping, regex, API (#21)
Back to top

Footnotes

  1. Any character except for the newline (\n). Keep this in mind; I have already spent hours trying to understand why my . did not capture what I wanted spanning multiple lines…↩︎

Citation

BibTeX citation:
@book{galiana2023,
  author = {Galiana, Lino},
  title = {Python Pour La Data Science},
  date = {2023},
  url = {https://pythonds.linogaliana.fr/},
  doi = {10.5281/zenodo.8229676},
  langid = {en}
}
For attribution, please cite this work as:
Galiana, Lino. 2023. Python Pour La Data Science. https://doi.org/10.5281/zenodo.8229676.