Web scraping with Python

Python makes it easy to retrieve a web page and extract data for restructuring. Web scraping is an increasingly popular way of retrieving large amounts of information in real time. This chapter presents the two main paradigms through BeautifulSoup and Selenium and the main challenges of web scraping.

Exercice
Manipulation
Author

Lino Galiana

Published

2025-12-23

If you want to try the examples in this tutorial:
View on GitHub Onyxia Onyxia Open In Colab
  • Understand the key challenges of web scraping, including legal concerns (e.g. GDPR, grey areas), site stability, and data reliability
  • Follow best practices when scraping: check the robots.txt file, space out your requests, avoid overloading servers, and scrape during off-peak hours when possible
  • Navigate the HTML structure of a web page (tags, parent-child relationships) to accurately target the elements you want to extract
  • Use the requests library to fetch web page content, and BeautifulSoup to parse and explore the HTML using methods like find and find_all
  • Practice your scraping skills with a hands-on exercise involving the French Ligue 1 football team list
  • Explore Selenium for simulating user interactions on JavaScript-driven dynamic pages
  • Understand the limitations of web scraping and know when it’s better to use more stable and reliable APIs

Web scraping refers to techniques for extracting content from websites. It is a very useful practice for anyone looking to work with information available online, but not necessarily in the form of an Excel table.

This chapter introduces you to how to create and run bots to quickly retrieve useful information for your current or future projects. It starts with some concrete use cases. This chapter is heavily inspired and adapted from Xavier Dupré’s work, the former professor of the subject.

1 Issues

A number of issues related to web scraping will only be briefly mentioned in this chapter.

1.2 Stability and Reliability of Retrieved Information

Data retrieval through web scraping is certainly practical, but it does not necessarily align with the intended or desired use by a data provider. Since data is costly to collect and make available, some sites may not necessarily want it to be extracted freely and easily. Especially when the data could provide a competitor with commercially useful information (e.g., the price of a competing product).

As a result, companies often implement strategies to block or limit the amount of data scraped. The most common method is detecting and blocking requests made by bots rather than humans. For specialized entities, this detection is quite easy because numerous indicators can identify whether a website visit comes from a human user behind a browser or a bot. To mention just a few clues: browsing speed between pages, speed of data extraction, fingerprinting of the browser used, ability to answer random questions (captcha)…

The best practices mentioned later aim to ensure that a bot behaves civilly by adopting behavior close to that of a human without pretending to be one.

It’s also essential to be cautious about the information received through web scraping. Since data is central to some business models, some companies don’t hesitate to send false data to bots rather than blocking them. It’s fair play! Another trap technique is called the honey pot. These are pages that a human would never visit—for example, because they don’t appear in the graphical interface—but where a bot, automatically searching for content, might get stuck.

Without resorting to the strategy of blocking web scraping, other reasons can explain why a data retrieval that worked in the past may no longer work. The most frequent reason is a change in the structure of a website. Web scraping has the disadvantage of retrieving information from a very hierarchical structure. A change in this structure can make a bot incapable of retrieving content. Moreover, to remain attractive, websites frequently change, which can easily render a bot inoperative.

In general, one of the key takeaways from this chapter is that web scraping is a last resort solution for occasional data retrieval without any guarantee of future functionality. It is preferable to favor APIs when they are available. The latter resemble a contract (formal or not) between a data provider and a user, where needs (the data) and access conditions (number of requests, volume, authentication…) are defined, whereas web scraping is more akin to behavior in the Wild West.

1.3 Best Practices

The ability to retrieve data through a bot does not mean one can afford to be uncivilized. Indeed, when uncontrolled, web scraping can resemble a classic cyberattack aimed at taking down a website: a denial of service. The course by Antoine Palazzolo reviews some best practices that have emerged in the scraping community. It is recommended to read this resource to learn more about this topic. Several conventions are discussed, including:

  • Navigate from the site’s root to the robots.txt file to check the guidelines provided by the website’s developers to regulate the behavior of bots;
  • Space out each request by several seconds, as a human would, to avoid overloading the website and causing it to crash due to a denial of service;
  • Make requests during the website’s off-peak hours if it is not an internationally accessed site. For example, for a French-language site, running the bot during the night in metropolitan France is a good practice. To run a bot from Python at a pre-scheduled time, there are cronjobs.

2 A Detour to the Web: How Does a Website Work?

Even though this lab doesn’t aim to provide a web course, you still need some basics on how a website works to understand how information is structured on a page.

A website is a collection of pages coded in HTML that describe both the content and the layout of a Web page.

To see this, open any web page and right-click on it.

  • On Chrome : Then click on “View page source” (CTRL+U);
  • On Firefox : “View Page Source” (CTRL+SHIFT+K);
  • On Edge : “View page source” (CTRL+U);
  • On Safari : see how to do it here.

If you know which element interests you, you can also open the browser’s inspector (right-click on the element + “Inspect”) to display the tags surrounding your element more ergonomically, like a zoom.

2.1 Tags

On a web page, you will always find elements like <head>, <title>, etc. These are the codes that allow you to structure the content of an HTML page and are called tags. For example, tags include <p>, <h1>, <h2>, <h3>, <strong>, or <em>. The symbol < > is a tag: it indicates the beginning of a section. The symbol </ > indicates the end of that section. Most tags come in pairs, with an opening tag and a closing tag (e.g., <p> and </p>).

For example, the main tags defining the structure of a table are as follows:

Tag Description
<table> Table
<caption> Table title
<tr> Table row
<th> Header cell
<td> Cell
<thead> Table header section
<tbody> Table body section
<tfoot> Table footer section

2.1.1 Application: A Table in HTML

The HTML code for the following table:

<table>
    <caption> Le Titre de mon tableau </caption>
    <tr>
        <th>Nom</th>
        <th>Profession</th>
    </tr>
    <tr>
        <td>Astérix</td>
        <td></td>
    </tr>
    <tr>
        <td>Obélix</td>
        <td>Tailleur de Menhir</td>
    </tr>
</table>

Donnera dans le navigateur :

Le Titre de mon tableau
Nom Profession
Astérix
Obélix Tailleur de Menhir

2.1.2 Parent and Child

In the context of HTML language, the terms parent (parent) and child (child) are used to refer to elements nested within each other. In the following construction, for example:

<div>
    <p>
       bla,bla
    </p>
</div>

On the web page, it will appear as follows:

bla,bla

One would say that the <div> element is the parent of the <p> element, while the <p> element is the child of the <div> element.

But why learn this for “scraping”?

Because to effectively retrieve information from a website, you need to understand its structure and, therefore, its HTML code. The Python functions used for scraping are primarily designed to help you navigate between tags. With Python, you will essentially replicate your manual search behavior to automate it.

3 Scraping with Python: The BeautifulSoup Package

3.1 Available Packages

In the first part of this chapter, we will primarily use the BeautifulSoup4 package, in conjunction with requests. The latter package allow you to retrieve the raw text of a page, which will then be inspected via BeautifulSoup4.

BeautifulSoup will suffice when you want to work on static HTML pages. As soon as the information you are looking for is generated via the execution of JavaScript scripts, you will need to use tools like Selenium.

Similarly, if you don’t know the URL, you’ll need to use a framework like Scrapy, which easily navigates from one page to another. This technique is called “web crawling”. Scrapy is more complex to handle than BeautifulSoup: if you want more details, visit the Scrapy tutorial page.

Web scraping is an area where reproducibility is difficult to implement. A web page may evolve regularly, and from one web page to another, the structure can be very different, making some code difficult to export. Therefore, the best way to have a functional program is to understand the structure of a web page and distinguish the elements that can be exported to other use cases from ad hoc requests.

!pip install lxml
!pip install bs4
NoteNote

To be able to use Selenium, it is necessary to make Python communicate with a web browser (Firefox or Chromium). The webdriver-manager package allows Python to know where this browser is located if it is already installed in a standard path. To install it, the code in the cell below can be used.

To run Selenium, you need to use a package called webdriver-manager. So, we’ll install it, along with selenium:

!pip install selenium
!pip install webdriver-manager

3.2 Retrieve the Content of an HTML Page

Let’s start slowly. Let’s take a Wikipedia page, for example, the one for the 2019-2020 Ligue 1 football season: 2019-2020 Championnat de France de football. We will want to retrieve the list of teams, as well as the URLs of the Wikipedia pages for these teams.

Step 1️⃣: Connect to the Wikipedia page and obtain the source code. For this, the simplest way is to use the requests package. This allows Python to make the appropriate HTTP request to obtain the content of a page from its URL:

import requests
url_ligue_1 = "https://fr.wikipedia.org/wiki/Championnat_de_France_de_football_2019-2020"

request_text = requests.get(
    url_ligue_1,
    headers={"User-Agent": "Python for data science tutorial"}
).content
request_text[:150]
b'<!DOCTYPE html>\n<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature'
WarningWarning

To limit the volume of bot retrieving information from Wikipedia (much used by LLMs, for example), you should now specify a user agent via request. This is a good practice, enabling sites to know who is using their resources.

Step 2️⃣: search this abundant source code for the tags that will extract the information we’re interested in. The main interest of the BeautifulSoup package is to offer easy-to-use methods for searching complex texts for strings of characters from HTML or XML tags.

import bs4
page = bs4.BeautifulSoup(request_text, "lxml")

If we print the page object created with BeautifulSoup, we see that it is no longer a string but an actual HTML page with tags. We can now search for elements within these tags.

3.3 The find method

As a first illustration of the power of BeautifulSoup, we want to know the title of the page. To do this, we use the .find method and ask it “title”.

print(page.find("title"))
<title>Championnat de France de football 2019-2020 — Wikipédia</title>

The .find method only returns the first occurrence of the element.

To verify this, you can:

  • copy the snippet of source code obtained when you search for a table,
  • paste it into a cell in your notebook,
  • and switch the cell to “Markdown”.

If we take the previous code and replace title with table, we get

print(page.find("table"))
Voir le résultat
print(page.find("table"))

which is the source text that generates the following table:

Généralités
Sport Football
Organisateur(s) LFP
Édition 82e
Lieu(x) Drapeau de la France France et
Drapeau de Monaco Monaco
Date Du
au (arrêt définitif)
Participants 20 équipes
Matchs joués 279 (sur 380 prévus)
Site web officiel Site officiel

3.4 The find_all Method

To find all occurrences, use .find_all().

print("Il y a", len(page.find_all("table")), "éléments dans la page qui sont des <table>")
Il y a 34 éléments dans la page qui sont des <table>
TipTip

Python is not the only language that allows you to retrieve elements from a web page. This is one of the main objectives of Javascript, which is accessible through any web browser.

For example, to draw a parallel with page.find('title') that we used in Python, you can open the previously mentioned page with your browser. After opening the browser’s developer tools (CTRL+SHIFT+K on Firefox), you can type document.querySelector("title") in the console to get the content of the HTML node you are looking for:

If you use Selenium for web scraping, you will actually encounter these Javascript verbs in any method you use.

Understanding the structure of a page and its interaction with the browser is extremely useful when doing scraping, even when the site is purely static, meaning it does not have elements reacting to user actions on the web browser.

4 Guided Exercise: Get the List of Ligue 1 Teams

In the first paragraph of the “Participants” page, there is a table with the results of the year.

TipExercise 1: Retrieve the Participants of Ligue 1

To do this, we will proceed in 6 steps:

  1. Find the table
  2. Retrieve each row from the table
  3. Clean up the outputs by keeping only the text in a row
  4. Generalize for all rows
  5. Retrieve the table headers
  6. Finalize the table

1️⃣ Find the table

# on identifie le tableau en question : c'est le premier qui a cette classe "wikitable sortable"
tableau_participants = page.find('table', {'class' : 'wikitable sortable'})
print(tableau_participants)

2️⃣ Retrieve each row from the table

Let’s first search for the rows where tr tag appears

table_body = tableau_participants.find('tbody')
rows = table_body.find_all('tr')

You get a list where each element is one of the rows in the table. To illustrate this, we will first display the first row. This corresponds to the column headers:

print(rows[0])

The second row will correspond to the row of the first club listed in the table:

print(rows[1])

3️⃣ Clean the outputs by keeping only the text in a row

We will use the text attribute to strip away all the HTML layer we obtained in step 2.

An example on the first club’s row:

  • We start by taking all the cells in that row, using the td tag.
  • Then, we loop through each cell and keep only the text from the cell using the text attribute.
  • Finally, we apply the strip() method to ensure the text is properly formatted (no unnecessary spaces, etc.).
cols = rows[1].find_all('td')
print(cols[0])
print(cols[0].text.strip())
<td><a href="/wiki/Paris_Saint-Germain_Football_Club" title="Paris Saint-Germain Football Club">Paris Saint-Germain</a>
</td>
Paris Saint-Germain
for ele in cols : 
    print(ele.text.strip())
Paris Saint-Germain
1974
637
1er
Thomas Tuchel
2018
Parc des Princes
47 929
46

4️⃣ Generalize for all rows:

for row in rows:
    cols = row.find_all('td')
    cols = [ele.text.strip() for ele in cols]

We have successfully retrieved the information contained in the participants’ table. But the first row is strange: it’s an empty list…

These are the headers: they are recognized by the th tag, not td.

We will put all the content into a dictionary, to later convert it into a pandas DataFrame:

dico_participants = dict()
for row in rows:
    cols = row.find_all('td')
    cols = [ele.text.strip() for ele in cols]
    if len(cols) > 0 : 
        dico_participants[cols[0]] = cols[1:]

dico_participants
import pandas as pd
data_participants = pd.DataFrame.from_dict(dico_participants,orient='index')
data_participants.head()
0 1 2 3 4 5 6 7
Paris Saint-Germain 1974 637 1er Thomas Tuchel 2018 Parc des Princes 47 929 46
LOSC Lille 2000 120 2e Christophe Galtier 2017 Stade Pierre-Mauroy 49 712 59
Olympique lyonnais 1989 310 3e Rudi Garcia 2019 Groupama Stadium 57 206 60
AS Saint-Étienne 2004 100 4e Claude Puel 2019 Stade Geoffroy-Guichard 41 965 66
Olympique de Marseille 1996 110 5e André Villas-Boas 2019 Orange Vélodrome 66 226 69

5️⃣ Retrieve the table headers:

for row in rows:
    cols = row.find_all('th')
    print(cols)
    if len(cols) > 0 : 
        cols = [ele.get_text(separator=' ').strip().title() for ele in cols]
        columns_participants = cols
columns_participants
['Club',
 'Dernière Montée',
 'Budget [ 3 ] En M €',
 'Classement 2018-2019',
 'Entraîneur',
 'Depuis',
 'Stade',
 'Capacité En L1 [ 4 ]',
 'Nombre De Saisons En L1']

6️⃣ Finalize the table

data_participants.columns = columns_participants[1:]
data_participants.head()
Dernière Montée Budget [ 3 ] En M € Classement 2018-2019 Entraîneur Depuis Stade Capacité En L1 [ 4 ] Nombre De Saisons En L1
Paris Saint-Germain 1974 637 1er Thomas Tuchel 2018 Parc des Princes 47 929 46
LOSC Lille 2000 120 2e Christophe Galtier 2017 Stade Pierre-Mauroy 49 712 59
Olympique lyonnais 1989 310 3e Rudi Garcia 2019 Groupama Stadium 57 206 60
AS Saint-Étienne 2004 100 4e Claude Puel 2019 Stade Geoffroy-Guichard 41 965 66
Olympique de Marseille 1996 110 5e André Villas-Boas 2019 Orange Vélodrome 66 226 69

5 Going Further

5.1 Retrieving stadium Locations

Try to understand step by step what is done in the following steps (retrieving additional information by navigating through the pages of the different clubs).

Code pour récupérer l’emplacement des stades
import requests
import bs4
import pandas as pd


def retrieve_page(url: str) -> bs4.BeautifulSoup:
    """
    Retrieves and parses a webpage using BeautifulSoup.

    Args:
        url (str): The URL of the webpage to retrieve.

    Returns:
        bs4.BeautifulSoup: The parsed HTML content of the page.
    """
    r = requests.get(url, headers={"User-Agent": "Python for data science tutorial"})
    page = bs4.BeautifulSoup(r.content, 'html.parser')
    return page


def extract_team_name_url(team: bs4.element.Tag) -> dict:
    """
    Extracts the team name and its corresponding Wikipedia URL.

    Args:
        team (bs4.element.Tag): The BeautifulSoup tag containing the team information.

    Returns:
        dict: A dictionary with the team name as the key and the Wikipedia URL as the value, or None if not found.
    """
    try:
        team_url = team.find('a').get('href')
        equipe = team.find('a').get('title')
        url_get_info = f"http://fr.wikipedia.org{team_url}"
        print(f"Retrieving information for {equipe}")
        return {equipe: url_get_info}
    except AttributeError:
        print(f"No <a> tag for \"{team}\"")
        return None


def explore_team_page(wikipedia_team_url: str) -> bs4.BeautifulSoup:
    """
    Retrieves and parses a team's Wikipedia page.

    Args:
        wikipedia_team_url (str): The URL of the team's Wikipedia page.

    Returns:
        bs4.BeautifulSoup: The parsed HTML content of the team's Wikipedia page.
    """
    r = requests.get(
        wikipedia_team_url, headers={"User-Agent": "Python for data science tutorial"}
    )
    page = bs4.BeautifulSoup(r.content, 'html.parser')
    return page


def extract_stadium_info(search_team: bs4.BeautifulSoup) -> tuple:
    """
    Extracts stadium information from a team's Wikipedia page.

    Args:
        search_team (bs4.BeautifulSoup): The parsed HTML content of the team's Wikipedia page.

    Returns:
        tuple: A tuple containing the stadium name, latitude, and longitude, or (None, None, None) if not found.
    """
    for stadium in search_team.find_all('tr'):
        try:
            header = stadium.find('th', {'scope': 'row'})
            if header and header.contents[0].string == "Stade":
                name_stadium, url_get_stade = extract_stadium_name_url(stadium)
                if name_stadium and url_get_stade:
                    latitude, longitude = extract_stadium_coordinates(url_get_stade)
                    return name_stadium, latitude, longitude
        except (AttributeError, IndexError) as e:
            print(f"Error processing stadium information: {e}")
    return None, None, None


def extract_stadium_name_url(stadium: bs4.element.Tag) -> tuple:
    """
    Extracts the stadium name and URL from a stadium element.

    Args:
        stadium (bs4.element.Tag): The BeautifulSoup tag containing the stadium information.

    Returns:
        tuple: A tuple containing the stadium name and its Wikipedia URL, or (None, None) if not found.
    """
    try:
        url_stade = stadium.find_all('a')[1].get('href')
        name_stadium = stadium.find_all('a')[1].get('title')
        url_get_stade = f"http://fr.wikipedia.org{url_stade}"
        return name_stadium, url_get_stade
    except (AttributeError, IndexError) as e:
        print(f"Error extracting stadium name and URL: {e}")
        return None, None


def extract_stadium_coordinates(url_get_stade: str) -> tuple:
    """
    Extracts the coordinates of a stadium from its Wikipedia page.

    Args:
        url_get_stade (str): The URL of the stadium's Wikipedia page.

    Returns:
        tuple: A tuple containing the latitude and longitude of the stadium, or (None, None) if not found.
    """
    try:
        soup_stade = retrieve_page(url_get_stade)
        kartographer = soup_stade.find('a', {'class': "mw-kartographer-maplink"})
        if kartographer:
            coordinates = kartographer.get('data-lat') + "," + kartographer.get('data-lon')
            latitude, longitude = coordinates.split(",")
            return latitude.strip(), longitude.strip()
        else:
            return None, None
    except Exception as e:
        print(f"Error extracting stadium coordinates: {e}")
        return None, None


def extract_team_info(url_team_tag: bs4.element.Tag, division: str) -> dict:
    """
    Extracts information about a team, including its stadium and coordinates.

    Args:
        url_team_tag (bs4.element.Tag): The BeautifulSoup tag containing the team information.
        division (str): Team league

    Returns:
        dict: A dictionary with details about the team, including its division, name, stadium, latitude, and longitude.
    """

    team_info = extract_team_name_url(url_team_tag)
    url_team_wikipedia = next(iter(team_info.values()))
    name_team = next(iter(team_info.keys()))
    search_team = explore_team_page(url_team_wikipedia)
    name_stadium, latitude, longitude = extract_stadium_info(search_team)
    dict_stadium_team = {
        'division': division,
        'equipe': name_team,
        'stade': name_stadium,
        'latitude': latitude,
        'longitude': longitude
    }
    return dict_stadium_team


def retrieve_all_stadium_from_league(url_list: dict, division: str = "L1") -> pd.DataFrame:
    """
    Retrieves information about all stadiums in a league.

    Args:
        url_list (dict): A dictionary mapping divisions to their Wikipedia URLs.
        division (str): The division for which to retrieve stadium information.

    Returns:
        pd.DataFrame: A DataFrame containing information about the stadiums in the specified division.
    """
    page = retrieve_page(url_list[division])
    teams = page.find_all('span', {'class': 'toponyme'})
    all_info = []

    for team in teams:
        all_info.append(extract_team_info(team, division))

    stadium_df = pd.DataFrame(all_info)
    return stadium_df


# URLs for different divisions
url_list = {
    "L1": "http://fr.wikipedia.org/wiki/Championnat_de_France_de_football_2019-2020",
    "L2": "http://fr.wikipedia.org/wiki/Championnat_de_France_de_football_de_Ligue_2_2019-2020"
}

# Retrieve stadiums information for Ligue 1
stades_ligue1 = retrieve_all_stadium_from_league(url_list, "L1")
stades_ligue2 = retrieve_all_stadium_from_league(url_list, "L2")

stades = pd.concat(
    [stades_ligue1, stades_ligue2]
)
stades.head(5)
division equipe stade latitude longitude
0 L1 Paris Saint-Germain Football Club Parc des Princes 48.8413634 2.2530693
1 L1 LOSC Lille Stade Pierre-Mauroy 50.611962 3.130631
2 L1 Olympique lyonnais Parc Olympique lyonnais 45.7652477 4.9818707
3 L1 Association sportive de Saint-Étienne Stade Geoffroy-Guichard 45.460856 4.390344
4 L1 Olympique de Marseille Stade Vélodrome 43.269806 5.395922

At this stage, everything is in place to create a beautiful map. We will use folium for this, which is introduced in the visualization section.

5.2 Stadium Map with folium

Code pour produire la carte
import geopandas as gpd
import folium

stades = stades.dropna(subset = ['latitude', 'longitude'])
stades.loc[:, ['latitude', 'longitude']] = (
    stades
    .loc[:, ['latitude', 'longitude']]
    .astype(float)
)
stadium_locations = gpd.GeoDataFrame(
    stades, geometry = gpd.points_from_xy(stades.longitude, stades.latitude)
)

center = stadium_locations[['latitude', 'longitude']].mean().values.tolist()
sw = stadium_locations[['latitude', 'longitude']].min().values.tolist()
ne = stadium_locations[['latitude', 'longitude']].max().values.tolist()

m = folium.Map(location = center, tiles='openstreetmap')

# I can add marker one by one on the map
for i in range(0,len(stadium_locations)):
    folium.Marker(
        [stadium_locations.iloc[i]['latitude'], stadium_locations.iloc[i]['longitude']],
        popup=stadium_locations.iloc[i]['stade']
    ).add_to(m)

m.fit_bounds([sw, ne])

The resulting map should look like the following:

Make this Notebook Trusted to load map: File -> Trust Notebook

6 Retrieving Information on Pokémon

The next exercise to practice web scraping involves retrieving information on Pokémon from the website pokemondb.net.

6.1 Unguided Version

ImportantImportant

As with Wikipedia, this site asks request to specify a parameter to control the user-agent. For instance,

requests.get(... , headers = {'User-Agent': 'Mozilla/5.0'})
TipExercise 2: Pokémon (Unguided Version)

For this exercise, we ask you to obtain various information about Pokémon:

  1. The personal information of the 893 Pokémon on the website pokemondb.net. The information we would like to ultimately obtain in a DataFrame is contained in 4 tables:

    • Pokédex data
    • Training
    • Breeding
    • Base stats
  2. We would also like you to retrieve images of each Pokémon and save them in a folder.

  • A small hint: use the request and shutil modules.
  • For this question, you will need to research some elements on your own; not everything is covered in the lab.

For question 1, the goal is to obtain the source code of a table like the one below (Pokémon Nincada).

Pokédex data

National № 290
Type Bug Ground
Species Trainee Pokémon
Height 0.5 m (1′08″)
Weight 5.5 kg (12.1 lbs)
Abilities 1. Compound Eyes
Run Away (hidden ability)
Local № 042 (Ruby/Sapphire/Emerald)
111 (X/Y — Central Kalos)
043 (Omega Ruby/Alpha Sapphire)
104 (Sword/Shield)

Training

EV yield 1 Defense
Catch rate 255 (33.3% with PokéBall, full HP)
Base Friendship 70 (normal)
Base Exp. 53
Growth Rate Erratic

Breeding

Egg Groups Bug
Gender 50% male, 50% female
Egg cycles 15 (3,599–3,855 steps)

Base stats

HP 31
172 266
Attack 45
85 207
Defense 90
166 306
Sp. Atk 30
58 174
Sp. Def 30
58 174
Speed 40
76 196
Total 266 Min Max

For question 2, the goal is to obtain images of the Pokémon.

6.2 Guided Version

The following sections will help you complete the above exercise step by step, in a guided manner.

First, we want to obtain the personal information of all the Pokémon on pokemondb.net.

The information we would like to ultimately obtain for the Pokémon is contained in 4 tables:

  • Pokédex data
  • Training
  • Breeding
  • Base stats

Next, we will retrieve and display the images.

6.2.1 Step 1: Create a DataFrame of Characteristics

TipExercise 2b: Pokémon (guided version)

To retrieve the information, the code must be divided into several steps:

  1. Find the site’s main page and transform it into an intelligible object for your code. The following functions will be useful:

    • requests.get
    • bs4.BeautifulSoup
  2. From this code, create a function that retrieves a pokémon’s page content from its name. You can name this function get_name.

  3. From the bulbasaur page, obtain the 4 arrays we’re interested in:

    • look for the following element: (‘table’, { ‘class’ : “vitals-table”})
    • then store its elements in a dictionary
  4. Retrieve the list of pokemon names, which will enable us to loop later. How many pokémons can you find?

  5. Write a function that retrieves all the information on the first ten pokémons in the list and integrates it into a DataFrame.

At the end of question 3, you should obtain a list of characteristics similar to this one:

The structure here is a dictionary, which is convenient.

Finally, you can integrate the information of the first ten Pokémon into a DataFrame, which will look like this:

National № name Type Species Height Weight Abilities Local № EV yield Catch rate ... Growth Rate Egg Groups Gender Egg cycles HP Attack Defense Sp. Atk Sp. Def Speed
0 0001 bulbasaur Grass Poison Seed Pokémon 0.7 m (2′04″) 6.9 kg (15.2 lbs) 1. OvergrowChlorophyll (hidden ability) 0001 (Red/Blue/Yellow)0226 (Gold/Silver/Crysta... 1 Sp. Atk 45 (5.9% with PokéBall, full HP) ... Medium Slow Grass, Monster 87.5% male, 12.5% female 20(4,884–5,140 steps) 45 49 49 65 65 45
1 0002 ivysaur Grass Poison Seed Pokémon 1.0 m (3′03″) 13.0 kg (28.7 lbs) 1. OvergrowChlorophyll (hidden ability) 0002 (Red/Blue/Yellow)0227 (Gold/Silver/Crysta... 1 Sp. Atk, 1 Sp. Def 45 (5.9% with PokéBall, full HP) ... Medium Slow Grass, Monster 87.5% male, 12.5% female 20(4,884–5,140 steps) 60 62 63 80 80 60
2 0003 venusaur Grass Poison Seed Pokémon 2.0 m (6′07″) 100.0 kg (220.5 lbs) 1. OvergrowChlorophyll (hidden ability) 0003 (Red/Blue/Yellow)0228 (Gold/Silver/Crysta... 2 Sp. Atk, 1 Sp. Def 45 (5.9% with PokéBall, full HP) ... Medium Slow Grass, Monster 87.5% male, 12.5% female 20(4,884–5,140 steps) 80 82 83 100 100 80
3 0004 charmander Fire Lizard Pokémon 0.6 m (2′00″) 8.5 kg (18.7 lbs) 1. BlazeSolar Power (hidden ability) 0004 (Red/Blue/Yellow)0229 (Gold/Silver/Crysta... 1 Speed 45 (5.9% with PokéBall, full HP) ... Medium Slow Dragon, Monster 87.5% male, 12.5% female 20(4,884–5,140 steps) 39 52 43 60 50 65
4 0005 charmeleon Fire Flame Pokémon 1.1 m (3′07″) 19.0 kg (41.9 lbs) 1. BlazeSolar Power (hidden ability) 0005 (Red/Blue/Yellow)0230 (Gold/Silver/Crysta... 1 Sp. Atk, 1 Speed 45 (5.9% with PokéBall, full HP) ... Medium Slow Dragon, Monster 87.5% male, 12.5% female 20(4,884–5,140 steps) 58 64 58 80 65 80

5 rows × 22 columns

6.2.2 Step 2: Retrieve and Display Pokémon Photos

We would also like you to retrieve the images of the first 5 Pokémon and save them in a folder.

TipExercise 2b: Pokémon (Guided Version)
  • The URLs of Pokémon images take the form “https://img.pokemondb.net/artwork/{pokemon}.jpg”. Use the requests and shutil modules to download and save the images locally.
  • Import these images stored in JPEG format into Python using the imread function from the skimage.io package.
!pip install scikit-image

First pokemon in Pokedex list

First pokemon in Pokedex list

7 Selenium : mimer le comportement d’un utilisateur internet

Until now, we have assumed that we always know the URL we are interested in. Additionally, the pages we visit are “static”, they do not depend on any action or search by the user.

We will now see how to fill in fields on a website and retrieve the information we are interested in. The reaction of a website to a user’s action often involves the use of JavaScript in the world of web development. The Selenium package allows you to automate the behavior of a manual user from within your code. It enables you to obtain information from a site that is not in the HTML code but only appears after the execution of JavaScript scripts in the background.

Selenium behaves like a regular internet user: it clicks on links, fills out forms, etc.

7.1 First Example: Scraping a Search Engine

In this example, we will try to go to the Bing News site and enter a given topic in the search bar. To test, we will search with the keyword “Trump”.

Installing Selenium requires Chromium, which is a minimalist version of the Google Chrome browser. The version of chromedriver must be >= 2.36 and depends on the version of Chrome you have on your working environment. To install this minimalist version of Chrome on a Linux environment, you can refer to the dedicated section.

ImportantInstallation de Selenium

On Colab, you can use the following commands:

!sudo apt-get update
!sudo apt install -y unzip xvfb libxi6 libgconf-2-4 -y
!sudo apt install chromium-chromedriver -y
!cp /usr/lib/chromium-browser/chromedriver /usr/bin


If you are on the SSP Cloud, you can run the following commands:

!wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb -O /tmp/chrome.deb
!sudo apt-get update
!sudo -E apt-get install -y /tmp/chrome.deb
!pip install chromedriver-autoinstaller selenium

import chromedriver_autoinstaller
path_to_web_driver = chromedriver_autoinstaller.install()


You can then install Selenium. For example, from a Notebook cell:

First, you need to initialize the behavior of Selenium by replicating the browser settings. To do this, we will first initialize our browser with a few options:

import time

from selenium import webdriver
from selenium.webdriver.common.keys import Keys

chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
#chrome_options.add_argument('--verbose')

Then we launch the browser:

from selenium.webdriver.chrome.service import Service
service = Service(executable_path=path_to_web_driver)

browser = webdriver.Chrome(
    service=service,
    options=chrome_options
)

We go to the Bing News site, and we specify the keyword we want to search for. In this case, we’re interested in news about Donald Trump. After inspecting the page using the browser’s developer tools, we see that the search bar is an element in the code called q (as in query). So we’ll ask selenium to search for this element:

browser.get('https://www.bing.com/news')
search = browser.find_element("name", "q")
print(search)
print([search.text, search.tag_name, search.id])

# on envoie à cet endroit le mot qu'on aurait tapé dans la barre de recherche
search.send_keys("Trump")

search_button = browser.find_element("xpath", "//input[@id='sb_form_go']")
search_button.click()

Selenium allows you to capture the image you would see in the browser with get_screenshot_as_png. This can be useful to check if you have performed the correct action:

Finally, we can extract the results. Several methods are available. The most convenient method, when available, is to use XPath, which is an unambiguous path to access an element. Indeed, multiple elements can share the same class or the same attribute, which can cause such a search to return multiple matches. To determine the XPath of an object, the developer tools of your web browser are handy. For example, in Firefox, once you have found an element in the inspector, you can right-click > Copy > XPath.

Finally, to end our session, we ask Python to close the browser:

browser.quit()

We get the following results:

['https://www.msn.com/en-us/news/politics/tuesday-briefing-epstein-files-trump-battleships-ice-messages-turning-point-banksy-mural-and-more/ar-AA1STj4k?ocid=BingNewsVerp', 'https://www.nj.com/politics/2025/12/new-poll-trumps-downward-spiral-on-this-issue-may-continue-in-2026.html', 'https://www.nbcnews.com/politics/trump-administration/trump-names-louisiana-governor-jeff-landry-greenland-special-envoy-rcna250426', 'https://www.nytimes.com/2025/12/23/world/europe/naples-trump-figurines-nativity-christmas.html', 'https://www.msn.com/en-us/money/markets/us-stocks-rose-again-in-2025-after-overcoming-turbulence-from-tariffs-and-trump-s-fight-with-the-fed/ar-AA1STvr4?ocid=BingNewsVerp', 'https://www.msn.com/en-us/news/world/trump-says-us-will-keep-or-sell-oil-seized-from-venezuela/ar-AA1ST7Vp?ocid=BingNewsVerp', 'https://www.msn.com/en-us/money/markets/trump-suspends-all-large-offshore-wind-farms-under-construction-threatening-thousands-of-jobs-and-cheaper-energy/ar-AA1SQapN?ocid=BingNewsVerp', 'https://www.msn.com/en-us/news/politics/trump-humiliated-as-his-nemesis-tops-popularity-poll/ar-AA1STNvF?ocid=BingNewsVerp']

Other useful Selenium methods:

Method Result
find_element(****).click() Once you have found a reactive element, such as a button, you can click on it to navigate to a new page
find_element(****).send_keys("toto") Once you have found an element, such as a field to enter credentials, you can send a value, in this case “toto”.

7.2 Additional Exercise

To explore another application of web scraping, you can also tackle topic 5 of the 2023 edition of a non-competitive hackathon organized by Insee:

The NLP section of the course may be useful for the second part of the topic!

Informations additionnelles

This site was built automatically through a Github action using the Quarto reproducible publishing software (version 1.8.26).

The environment used to obtain the results is reproducible via uv. The pyproject.toml file used to build this environment is available on the linogaliana/python-datascientist repository

pyproject.toml
[project]
name = "python-datascientist"
version = "0.1.0"
description = "Source code for Lino Galiana's Python for data science course"
readme = "README.md"
requires-python = ">=3.12,<3.13"
dependencies = [
    "altair==5.4.1",
    "black==24.8.0",
    "cartiflette",
    "contextily==1.6.2",
    "duckdb>=0.10.1",
    "folium>=0.19.6",
    "geoplot==0.5.1",
    "graphviz==0.20.3",
    "great-tables>=0.12.0",
    "gt-extras>=0.0.8",
    "ipykernel>=6.29.5",
    "jupyter>=1.1.1",
    "jupyter-cache==1.0.0",
    "kaleido==0.2.1",
    "langchain-community>=0.3.27",
    "loguru==0.7.3",
    "markdown>=3.8",
    "nbclient==0.10.0",
    "nbformat==5.10.4",
    "nltk>=3.9.1",
    "pip>=25.1.1",
    "plotly>=6.1.2",
    "plotnine>=0.15",
    "polars==1.8.2",
    "pyarrow==17.0.0",
    "pynsee==0.1.8",
    "python-dotenv==1.0.1",
    "python-frontmatter>=1.1.0",
    "pywaffle==1.1.1",
    "requests>=2.32.3",
    "scikit-image==0.24.0",
    "scipy>=1.13.0",
    "selenium<4.39.0",
    "spacy==3.8.4",
    "webdriver-manager==4.0.2",
    "wordcloud==1.9.3",
]

[tool.uv.sources]
cartiflette = { git = "https://github.com/inseefrlab/cartiflette" }

[dependency-groups]
dev = [
    "nb-clean>=4.0.1",
]

To use exactly the same environment (version of Python and packages), please refer to the documentation for uv.

SHA Date Author Description
d957efa7 2025-12-07 20:15:10 lgaliana Modularise et remet en forme le chapitre webscraping
d555fa72 2025-09-24 08:39:27 lgaliana warninglang partout
378b872a 2025-09-21 09:31:47 lgaliana try/except selenium
53883c08 2025-09-20 14:01:30 lgaliana try/error pipeline for GHA + update some webscraping codebase to avoid deprecation warning
794ce14a 2025-09-15 16:21:42 lgaliana retouche quelques abstracts
40d6151e 2025-09-12 12:05:54 lgaliana callout warning
c3d51646 2025-08-12 17:28:51 Lino Galiana Ajoute un résumé au début de chaque chapitre (première partie) (#634)
99ab48b0 2025-07-25 18:50:15 Lino Galiana Utilisation des callout classiques pour les box notes and co (#629)
94648290 2025-07-22 18:57:48 Lino Galiana Fix boxes now that it is better supported by jupyter (#628)
91431fa2 2025-06-09 17:08:00 Lino Galiana Improve homepage hero banner (#612)
3f1d2f3f 2025-03-15 15:55:59 Lino Galiana Fix problem with uv and malformed files (#599)
2f96f636 2025-01-29 19:49:36 Lino Galiana Tweak callout for colab engine (#591)
e8beceb7 2024-12-11 11:33:21 Romain Avouac fix(selenium): chromedriver path (#581)
ddc423f1 2024-11-12 10:26:14 lgaliana Quarto rendering
9d8e69c3 2024-10-21 17:10:03 lgaliana update badges shortcode for all manipulation part
4be74ea8 2024-08-21 17:32:34 Lino Galiana Fix a few buggy notebooks (#544)
40446fa3 2024-08-21 13:17:17 Lino Galiana solve pb (#543)
f7d7c83b 2024-08-21 09:33:26 linogaliana solve problem with webscraping chapter
1953609d 2024-08-12 16:18:19 linogaliana One button is enough
64262ca1 2024-08-12 17:06:18 Lino Galiana Traduction partie webscraping (#541)
101465fb 2024-08-07 13:56:35 Lino Galiana regex, webscraping and API chapters in 🇬🇧 (#532)
065b0abd 2024-07-08 11:19:43 Lino Galiana Nouveaux callout dans la partie manipulation (#513)
06d003a1 2024-04-23 10:09:22 Lino Galiana Continue la restructuration des sous-parties (#492)
005d89b8 2023-12-20 17:23:04 Lino Galiana Finalise l’affichage des statistiques Git (#478)
3fba6124 2023-12-17 18:16:42 Lino Galiana Remove some badges from python (#476)
4cd44f35 2023-12-11 17:37:50 Antoine Palazzolo Relecture NLP (#474)
a06a2689 2023-11-23 18:23:28 Antoine Palazzolo 2ème relectures chapitres ML (#457)
889a71ba 2023-11-10 11:40:51 Antoine Palazzolo Modification TP 3 (#443)
762f85a8 2023-10-23 18:12:15 Lino Galiana Mise en forme du TP webscraping (#441)
8071bbb1 2023-10-23 17:43:37 tomseimandi Make minor changes to 02b, 03, 04a (#440)
3eb0aeb1 2023-10-23 11:59:24 Thomas Faria Relecture jusqu’aux API (#439)
102ce9fd 2023-10-22 11:39:37 Thomas Faria Relecture Thomas, première partie (#438)
fbbf066a 2023-10-16 14:57:03 Antoine Palazzolo Correction TP scraping (#435)
a7711832 2023-10-09 11:27:45 Antoine Palazzolo Relecture TD2 par Antoine (#418)
154f09e4 2023-09-26 14:59:11 Antoine Palazzolo Des typos corrigées par Antoine (#411)
b7f4d7ea 2023-09-17 17:03:14 Antoine Palazzolo Renvoi vers sujet funathon pour partie scraping (#404)
9a4e2267 2023-08-28 17:11:52 Lino Galiana Action to check URL still exist (#399)
8baf507b 2023-08-28 11:09:30 Lino Galiana Lien mort formation webscraping
a8f90c2f 2023-08-28 09:26:12 Lino Galiana Update featured paths (#396)
3bdf3b06 2023-08-25 11:23:02 Lino Galiana Simplification de la structure 🤓 (#393)
30823c40 2023-08-24 14:30:55 Lino Galiana Liens morts navbar (#392)
3560f1f8 2023-07-21 17:04:56 Lino Galiana Build on smaller sized image (#384)
130ed717 2023-07-18 19:37:11 Lino Galiana Restructure les titres (#374)
ef28fefd 2023-07-07 08:14:42 Lino Galiana Listing pour la première partie (#369)
f21a24d3 2023-07-02 10:58:15 Lino Galiana Pipeline Quarto & Pages 🚀 (#365)
38693f62 2023-04-19 17:22:36 Lino Galiana Rebuild visualisation part (#357)
32486330 2023-02-18 13:11:52 Lino Galiana Shortcode rawhtml (#354)
3c880d59 2022-12-27 17:34:59 Lino Galiana Chapitre regex + Change les boites dans plusieurs chapitres (#339)
938f9bcb 2022-12-04 15:28:37 Lino Galiana Test selenium en intégration continue (#331)
342b59b6 2022-12-04 11:55:00 Romain Avouac Procedure to install selenium on ssp cloud (#330)
037842a9 2022-11-22 17:52:25 Lino Galiana Webscraping exercice nom et age ministres (#326)
738c0744 2022-11-17 12:23:29 Lino Galiana Nettoie le TP scraping (#323)
f5f0f9c4 2022-11-02 19:19:07 Lino Galiana Relecture début partie modélisation KA (#318)
43a863f1 2022-09-27 11:14:18 Lino Galiana Change notebook url (#283)
25046de4 2022-09-26 18:08:19 Lino Galiana Rectifie bug TP webscraping (#281)
494a85ae 2022-08-05 14:49:56 Lino Galiana Images featured ✨ (#252)
d201e3cd 2022-08-03 15:50:34 Lino Galiana Pimp la homepage ✨ (#249)
bb38643d 2022-06-08 16:59:40 Lino Galiana Répare bug leaflet (#234)
12965bac 2022-05-25 15:53:27 Lino Galiana :launch: Bascule vers quarto (#226)
9c71d6e7 2022-03-08 10:34:26 Lino Galiana Plus d’éléments sur S3 (#218)
66e2837c 2021-12-24 16:54:45 Lino Galiana Fix a few typos in the new pipeline tutorial (#208)
0e01c33f 2021-11-10 12:09:22 Lino Galiana Relecture @antuki API+Webscraping + Git (#178)
9a3f7ad8 2021-10-31 18:36:25 Lino Galiana Nettoyage partie API + Git (#170)
6777f038 2021-10-29 09:38:09 Lino Galiana Notebooks corrections (#171)
2a8809fb 2021-10-27 12:05:34 Lino Galiana Simplification des hooks pour gagner en flexibilité et clarté (#166)
b138cf3e 2021-10-21 18:05:59 Lino Galiana Mise à jour TP webscraping et API (#164)
2e4d5862 2021-09-02 12:03:39 Lino Galiana Simplify badges generation (#130)
80877d20 2021-06-28 11:34:24 Lino Galiana Ajout d’un exercice de NLP à partir openfood database (#98)
4cdb759c 2021-05-12 10:37:23 Lino Galiana :sparkles: :star2: Nouveau thème hugo :snake: :fire: (#105)
7f9f97bc 2021-04-30 21:44:04 Lino Galiana 🐳 + 🐍 New workflow (docker 🐳) and new dataset for modelization (2020 🇺🇸 elections) (#99)
6d010fa2 2020-09-29 18:45:34 Lino Galiana Simplifie l’arborescence du site, partie 1 (#57)
66f9f87a 2020-09-24 19:23:04 Lino Galiana Introduction des figures générées par python dans le site (#52)
5c1e76d9 2020-09-09 11:25:38 Lino Galiana Ajout des éléments webscraping, regex, API (#21)
Back to top

Citation

BibTeX citation:
@book{galiana2025,
  author = {Galiana, Lino},
  title = {Python Pour La Data Science},
  date = {2025},
  url = {https://pythonds.linogaliana.fr/},
  doi = {10.5281/zenodo.8229676},
  langid = {en}
}
For attribution, please cite this work as:
Galiana, Lino. 2025. Python Pour La Data Science. https://doi.org/10.5281/zenodo.8229676.