Mastering Web Scraping with Python Requests.

Surenjanath Singh
7 min readMar 14, 2023

--

Introduction

As we know, Python Requests is a powerful tool for web scraping/web harvesting, including handling cookies, headers, and sessions, as well as pagination and errors as show before in some of my articles. We can scrape our own data from the internet once it is allowed and legal. It’s legal once it is for the public, except for personal data. Today, I’m just going to give a quick rundown on mastering this technique ( Skill ).

Step 1: Handling HTTP Headers

My previous article Spotify downloader broke due to the missing header, which causes the server to deny my request to the hidden treasure ( data ). After passing the header, it opened that closed door and give me that reward I desired. So what is this header? Headers or HTTP Headers are additional pieces of information sent with HTTP requests and responses, such as user agent, content type, and encoding. Python Requests makes it easy to set and modify them. For Example :

import requests

# Set the headers
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
}

# Send a GET request with headers
response = requests.get('https://www.someexample.com', headers=headers)
print(response.content)

This will send a GET request to the URL specified with the headers set in the request. Always remember the user agent is the most important variable when using headers. This basically fools the server that a browser or phone or even potato if you like sending a request. No server likes a bot or automated code coming towards it, this can cause DDoS if used in spam and cause a server to become laggy, that’s why some sites do not like this. See this to see the use case of the header : https://github.com/surenjanath/Spotify-Playlist-Download/

Step 2: Using Sessions

Using session, using session is great for when scraping websites that need you to log in or even to send POST data to the server. Using this would preserve the headers, cookies and even any authentication sessions that would be needed to browse through that specific website. It is not necessary for logging in websites only, some websites may be using the previous POST session to generate documents or tasks, using session is a great way to save previous GET/POST sessions that the now session needs. Here’s an example of this:

import requests

session = requests.Session()

# Login with a POST request
login_data = {
'username': 'username',
'password': 'password'
}
response = session.post('https://www.someexample.com/login', data=login_data)

# Send another request with the session
response = session.get('https://www.someexamplemple.com/profile')
print(response.content)

This will log in to the website using a POST request with the login data provided, and then send another request with the session to the user’s profile page. This code is a great way to show the use case of using session

Step 3: Passing parameters such as params or data

Passing parameters such as params or data are sometimes crucial when using requests to scrape data, or in fact any website that needs data from the frontend of the website would need parameters. The GET method takes in what we would call params which sometimes you would see at the end of any web URL such as
Link = 'https://www.example.com/data?page={}'
The text in bold is what you would call parameters, this always comes after a question mark ( ? ) on a web URL. The POST method takes what you would call data, this is what you’d use to do processes such as write/create using the data that is captured in the front end. There are two ways we can pass the data.

  1. In the URL itself by putting a “?” at the ending of the URL
  2. Passing it via a dictionary

Here’s an example :

import requests

# Method 1 :
page = 14
url = f'https://www.example.com/data?page={page}'
response = session.get(url = url, headers=headers)

# Method 2 :
params = {
'page' : 14,
}

url = 'https://www.example.com/data'
response = session.get(url = url, params = params)

Same concept using POST method, but POST method only takes in a dictionary, would give a lot of errors when trying to use link on itself. Example

import requests

login_data = {
'username': 'username',
'password': 'password'
}
url = 'https://www.example.com/data/login'
response = session.post(url = url, data = login_data)

Do note that you cannot load a POST in your browser as the browser only sends GET requests, but you can load a GET URL from method 1 in GET method, meaning, you can https://www.example.com/data?page=15 load that URL link that you’ve built in the browser.

Step 4: Handling Pagination

The process of pagination involves breaking up big datasets into more manageable chunks, websites do this to preserve loading and website respond time. Nobody wants to browse through a website that takes a minute to load its data, right? This is why the developers would then paginate their dataset, 50 data entries are ideal for quick response time. This is bad for webscrapers, but a while loop or even a for loop can work to deal with such cases. While loop along with a try except error handling is great to handle sites that uses AJAX pagination ( API ) but if there’s an API, Life is set. Here’s a sample code from this project to show pagination in works:

import requests

url = 'https://www.example.com/data?page={}'


offset = OFFSET_VARIABLE
page = 0
offset_data['offset'] = offset

response = session.get(url = Playlist_Link, headers=headers, params=offset_data )

while offset != None :
if response.status_code == 200 :
Tdata = response.json()['trackList']
page = response.json()['nextOffset']

### Do Something Here

if page!=None:
offset_data['offset'] = page
response = session.get(url = Playlist_Link, params=offset_data, headers=headers)
else:
break

This will send GET requests to the URL specified with the page number appended to it, allowing you to scrape all the pages of data until page is equal to None.

Step 5: Handling Errors

Now for the error handling, it is always crucial to appropriately manage mistakes when web scraping, from experience you don’t want to be not handling errors because it can be sometimes very time-consuming, say you’re webscraping data from a website, and you have like 3000 entries and data collection to do, if you don't handle errors such as Timeout, connection, HTTP problems, value Error or even attribute Errors, etc, you’re going to rip hair off your skull. Trust me on that. Waiting for like a couple of minutes to hours depending on the type of webscraping you’re doing, if it fails, and you did not handle the error properly, you’d have to start all over, or you’d lost all your data and a lot of time. So it’s better to handle the errors properly at the earliest. You can save all your errors in another data list variable or write it to a file for better data logging, also you can use python logger to handle this proper but here’s a basic way to handle errors. Sample code from https://github.com/surenjanath/Spotify-Playlist-Download


offset = OFFSET_VARIABLE
page = 0
offset_data['offset'] = offset

response = session.get(url = Playlist_Link,headers=headers,params=offset_data )

while offset != None :
if response.status_code == 200 :
Tdata = response.json()['trackList']
page = response.json()['nextOffset']
for count,song in enumerate(Tdata):
yt_id = get_ID(session=session, id=song['id'])

if yt_id is not None:
## Do something

try:
## Doing Operations
except Exception as error_occured:
print('[*] Error Code : ',error_occured)
else:
print('[*] No data found for : ', song)
if page!=None:
offset_data['offset'] = page
response = session.get(url = Playlist_Link, params=offset_data, headers=headers)
else:
break

This will check for HTTP errors and other errors that may occur, we would print the error message if there is one, or do some operations if not. As you saw, I did not just use try except code, I always handle errors using the if statement, you can do that also if the data you’re scraping don’t always have values returned. Sometimes this way can cause stress if not properly handled

Conclusion

Finding success with Python Requests may provide you a lot of freedom and control over your web scraping projects. It is a powerful tool for web scraping that allows us to handle cookies, headers, and sessions, as well as pagination and errors. In this article, we have covered four important techniques for mastering web scraping with Python Requests, including handling HTTP headers, using sessions, passing parameters such as params or data, and handling pagination. By mastering these techniques, we can scrape data from websites more efficiently and effectively, and handle various situations that may arise during the scraping process. As with any web scraping, it is important to always follow the ethical guidelines and ensure that the scraping process does not violate any terms of service or legal restrictions. You’ll be well on your way to mastering Python Requests’ complex web scraping features by learning how to deal with cookies, headers, sessions, pagination, and errors. Lastly, don't forget to ensure that the website you are scraping is legal and within ethical boundaries.

Here’s the link to the full code that sample codes were taken from:

Github : https://github.com/surenjanath/Spotify-Playlist-Download
Medium Article for that Project : Automating Spotify Playlist Music Download [ Spotify Free Version ]

Follow me for more web scraping and other python programming articles if you like this.

If you wanna support me — -> https://www.buymeacoffee.com/surenjanath

See Part 2 : Mastering Web Scraping with Python Requests — Part 2

--

--

No responses yet