Mastering Web Scraping with Python Requests — Part 2
So in “Mastering Web Scraping with Python Requests”, we’ve introduced Python Requests as a powerful tool for webscraping and emphasizes the importance of handling HTTP headers, using sessions, passing parameters, handling pagination and handling some errors. The article provides sample code and explanations for each step, and now in this article we will expand a little further to that, to make you a pro in webscraping. So let’s get into it.
Step 5 cont’d : Handling Response status code errors with requests and timeout sessions
Many websites sometimes return errors, mostly a 404 error, sometimes a 500 error, or even a 504 error. For a user that webscrapes data, this can be very frustrating because this means that the code would just break if an error occurs or if we don’t properly handle it we’d waste alot of time. Requests have a built-in HTTPError that we can use to handle errors. This helps us to get more information about the error and possibly a way to solve it. We can also set a timeout for the request, this helps the code to not hang and eventually timeout if the website is not responsive.
For example:
import requests
# Set the timeout
timeout = 5
try:
response = requests.get('https://www.example.com', timeout=timeout)
response.raise_for_status()
print('[*] Results : ', response.content)
except requests.exceptions.HTTPError as error:
print('[*] Error occurred : ', error)
In this example, we set a timeout of 5 seconds for the request. If the website does not respond within 5 seconds, it will raise a timeout error. We also added a try-except block to handle any HTTP errors that might occur. The raise_for_status() method is used to raise an HTTP error if one occurred during the request.
Here are some response code that are common when webscraping data. You will come across multiple of these. These will show if there is an error.
- 200 OK: The request was successful, and the response body contains the requested data.
- 201 Created: The request was successful, and a new resource was created.
- 204 No Content: The request was successful, but there is no data to return.
- 400 Bad Request: The request was invalid or incomplete.
401 Unauthorized: The request requires authentication, and the user is not authenticated. - 403 Forbidden: The user is authenticated, but does not have permission to access the requested resource.
- 404 Not Found: The requested resource could not be found. Most Common response error code.
- 405 Method Not Allowed: The requested HTTP method is not allowed for the requested resource.
- 500 Internal Server Error: There was an error on the server while processing the request.
- 502 Bad Gateway: The server acting as a gateway or proxy received an invalid response from the upstream server.
- 503 Service Unavailable: The server is currently unable to handle the request due to a temporary overload or maintenance of the server.
There are other response codes, but it solely depends on the website or API. This is because some website have their own custom error codes, be on the lookout for those.
Here’s another example with dealing with the status code when it’s not 200 OK.
import requests
# Set the timeout
timeout = 5
try:
response = requests.get('https://www.example.com', timeout=timeout)
response.raise_for_status()
if response.status_code == 200 :
print('[*] Results : ', response.content)
else:
print('[*] Status Error Code : ', response.status_code)
except requests.exceptions.HTTPError as error:
print('[*] Error occurred : ', error)
In the example above, we dealt with both the timeout error and check our status code, if it’s a status code of 200, we will print results else we’d print the status error code.
Step 6: Parsing HTML with BeautifulSoup
Now that we have scraped the data, we need to parse it. Parsing is the process of extracting relevant data from a website’s HTML. This can be achieved with BeautifulSoup. BeautifulSoup is a Python library that makes it easy to extract information from web pages. BeautifulSoup is an external library just as Requests. To install BeautifulSoup we’d bring up our terminal and do a pip install bs4
Now we will parse our data, for example
import requests
from bs4 import BeautifulSoup as bs
# Make the request
response = requests.get('https://www.example.com')
soup = bs(response.content, 'html.parser')
# Find the relevant data
title = soup.find('h1', {'class':'title'})
print(title.text)
In this example, we make a request to a website example and get the HTML content. We then parse the HTML using BeautifulSoup
and search for the relevant data, which is the title of the page in this case.
We can use other parsers such as lxml
, in my opinion lxml
is fastest.
To use this parser, you’d need to pip install lxml
and also pass ‘lxml’
as the parser example :
import requests
from bs4 import BeautifulSoup as bs
# Make the request
response = requests.get('https://www.example.com')
soup = bs(response.content, 'lxml')
# Find the relevant data
title = soup.find('h1', {'class':'title'})
print(title.text)
You can also parse for other relevant data, see documentation here
Also do note well when parsing data using BeautifulSoup, some classes would have more than one class, to get around this sometimes you would need to put a period (.) in place of the spaces e.g.
title = soup.find('h1', {'class':'title.heading'})
But wait. To know what you’re looking for, you’d open up the DevTools and click on the
then select the text you’re looking for, e.g.
And to get that button, you’d use the following :
btn = soup.find('button', {'class':'button'})
To get the button text, we’d just :
btn_text = btn.text
# Which would yield 'Please'
Step 7: Saving scraped data to a file
After doing all the work of coding and testing our webscrape, we would need to save our data to a file for future use or analysis at a later time. Python provides several ways to save data to a file, including CSV, JSON, EXCEL, and text files, but one of my favourite ways to do this is to save it into pandas and do a small data cleaning and exporting it as excel. Pandas is such an amazing python library.
To do this, we’d :
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
page = 100
URL = f'https://fitgirl-repacks.site/page/{page}/'
r = requests.get(URL)
if r.status_code == 200 :
soup = bs(r.content, 'lxml')
main = soup.find('div',{'class':'content-area'})
articles = main.find_all('article')
print('[*] Number of Posts : ', len(articles))
scraped_games = []
for article in articles :
Name = article.find('h1',{'class':'entry-title'}).text.strip()
DatePosted = article.find('div',{'class':'entry-meta'}).text.strip()
NumofComments = article.find('span',{'class':'comments-link'}).text.strip()
GameLink = article.find('h1',{'class':'entry-title'}).find_next().get('href')
magnetLink = article.find('div',{'class':'entry-content'}).find('ul').find('li').find_all('a')[1].get('href')
Genre = article.find('div',{'class':'entry-content'}).find('p').find_all('strong')[0].text.strip()
Company = article.find('div',{'class':'entry-content'}).find('p').find_all('strong')[1].text.strip()
OriginalSize = article.find('div',{'class':'entry-content'}).find('p').find_all('strong')[3].text.strip()
RepackSize = article.find('div',{'class':'entry-content'}).find('p').find_all('strong')[4].text.strip()
print('[*] Name : ', Name)
print('[*] Date Posted : ', DatePosted)
print('[*] Number of Comments : ', NumofComments)
print('[*] Game Link : ', GameLink)
print('[*] magnet Link : ', magnetLink)
print('[*] Genre : ', Genre)
print('[*] Company : ', Company)
print('[*] Original Size : ', OriginalSize)
print('[*] Repack Size : ', RepackSize)
print('*'*50)
## Saving to dictionary variable
data = {
'Name' : Name,
'DatePosted' : DatePosted,
'NumofComments' : NumofComments,
'GameLink' : GameLink,
'magnetLink' : magnetLink,
'Genre' : Genre,
'Company' : Company,
'OriginalSize' : OriginalSize,
'RepackSize' : RepackSize,
}
## Adding dictionary to a list
scraped_games.append(data)
## Exporting list to a dataframe in pandas
df = pd.DataFrame(scraped_games)
## Export to excel
df.to_excel('Scraped_Games.xlsx', index=False)
df
Results:
List DataFrame :
Exported File :
The above code is an actual webscraper, you can copy and test it yourself. It scrapes data from one of my favourite online forums. Shout out to fitgirl.!
It saves the data to a pandas dataframe and then exports it to an excel without the index.
Conclusion
In conclusion, “Mastering Web Scraping with Python Requests” and “Mastering Web Scraping with Python Requests — Part 2” covers the fundamentals of web scraping, in detailed also it gives some advance techniques to get you on going with your webscraping projects. As we know that, Webscraping is a powerful technique for gathering data from websites. Python Requests makes it easy to scrape data from websites, handle HTTP headers, sessions, parameters, errors, and pagination, parsing data, understanding error codes and saving files to an Excel using pandas. We also learned about BeautifulSoup. Where it can be used to parse the HTML and extract relevant data, while Python provides several ways to save the scraped data to a file for future use or analysis, We used Pandas to export our data to an Excel file. With these skills, you can now scrape data from the internet and use it for your own purposes. By following this tutorial, you have gained a strong foundation in web scraping and learn practical skills to apply to your own web scraping projects.
Follow me for more web scraping and other python programming articles if you like this.
If you wanna support me — -> https://www.buymeacoffee.com/surenjanath