Webscraping and Data cleaning NLCB Lotto Results
Hello, everyone, today we are going to scrape and clean some Trinidad and Tobago NLCB Lotto Results from a non disclaimer website called : http://www.nlcbplaywhelotto.com/
Please note that scraping data from websites may be subject to legal and ethical considerations, so be sure to read and comply with the terms of service and any applicable laws or regulations. This is just for educational purposes and for data analysis.
Website Investigation
Anyway, to scrape the said website, we first must inspect the website and learn the way it works and do certain calls and methods ( GET/POST ). To do this, we head over to the said website, then scroll down to ‘NLCB Lotto Plus Results’ as that’s the information we would like, click on ‘Latest Results’,
You will see something similar to the screenshot below :
We now would right-click if on Google Chrome and select inspect, and now the devtool tab would open.
Select Network on the list of tabs and clear it by click on the circle with the line across ( Prohibition sign) 🚫
Make sure the first red button is enabled, and no requests are in the tab. Now we want to do a search. It’s easier to use the month search query than the rest of queries.
We now click on search and look at the requests sent through the devtool.
After closely looking at the requests in the devtool you would see that the data is being sent as an HTML, and it’s in the form of a table along with HTML data. All other requests are not useful to us, so now we would take that URL and what else we can get from it.
Now we look at the URL.
URL = http://www.nlcbplaywhelotto.com/nlcb-lotto-plus-results/?monthyear=Jan-23
hmm, whatever comes after ?
are what we would call parameters. So this link URL took a parameter called monthyear where in the previous page we had chosen Jan and 2023. What if we had chosen Feb 2022, what would have happened? Well, you can try that and see what happens. This seems like a simple requests. Also each requests we can see that there’s a table that is filled with lovely data. No authentication, no cookies, just plain ole HTML data. Now let’s set up our code.
Getting the HTML data with Requests
Right, so let’s get this set up.
The libraries we would be using :
- Requests
- Pandas { in for a secret }
Now, I know you may be wondering why are we using pandas ?
What is pandas ? In short, Pandas is an open-source Python library used for data manipulation and analysis, but it can also be used to webscrape data from websites once the website have a table format of data in it, in which our data HTML do have a table format where our data is being housed. Yea, we can directly scrape the data using pandas, but it’s faster using requests and passing the HTML into pandas for extracting the table.
So let’s start.
Import our libraries :
import requests
import pandas as pd
if as usual you do not have the libraries above you can always do a pip install requests pandas
So back to our website, we know that the URL takes a parameter monthyear like Jan-23
, which is Month-Year
we would need to generate this. Also keep in mind, we saw that the years goes from 2001 to current date. Wonder what happens if we just send a request outside that range? Yes, if you’ve tried it. It returns ‘No result was found for Jan-00.’. I’ve tried Jan 2000, no data.
Ok so to generate our parameter we would do the following :
month = ['Jan','Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', \
'Aug', 'Sep','Oct','Nov','Dec']
Year = ['{:02d}'.format(i) for i in range(1,24)]
So the code above is what we will be using to create our parameters. Basically, we have a total number of links : 12 x 23 to scrape.
Let’s scrape a link and see what our results are like.
import requests
import pandas as pd
month = 'jan'
year = 2022
url = 'http://www.nlcbplaywhelotto.com/nlcb-lotto-plus-results/'
params = {
'monthyear' : f'{month}-{year}'
}
r = requests.get(url = url , params=params)
if r.status_code == 200:
df = pd.read_html(r.content)
else:
print('[*] Error Occured : ', r.status_code)
If we get an ImportError: html5lib not found, please install it
We just need to do the following : pip install html5lib
If our request is successful, we should see a table like :
As you can see above, we got a table, but it’s in a list, so we only need df[0]
to see our table better.
Why does our table look like this ? Our results are exactly what we want as is shown on the website.
So now we would need to do some simple data manipulation.
Data Manipulation
With our table, closely looking, you can see that the dates are always on even rows of the table and our data are on the odd rows. Which is great. Let’s try another case and see if this is true for another request. Trying with month=Dec and year=2003, always remember that the parameter is month 3 letters and year last 2 digit.
our code.
Nice it seemed true for this request also, so then we can assume that for all our request will have similar results. Now let’s make this table more readable.
A super nice and neat pythonic way to get the even rows in pandas is by the following code : df[0].iloc[::2]
What iloc and [::2] means :
.iloc
: a method that accesses the DataFrame df using integer-based indexing. [::2]
: This is what we would call slicing that selects every second row of the df or even rows, starting from the first row (i.e., the row with index 0).
Now we would need to reset the index and also rename a column as Date
which we would do using the following code :
['Date']
: Slice a column from the dataframe df.reset_index()
: resets the index from 0 1 2 3 etc.drop(columns=[‘index’])
: drops the column index
You can read more of these using the pandas documentation
Now let’s deal with the odd rows.
We would use the code : df[0].iloc[1::2].reset_index().drop(columns=[‘index’]) # Odd Rows
Nice now our data looking promising. And the reason for resetting the index is that we created a key to merge the dates and the table together because we do want our dates with our data. So to merge our table, we will do the following:
dates = df[0].iloc[::2].reset_index().drop(columns=['index']).rename(columns={'Draw#':'Date'})['Date'] # Even Rows
Data = df[0].iloc[1::2].reset_index().drop(columns=['index']) # Odd Rows
table = pd.merge(dates,Data,left_index=True, right_index=True).copy()
table
Now our data looks quite good. We would need to do some simple cleaning up in the Jackpot column and the multiplier column. Also, we need to convert the Date column to a date time object.
To clean the data :
def clean_jp(row):
try:
return float(str(row).replace('$','').replace(',',''))
except : return 0
table['Jackpot'] = table['Jackpot'].apply(clean_jp)
table['Date'] = pd.to_datetime(table['Date'], dayfirst=True)
table
Now that we have done this for one page of data. Let’s set it up to run all
12 x 23 pages of data. After we will do further cleaning up.
Looping through 276 pages of data
Right let’s set up our final scraper then we will do some data cleaning.
To do this, we code the following :
import requests
import pandas as pd
url = 'http://www.nlcbplaywhelotto.com/nlcb-lotto-plus-results/'
MONTH = ['Jan','Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep','Oct','Nov','Dec']
YEAR = ['{:02d}'.format(i) for i in range(0,23)]
DATA = []
for year in YEAR:
for month in MONTH:
params = {
'monthyear' : f'{month}-{year}'
}
r = requests.get(url = url , params=params)
try:
df = pd.read_html(r.content)[0].dropna(how='all')
dates = df.iloc[::2].reset_index().drop(columns=['index']).rename(columns={'Draw#':'Date'})['Date'] # Even Rows
Data = df.iloc[1::2].reset_index().drop(columns=['index']) # Odd Rows
table = pd.merge(dates,Data,left_index=True, right_index=True).copy()
DATA.append(table)
print('[*] Data Scraped : ',params['monthyear'])
except Exception as e:
print(f'[*] Error {e} Occurred : {month}-{year}')
continue
df = pd.concat(DATA)
df
Right after running the code, we will get the results :
I ran it without the clean-up part of the code to retain if any errors occurred, and put the code into a try except block to catch all errors. Given that we’ve only got a few missing data, we’re on the green to clean the data.
To clean up, we would :
table = df.copy()
def clean_jp(row):
try:
return float(str(row).replace('$','').replace(',',''))
except : return 0
table['Jackpot'] = table['Jackpot'].apply(clean_jp)
table['Date'] = pd.to_datetime(table['Date'], dayfirst=True)
table['Draw#'] = table['Draw#'].astype(int)
table['Power Ball'] = table['Power Ball'].astype(int)
table
Where our results are :
If we try to convert Multiplier and Wins we will come across an error. Now we would need to see the unique values in these two columns
print('[*] Unique Values in Multiplier : ', table.Multiplier.unique())
print('[*] Unique Values in Wins : ', table.Wins.unique())
We will change these values to -1 to show that these values were missing. And then convert the data type to integer
print('[*] Unique Values in Multiplier : ', table.Multiplier.unique())
table.Multiplier = table.Multiplier.fillna(-1)
table['Multiplier'] = table['Multiplier'].astype(int)
print('[*] New Unique Values in Multiplier : ', table.Multiplier.unique())
print('[*] Unique Values in Column : ', table.Wins.unique())
table.Wins = table.Wins.str.replace("X",'-1')
table['Wins'] = table['Wins'].astype(int)
table
print('[*] New Unique Values in Wins : ', table.Wins.unique())
table
and there we have it. Our data is cleaned and ready to export.
To Export our data, we simply :table.to_excel(“NLCB_Lotto_Results-2001–2023.xslx”, index=False)
Some recommendations:
- As you’ve noticed webscraping via a for loop takes a few minutes but this we can use asynchronous scraping instead which means sending out all the links and saving the data as it returns. This method is way faster than normal non async way. In another article we will do that.
- The number's column can be split into 1 to 5 balls, as that’s how many chances there are.
So what are your final thoughts on this. You can scrape the PlayWhe section as a challenge. Just for fun. ^_^
Thanks for reading my article.
I hope you find this medium article helpful and easy to follow through Also, Share this article with your fellow Friends. Happy Coding!
GitHub Link along with the scraped data :
Link : https://github.com/surenjanath/Lotto_Webscraping