Prospecting for Data on the Web


At Enthought we teach a lot of scientists and engineers about using Python and the ecosystem of scientific Python packages for processing, analyzing, and visualizing data. Most of what we teach involves nice, clean data sets–collections of data that have been carefully collected, scrubbed, and prepared for analysis. While we also mention in passing the idea of collecting data from the web, work a few examples of general data cleanup, and at least show our students each of the tools needed, we seldom have enough time in class to follow a complete, practical example of web data prospecting from end to end. This newsletter should help remedy that.

The Problem

While the internet is a great resource for many things, including data, the web’s wild and tangled nature presents a few problems:

  • Searches are hit and miss. If you happen upon the right keywords, you may find what you are looking for. Then again, if you try the same keywords next week, or next year, you may get very different results.
  • If you find what you need one day, it might disappear another day. The whole site could be taken down. Or, the webmaster could decide to reorganize the pages, changing their addresses in the process.
  • Even if a page remains in place, the information on the page could be updated or reformatted in such a way that you cannot rely on the same process finding the same information.

In short, if you don’t plan on the web being an ephemeral resource (a metaphorical “intermittent stream” of data), you can have problems incorporating that data into your applications, tools, and research. As a result, the best approach is often to have good tools that allow you to easily collect what you find, clean it up for your own needs, and then save locally. This is where Python—and Pandas in particular—really shines.

Step 1 – Find the Data Resource

This is often the hardest part of the process. Using a web search engine, find a web page with the data you want. If you are really lucky, you will have a download link with a sanely formatted text file, a spreadsheet, or some other known format. Alternatively, you may simply find a web page that displays your data with a whole bunch of other stuff that you really don’t care about (images, links, other text, etc.).

As a simple example here, we are going to find some historical financial information about Amgen on the web. In particular, we are looking for dividend information. After identifying our page, we store the URL (the web address) in a variable.

>>> url = ''

This is a typical page with some extra stuff that we don’t care about, but it does have the data we need.

Step 2 – Scrape the Data Table(s)

Next we are going to use Pandas to retrieve the data embedded in the page. Pandas has a nice read_html() function that will pull down any tables on a web page and present them to us as Pandas DataFrames.

>>> import pandas as pd
>>> pd.read_html(url)
HTTPError: HTTP Error 403: Forbidden

Hmmm… that wasn’t nice. Looks like the https web protocol is causing a problem in this case. Basically read_html() doesn’t use https and tries to fall back to plain old http which is forbidden on the web server that we are trying to contact. For this URL we will try impersonating a web browser that can handle the secure protocol and see if we can work around the error. We can use the Python library requests to accomplish this task.

>>> import requests
>>> requests.get(url)
<Response [200]>

That’s better. We see the 200 OK success status response code that indicates the request has succeeded. Let’s actually grab the response and feed it into Pandas.

>>> response = requests.get(url)
>>> pd.read_html(response.text)
[    Record      Declared     Payable     Amount
0    11.16.2021  10.21.2021   12.08.2021  1.76

Note the leading ‘[‘ in the output above that tells us that Pandas is returning a list of something to us (we are expecting DataFrames). Let’s save the list and see what we have.

>>> tables = pd.read_html(response.text)
>>> type(tables)
<class 'list'>
>>> len(tables)

Good, there is only one item in the list. Sometimes there are several and we have to figure which one (or which ones) we are really interested in. Let’s take a look at the first entry.

>>> type(tables[0])
>>> tables[0].head()
               Record    Declared     Payable  Amount
0          11.16.2021  10.21.2021  12.08.2021    1.76
1          08.17.2021  07.30.2021  09.08.2021    1.76
2          05.17.2021  03.03.2021  06.08.2021    1.76
3          02.15.2021  12.16.2020  03.08.2021    1.76
4  Total dividends in 2021:   NaN        NaN     7.04

That is the right data, but it could use some clean up. We’ll take care of that next, after saving the data into a new variable.

>>> dividends = tables[0]

Step 3 – Clean Up the Data

There are several “Total dividends” lines in the data that basically mess up the data set, making it harder to work with. Let’s identify those in the DataFrame by looking at the contents of the Record column, creating a mask, and then removing any that do not match the mask.

>>> mask = dividends.Record.str.startswith('Total ')
>>> dividends = dividends[~mask]
>>> dividends.head()
       Record    Declared     Payable  Amount
0  11.16.2021  10.21.2021  12.08.2021    1.76
1  08.17.2021  07.30.2021  09.08.2021    1.76
2  05.17.2021  03.03.2021  06.08.2021    1.76
3  02.15.2021  12.16.2020  03.08.2021    1.76
5  11.16.2020  10.21.2020  12.08.2020    1.60

That did the trick. Now we have a bunch of date columns in the data in which the dates are stored as strings. Those are not very useful unless we just want to print them or look at them. Let’s turn them into real Python datetime objects. First let’s see if the Pandas to_datetime() function will work without too much trouble on this date format.

>>> pd.to_datetime('11.16.2021')
Timestamp('2021-11-16 00:00:00')

That looks good. Let’s apply pd.to_datetime() to each of our date columns (Record, Declared, and Payable), which are all but the last one (Amount).

>>> dividends.iloc[:, :-1] = dividends.iloc[:, :-1].apply(
>>> dividends.dtypes
Record      datetime64[ns]
Declared    datetime64[ns]
Payable     datetime64[ns]
Amount             float64
dtype: object
>>> dividends.head()
       Record    Declared     Payable  Amount
0  2021-11-16  2021-10-21  2021-12-08    1.76
1  2021-08-17  2021-07-30  2021-09-08    1.76
2  2021-05-17  2021-03-03  2021-06-08    1.76
3  2021-02-15  2020-12-16  2021-03-08    1.76
5  2020-11-16  2020-10-21  2020-12-08    1.60

Much better. Now we can use any of the date columns as a time series index in our analysis. The data is in reverse chronological order, so let’s get it sorted into a more usual order. At the same time, the original numerical index showing where we dropped the “Total dividends” lines isn’t all that useful, so let’s reset that too.

>>> dividends = dividends.sort_values('Record')
>>> dividends = dividends.reset_index(drop=True)
>>> dividends.head()
       Record    Declared     Payable  Amount
0  2011-08-18  2011-07-28  2011-09-08    0.28
1  2011-11-17  2011-10-13  2011-12-08    0.28
2  2012-02-15  2011-12-15  2012-03-07    0.36
3  2012-05-16  2012-03-15  2012-06-07    0.36
4  2012-08-16  2012-07-19  2012-09-07    0.36

That looks good.

Step 4 – Save the Data

Before you do anything else with this data, it would be a good idea to save it to your local disk in a usable format. A reasonable choice is a CSV file (but you could choose whatever format you are accustomed to working with).

>>> dividends.to_csv('dividends.csv', index=False)

That’s it. You now have a sanely formatted CSV on disk. If you reload it into Pandas you will need to reprocess the date columns. However, your dates are now in a standard ISO date format (yyyy-mm-dd) that is easy to parse and can be easily converted by many software packages without issue.

That’s the web data prospecting process. It is an iterative process that we typically do a step at a time, figuring out what we need to fix or do next as we go. Just make sure you save anything you need before moving on!


About the Author

Eric Olsen holds a Ph.D. in history from the University of Pennsylvania, a M.S. in software engineering from Pennsylvania State University, and a B.A. in computer science from Utah State University. Eric spent three decades working in software development in a variety of fields, including atmospheric physics research, remote sensing and GIS, retail, and banking. In each of these fields, Eric focused on building software systems to automate and standardize the many repetitive, time-consuming, and unstable processes that he encountered.

Share this article:

Related Content