Data Scraping with Google Sheets to assist Journalism and OSINT — Tutorial
The internet is full of data which can be used to make your life easier. Today’s tutorial will look at taking such data and turning it into a useful format for journalists and open-source intelligence analysts.
First, a bit about what we will cover. This tutorial will be based on a small project I did which converted data from a Wikipedia article into a Google Maps file.
The data was a table of Chinese Airbases and their coordinates, but in its format, it was cumbersome to analyse, and manually adding each base to a Maps projects would have taken a long time.
I employed a very useful data scraping formula in Google Sheets which was able to import the table in full, which was then able to be formatted to work in a custom Maps file.
This project has inspired others to add additional data, most notably, Twitter user @GONZOs_int, who has taken it further and added many more nations to the project.
Why use data scraping?
Data Scraping is a great way to compile large quantities of information for additional research. As someone who enjoys looking at satellite imagery, having all data pinned on maps is a really useful tool as it saves me adding each point manually.
It also allows you to start off quickly and build templates for research. It also, to a degree, removes user error.
This particular method that I will walk you through has its limitations – namely, the source of data – but the time savings allow you to check the quality of data in a much more visual manner on a map.
Getting started
The best types of information to get started on are tables (Wikipedia is great for this) which have coordinates. These can be easily scraped and collated into a spreadsheet and offer a lot of additional information.
If you want to follow along with this tutorial, I recommend using the same Wikipedia page that I used. Another good one to check out is Wikipedia’s “List of satellite map images with missing or unclear data”. The latter uses a number of tables so will require more effort, but is arguably more interesting as it contains lots of anomalies and censored locations. Who doesn’t want to speculate on the world’s most secretive spy bases?
Before we go further, I want to reassure people with limited spreadsheet experience. The formulas can look a bit daunting, but everything can be copy/pasted and you don’t need any mathematical skills to proceed.
Data Scraping Wikipedia tables in Google Sheets
The code needed to scrape data from the PLA air bases page is a simple Import HTML formula.
=ImportHtml("https://en.wikipedia.org/wiki/List_of_People%27s_Liberation_Army_Air_Force_airbases", "table", 6)
To explain this,
=ImportHtml
Tells Google Sheets to import the code on a particular page and at a particular location.
("WEB URL", "table", TABLE NUMBER)
The “WEB URL” is simply the URL of the page you want to scrape from. It needs to be in quotation marks.
The “table” simply tells Sheets to import the HTML as a table. There are other elements you can import, but this is probably the best starting point.
The TABLE NUMBER depends on the web page. If all works well, you might be able to use the number 1. In this case, the Sheet imports the wrong element as seen below.
To find the table number, you might have to use trial and error. I went through numbers 1–6 before I found the table that I needed.
Formatting the data to be transferred to a Maps file
To turn this into a Maps or Google Earth file, we first need to know what format it needs to be in.
Luckily, you can import CSV files which are a simple spreadsheet format.
Sadly, we can’t import it just yet because the coordinates need to be formatted.
When importing, we need the coordinates in separate latitude and longitude columns.
Below is the entry for the first base. It displays the coordinates in a number of ways. We only need one.
30°35′00″N117°03′00″E / 30.58333°N 117.05000°E / 30.58333; 117.05000
Let’s pick the last one (see below in bold and italic).
30°35′00″N117°03′00″E / 30.58333°N 117.05000°E / 30.58333; 117.05000
With that number in mind, we need to consider how we separate out those values. Manually is off the table. It takes too long, and does not prepare you for larger datasets.
We want to remove everything before the forward slash, as well as the forward slash itself. We also want to remove the semi-colon, and finally, we want to split the latitude and longitudes up into different columns.
I actually prefer using Excel for spreadsheet formatting because it offers better choices, so I downloaded the spreadsheet and loaded it into Excel.
To filter out all the unwanted data we can use the find and replace tool in Excel. This is easily accessible by hitting Control+H.
To use Find and replace, we need to replace anything before the forwards slash. If you haven’t used an * before, it is common computer speak for “anything”.
So having a “*/” means anything before the forward slash.
We can also remove the last characters “). The [“] quote symbol is easy enough to remove in the same way, but the bracket can be a hassle to change since it is a special character.
If you want to do it exactly the way I did it, create a new column called “Coordinates Fixed” and used a formula to remove the last character (an alternative method is about to be addressed which involves pasting the values as text which I’ll cover shortly).
=LEFT(C2, LEN(C2)-1)
Now we have the bracket removed from the coordinates, we just need to split them up into two columns, Latitude and Longitude. Make those columns and copy the coordinates data.
Once copied, you need to paste only the values. This is the other way to get around special characters causing trouble in the find and replace feature.
Right click on the first cell and you can see the paste options. It is important to paste only the values as we need to edit them as a string and not as a formula.
Then we can do the find and replace for anything after the semi-colon by using ;* in the find section and a blank space in the replace.
Finally, do the same to the longitude column, but remove everything before the semicolon by using find and replace and typing *; into the box.
You now have a formatted set of coordinates. Export the file as a .CSV and upload it to My Maps or Google Earth.
When it goes wrong
The process is relatively fool-proof but you need to check it all works before publishing.
In my example map, there is a problem. You can see that a base is present near Africa. Why is this?
To investigate this, we can refer back to our dataset. We need to look for the entry for Subi Reef.
This can now be fixed manually alongside any other errors, which hopefully, should be few and far between!