Data Scraping with Google Sheets to assist Journalism and OSINT — Tutorial

Tom Jarvis
7 min readOct 13, 2020

--

Every red dot is a People’s Liberation Army Air Force location. If you want to research the number of military planes China has, for example, why start from scratch? A lot of the work is already done — namely, a list of locations to start the search.

The internet is full of data which can be used to make your life easier. Today’s tutorial will look at taking such data and turning it into a useful format for journalists and open-source intelligence analysts.

First, a bit about what we will cover. This tutorial will be based on a small project I did which converted data from a Wikipedia article into a Google Maps file.

The data was a table of Chinese Airbases and their coordinates, but in its format, it was cumbersome to analyse, and manually adding each base to a Maps projects would have taken a long time.

The list of People’s Liberation Army air bases was long and manually adding each to a Maps project for further analysis would have taken a long time.

I employed a very useful data scraping formula in Google Sheets which was able to import the table in full, which was then able to be formatted to work in a custom Maps file.

This project has inspired others to add additional data, most notably, Twitter user @GONZOs_int, who has taken it further and added many more nations to the project.

Why use data scraping?

By the end of this tutorial, you will be able to very quickly tour all of China’s air bases and find some really cool aircraft — or whatever your research demands.

Data Scraping is a great way to compile large quantities of information for additional research. As someone who enjoys looking at satellite imagery, having all data pinned on maps is a really useful tool as it saves me adding each point manually.

It also allows you to start off quickly and build templates for research. It also, to a degree, removes user error.

This particular method that I will walk you through has its limitations – namely, the source of data – but the time savings allow you to check the quality of data in a much more visual manner on a map.

Getting started

The best types of information to get started on are tables (Wikipedia is great for this) which have coordinates. These can be easily scraped and collated into a spreadsheet and offer a lot of additional information.

If you want to follow along with this tutorial, I recommend using the same Wikipedia page that I used. Another good one to check out is Wikipedia’s “List of satellite map images with missing or unclear data”. The latter uses a number of tables so will require more effort, but is arguably more interesting as it contains lots of anomalies and censored locations. Who doesn’t want to speculate on the world’s most secretive spy bases?

Before we go further, I want to reassure people with limited spreadsheet experience. The formulas can look a bit daunting, but everything can be copy/pasted and you don’t need any mathematical skills to proceed.

Data Scraping Wikipedia tables in Google Sheets

The code needed to scrape data from the PLA air bases page is a simple Import HTML formula.

=ImportHtml("https://en.wikipedia.org/wiki/List_of_People%27s_Liberation_Army_Air_Force_airbases", "table", 6)
The code for importing a data table from Wikipedia into a Sheets spreadsheet. Note that the URL and the number at the end (in this case, 6) will depend on what you are researching.

To explain this,

=ImportHtml

Tells Google Sheets to import the code on a particular page and at a particular location.

("WEB URL", "table", TABLE NUMBER)

The “WEB URL” is simply the URL of the page you want to scrape from. It needs to be in quotation marks.

The “table” simply tells Sheets to import the HTML as a table. There are other elements you can import, but this is probably the best starting point.

The TABLE NUMBER depends on the web page. If all works well, you might be able to use the number 1. In this case, the Sheet imports the wrong element as seen below.

Her we can see the result of importing with the table reference of 1 (to make it work, it should be 6). This happened because there was an element higher up that it picked first.

To find the table number, you might have to use trial and error. I went through numbers 1–6 before I found the table that I needed.

This is what we should see when it works.

Formatting the data to be transferred to a Maps file

To turn this into a Maps or Google Earth file, we first need to know what format it needs to be in.

Luckily, you can import CSV files which are a simple spreadsheet format.

Sadly, we can’t import it just yet because the coordinates need to be formatted.

When importing, we need the coordinates in separate latitude and longitude columns.

Right now, the latitude and longitude data is not useable. We need to split them up and remove the characters that aren’t needed

Below is the entry for the first base. It displays the coordinates in a number of ways. We only need one.

30°35′00″N117°03′00″E / 30.58333°N 117.05000°E / 30.58333; 117.05000

Let’s pick the last one (see below in bold and italic).

30°35′00″N117°03′00″E / 30.58333°N 117.05000°E / 30.58333; 117.05000

With that number in mind, we need to consider how we separate out those values. Manually is off the table. It takes too long, and does not prepare you for larger datasets.

We want to remove everything before the forward slash, as well as the forward slash itself. We also want to remove the semi-colon, and finally, we want to split the latitude and longitudes up into different columns.

I actually prefer using Excel for spreadsheet formatting because it offers better choices, so I downloaded the spreadsheet and loaded it into Excel.

Load the data into Excel if you want an easier formatting experience.

To filter out all the unwanted data we can use the find and replace tool in Excel. This is easily accessible by hitting Control+H.

To use Find and replace, we need to replace anything before the forwards slash. If you haven’t used an * before, it is common computer speak for “anything”.

So having a “*/” means anything before the forward slash.

Alternatively, typing “/*” would replace the forward slash and anything after it.

We can also remove the last characters “). The [“] quote symbol is easy enough to remove in the same way, but the bracket can be a hassle to change since it is a special character.

If you want to do it exactly the way I did it, create a new column called “Coordinates Fixed” and used a formula to remove the last character (an alternative method is about to be addressed which involves pasting the values as text which I’ll cover shortly).

=LEFT(C2, LEN(C2)-1)
This formula removes the brackets and leaves us with only one undesirable character, the semi-colon. Luckily, this becomes important for the final stage.

Now we have the bracket removed from the coordinates, we just need to split them up into two columns, Latitude and Longitude. Make those columns and copy the coordinates data.

Once copied, you need to paste only the values. This is the other way to get around special characters causing trouble in the find and replace feature.

Right click on the first cell and you can see the paste options. It is important to paste only the values as we need to edit them as a string and not as a formula.

Then we can do the find and replace for anything after the semi-colon by using ;* in the find section and a blank space in the replace.

Finally, do the same to the longitude column, but remove everything before the semicolon by using find and replace and typing *; into the box.

You now have a formatted set of coordinates. Export the file as a .CSV and upload it to My Maps or Google Earth.

When it goes wrong

The process is relatively fool-proof but you need to check it all works before publishing.

In my example map, there is a problem. You can see that a base is present near Africa. Why is this?

You can see there is a base in Africa when in reality, there isn’t. Notice it’s location. On the Equator and the Prime Meridian. This looks like it has defaulted to 0 on both.

To investigate this, we can refer back to our dataset. We need to look for the entry for Subi Reef.

The entry for Subi Reef never had coordinates on the Wikipedia table, and so resulted in an error on the spreadsheet. This error translated into the default location on the map.

This can now be fixed manually alongside any other errors, which hopefully, should be few and far between!

--

--

Tom Jarvis
Tom Jarvis

Written by Tom Jarvis

OSINT Consultant and giant big huge nerd

No responses yet