Data Scraping with Google Sheets to assist Journalism and OSINT — Tutorial

Every red dot is a People’s Liberation Army Air Force location. If you want to research the number of military planes China has, for example, why start from scratch? A lot of the work is already done — namely, a list of locations to start the search.
The list of People’s Liberation Army air bases was long and manually adding each to a Maps project for further analysis would have taken a long time.

Why use data scraping?

By the end of this tutorial, you will be able to very quickly tour all of China’s air bases and find some really cool aircraft — or whatever your research demands.

Getting started

The best types of information to get started on are tables (Wikipedia is great for this) which have coordinates. These can be easily scraped and collated into a spreadsheet and offer a lot of additional information.

Data Scraping Wikipedia tables in Google Sheets

The code needed to scrape data from the PLA air bases page is a simple Import HTML formula.

=ImportHtml("", "table", 6)
The code for importing a data table from Wikipedia into a Sheets spreadsheet. Note that the URL and the number at the end (in this case, 6) will depend on what you are researching.
("WEB URL", "table", TABLE NUMBER)
Her we can see the result of importing with the table reference of 1 (to make it work, it should be 6). This happened because there was an element higher up that it picked first.
This is what we should see when it works.

Formatting the data to be transferred to a Maps file

To turn this into a Maps or Google Earth file, we first need to know what format it needs to be in.

Right now, the latitude and longitude data is not useable. We need to split them up and remove the characters that aren’t needed
30°35′00″N117°03′00″E / 30.58333°N 117.05000°E / 30.58333; 117.05000
30°35′00″N117°03′00″E / 30.58333°N 117.05000°E / 30.58333; 117.05000
Load the data into Excel if you want an easier formatting experience.
Alternatively, typing “/*” would replace the forward slash and anything after it.
=LEFT(C2, LEN(C2)-1)
This formula removes the brackets and leaves us with only one undesirable character, the semi-colon. Luckily, this becomes important for the final stage.

When it goes wrong

The process is relatively fool-proof but you need to check it all works before publishing.

You can see there is a base in Africa when in reality, there isn’t. Notice it’s location. On the Equator and the Prime Meridian. This looks like it has defaulted to 0 on both.
The entry for Subi Reef never had coordinates on the Wikipedia table, and so resulted in an error on the spreadsheet. This error translated into the default location on the map.



OSINT Consultant and giant big huge nerd

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store