Telegram OSINT: Generating a data ‘backbone’ for investigation

With Telegram growing ever more popular, vast amounts of data are being generated which we can use to map trends and fuel investigations. While most topics I discuss tend to focus on smaller granular cases, I want to cover an important topic.

The data backbone is a much-neglected area in many investigations and for a good reason. It can be laborious to set up and just as technical to maintain. While the most obvious example of a data backbone may be an ever-growing KML file in your Google Earth as you find more locations of interest, I actually want to dive away from geospatial analysis for this guide.

Here we want to make a versatile Telegram trend monitor that can be easily tweaked and processed in a spreadsheet. The goal is to provide more information — particularly hyper-specific information — than what you would get just by scrolling through multiple Telegram channels.

A mock example

To avoid diving into a current topic, here’s our example scenario:

The land of Mordor is currently invading the rest of Middle Earth. Major human rights violations are being reported. As an investigator, you are wanting to verify and track dragon attacks on civilian villages. This is good when you have footage of such attacks, but can be especially difficult when fleeting mentions are found across various social media platforms.

Each village has its own Telegram group or news channel, and there are also channels for dissident groups and militias that have formed to defend the land. Most importantly, there are a few ‘recon’ channels that are near Mordor that post whenever a dragon takes off, to give additional warning so people can take shelter.

The basic investigation we already know

Currently, as an investigator, you may see reports of an attack and start searching Telegram, especially the relevant local groups for mentions of dragon attacks. Here you may find footage that can be verified and reports are then produced.

But this is a very reactive approach. Following a dragon attack, it may be days before mobile networks are back and people can upload footage. By then, not only do you have latency in your reporting, any actions taken may be too late.

Likewise, what if no footage emerges? You may have very little to go by unless you have a data backbone of dragon movements in the area to infer that something may have indeed taken place.

The optimisation

What if we could sort through 100,000 messages across hundreds of channels in minutes and get a better situational awareness of dragon sightings, when they happened, and even where? Then we could verify any claims by matching them against dates of known dragon activity. We may also have reason to believe an attack wasn’t carried out on a claimed date because we can see there was no activity or the activity was elsewhere.

This is our desired output, each vertical bar is a different date and the bar height corresponds to the number of sightings reported. The days with the tall bars correspond to days where observed dragon activity is highest. This used a dataset of 100,000 Telegram messages. (The graph actually monitors a specific vehicle in an active conflict zone but I changed it to “dragons” for the sake of this guide).

Let’s revisit our scenario. As I mentioned, we have some recon groups on Telegram who regularly update the channel with sightings of dragons taking off from the bases in Mordor. These channels also report on other attacks and sightings that don’t involve dragons.

We can filter this data to visualise all dragon activity and have a backbone of context to refer to when verifying reports.

Telegram Export

Telegram has an extremely useful feature in the desktop version of the app that allows you to export all chat history to HTML format. This is great for preservation but also it means we can play with the data.

The issue is that I don’t want to rely on coding to process this data. So, if you are scared of coding, this guide is still for you.

To export data in Telegram, look at the top right of the desktop app and you will see the menu icon (three vertical dots).

Click the three dots and select Export Chat History

At this point, you want to make a decision. If you need to preserve everything, make sure you check the boxes for all files. In this case, we just want a simple text scrape. Uncheck everything including photos so we only get the messages. Having only text will make life easier for the following tools.

Uncheck all the boxes so we only get text

Convert to CSV

This is the part that makes the process accessible to everyone. Once the data is in a CSV format, we can have the posts in a spreadsheet, organised by date and time. Spreadsheets are also accessible to anyone else who may not be as technically proficient and run on pretty much every computer. Data is only valuable if you can use it.

To do this, there's a very useful tool that can be found on Github. Telegram Export and Converter Tool takes the HTML files and converts them into CSV so you can load the data into a spreadsheet and play with it.

This tool is simple to use and works insanely fast

The first thing you want to do is make sure your Telegram export worked correctly. You should have a folder in your downloads that contains these types of files:

Telegram exports have multiple HTML files as well as a folder for CSS, images, and JS. In larger channels, you may have tens of HTML files or maybe even more.

You want to convert the HTML files to CSV with the tool mentioned above. To do this go to the tool’s page on Github and click the green Code button then download as a zip.

This downloads the small program and means you can run it on your computer.

The downloaded folder can be unzipped and you will see the files inside. A small note, make sure you have Python installed for this to work.

The Python file is the tool you want to use

Copy the Python file into the folder with the Telegram exports. It can just be dragged across into the folder.

Drag the Python file into the folder you want to be converted.

Go to the address bar of that folder and copy the path of the folder. The reason we do this is that we are going to run the tool using the command line/terminal.

Don’t be afraid of this because you only need to know one command. “CD”. This command stands for change directory and all it tells the computer is to open that folder and any future actions will take place in that folder.

type “cd” then paste the path to the folder as seen below:

cd C:\Users\XXXXX\Downloads\Telegram Desktop\ChatExport_2022–04–05

Again don’t be scared of this, it’s only daunting the first time.

Then all you need to do is run the program by typing the name of the tool:

telegram-export-converter.py

Hit enter and a window should briefly flash, then you have the Telegram export converted to CSV

And now you have a file in that folder with all the chat history in a spreadsheet format.

The new spreadsheet file can then be modified to how you like.

Now you have every message and the date and time in the spreadsheet. You can start filtering by the mention of “dragon” or any location of interest and graph out those mentions by date. You can also do more advanced processing of the data to collect sightings by date and location.

This is what the spreadsheet looks like when first opened. You can format it however you like and add filters to only show certain dragon types or anything mentioned in the text. It can all be mapped out by date or time. It could even be cross-referenced by any mentions of location.

Now you can have trend graphs of sightings of various attack equipment. You can filter for different types and various parameters.

Think about how this could also be applied, perhaps you are looking to see the increase in the propagation of certain terms or disinformation. Using a real-life example, perhaps you could map a trend in the increase of words such as “Nazi” in pro-Russian propaganda channels.

Other considerations

When filtering for keywords in the data you want to be specific. This can help rule out general discussion. Say you were monitoring the progress of a new tank battalion with the latest M20 (fictional) tanks, you wouldn’t want to use the search term “tanks”.

You want to filter for references to that specific model to get more informed information. Likewise just filtering the word “tanks” could include all discussions of tanks which may be irrelevant.

This is where your own judgement is essential because every investigation is different and you must design your data processing with context in mind.

If you are monitoring air strikes, the term “air raid siren” may be a good one, but if a channel starts updating hourly advice telling people to listen out for the siren and seek shelter, it suddenly becomes noise in the dataset and yields false-positive results.

Likewise, if you are too specific, you may be overly reliant on the identification skills of the people running a channel and may lose data because they did not positively ID a vehicle.

Conclusion

This barely scratches the surface of large-scale data scraping with Telegram but it hopefully demonstrates the power of a few tricks in being able to establish very rapid and visual contexts that are not always apparent just by scrolling through a feed.

--

--

--

OSINT Consultant and giant big huge nerd

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

niteCTF 2021, Flip Me Over Writeup

Joomla 4: Using the Security Header Features

Common vulnerabilities in mobile apps

{UPDATE} Schnapsen App Vollversion Hack Free Resources Generator

Is Our Digital Footprint in Safe Hands? Part 2: Apollo Breach 2018

Thieves Steal Four Rams From Factory, Including A 702 HP TRX

Establishing Copyright on Creation with Capture App

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Tom Jarvis

Tom Jarvis

OSINT Consultant and giant big huge nerd

More from Medium

Malware Attacks Increasingly Targeting Healthcare IoT Devices | Soracom

Analysis of a trojanized anydesk

Phishing Domain Tool — DnsTwist Part 2

Malware Analysis —Manual Unpacking of Redaman