Using custom GPT to create a hatespeech identification tool

Tom Jarvis
5 min readDec 20, 2023

OpenAI’s addition of custom GPTs that can be made to run certain tasks opens many avenues for repetitive analysis. Here we explore the creation and testing of a custom GPT running on GPT4 to identify and score messages on their hatespeech and toxicity scores. While this is not currently a scalable system due to API limitations it offers an insight into how such tools could be explored without training large models, instead taking a general approach.

This article contains examples which may be obscene or contain offensive messages. It is not intended as a full evaluation of the capabilities, nor do I do extensive data analysis, rather, I wanted to share this write-up as a point of inspiration and as an initial seed for future research.

Link to the custom GPT

Designing the tool

While it would be far better to build and train a dedicated hatespeech detection machine, the power of ChatGPT allows a very broad understanding of text in many languages.

Custom prompt for the GPT

Taking inspiration from Detoxify, this custom model evaluates across seven categories: Toxicity, Severe Toxicity, Obscene, Threat, Insult, Identity Attack, and Sexually Explicit content.

The prompt is simple:

The Hatespeech Analyzer, focused on gendered hate speech, evaluates Telegram messages across seven categories: Toxicity, Severe Toxicity, Obscene, Threat, Insult, Identity Attack, and Sexually Explicit content. It provides scores between 0 and 1 for each category, indicating the intensity of hate speech elements. After scoring, the Analyzer offers a concise summary in English, highlighting key points about its reasoning. This summary provides context and clarity to the scores, helping users understand the evaluation basis. The Analyzer’s style remains formal and professional, delivering objective and factual information in both scoring and the explanatory summary, exclusively in English.

One key addition to this tool is the lock on the output language being in English. It was found that, with inconsistencies, the input language sometimes would cause the output to default to that language.

Initial Results

Initial results with this type of prompt are promising. The tool works as it is intended and can readily differentiate between toxic and non-toxic content. It excels at looking at the meaning of the message and understanding it as a whole.

Toxic content

Results show that the tool can identify toxic content and pick the levels at which each category is present.

Non-toxic content

Similarly, it can identify the non-toxic content and understand the context between things that may be considered threatening — such as a missile test, but not a directed threat to an individual.

One example where this tool may shine compared to older models is it’s contextual understanding of certain terms. For example in the context of dogs, certain words may be less offensive than if they were applied to humans. This tool is accurately able to detect this nuance — something which other tools highlight is a weak-point.

Accurate understanding of the context of words and even acknowledgement of where misinterpretation may take place.

Repeatability

One of the key things to test with this is the repeatability of the output. To test this, the same input messages were given multiple times to ensure that the tool’s understanding of the content did not score wildly differently.

One of the key findings with this is that the tool performs with consistency, but within a narrow window of variation. It does a good job of maintaining the proportions of each category in relation to each other, but some runs do vary in the overall scoring.

Below are two examples of messages that were scored in triplicate across the seven categories. These results represent the consistency seen across running many different messages in triplicate.

While multiple passes of the same input produced variations in the scoring, there was a useable consistency in the results.
This consistency in repeated runs continues for posts with lower hatespeech scores too, including those which may have terms such as “Nazi” which could confuse the process.

From looking at the variability between repeat runs of the same message, it appears that almost all non-zero results have some variation. There was a maximum difference of 0.1 (scale 0 to 1).

Language

This tool was tested with English and Russian language content. It performed well on both languages within the designer and was able to understand and translate the context.

Once published as a custom GPT, the tool appears to have lost this functionality. As linked above, the tool struggles to handle any non-English content in its current form.

This can be “resolved” by taking the prompt and making your own custom GPT and running it in the creation mode.

Expandability

With this prompt, it was possible to enable CSV or JSON uploads for analysis, however on testing, it was not possible to make it analyse any large sets of messages.

The goal would be in future, if there are increased allowances to enable the upload of JSON and CSV to analyse messages row-by-row from Telegram exports, and then create a CSV output file with columns for each scored category.

This sort of process could potentially be implemented by using the API and a non-custom GPT alongside a Python script.

Conclusion

While this is not an optimal approach to analysing hatespeech in large datasets, there appears to be promising potential for tools like this to be able to accurately contextualise a message’s meaning and understand nuances. This may be especially beneficial compared to certain models.

--

--