Improving hate speech detection with large language models

Natalia Umansky et al.

European Journal of Political Research2026https://doi.org/10.1017/s1475676525100546article
ABDC A
Weight
0.50

Abstract

Efforts to curb online hate speech depend on our ability to reliably detect it at scale. Previous studies have highlighted the strong zero-shot classification performance of large language models (LLMs), offering a potential tool to efficiently identify harmful content. Yet for complex and ambivalent tasks like hate speech detection, pre-trained LLMs can be insufficient and carry systemic biases. Domain-specific models fine-tuned for the given task and empirical context could help address these issues, but, as we demonstrate, the quality of data used for fine-tuning decisively matters. In this study, we fine-tuned GPT-4o-mini using a unique corpus of online comments annotated by diverse groups of coders with varying annotation quality: research assistants, activists, two kinds of crowd workers, and citizen scientists. We find that only annotations from those groups of annotators that are better than zero-shot GPT-4o-mini in recognizing hate speech improve the classification performance of the fine-tuned LLM. Specifically, fine-tuning using the highest-quality annotator group – trained research assistants – boosts classification performance by increasing the model’s precision without notably sacrificing the good recall of zero-shot GPT-4o-mini. In contrast, lower-quality annotations do not improve and may even decrease the ability to identify hate speech. By examining tasks reliant on human judgment and context, we offer insights that go beyond hate speech detection.

Open via your library →

Cite this paper

https://doi.org/https://doi.org/10.1017/s1475676525100546

Or copy a formatted citation

@article{natalia2026,
  title        = {{Improving hate speech detection with large language models}},
  author       = {Natalia Umansky et al.},
  journal      = {European Journal of Political Research},
  year         = {2026},
  doi          = {https://doi.org/https://doi.org/10.1017/s1475676525100546},
}

Paste directly into BibTeX, Zotero, or your reference manager.

Flag this paper

Improving hate speech detection with large language models

Flags are reviewed by the Arbiter methodology team within 5 business days.


Evidence weight

0.50

Balanced mode · F 0.40 / M 0.15 / V 0.05 / R 0.40

F · citation impact0.50 × 0.4 = 0.20
M · momentum0.50 × 0.15 = 0.07
V · venue signal0.50 × 0.05 = 0.03
R · text relevance †0.50 × 0.4 = 0.20

† Text relevance is estimated at 0.50 on the detail page — for your query’s actual relevance score, open this paper from a search result.