Multimodal misinformation detection across diverse languages using RAG and LLMs
Sheetal Harris et al.
Abstract
The rapid spread of multimodal fake news (FN) on Online Social Networks (OSNs) threatens digital information ecosystems, particularly in low-resource languages. Existing multimodal fake news detection (FND) methods are largely limited to high-resource settings, restricting their global applicability. We propose an M&M-RAG, a Multilingual & Multimodal Retrieval-Augmented Generation framework, that leverages Large Vision-Language Models (LVLMs) and Large Language Models (LLMs) to verify news claims across English, Chinese and Urdu. M&M-RAG integrates real-time multilingual evidence retrieval, language-aware prompting, and cross-modal reasoning for fact verification. We further propose Multi-Ax-to-Grind Urdu, the first large-scale, multi-domain multimodal benchmark for FND in Urdu. Experiments on typologically diverse monolingual multimodal datasets demonstrate that M&M-RAG achieves state-of-the-art (SOTA) performance, with 94.6% accuracy and 94.2% F1 score, surpassing models such as SpotFake, MPFN, MMCFND, and Semi-FND. The proposed framework remains robust in zero-shot and cross-lingual scenarios under frozen-model inference without task-specific fine-tuning. The results underscore the scalability and interpretability of LVLM-based approaches for combating multimodal misinformation, particularly in under-represented and typologically diverse languages.
Evidence weight
Balanced mode · F 0.40 / M 0.15 / V 0.05 / R 0.40
| F · citation impact | 0.50 × 0.4 = 0.20 |
| M · momentum | 0.50 × 0.15 = 0.07 |
| V · venue signal | 0.50 × 0.05 = 0.03 |
| R · text relevance † | 0.50 × 0.4 = 0.20 |
† Text relevance is estimated at 0.50 on the detail page — for your query’s actual relevance score, open this paper from a search result.