Financial statement fraud detection using topic-driven financial sentiment analysis
Petr Hajek et al.
Abstract
Financial statement fraud undermines market integrity and incurs substantial costs for investors, regulators, and companies. Text-based detection methods have emerged as useful complements to traditional financial indicators, but many fail to incorporate domain-specific topics or sentiment cues, often missing subtle changes in deceptive communication. To overcome this problem, this study proposes a topic-driven financial sentiment analysis (TDFSA) model that detects corporate fraud by analyzing linguistic patterns in the Management Discussion & Analysis (MD&A) sections of annual reports. Our approach captures contextual sentiment within financially relevant topics using FinBERT embeddings. To evaluate these signals in fraud detection, we integrate the TDFSA outputs into a broader cost-sensitive evaluation framework. This framework combines text-based indicators with financial ratios to balance the need to avoid false alarms with the high cost of undetected fraud. Using data from U.S. firms flagged in SEC Accounting and Auditing Enforcement Releases from 2014 to 2024 and matched non-fraud peers, we examine trends in financial ratios, textual complexity, and sentiment dynamics in the three years preceding fraud events. The results show that models leveraging TDFSA achieve higher detection accuracy and lower cost than dictionary-based sentiment, generic topic models, and deep learning baselines. • Topic-driven financial sentiment analysis (TDFSA) improves financial statement fraud detection. • FinBERT embeddings capture both topic-level and sentiment context in MD&A disclosures. • Cost-sensitive learning prioritizes preventing undetected fraud over false alarms, using a 6.46:1 ratio. • The proposed model provides accurate and fair decision support for auditors and investors.
Evidence weight
Balanced mode · F 0.40 / M 0.15 / V 0.05 / R 0.40
| F · citation impact | 0.50 × 0.4 = 0.20 |
| M · momentum | 0.50 × 0.15 = 0.07 |
| V · venue signal | 0.50 × 0.05 = 0.03 |
| R · text relevance † | 0.50 × 0.4 = 0.20 |
† Text relevance is estimated at 0.50 on the detail page — for your query’s actual relevance score, open this paper from a search result.