From Text to Insight: Leveraging Large Language Models for Performance Evaluation in Management
Ning Li et al.
Abstract
This study examines whether Large Language Models can serve as reliable supplements to human judgment in evaluating text‐based task performance. Through two studies analyzing 744 knowledge‐based performance outputs, we compare ratings from multiple LLM architectures (GPT‐4, GPT‐5, o3, Claude Sonnet 4, DeepSeek v3) against human evaluators (individual and aggregated ratings), with external expert consensus serving as the validity benchmark for both. Our multi‐model design reveals that various LLMs demonstrate comparable or superior evaluation capabilities relative to human raters, with newer models showing enhanced performance. Using external expert panels as validation criteria, we find that advanced AI models achieve correlations up to r = 0.62 with expert consensus, surpassing aggregated human ratings ( r = 0.50). Different AI systems exhibit higher consistency than human evaluators while showing varying bias resistance: newer models demonstrate minimal susceptibility to halo effects, while earlier models show greater vulnerability (GPT‐4 declining 35.6%). Our findings validate LLMs as reliable supplements to human evaluation, establishing external benchmarking protocols and providing evidence‐based guidance for selecting appropriate models based on evaluation requirements and bias resistance needs.
Evidence weight
Balanced mode · F 0.40 / M 0.15 / V 0.05 / R 0.40
| F · citation impact | 0.50 × 0.4 = 0.20 |
| M · momentum | 0.50 × 0.15 = 0.07 |
| V · venue signal | 0.50 × 0.05 = 0.03 |
| R · text relevance † | 0.50 × 0.4 = 0.20 |
† Text relevance is estimated at 0.50 on the detail page — for your query’s actual relevance score, open this paper from a search result.