From Text to Insight: Leveraging Large Language Models for Performance Evaluation in Management

Ning Li et al.

Personnel Psychology2026https://doi.org/10.1111/peps.70020article

AJG 4*ABDC A*

Weight

0.50

What the paper says

This study examines whether Large Language Models can serve as reliable supplements to human judgment in evaluating text‐based task performance. Through two studies analyzing 744 knowledge‐based performance outputs, we compare ratings from multiple LLM architectures (GPT‐4, GPT‐5, o3, Claude Sonnet 4, DeepSeek v3) against human evaluators (individual and aggregated ratings), with external expert consensus serving as the validity benchmark for both. Our multi‐model design reveals that various LLMs demonstrate comparable or superior evaluation capabilities relative to human raters, with newer models showing enhanced performance. Using external expert panels as validation criteria, we find that advanced AI models achieve correlations up to r = 0.62 with expert consensus, surpassing aggregated human ratings ( r = 0.50). Different AI systems exhibit higher consistency than human evaluators while showing varying bias resistance: newer models demonstrate minimal susceptibility to halo effects, while earlier models show greater vulnerability (GPT‐4 declining 35.6%). Our findings validate LLMs as reliable supplements to human evaluation, establishing external benchmarking protocols and providing evidence‐based guidance for selecting appropriate models based on evaluation requirements and bias resistance needs.

Open paper page →

Evidence weight

0.50

Balanced mode · F 0.40 / M 0.15 / V 0.05 / R 0.40

F · citation impact	0.50 × 0.4 = 0.20
M · momentum	0.50 × 0.15 = 0.07
V · venue signal	0.50 × 0.05 = 0.03
R · text relevance †	0.50 × 0.4 = 0.20

† Text relevance is estimated at 0.50 on the detail page — for your query’s actual relevance score, open this paper from a search result.