From Text to Insight: Leveraging Large Language Models for Performance Evaluation in Management

Ning Li et al.

Personnel Psychology2026https://doi.org/10.1111/peps.70020article
AJG 4*ABDC A*
Weight
0.50

Abstract

This study examines whether Large Language Models can serve as reliable supplements to human judgment in evaluating text‐based task performance. Through two studies analyzing 744 knowledge‐based performance outputs, we compare ratings from multiple LLM architectures (GPT‐4, GPT‐5, o3, Claude Sonnet 4, DeepSeek v3) against human evaluators (individual and aggregated ratings), with external expert consensus serving as the validity benchmark for both. Our multi‐model design reveals that various LLMs demonstrate comparable or superior evaluation capabilities relative to human raters, with newer models showing enhanced performance. Using external expert panels as validation criteria, we find that advanced AI models achieve correlations up to r = 0.62 with expert consensus, surpassing aggregated human ratings ( r = 0.50). Different AI systems exhibit higher consistency than human evaluators while showing varying bias resistance: newer models demonstrate minimal susceptibility to halo effects, while earlier models show greater vulnerability (GPT‐4 declining 35.6%). Our findings validate LLMs as reliable supplements to human evaluation, establishing external benchmarking protocols and providing evidence‐based guidance for selecting appropriate models based on evaluation requirements and bias resistance needs.

Open via your library →

Cite this paper

https://doi.org/https://doi.org/10.1111/peps.70020

Or copy a formatted citation

@article{ning2026,
  title        = {{From Text to Insight: Leveraging Large Language Models for Performance Evaluation in Management}},
  author       = {Ning Li et al.},
  journal      = {Personnel Psychology},
  year         = {2026},
  doi          = {https://doi.org/https://doi.org/10.1111/peps.70020},
}

Paste directly into BibTeX, Zotero, or your reference manager.

Flag this paper

From Text to Insight: Leveraging Large Language Models for Performance Evaluation in Management

Flags are reviewed by the Arbiter methodology team within 5 business days.


Evidence weight

0.50

Balanced mode · F 0.40 / M 0.15 / V 0.05 / R 0.40

F · citation impact0.50 × 0.4 = 0.20
M · momentum0.50 × 0.15 = 0.07
V · venue signal0.50 × 0.05 = 0.03
R · text relevance †0.50 × 0.4 = 0.20

† Text relevance is estimated at 0.50 on the detail page — for your query’s actual relevance score, open this paper from a search result.