Reliability Evidence for AI-Based Scores in Organizational Contexts: Applying Lessons Learned From Psychometrics

Andrew B. Speer et al.

Organizational Research Methods2025https://doi.org/10.1177/10944281251346404article
AJG 4ABDC A*
Weight
0.41

Abstract

Machine learning and artificial intelligence (AI) are increasingly used within organizational research and practice to generate scores representing constructs (e.g., social effectiveness) or behaviors/events (e.g., turnover probability). Ensuring the reliability of AI scores is critical in these contexts, and yet reliability estimates are reported in inconsistent ways, if at all. The current article critically examines reliability estimation for AI scores. We describe different uses of AI scores and how this informs the data and model needed for estimating reliability. Additionally, we distinguish between reliability and validity evidence within this context. We also highlight how the parallel test assumption is required when relying on correlations between AI scores and established measures as an index of reliability, and yet this assumption is frequently violated. We then provide methods that are appropriate for reliability estimation for AI scores that are sensitive to the generalizations one aims to make. In conclusion, we assert that AI reliability estimation is a challenging task that requires a thorough understanding of the issues presented, but a task that is essential to responsible AI work in organizational contexts.

2 citations

Open via your library →

Cite this paper

https://doi.org/https://doi.org/10.1177/10944281251346404

Or copy a formatted citation

@article{andrew2025,
  title        = {{Reliability Evidence for AI-Based Scores in Organizational Contexts: Applying Lessons Learned From Psychometrics}},
  author       = {Andrew B. Speer et al.},
  journal      = {Organizational Research Methods},
  year         = {2025},
  doi          = {https://doi.org/https://doi.org/10.1177/10944281251346404},
}

Paste directly into BibTeX, Zotero, or your reference manager.

Flag this paper

Reliability Evidence for AI-Based Scores in Organizational Contexts: Applying Lessons Learned From Psychometrics

Flags are reviewed by the Arbiter methodology team within 5 business days.


Evidence weight

0.41

Balanced mode · F 0.40 / M 0.15 / V 0.05 / R 0.40

F · citation impact0.25 × 0.4 = 0.10
M · momentum0.55 × 0.15 = 0.08
V · venue signal0.50 × 0.05 = 0.03
R · text relevance †0.50 × 0.4 = 0.20

† Text relevance is estimated at 0.50 on the detail page — for your query’s actual relevance score, open this paper from a search result.