Revisiting reliability with human and machine learning raters under scoring design and rater configuration in the many‐facet Rasch model

Xingyao Xiao et al.

British Journal of Mathematical and Statistical Psychology2026https://doi.org/10.1111/bmsp.70034article
ABDC B
Weight
0.50

Abstract

Constructed-response (CR) items are widely used to assess higher order skills but require human scoring, which introduces variability and is costly at scale. Machine learning (ML)-based scoring offers a scalable alternative, yet its psychometric consequences in rater-mediated models remain underexplored. This study examines how scoring design, rater bias, ML inconsistency and model specification affect the reliability of ability estimation in polytomous CR assessments. Using Monte Carlo simulation, we manipulated human and ML rater bias, ML inconsistency and scoring density (complete, overlapping, isolated). Five estimation models were compared, including the Partial Credit Model (PCM) with fixed thresholds and the Many-Facet Partial Credit Model (MFPCM) with and without free calibration. Results showed that systematic bias, not random inconsistency, was the main source of error. Hybrid human-ML scoring improved estimation when raters were unbiased or exhibited opposing biases, but error compounded when biases aligned. Across designs, PCM with fixed thresholds consistently outperformed more complex alternatives, while anchoring CR items to selected-response metrics stabilized MFPCM estimation. The real data application replicated these patterns. Findings show that scoring design and bias structure, rather than model complexity, drive the benefits of hybrid scoring and that anchoring offers a practical strategy for stabilizing estimation.

Open via your library →

Cite this paper

https://doi.org/https://doi.org/10.1111/bmsp.70034

Or copy a formatted citation

@article{xingyao2026,
  title        = {{Revisiting reliability with human and machine learning raters under scoring design and rater configuration in the many‐facet Rasch model}},
  author       = {Xingyao Xiao et al.},
  journal      = {British Journal of Mathematical and Statistical Psychology},
  year         = {2026},
  doi          = {https://doi.org/https://doi.org/10.1111/bmsp.70034},
}

Paste directly into BibTeX, Zotero, or your reference manager.

Flag this paper

Revisiting reliability with human and machine learning raters under scoring design and rater configuration in the many‐facet Rasch model

Flags are reviewed by the Arbiter methodology team within 5 business days.


Evidence weight

0.50

Balanced mode · F 0.40 / M 0.15 / V 0.05 / R 0.40

F · citation impact0.50 × 0.4 = 0.20
M · momentum0.50 × 0.15 = 0.07
V · venue signal0.50 × 0.05 = 0.03
R · text relevance †0.50 × 0.4 = 0.20

† Text relevance is estimated at 0.50 on the detail page — for your query’s actual relevance score, open this paper from a search result.