Combining Propensity Scores and Common Items for Test Score Equating

Inga Laukaityte et al.

Applied Psychological Measurement2025https://doi.org/10.1177/01466216251363240article
AJG 2ABDC B
Weight
0.37

Abstract

Ensuring that test scores are fair and comparable across different test forms and different test groups is a significant statistical challenge in educational testing. Methods to achieve score comparability, a process known as test score equating, often rely on including common test items or assuming that test taker groups are similar in key characteristics. This study explores a novel approach that combines propensity scores, based on test takers' background covariates, with information from common items using kernel smoothing techniques for binary-scored test items. An empirical analysis using data from a high-stakes college admissions test evaluates the standard errors and differences in adjusted test scores. A simulation study examines the impact of factors such as the number of test takers, the number of common items, and the correlation between covariates and test scores on the method's performance. The findings demonstrate that integrating propensity scores with common item information reduces standard errors and bias more effectively than using either source alone. This suggests that balancing the groups on the test-takers' covariates enhance the fairness and accuracy of test score comparisons across different groups. The proposed method highlights the benefits of considering all the collected data to improve score comparability.

1 citation

Open via your library →

Cite this paper

https://doi.org/https://doi.org/10.1177/01466216251363240

Or copy a formatted citation

@article{inga2025,
  title        = {{Combining Propensity Scores and Common Items for Test Score Equating}},
  author       = {Inga Laukaityte et al.},
  journal      = {Applied Psychological Measurement},
  year         = {2025},
  doi          = {https://doi.org/https://doi.org/10.1177/01466216251363240},
}

Paste directly into BibTeX, Zotero, or your reference manager.

Flag this paper

Combining Propensity Scores and Common Items for Test Score Equating

Flags are reviewed by the Arbiter methodology team within 5 business days.


Evidence weight

0.37

Balanced mode · F 0.40 / M 0.15 / V 0.05 / R 0.40

F · citation impact0.16 × 0.4 = 0.06
M · momentum0.53 × 0.15 = 0.08
V · venue signal0.50 × 0.05 = 0.03
R · text relevance †0.50 × 0.4 = 0.20

† Text relevance is estimated at 0.50 on the detail page — for your query’s actual relevance score, open this paper from a search result.