Model calibration and evaluation via optimal subsampling using electronic health record data

Joochul Lee et al.

Journal of the Royal Statistical Society. Series A: Statistics in Society2026https://doi.org/10.1093/jrsssa/qnag036article

AJG 3

Weight

0.50

What the paper says

A common challenge for validating risk prediction models using electronic health record (EHR) data is that labels for the predicted outcome are not directly available. Towards efficient and unbiased model validation, we study optimal sampling designs for efficiently labelling an informative subset of patients in an EHR cohort. Given a pre-specified number of outcome labels, our design aims to minimize the asymptotic variance of an improved inverse probability weighted (‘I-IPW’) estimator for predictive accuracy metrics. Implementation of the sampling requires accurate risk estimates and the predictive accuracy metric of interest. We therefore propose to implement sampling in two steps. First a portion of the target number of labels is acquired by applying entropy sampling to a random subset of the cohort. These initial labels are used to calibrate risk estimates and obtain an initial estimate of the predictive accuracy metric, which are used to inform optimal sampling of the remaining target number of labels. The final estimate of the predictive accuracy metrics is obtained by applying the I-IPW estimator to the cohort and all acquired labels pooled together. Results from simulation studies and application to a real EHR dataset indicate superior efficiency of the proposed sampling design and I-IPW estimator.

Open paper page →

Evidence weight

0.50

Balanced mode · F 0.40 / M 0.15 / V 0.05 / R 0.40

F · citation impact	0.50 × 0.4 = 0.20
M · momentum	0.50 × 0.15 = 0.07
V · venue signal	0.50 × 0.05 = 0.03
R · text relevance †	0.50 × 0.4 = 0.20

† Text relevance is estimated at 0.50 on the detail page — for your query’s actual relevance score, open this paper from a search result.