When Better Prediction Reduces Overlap: The Predictability Paradox in Propensity Score Matching with Machine Learning
Foong Soon Cheong
Abstract
Evidence from observational studies plays a central role in shaping public policy in health, education, and financial regulation, where randomized experiments are rarely feasible. Propensity score matching (PSM) is a widely used method to approximate fair comparisons between treatment and control groups. Incorporating machine learning into the estimation of propensity scores can strengthen prediction and enhance the credibility of findings. However, stronger predictive models create a “predictability paradox”. As predictive accuracy improves, estimated propensity scores for treated and control units become more distinct when treatment assignment is strongly predictable from observed covariates, revealing limited overlap between groups. In the limit, near-perfect prediction produces near-complete separation between groups, rendering traditional matching infeasible and confining inference to a narrow subset of units near the boundary of the propensity score distribution, a setting analogous to a regression discontinuity design (RDD). Researchers thus face perverse incentives to use weaker models for statistically significant but spurious results. These dynamics jeopardize the reliability of evidence for policy. To safeguard decision-making, we propose a simple reform: require that studies using PSM disclose model error rates, including false positive and false negative rates, along with information on overlap and effective sample size.
Evidence weight
Balanced mode · F 0.40 / M 0.15 / V 0.05 / R 0.40
| F · citation impact | 0.50 × 0.4 = 0.20 |
| M · momentum | 0.50 × 0.15 = 0.07 |
| V · venue signal | 0.50 × 0.05 = 0.03 |
| R · text relevance † | 0.50 × 0.4 = 0.20 |
† Text relevance is estimated at 0.50 on the detail page — for your query’s actual relevance score, open this paper from a search result.