Correcting Label-noise Corruption withlambda-Odds Ratio
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Attribution in large-scale observational studies is often distorted by outcome misclassification and selection, which can inflate weak associations into significant risk factors or spurious ''protective'' effects. We present the lambda-Odds Ratio ($\SPOR$), a misclassification-corrected odds-ratio estimator that (i) uses two ROC thresholds to select high-purity tails and (ii) corrects the observed $2\times 2$ table via a minimal-feasible ridge inversion. Large-sample Wald intervals are obtained on the log scale, with total variance that includes the delta-method contribution from estimating selection-conditional error rates on a validation cohort. In simulation, the naive log-odds ratio exhibits attenuation and coverage collapse, whereas the corrected estimator remains effectively unbiased and sustains substantially higher coverage with only modest RMSE inflation. In EHR-scale applications to Alzheimer's disease and related dementias, idiopathic pulmonary fibrosis, and autism spectrum disorder, the method reduces inflated discoveries seen with naive odds ratios and SHAP attributions, while surfacing biologically credible modulators. These results provide a practical, defensible framework for attribution in biobank- and EHR-scale studies.