IONE: Incoherence-Oriented Neutralisation and Extraction for Detecting Hidden Population Structure in Observational Studies

Onishi Tatsuki

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background Observational studies are susceptible to multiple biases arising from hidden population structure, including confounding, Simpson's paradox, undetected effect modification, the ecological fallacy, and non-collapsibility. Existing adjustment methods such as propensity scores and prognostic scores address only measured confounders and provide no mechanism for detecting subgroup structure driven by unmeasured variables. We propose IONE (Incoherence-Oriented Neutralisation and Extraction), a framework that quantifies population incoherence and extracts coherent subpopulations using routinely measured variables alone. Methods We conducted a Monte Carlo simulation study following the ADEMP framework. Data were generated from a causal directed acyclic graph with three intentionally withheld variables (age, sex, BMI) influencing ten measured variables and a binary outcome. We evaluated six stratification methods in two families: decision power-based methods (predicted probability, residual, cross-validated, machine learning uncertainty) exploiting the outcome, and feature score-based methods (principal component analysis, clustering) operating in the covariate space alone. Performance was assessed by the Adjusted Rand Index (ARI), eta-squared (η²), and a coherence indicator (C1) derived from the I² heterogeneity statistic. Phase 1 comprised 18,000 evaluations across 1,200 scenarios; sensitivity analyses comprised 48,600 evaluations across 8,100 scenarios. We additionally applied IONE to five published instances of Simpson's paradox: COVID-19 case fatality rates, kidney stone treatments, UC Berkeley admissions, Israeli vaccine effectiveness, and the smoking–mortality paradox. Results In simulations, all proposed methods significantly outperformed random stratification (best ARI = 0.020 vs. 0.000, p < 0.001). Decision power-based methods consistently outperformed feature score-based methods. The strength of the hidden variable’s influence on measured variables (Z→X influence) was the primary determinant of performance, with ARI increasing up to 18-fold from weak to strong influence conditions. The coherence indicator C1 clearly distinguished incoherent from coherent populations (proposed methods C1 = 0.001 vs. random C1 = 0.863). In empirical validation, C1 correctly detected incoherence in all five examples (C1 = 0.001–0.034 vs. random C1 = 0.695–1.000). For two-group structures, stratification achieved high accuracy (kidney stone ARI = 0.851; Israeli vaccine ARI = 0.746). For multi-group structures, detection power was limited (COVID-19 ARI = 0.064; Berkeley ARI = 0.082). Conclusions IONE provides a two-tier contribution: first, the C1 coherence indicator reliably detects population incoherence regardless of subgroup complexity; second, stratification-based extraction of coherent subpopulations is effective when hidden variables leave sufficiently strong traces in measured variables (η² > 0.4) and the subgroup structure is discrete. We recommend that coherence assessment be incorporated as a standard step in observational study reporting.

Version published to 10.21203/rs.3.rs-9271445/v1 on Research Square
Apr 15, 2026

Heterogeneity in Statistics: A Conceptual and Methodological Review

This article has 3 authors:
1. Zhanshan (Sam) Ma
2. Shu Liu
3. Aaron Ellison
This article has no evaluationsLatest version Apr 14, 2026
Emergent Causality and Robust Estimation in Open Quantum-Compatible Systems under Non-Unitary Selection

This article has 1 author:
1. Joonsung Kang
This article has no evaluationsLatest version Apr 16, 2026
Fine-grained Debiasing for Large Language Modelsvia Bias Intensity and Probability Decoupling

This article has 4 authors:
1. Zhuge Yan
2. Xiaolong Gong
3. Wangchao Wu
4. Zhike Han
This article has no evaluationsLatest version Apr 6, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Heterogeneity in Statistics: A Conceptual and Methodological Review

Emergent Causality and Robust Estimation in Open Quantum-Compatible Systems under Non-Unitary Selection

Fine-grained Debiasing for Large Language Modelsvia Bias Intensity and Probability Decoupling