Information Bottleneck Dominates Adversarial Training for Ancestry-Invariant Polygenic Risk Prediction: Dimensionality, Not Gradient Reversal, Controls the Fairness-Accuracy Tradeoff

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

In adversarial representation learning for fair prediction, the gradient reversal coefficient ( λ ) is widely treated as the primary control for sensitive-attribute invariance. We show this assumption is wrong. Using a dual-stream architecture for cross-ancestry polygenic risk score (PRS) prediction, we demonstrate that latent dimensionality — the information bottleneck — accounts for 8–27 × more variance in ancestry leakage than adversarial strength. Varying λ across a 20 × range changes leakage by only 2.2 percentage points; varying dimensionality across a 16 × range changes it by 46.6 pp. At dimension 8 with no adversarial training ( λ = 0), ancestry leakage is 32.9% (chance = 20%): the bottleneck alone achieves near-invariance. The adversary architecture (linear vs deep MLP) is equally irrelevant (0.6 pp range). We validate this finding across two unrelated domains — genomic ancestry invariance (6 clinical traits, 1000 Genomes, n = 2,504) and EEG subject invariance (pretrained HFTP + Braindecode dual-domain model, 20 subjects) — observing consistent dimensionality dominance (12.7:1 ratio in EEG).

For the genomic application, Stream 1 encodes population structure via DCT-II frequencydomain features (136 coefficients); Stream 2 encodes phenotype signal from top PRS SNPs (PCA to 128 dimensions). The architecture works equally well with standard genomic PCA as the ancestry stream ( R 2 = 0.217 vs 0.222), confirming the contribution is architectural, not encoding-specific. African-ancestry PRS reconstruction R 2 improves on all six traits (e.g., +5.1 pp for coronary artery disease). Linear models achieve higher aggregate R 2 but fail catastrophically on cross-ancestry transfer ( R 2 = − 12.45 for African-ancestry CAD). We emphasize that we predict PRS (a computed score), not disease phenotypes; validation on biobank-scale phenotype data is ongoing.

These results suggest the adversarial fairness community has been over-investing in adversary engineering relative to simple capacity control. Practitioners should select latent dimensionality first to set the information budget for the fairness-accuracy tradeoff, then optionally use adversarial training for marginal refinement.

Article activity feed