Information Bottleneck Dominates Adversarial Training for Ancestry-Invariant Polygenic Risk Prediction: Dimensionality, Not Gradient Reversal, Controls the Fairness-Accuracy Tradeoff

Philip Phuong Tran
Anh T. Do

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

In adversarial representation learning for fair prediction, the gradient reversal coefficient ( λ ) is widely treated as the primary control for sensitive-attribute invariance. We show this assumption is wrong. Using a dual-stream architecture for cross-ancestry polygenic risk score (PRS) prediction, we demonstrate that latent dimensionality — the information bottleneck — accounts for 8–27 × more variance in ancestry leakage than adversarial strength. Varying λ across a 20 × range changes leakage by only 2.2 percentage points; varying dimensionality across a 16 × range changes it by 46.6 pp. At dimension 8 with no adversarial training ( λ = 0), ancestry leakage is 32.9% (chance = 20%): the bottleneck alone achieves near-invariance. The adversary architecture (linear vs deep MLP) is equally irrelevant (0.6 pp range). We validate this finding across two unrelated domains — genomic ancestry invariance (6 clinical traits, 1000 Genomes, n = 2,504) and EEG subject invariance (pretrained HFTP + Braindecode dual-domain model, 20 subjects) — observing consistent dimensionality dominance (12.7:1 ratio in EEG).

For the genomic application, Stream 1 encodes population structure via DCT-II frequencydomain features (136 coefficients); Stream 2 encodes phenotype signal from top PRS SNPs (PCA to 128 dimensions). The architecture works equally well with standard genomic PCA as the ancestry stream ( R ² = 0.217 vs 0.222), confirming the contribution is architectural, not encoding-specific. African-ancestry PRS reconstruction R ² improves on all six traits (e.g., +5.1 pp for coronary artery disease). Linear models achieve higher aggregate R ² but fail catastrophically on cross-ancestry transfer ( R ² = − 12.45 for African-ancestry CAD). We emphasize that we predict PRS (a computed score), not disease phenotypes; validation on biobank-scale phenotype data is ongoing.

These results suggest the adversarial fairness community has been over-investing in adversary engineering relative to simple capacity control. Practitioners should select latent dimensionality first to set the information budget for the fairness-accuracy tradeoff, then optionally use adversarial training for marginal refinement.

Version published to 10.64898/2026.04.24.720752 on bioRxiv
Apr 29, 2026

High-Dimensional Sensitivity Analysis for Genomic Studies: An Adversarial Framework for Learning Worst-Case Latent Confounders

This article has 2 authors:
1. Yifan Lin
2. Kevin Z. Lin
This article has no evaluationsLatest version May 29, 2026
The Second Brain: Diffusion Models for Realistic Human Microbiome Generation

This article has 2 authors:
1. Brandon Yee
2. Jiayi Fu
Reviewed by PREreview

This article has 1 evaluationAppears in 1 listLatest version May 11, 2026Latest activity Jun 4, 2026
EVEE: Interpretable variant effect prediction from genomic foundation model embeddings

This article has 22 authors:
1. Michael T. Pearce
2. Thomas Dooms
3. Ryo Yamamoto
4. Joshua Meehl
5. Carl Molnar
6. Mark Bissell
7. Dron Hazra
8. Ching Fang
9. Nam Nguyen
10. Michael Anderson
11. Collin Osborne
12. Patrick Duffy
13. Bridget Toomey
14. Eric Klee
15. Elena Myasoedova
16. Alexander J. Ryu
17. Shant Ayanian
18. Panos Korfiatis
19. Matt Redlon
20. Archa Jain
21. Daniel Balsam
22. Nicholas K. Wang
This article has no evaluationsLatest version Apr 11, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

High-Dimensional Sensitivity Analysis for Genomic Studies: An Adversarial Framework for Learning Worst-Case Latent Confounders

The Second Brain: Diffusion Models for Realistic Human Microbiome Generation

EVEE: Interpretable variant effect prediction from genomic foundation model embeddings