Fast and Memory-Efficient Dynamic Programming Approach for Large-Scale EHH-Based Selection Scans

Amatur Rahman
T. Quinn Smith
Zachary A. Szpiech

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Haplotype-based statistics are widely used for finding genomic regions under positive selection. At the heart of many such statistics is the computation of extended haplotype homozygosity (EHH), which captures the decay of homozygosity away from a focal site. This computation, repeated for potentially millions of sites, is computationally demanding, as it involves tracking counts of unique haplotypes iteratively over long genomic distances and across many individuals. Because of these computational challenges, existing tools do not scale well when applied to large-scale population datasets, such as the 1000 Genomes Project, or the UK Biobank with 500,000 individuals. Optimizing computation becomes crucial when data sets grow large, especially when handling large sample sizes or generating training data for machine learning algorithms.

Here, we propose a dynamic programming algorithm that substantially improves runtime and memory usage over existing tools on both real and simulated data. On real phased data, we achieve 5-50x speedup with minimal memory footprint. Our simulations show an even more pronounced performance gap with large populations (up to 15x speedup and 46x memory reduction). EHH-based statistics designed for unphased genotypes run an order of magnitude faster, and multi-parameter support results in 20x runtime improvement. Source code and binaries are available at https://github.com/szpiech/selscan as selscanv2.1 .

Version published to 10.1101/2025.04.09.647986v1 on bioRxiv
Apr 15, 2025

Assessing simulation-based supervised machine learning for demographic parameter inference from genomic data

This article has 3 authors:
1. Arnaud Quelin
2. Frédéric Austerlitz
3. Flora Jay
This article has no evaluationsLatest version Apr 11, 2025
polars-bio – fast, scalable and out-of-core operations on large genomic interval datasets

This article has 4 authors:
1. Marek Wiewiórka
2. Pavel Khamutou
3. Marek Zbysiński
4. Tomasz Gambin
This article has no evaluationsLatest version Mar 25, 2025
Genomic Data Classification via Universal Compression

This article has 6 authors:
1. Yasmine Omri
2. Naomi Sagan
3. Eugene Min
4. Heewoong Choi
5. Taesup Moon
6. Tsachy Weissman
This article has no evaluationsLatest version Apr 9, 2025

Listed in

Abstract

Article activity feed

Related articles

Assessing simulation-based supervised machine learning for demographic parameter inference from genomic data

polars-bio – fast, scalable and out-of-core operations on large genomic interval datasets

Genomic Data Classification via Universal Compression