Predicting Genome-Wide Approximate Match Frequencies with Hit Frequency Vectors

Nathalie Gocht
Alexander Schliep

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background: Estimating how many approximate matches a query sequence has in a genome is a fundamental task in bioinformatics, with applications including read mapping or genome mappability analysis. While exact string algorithms and specialized index data structures enable low-error matching, counting matches for short k-mers at high dissimilarity (small k/e ratio) becomes computationally intractable as the number of potential matches grow exponentially for Hamming distance, and even more so for other dissimilarity measures including edit distance. Results: We use machine learning to solve this algorithmic problem. Specifically, we propose deep learning models which in contrast to learned index approaches do not accelerate exact algorithms, but rather directly predict Hit Frequency Vectors (HFVs), which count approximate matches at increasing distance e = 1, 2, 3, . . .. The models are trained on HFVs which are computed using brute-force (for Hamming distance), respectively an efficient sampling approach followed by a search to account for undersampling close matches and correcting counts. Model training and inference are computationally efficient, and enable rapid genome-scale estimation of approximate match frequencies for short query sequences and small k/e. The models capture genome-wide match frequency distributions with high accuracy and allow fast batch prediction even in regimes where exhaustive computation is computationally infeasible for exact algorithms relying on seed-and-extend due to limited seed length leading to exponential effort. We provide loss functions and evaluation measures appropriate for histograms with bin count varying by several orders of magnitude. Conclusion: Our approach enables practical approximation of HFVs and can be easily extended to complex dissimilarities which are too computationally expensive to evaluate exhaustively across an entire genome. The combination of approximate HFV generation and predictive modeling extends the scope of sequence uniqueness analyses considerably. Our approach can trivially be extended to continuous dissimilarities like oligonucleotide binding affinity via appropriate binning for applications in DNA storage, target depletion, or oligonucleotide therapeutics. Lastly, we believe that our work can provide a blue-print for further applications in bioinformatics where straight-forward ML prediction—possibly followed by confirmation with exact algorithms—could have great utility.

Version published to 10.21203/rs.3.rs-8865258/v1 on Research Square
Feb 26, 2026

HitSV: Maximizing discovery of structural variants across sequencing technologies

This article has 5 authors:
1. Yadong Wang
2. Gaoyang Li
3. Yadong Liu
4. Bo Liu
5. Long Qian
This article has no evaluationsLatest version Feb 20, 2026
Genomic Data Classification via Universal Compression

This article has 6 authors:
1. Yasmine Omri
2. Naomi Sagan
3. Eugene Min
4. Heewoong Choi
5. Taesup Moon
6. Tsachy Weissman
This article has no evaluationsLatest version Mar 3, 2026
Benchmarking protein sequence and structure search methods for remote homology detection

This article has 6 authors:
1. Yuan Liu
2. Yingquan Zhou
3. Yan Huang
4. Hongyi Xin
5. Xiaoyong Pan
6. Hong-Bin Shen
This article has no evaluationsLatest version Feb 24, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

HitSV: Maximizing discovery of structural variants across sequencing technologies

Genomic Data Classification via Universal Compression

Benchmarking protein sequence and structure search methods for remote homology detection