Predicting Genome-Wide Approximate Match Frequencies with Hit Frequency Vectors

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background: Estimating how many approximate matches a query sequence has in a genome is a fundamental task in bioinformatics, with applications including read mapping or genome mappability analysis. While exact string algorithms and specialized index data structures enable low-error matching, counting matches for short k-mers at high dissimilarity (small k/e ratio) becomes computationally intractable as the number of potential matches grow exponentially for Hamming distance, and even more so for other dissimilarity measures including edit distance. Results: We use machine learning to solve this algorithmic problem. Specifically, we propose deep learning models which in contrast to learned index approaches do not accelerate exact algorithms, but rather directly predict Hit Frequency Vectors (HFVs), which count approximate matches at increasing distance e = 1, 2, 3, . . .. The models are trained on HFVs which are computed using brute-force (for Hamming distance), respectively an efficient sampling approach followed by a search to account for undersampling close matches and correcting counts. Model training and inference are computationally efficient, and enable rapid genome-scale estimation of approximate match frequencies for short query sequences and small k/e. The models capture genome-wide match frequency distributions with high accuracy and allow fast batch prediction even in regimes where exhaustive computation is computationally infeasible for exact algorithms relying on seed-and-extend due to limited seed length leading to exponential effort. We provide loss functions and evaluation measures appropriate for histograms with bin count varying by several orders of magnitude. Conclusion: Our approach enables practical approximation of HFVs and can be easily extended to complex dissimilarities which are too computationally expensive to evaluate exhaustively across an entire genome. The combination of approximate HFV generation and predictive modeling extends the scope of sequence uniqueness analyses considerably. Our approach can trivially be extended to continuous dissimilarities like oligonucleotide binding affinity via appropriate binning for applications in DNA storage, target depletion, or oligonucleotide therapeutics. Lastly, we believe that our work can provide a blue-print for further applications in bioinformatics where straight-forward ML prediction—possibly followed by confirmation with exact algorithms—could have great utility.

Article activity feed