Introducing STRAND: A Foundational Sequence Transformer for Range Adaptive Nucleotide Decoding
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
The advent of high-throughput sequencing has led to an exponential increase in genomic data, highlighting the need for efficient and accurate models to analyze and interpret this information. Here, we introduce a novel, exomic foundational model that leverages a combination of human reference genome and multispecies data to improve variant detection and interpretation. Our model utilizes a short- range transformer architecture and is trained on a large dataset of human exomic sequences derived from the Tapestry study. Through a series of ablation studies and scaling experiments, we demonstrate the effectiveness of our model in pre- dicting next token accuracy and identifying clinically pathogenic variants. We also show that our model outperforms existing models in a range of downstream tasks, including variant effect prediction and disease state identification. In fact, our largest STRAND variant (1B parameters) surpassed previous benchmarks, demonstrating a mean accuarcy of 0.880 (8.2% improvement over the original NT and 7% improvement over NT-v2). Furthermore, we construct a unique exomic ClinVar dataset to evaluate the model’s performance on pathogenicity and disease states. Our results highlight the potential of this model to improve our understanding of the human exome and its role in disease. The model and its applications have significant implications for genomic based diagnosis and personalized medicine including tailored therapeutic development.