Lossless Pangenome Indexing Using Tag Arrays

Parsa Eskandar
Benedict Paten
Jouni Sirén

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Pangenome graphs represent the genomic variation by encoding multiple haplotypes within a unified graph structure. However, efficient and lossless indexing of such structures remains challenging due to the scale and complexity of pangenomic data. We present a practical and scalable indexing framework based on tag arrays, which annotate positions in the Burrows--Wheeler transform (BWT) with graph coordinates. Our method extends the FM-index with a run-length compressed tag structure that enables efficient retrieval of all unique graph locations where a query pattern appears. We introduce a novel construction algorithm that combines unique $k$-mers, graph-based extensions, and haplotype traversal to compute the tag array in a memory-efficient manner. To support large genomes, we process each chromosome independently and then merge the results into a unified index using properties of the multi-string BWT and r-index. Our evaluation on the HPRC graphs demonstrates that the tag array structure compresses effectively, scales well with added haplotypes, and preserves accurate mapping information across diverse regions of the genome. This indexing method enables lossless and haplotype-aware querying in complex pangenomes and offers a practical indexing layer to develop scalable aligners and downstream graph-based analysis tools. The index additionally supports efficient one-to-all coordinate translation, enabling any interval on a haplotype to be mapped to its corresponding intervals across all other haplotypes in the graph.

Version published to 10.21203/rs.3.rs-8233501/v1 on Research Square
Jan 18, 2026

GTcomplex: Spatial indexing-powered search and alignment of macromolecular complexes

This article has 1 author:
1. Mindaugas Margelevicius
This article has no evaluationsLatest version Jan 22, 2026
A Survey on Efficient Protein Language Models

This article has 8 authors:
1. Shouren Wang
2. Debargha Ganguly
3. Vinooth Kulkarni
4. Wang Yang
5. Zhuoran Qiao
6. Daniel Blankenberg
7. Vipin Chaudhary
8. Xiaotian Han
This article has no evaluationsLatest version Dec 24, 2025
Data Structures for Range Sorted Consecutive Occurrence Queries

This article has 2 authors:
1. Waseem Akram
2. Takuya Mieno
This article has no evaluationsLatest version Jan 21, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

GTcomplex: Spatial indexing-powered search and alignment of macromolecular complexes

A Survey on Efficient Protein Language Models

Data Structures for Range Sorted Consecutive Occurrence Queries