Re ² Pair: Increasing the Scalability of RePair by Decreasing Memory Usage

Justin Kim
Rahul Varki
Marco Oliva
Christina Boucher

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The RePair compression algorithm produces a context-free grammar by iteratively substituting the most frequently occurring pair of consecutive symbols with a new symbol until all consecutive pairs of symbols appear only once in the compressed text. It is widely used in the settings of bioinformatics, machine learning, and information retrieval where random access to the original input text is needed. For example, in pangenomics, RePair is used for random access to a population of genomes. BigRePair improves the scalability of the original RePair algorithm by using Prefix-Free Parsing (PFP) to preprocess the text prior to building the RePair grammar. Despite the efficiency of PFP on repetitive text, there is a scalability issue with the size of the parse which causes a memory bottleneck in BigRePair. In this paper, we design and implement recursive RePair (denoted as Re ² Pair), which builds the RePair grammar using recursive PFP. Our novel algorithm faces the challenge of constructing the RePair grammar without direct access to the parse of text, relying solely on the dictionary of the text and the parse and dictionary of the parse of the text. We compare Re ² Pair to BigRePair using SARS-CoV-2 haplotypes and haplotypes from the 1000 Genomes Project. We show that our method Re ² Pair achieves over a 40% peak memory reduction and a speed up ranging between 12% to 79% compared to BigRePair when compressing the largest input texts in all experiments. Re ² Pair is made publicly available under the GNU public license here: https://github.com/jkim210/Recursive-RePair

2012 ACM Subject Classification

Theory of computation → Formal languages and automata theory

Version published to 10.1101/2024.07.11.603142v1 on bioRxiv
Jul 16, 2024

ConsensuSV-ONT - a modern method for accurate structural variant calling

This article has 4 authors:
1. Antoni Pietryga
2. Mateusz Chilinski
3. Sachin Gadakh
4. Dariusz Plewczynski
This article has no evaluationsLatest version Jul 26, 2024
FoldToken4: Consistent & Hierarchical Fold Language

This article has 3 authors:
1. Zhangyang Gao
2. Cheng Tan
3. Stan Z. Li
This article has no evaluationsLatest version Aug 4, 2024
DNA data storage: a generative tool for data encoding motifs

This article has 3 authors:
1. Samira Brunmayr
2. Omer Shimon Sella
3. Thomas Heinis
This article has no evaluationsLatest version Jul 29, 2024

Listed in

Abstract

2012 ACM Subject Classification

Article activity feed

Related articles

ConsensuSV-ONT - a modern method for accurate structural variant calling

FoldToken4: Consistent & Hierarchical Fold Language

DNA data storage: a generative tool for data encoding motifs