Measuring Genomic Data with PFP

Zsuzsanna Lipták
Simone Lucà
Francesco Masillo

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Summary

Prefix free parsing [Boucher et al., Alg. Mol. Biol., 2019], is a highly effective heuristic for computing text indexes for very large amounts of biological data. The algorithm constructs a data structure, the prefix-free parse (PFP) of the input, consisting of a dictionary and a parse, which is then used to speed up computation of the final index. In this paper, we study the size of the PFP, which we refer to as π , and show that it is a powerful theoretical tool in its own right. To show this, we present two use cases. We first study the application of π as a repetitiveness measure of the input text, and compare it to other currently used repetitiveness measures, including z, r , and δ . We then turn to the use of π as a measure for pangenome openness . In both applications, our results are similar to existing measures, but our tool, in almost all cases, is more efficient than those computing the other measures, both in terms of time and space, sometimes by an order of magnitude. We close the paper with the first systematic study of the parameter choice for PFP (window size w and modulus p ). This gives rise to interesting open questions.

Availability and implementation

The source code is available at https://github.com/simolucaa/piPFP , the accession codes for all the datasets used at https://github.com/simolucaa/piPFP_experiments .

Version published to 10.1101/2025.02.21.639270v1 on bioRxiv
Feb 27, 2025

IGD: A simple, efficient genotype data format

This article has 2 authors:
1. Drew DeHaas
2. Xinzhu Wei
This article has no evaluationsLatest version Feb 8, 2025
GrAnnoT, a tool for efficient and reliable annotation transfer through pangenome graph

This article has 3 authors:
1. Nina Marthe
2. Matthias Zytnicki
3. Francois Sabot
This article has no evaluationsLatest version Mar 2, 2025
PDBCleanV2: A Python Library for Generating Consistent Structure Datasets

This article has 6 authors:
1. Fátima Pardo-Avila
2. Liv Weiner
3. Paulina Cabral
4. Nicholas Corsepius
5. Frédéric Poitevin
6. Michael Levitt
This article has no evaluationsLatest version Feb 19, 2025

Listed in

Abstract

Summary

Availability and implementation

Article activity feed

Related articles

IGD: A simple, efficient genotype data format

GrAnnoT, a tool for efficient and reliable annotation transfer through pangenome graph

PDBCleanV2: A Python Library for Generating Consistent Structure Datasets