Measuring Genomic Data with PFP

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Summary

Prefix free parsing [Boucher et al., Alg. Mol. Biol., 2019], is a highly effective heuristic for computing text indexes for very large amounts of biological data. The algorithm constructs a data structure, the prefix-free parse (PFP) of the input, consisting of a dictionary and a parse, which is then used to speed up computation of the final index. In this paper, we study the size of the PFP, which we refer to as π , and show that it is a powerful theoretical tool in its own right. To show this, we present two use cases. We first study the application of π as a repetitiveness measure of the input text, and compare it to other currently used repetitiveness measures, including z, r , and δ . We then turn to the use of π as a measure for pangenome openness . In both applications, our results are similar to existing measures, but our tool, in almost all cases, is more efficient than those computing the other measures, both in terms of time and space, sometimes by an order of magnitude. We close the paper with the first systematic study of the parameter choice for PFP (window size w and modulus p ). This gives rise to interesting open questions.

Availability and implementation

The source code is available at https://github.com/simolucaa/piPFP , the accession codes for all the datasets used at https://github.com/simolucaa/piPFP_experiments .

Article activity feed