Full text
2,222 characters
· extracted from
oa-doi-fallback
· click to expand
Abstract
Summary Prefix-free parsing [Boucher et al., Alg. Mol. Biol., 2019] is a highly effective heuristic for computing text indexes for very large amounts of biological data. The algorithm constructs a data structure, the prefix-free parse (PFP) of the input, consisting of a dictionary and a parse, which is then used to speed up computation of the final index. In this paper, we study the size of the PFP, which we refer to as π, and show that it is a powerful tool in its own right. To show this, we present two use cases. We first study the application of π as a repetitiveness measure of the input text, and compare it to other currently used repetitiveness measures, including z (the number of Lempel-Ziv phrases), r (the number of runs of the Burrows-Wheeler Transform), and δ (the text’s substring complexity). We then turn to the use of π as a measure for pangenome openness. In both applications, our results are similar to existing measures, but our tool, in almost all cases, is more efficient than those computing the other measures, both in terms of time and space, sometimes by an order of magnitude. We close the paper with a detailed systematic study of the parameter choice for PFP (window size w and modulus p). This gives rise to interesting open questions.
Availability and implementation The source code is available at https://github.com/simolucaa/piPFP. The accession codes for all the datasets used and the raw results are available at https://github.com/simolucaa/piPFP_experiments.
Competing Interest Statement
The authors have declared no competing interest.
Footnotes
Email addresses: francesco.masillo{at}tu-dortmund.de (Francesco Masillo), zsuzsanna.liptak{at}univr.it (Zsuzsanna Lipták)
This revision includes the following updates: (1) We updated the references for the applications of Prefix-Free Parsing (PFP) in the Introduction. (2) We expanded our discussion on our applications of PFP and included previous results to provide a more comprehensive overview of our findings. (3) We performed a new experiment to confirm the correlation of our measure π and r. (4) We moved most of the supplementary material into the main body of the article and included a link to the raw results.
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.