Measuring Genomic Data with Prefix-Free Parsing

doi:10.1101/2025.02.21.639270

Measuring Genomic Data with Prefix-Free Parsing

2025 · doi:10.1101/2025.02.21.639270

preprint OA: closed

📄 Open PDF Full text JSON View at publisher

Full text 2,222 characters · extracted from oa-doi-fallback · click to expand

Abstract Summary Prefix-free parsing [Boucher et al., Alg. Mol. Biol., 2019] is a highly effective heuristic for computing text indexes for very large amounts of biological data. The algorithm constructs a data structure, the prefix-free parse (PFP) of the input, consisting of a dictionary and a parse, which is then used to speed up computation of the final index. In this paper, we study the size of the PFP, which we refer to as π, and show that it is a powerful tool in its own right. To show this, we present two use cases. We first study the application of π as a repetitiveness measure of the input text, and compare it to other currently used repetitiveness measures, including z (the number of Lempel-Ziv phrases), r (the number of runs of the Burrows-Wheeler Transform), and δ (the text’s substring complexity). We then turn to the use of π as a measure for pangenome openness. In both applications, our results are similar to existing measures, but our tool, in almost all cases, is more efficient than those computing the other measures, both in terms of time and space, sometimes by an order of magnitude. We close the paper with a detailed systematic study of the parameter choice for PFP (window size w and modulus p). This gives rise to interesting open questions. Availability and implementation The source code is available at https://github.com/simolucaa/piPFP. The accession codes for all the datasets used and the raw results are available at https://github.com/simolucaa/piPFP_experiments. Competing Interest Statement The authors have declared no competing interest. Footnotes Email addresses: francesco.masillo{at}tu-dortmund.de (Francesco Masillo), zsuzsanna.liptak{at}univr.it (Zsuzsanna Lipták) This revision includes the following updates: (1) We updated the references for the applications of Prefix-Free Parsing (PFP) in the Introduction. (2) We expanded our discussion on our applications of PFP and included previous results to provide a more comprehensive overview of our findings. (3) We performed a new experiment to confirm the correlation of our measure π and r. (4) We moved most of the supplementary material into the main body of the article and included a link to the raw results.

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: oa-doi-fallback ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00