What we talk about when we talk about species

doi:10.32942/x2jw76

What we talk about when we talk about species

2025 · doi:10.32942/x2jw76

preprint OA: closed

Full text JSON View at publisher

Full text 2,452 characters · extracted from oa-doi-fallback · click to expand

This is a Preprint and has not been peer reviewed. This is version 2 of this Preprint. You must log in to post a comment. There are no comments or no comments have been made public for this article. This is a Preprint and has not been peer reviewed. This is version 2 of this Preprint. Add a Comment You must log in to post a comment. Comments There are no comments or no comments have been made public for this article. Genome annotation, alignment, and phylogenetics are at the center of most work in evolutionary genomics. These techniques function best when rooted in prior work. Genes are mined from new genomes using evidence from old gene models. These genomes are aligned to well-worn references to create matrices for tree reconstruction. And trees are often populated with well characterized genomes to add context to the newly sequenced. Genome inference traces a line back to model organisms, yoking the analysis of new genomes to layers of previous knowledge. We instead highlight methods that use unannotated and unaligned sequence to understand the information diversity of sequence ensembles. Any set of genomes can comprise our sequence ensemble. In a pandemic context, a sequence ensemble might be clinically isolated strains from one day. In a systematic context, a sequence ensemble could be the pangenome available for a clade. The normal bioinformatics playbook would have us align. But we instead compress. A sequence ensemble that compresses easily contains lower information diversity. For pandemics, we can use curves of information diversity to trace genomic novelty and monitor selective sweeps in new strains. For systematics, we can calculate compressibility quickly across all known bacterial taxa, leveling the criteria for species across clades. If we tolerate data loss, we can go one step further and capture structural evolution as we compress. Our approach sacrifices a lot. We skip many of the products of modern bioinformatics like variation anchored to known genes or genome alignment to prescribed references or pangenome graphs. But we gain speed, breadth, and the ability to respond to novelty. https://doi.org/10.32942/X2JW76 Life Sciences compression, Infomation Theory, pangenomes Published: 2025-08-11 16:00 Last Updated: 2025-09-16 09:41 CC-BY Attribution-NonCommercial 4.0 International Conflict of interest statement: None Data and Code Availability Statement: Data publicly available Language: English

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: oa-doi-fallback ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00