GFFx: A Rust-based suite of utilities for ultra-fast genomic feature extraction

preprint OA: closed CC-BY-NC-4.0
📄 Open PDF Full text JSON View at publisher
Full text 22,831 characters · extracted from oa-pdf · 7 sections · click to expand

Abstract

Genome annotations are becoming increasingly comprehensive due to the discovery of diverse regulatory elements and transcript variants. However, this improvement in annotation resolution poses major challenges for efficient querying, especially across large genomes and pangenomes. Existing tools often exhibit performance bottlenecks when handling large-scale genome annotation files, particularly for region-based queries and hierarchical model extraction. Here, we present GFFx, a Rust-based toolkit for ultra-fast and scalable genome annotation access. GFFx introduces a compact, model-aware indexing system inspired by binning strategies and leverages Rust’s strengths in execution speed, memory safety, and multithreading. It supports both feature- and region-based extraction with significant improvements in runtime and scalability over existing tools. Distributed via Cargo, GFFx provides a cross-platform command-line interface and a reusable library with a clean API, enabling seamless integration into custom pipelines. Benchmark

Results

demonstrate that GFFx offers substantial speedups and makes a practical, extensible solution for genome annotation workflows.

Keywords

GFF file; Genome Annotation; Rust Programming; Feature Extraction

Introduction

With the growing understanding of functional genome regions beyond conventional protein- coding genes, genome annotations are rapidly increasing in both complexity and volume. Large- scale efforts such as ENCODE [1], FANTOM [2], and Roadmap Epigenomics Program [3] have cataloged diverse noncoding elements—including enhancers, promoters, long non-coding RNAs (lncRNAs), and epigenetic marks—highlighting their roles in gene regulation, chromatin dynamics, and cellular identity. As novel regulatory elements, alternative isoforms, and lineage- or .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted August 12, 2025. ; https://doi.org/10.1101/2025.08.08.669426doi: bioRxiv preprint tissue-specific transcripts continue to emerge, annotation datasets are expected to expand further [4]. The accumulation of such multilayered annotations, particularly across large genomes or pangenomes, poses growing challenges for storage, indexing, and efficient querying. However, existing tools often struggle to process ultra-large annotation files efficiently, particularly for region-based queries, hierarchical model extraction, or parallel execution. A scalable, high-performance solution optimized for such tasks is urgently needed. Rust, a modern systems programming language, offers high execution speed, memory safety, efficient multithreading, and cross-platform portability. These features have led to its increasing adoption in bioinformatics [5], as exemplified by Rust-Bio [6], Bigtools [7], Phylo-rs [8], and fibertools [9]. To address these challenges, we developed GFFx, a Rust-based toolkit for fast and scalable access to genome annotation files. GFFx fills a gap in current methods by enabling efficient indexing and querying of ultra-large General Feature Format (GFF) datasets. Designed as both a command-line tool and a reusable library, it can be integrated into larger pipelines and software systems. It also demonstrates Rust’s potential in computational biology by providing a robust, extensible foundation for high-performance annotation processing. Findings Performance benchmark in annotation indexing GFFx achieves high-performance efficiency through a modular indexing system anchored by two core indices, .prt and .gof, which capture feature hierarchical relationship and map annotation blocks to their byte-offsets for direct memory access, respectively. Complementary lightweight indices, including .fts, .a2f, .atn, .sqs, rit, and .rix, support subcommand-specific operations like feature extraction, attribute-based searches, and region queries with minimal I/O overhead (Fig. 1). .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted August 12, 2025. ; https://doi.org/10.1101/2025.08.08.669426doi: bioRxiv preprint Figure 1. Architecture of the indexing system and subcommand interactions in GFFx. All index files are generated in advance from a GFF3 file (cream box) via the index module. While all subcommands (green boxes) have access to the complete set of indices, each subcommand loads only the subset relevant to its specific function. Core indices .gof and .prt (dark brown boxes) are universally required, whereas module-specific indices such as .fts, .a2f, .atn, .rit, rix, and .sqs (light brown boxes) are utilized only by specific subcommands as illustrated. Among commonly used GFF processing tools, only gffutils [10] preprocesses on GFF files by converting it into a SQLite database. We compared the runtime of index-building in GFFx and database-creating in gffutils (v0.13) using four representative GFF3 annotation datasets: the Homo sapiens genome (hg38), the Pungitius sinensis genome (ceob_ps_1.0), the Drosophila melanogaster genome (dm6), and the Arabidopsis thaliana genome (tair10.1), which contain 4 894 761, 652 095, 412 995 and 710 661 feature items, respectively. All benchmarks were performed on a dedicated compute node equipped with 2 × Intel(R) Xeon(R) Gold 6448H CPUs (32 cores/64 threads each), 1 TB DDR4 RAM, and dual Micron 7450 MTFDKCB960TFR NVMe SSDs (total capacity 1.92 TB). The results indicate that, despite its relatively complex indexing architecture, GFFx did not exhibit a significant increase in runtime. For GFF files of moderate size, GFFx took one to two minutes—approximately 1.44 to 2.17 times longer than gffutils. However, for the large hg38 GFF file, GFFx outperformed gffutils substantially, running 6.72 times faster (Supplementary Fig. S1). In addition, the total sizes of the index files generated by GFFx were remarkably smaller than those of database files produced by gffutils (Supplementary Table S1), underscoring another key advantage of GFFx. .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted August 12, 2025. ; https://doi.org/10.1101/2025.08.08.669426doi: bioRxiv preprint Benchmarking identifier-based feature extraction performance We benchmarked identifier-based feature extraction performance of GFFx against four existing tools: gffread (v0.12.8) [11], gffutils (v0.13) [10], bcbio-gff (v0.7.1) [12] and AGAT (v1.4.1) [13]. These benchmarks used the same four annotation GFF files as above, with 100 replicates per file. In each replicate, we randomly sampled 100,000 feature identifiers and extracted the corresponding entries. GFFx achieved median runtimes of 1.62 s, 0.41 s, 0.37 s and 0.38 s on hg38, ceob_ps_1.0, dm6 and tair10.1, respectively (Fig. 2a; Supplementary Table S2), corresponding to 12.33- to 80.27-fold speedups over the second fastest tool, gffread. Besides, GFFx required less memory than other tools except gffutils (Fig. 2b; Supplementary Table S2). Overall, GFFx achieves substantial speedups, with the speed increasing proportionally with the size and complexity of the annotation files, without incurring additional memory overhead. As genome assemblies become larger and the annotations grow more detailed, GFFx will continue to outpace other tools by an ever-widening margin. Figure 2. Comparison of identifier-based extraction performance among GFFx and other tools. (a) Median wall-clock time (log scale) for extracting 100 000 random 20-kbp intervals in dm6 (dark teal), ceob_ps_1.0 (cyan), tair10.1 (tan) and hg38 (pink) using GFFx, gffread, gffutils, BCBio, and AGAT. (b) Maximum resident set size (RSS, log scale), a measure of peak memory consumption, for each tool and dataset. Data represent the median of 100 replicate runs. Tools are ordered left-to-right by increasing median wall-clock time on hg38. Benchmarking region-based feature retrieval performance Subsequently, we benchmarked region-based retrieval performance of GFFx against four tools— gffutils, bcbio-gff, AGAT and bedtools (v2.31.1) [14]—substituting bedtools for gffread because gffread only handles single user-specified regions and does not accept BED files. Using the same four annotation GFF files with 100 replicates each, we generated BED4-format interval files containing 100,000 randomly sampled 20-kbp bins per replicate using the random command from .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted August 12, 2025. ; https://doi.org/10.1101/2025.08.08.669426doi: bioRxiv preprint bedtools, which generates random genomic intervals. Among all tools, GFFx delivered the fastest region-based retrieval, with median runtimes of 0.46 s, 0.10 s, 0.18 s and 0.17 s on hg38, ce11_ps_1.0, dm6 and tair10.1, respectively (Fig. 3a; Supplementary Table S3). Excluding GFFx, bedtools was the next fastest, requiring 4.89–11.04 s (24.0–61.8-fold slower), while dedicated GFF processors were at least 201-fold slower. This performance gain of GFFx derives from its interval-tree algorithm, which reduces time complexity from O(N) to O(log N + k), where N represents the total number of intervals in a GFF file and k represents the number of overlapped intervals. Although the memory usage of GFFx is not always the lowest (Fig. 3b; Supplementary Tables S3), it remains under 1 GB across all tests, ensuring operability on standard personal computers without sacrificing speed. Figure 3. Comparison of region-based feature retrieval performance among GFFx and other tools. (a) Median wall-clock time (log scale) for extracting 100 000 random 20-kbp intervals in dm6 (dark teal), ceob_ps_1.0 (cyan), tair10.1 (tan) and hg38 (pink) using GFFx, bedtools, BCBio, gffutils and AGAT. (b) Maximum resident set size (RSS, log scale), a measure of peak memory consumption, for each tool and dataset. Data represent the median of 100 replicate runs. Tools are ordered left-to-right by increasing median wall-clock time on hg38.

Discussion

Here, we present GFFx, a Rust-based, modular, and high-performance toolkit for efficient processing and querying of ultra-large GFF3 genome annotation files. It addresses key limitations of existing tools through a compact, model-aware indexing system and by leveraging Rust's strengths in speed, memory safety, and multithreaded execution. Many widely used tools suffer from performance bottlenecks when processing large-scale annotations. For example, gffutils depends on relational databases, leading to long indexing times and high disk usage; AGAT and bcbio-gff offer broad functionality but are not optimized for fast querying; bedtools supports region-based queries but lacks model awareness; and gffread performs well only on small datasets and lacks parallel support. .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted August 12, 2025. ; https://doi.org/10.1101/2025.08.08.669426doi: bioRxiv preprint Region-based queries in GFFx are powered by an in-memory interval-tree index. Interval trees are a well-established data structure for efficiently storing and querying one-dimensional intervals that vary widely in length and often overlap or nest, making them an ideal fit for genome annotation data [15]. In an interval tree, each node represents a feature interval and tracks the maximum endpoint of its subtree. This pruning mechanism skips entire subtrees whose intervals lie outside the query region, avoiding full-file scans and enabling sublinear query times. Once features are identified, GFFx uses the .gof index, which maps feature IDs to byte offsets in the original GFF file, to retrieve annotation blocks directly, resulting in rapid end-to-end extraction even on large, complex datasets. Benchmark results show that GFFx significantly outperforms existing tools in both feature- and region-based extraction, offering large speedups while maintaining modest memory usage and strong parallel scalability. As genome annotations continue to grow in complexity and size, GFFx offers a practical and extensible foundation for future bioinformatics workflows. While robust for standard GFF3 files, the current implementation assumes well-formed input and does not yet support GTF or legacy GFF2 formats. Enhancing compatibility and fault tolerance— particularly for nonstandard annotations—remains an important area for development. Planned extensions include support for additional formats, distributed computing integration, and interactive search for large-scale databases. GFFx is distributed as a statically compiled binary for Linux, macOS, and Windows. It can also be used as a Rust library, allowing integration into custom pipelines and tools. Its modular architecture and clean API offer fine-grained access to core functions, making GFFx both performant and programmable. Full documentation is available at docs.rs/GFFx, and the GitHub repository includes user manuals, benchmarks, input data, and source code for complete reproducibility (https://github.com/Baohua-Chen/GFFx).

Methods

Architectural design of indexing system underpins GFFx performance GFFx was developed as a modular and high-performance command-line toolkit for processing large GFF files. Its efficiency is supported by a carefully engineered indexing system (Fig. 1). At the core of GFFx are two index files shared across all subcommands: .prt and .gof. The .prt index encodes the hierarchical relationships among annotated features and delineates annotation blocks as minimal, biologically coherent units, such as complete gene models or transcript structures. The .gof index maps each annotation block to its corresponding byte-offset range in the original GFF file, enabling direct memory-mapped access to specific regions without requiring full-file scanning or decompression. Together, these two indices provide the structural and positional backbone of GFFx, allowing fast and model-aware access to genome annotations with minimal input/output overhead. To minimize redundancy and reduce index file size, both .prt and .gof use .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted August 12, 2025. ; https://doi.org/10.1101/2025.08.08.669426doi: bioRxiv preprint numeric feature identifiers assigned in order of appearance. The original string-form feature IDs are stored separately in the .fts file. In addition to the core indices, GFFx generates several auxiliary index files that support specific subcommands. The extract subcommand retrieves the full annotation block associated with a given feature and requires only the .fts index, which records all feature identifiers in order, together with the .prt and .gof files. For attribute-based queries, the .atn file stores all user- specified string-form identifiers found in the attribute field of the GFF file (such as “gene”, “Name”, or “symbol”), while the .a2f file maps each attribute value to its corresponding numeric feature ID. These two files are used by the search subcommand, which enables both exact and fuzzy attribute queries. The intersect subcommand uses an interval tree scheme. GFFx builds a .rit file containing all interval-tree nodes laid out sequentially and a companion .rix file that records offsets in .rit for each chromosome or scaffold, so that only the relevant subtree is loaded on demand. This reduces region-query time complexity from O(N) to O(log N), greatly speeding up lookups in large genomes. All indices are written in compact binary format and accessed on demand by each subcommand to minimize storage footprint and loading time. Efficient Runtime Strategies for Feature and Region-Based Extraction To achieve high-throughput querying from ultra-large GFF3 files, GFFx incorporates several performance-oriented design strategies beyond its indexing system. All subcommands operate directly on memory-mapped representations of the original GFF file using the memmap2 library. This eliminates the need for repeated I/O or line-by-line parsing by allowing byte-range access to annotation blocks through read-only mappings. Extracted regions or feature models are located via index lookups and retrieved efficiently by copying their byte slices directly from the memory- mapped buffer. To minimize redundant computation, GFFx leverages reference-counted shared memory to ensure that index structures such as .gof and .rit are loaded only once and reused across all operations. Output blocks are streamed directly to disk, avoiding large memory buffers, and the software assumes well-formed GFF3 input to reduce validation overhead. To ensure high-performance region-based feature extraction, GFFx leverages several optimizations provided by the Rust ecosystem, such as the use of “FxHashMap” for low-overhead hash-based mappings and “lexical_core” for converting ASCII byte sequences into integer coordinates with minimal latency. Additionally, input regions are pre-bucketed by chromosome and sorted by the start coordinates, ensuring each interval tree to be queried only with relevant regions, thereby reducing unnecessary computation and improving cache locality. Supplementary data .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted August 12, 2025. ; https://doi.org/10.1101/2025.08.08.669426doi: bioRxiv preprint Supplementary data are available online. Conflict of interest None declared. Data availability The source code and a comprehensive user manual are available on GitHub: https://github.com/Baohua-Chen/GFFx. Benchmarking scripts and original results are also publicly available at: https://github.com/Baohua-Chen/GFFx_benchmarks. Funding This work was supported by the Young Scientists Fund of the National Natural Science Foundation of China (No. 32300490) to D.W.

References

1. Moore JE, Purcaro MJ, Pratt HE, Epstein CB, Shoresh N, Adrian J, et al.. Expanded encyclopaedias of DNA elements in the human and mouse genomes. Nature. Nature Publishing Group; 2020; doi: 10.1038/s41586-020-2493-4. 2. Forrest ARR, Kawaji H, Rehli M, Kenneth Baillie J, de Hoon MJL, Haberle V, et al.. A promoter- level mammalian expression atlas. Nature. Nature Publishing Group; 2014; doi: 10.1038/nature13182. 3. Satterlee JS, Chadwick LH, Tyson FL, McAlliste r K, Beaver J, Birnbaum L, et al.. The NIH Common Fund/Roadmap Epigenomics Program: Successes of a comprehensive consortium. Science Advances . American Association for the Advancement of Science; 2019; doi: 10.1126/sciadv.aaw6507. 4. Harrison PW, Amode MR, Austine-Orimoloye O, Azov AG, Barba M, Barnes I, et al.. Ensembl 2024. Nucleic Acids Research. 2024; doi: 10.1093/nar/gkad1049. 5. Jeffrey M. Perkel. Why scientists are turning to Rust. Nature. 588:1852020; 6. Köster J. Rust-Bio: a fast and safe bioinformatics library. Bioinformatics. 2016; doi: 10.1093/bioinformatics/btv573. 7. Huey JD, Abdennur N. Bigtools: a high-performance BigWig and BigBed library in Rust. Bioinformatics. 2024; doi: 10.1093/bioinformatics/btae350. .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted August 12, 2025. ; https://doi.org/10.1101/2025.08.08.669426doi: bioRxiv preprint 8. Vijendran S, Anderson T, Markin A, Eulenstein O. Phylo-rs: an extensible phylogenetic analysis library in rust. BMC Bioinformatics. 2025; doi: 10.1186/s12859-025-06234-w. 9. Jha A, Bohaczuk SC, Mao Y, Ranchalis J, Mallory BJ, Min AT, et al.. DNA-m6A calling and integrated long-read epigenetic and genetic analysis with fibertools. Genome Res . 2024; doi: 10.1101/gr.279095.124. 10. Dale R. Gffutils: GFF and GTF file manipulation and intercon-version. Github; 11. Pertea G, Pertea M. GFF Utilities: GffRead and GffCompare. F1000Res. 2020; doi: 10.12688/f1000research.23297.2. 12. Chapman B. bcbio-gff. GitHub. v0. 6.4. 13. Dainat J. Another Gtf/Gff Analysis Toolki t (AGAT): Resolve Interoperability Issues and Accomplish More with Your Annotations. Plant and Animal Genome XXIX Conference (January 8-12, 2022). PAG; 14. Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010; doi: 10.1093/bioinformatics/btq033. 15. Lawrence M, Huber W, Pagès H, Aboyoun P, Carlson M, Gentleman R, et al.. Software for Computing and Annotating Genomic Ranges. PLOS Computational Biology . Public Library of Science; 2013; doi: 10.1371/journal.pcbi.1003118. .CC-BY-NC 4.0 International licensemade available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprintthis version posted August 12, 2025. ; https://doi.org/10.1101/2025.08.08.669426doi: bioRxiv preprint

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

Ask this paper AI returns verbatim quotes from the full text · source: oa-pdf

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc
last seen: 2026-05-20T01:45:00.602351+00:00
unpaywall
last seen: 2026-05-23T02:00:01.238055+00:00
License: CC-BY-NC-4.0