pynnotate: a flexible tool for retrieving and processing GenBank data in molecular evolution research and education

doi:10.32942/x2294v

pynnotate: a flexible tool for retrieving and processing GenBank data in molecular evolution research and education

2026 · doi:10.32942/x2294v

preprint OA: closed

Full text JSON View at publisher

Full text 2,637 characters · extracted from oa-doi-fallback · click to expand

This is a Preprint and has not been peer reviewed. This is version 1 of this Preprint. You must log in to post a comment. There are no comments or no comments have been made public for this article. This is a Preprint and has not been peer reviewed. This is version 1 of this Preprint. Add a Comment You must log in to post a comment. Comments There are no comments or no comments have been made public for this article. Pynnotate is a Python-based tool designed for automated retrieval, parsing, and extraction of annotated gene sequences from GenBank records. The tool addresses the common challenges researchers face when working with GenBank data, including inconsistent gene nomenclature, redundant sequences, and the need for standardised gene extraction across multiple taxa. Pynnotate operates through both a graphical user interface and a command-line interface, making it accessible to users with varying levels of bioinformatics experience. The tool supports flexible sequence retrieval through manually defined accession numbers or NCBI query terms, and offers three distinct filtering modes: unconstrained (all sequences), strict (one sequence per species prioritising gene completeness), and flexible (multiple sequences per species when contributing different genes). Key features include synonym resolution for gene names, customizable sequence headers, metadata tracking, and automated gene extraction into separate files. Built-in dictionaries support animal and plant mitochondrial DNA, chloroplast DNA, and ribosomal DNA, and allow users to provide custom synonym dictionaries. The tool generates structured output including FASTA files, metadata matrices, and detailed logs, facilitating integration with downstream analyses. Designed for speed and scalability, pynnotate efficiently handles large datasets, allowing quick retrieval and extraction of annotated sequences across multiple taxa. Finally, pynnotate serves as a valuable resource for both research applications and educational settings, particularly benefiting educators conducting bioinformatics analyses with students with limited command-line experience. https://doi.org/10.32942/X2294V Bioinformatics, Ecology and Evolutionary Biology, Evolution bioinformatics, comparative genomics, feature extraction, molecular evolution, phylogenetics, Python, sequence annotation Published: 2026-02-26 10:24 Last Updated: 2026-02-26 10:24 CC BY Attribution 4.0 International Conflict of interest statement: None. Data and Code Availability Statement: The ‘pynnotate’ public repository is available at https://github.com/fernandacaron/pynnotate. Language: English

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: oa-doi-fallback ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2026) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00