pynnotate: a flexible tool for retrieving and processing GenBank data in molecular evolution research and education
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Pynnotate is a Python-based tool designed for automated retrieval, parsing, and extraction of annotated gene sequences from GenBank records. The tool addresses the common challenges researchers face when working with GenBank data, including inconsistent gene nomenclature, redundant sequences, and the need for standardised gene extraction across multiple taxa. Pynnotate operates through both a graphical user interface and a command-line interface, making it accessible to users with varying levels of bioinformatics experience. The tool supports flexible sequence retrieval through manually defined accession numbers or NCBI query terms, and offers three distinct filtering modes: unconstrained (all sequences), strict (one sequence per species prioritising gene completeness), and flexible (multiple sequences per species when contributing different genes). Key features include synonym resolution for gene names, customizable sequence headers, metadata tracking, and automated gene extraction into separate files. Built-in dictionaries support animal and plant mitochondrial DNA, chloroplast DNA, and ribosomal DNA, and allow users to provide custom synonym dictionaries. The tool generates structured output including FASTA files, metadata matrices, and detailed logs, facilitating integration with downstream analyses. Designed for speed and scalability, pynnotate efficiently handles large datasets, allowing quick retrieval and extraction of annotated sequences across multiple taxa. Finally, pynnotate serves as a valuable resource for both research applications and educational settings, particularly benefiting educators conducting bioinformatics analyses with students with limited command-line experience.