Squeakuences: a portable tool for formatting ‘squeaky-clean’ sequences to eliminate bioinformatic software incompatibilities

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Computational analysis of biological sequences is the cornerstone of modern bioinformatics research. Complex processing and interpretation of data often entails multi-step workflows. The specific requirements and limitations of individual applications can require laborious reformatting and piecemeal ‘data-wrangling’ to produce a satisfactory input for each step in a pipeline. We present Squeakuences , a command line tool developed to simplify and automate FASTA file preparation for applications such as phylogenetics, gene annotation, and genome analysis. Implemented in a lightweight Python script, Squeakuences identifies and removes potentially problematic elements in sequence identifiers, such as non-alphanumeric characters, white space, and excessive character count. Squeakuences outputs a new clean version of the sequence file for analysis alongside metadata files to track changes. The user can customize Squeakuences ’ behavior using optional arguments to meet individual processing and formatting requirements. We tested the performance of Squeakuences on molecular data from the human reference genome and found that runtime correlates with the number of sequences processed but not with file size. We expect Squeakuences to save time and manual effort when analyzing sequence data. Squeakuences code is freely available at https://github.com/EvanForsythe/Squeakuences .

Article activity feed