Squeakuences: a portable tool for formatting ‘squeaky-clean’ sequences to eliminate bioinformatic software incompatibilities

Linnea E Lane
Ashley N Doerfler
Luna A L’Argent
Emily Touchette
Bronson Mills
Jessika Bryant
Savannah N Roller
Evan S Forsythe

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Computational analysis of biological sequences is the cornerstone of modern bioinformatics research. Complex processing and interpretation of data often entails multi-step workflows. The specific requirements and limitations of individual applications can require laborious reformatting and piecemeal ‘data-wrangling’ to produce a satisfactory input for each step in a pipeline. We present Squeakuences , a command line tool developed to simplify and automate FASTA file preparation for applications such as phylogenetics, gene annotation, and genome analysis. Implemented in a lightweight Python script, Squeakuences identifies and removes potentially problematic elements in sequence identifiers, such as non-alphanumeric characters, white space, and excessive character count. Squeakuences outputs a new clean version of the sequence file for analysis alongside metadata files to track changes. The user can customize Squeakuences ’ behavior using optional arguments to meet individual processing and formatting requirements. We tested the performance of Squeakuences on molecular data from the human reference genome and found that runtime correlates with the number of sequences processed but not with file size. We expect Squeakuences to save time and manual effort when analyzing sequence data. Squeakuences code is freely available at https://github.com/EvanForsythe/Squeakuences .

Version published to 10.1101/2024.11.01.621607v3 on bioRxiv
Nov 8, 2024
Version published to 10.1101/2024.11.01.621607v2 on bioRxiv
Nov 5, 2024
Version published to 10.1101/2024.11.01.621607v1 on bioRxiv
Nov 3, 2024

Listed in

Abstract

Article activity feed