pLAST - a tool for rapid comparison and classification of bacterial plasmid sequences
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Motivation
The increasing number of fully sequenced bacterial plasmids being annotated and cataloged has prompted the development of computational tools for comparing and classifying them. Existing approaches typically compare full-length DNA sequences (e.g., Mash, BLASTn) or translated open reading frames (ORFs) (e.g., BLASTp, DIAMOND), with plasmid-level scores obtained by aggregating ORF-to-ORF similarities; however, they are either restricted to closely related plasmids or become computationally demanding in large-scale analyses.
Results
We describe pLAST (plasmid Language Analysis and Search Tool), a plasmid-search tool built using word2vec representations capturing both ORF-to-ORF similarity and gene neighborhood conservation. Benchmarks indicate that pLAST outperforms both DNA- and ORF-based methods in identifying functionally similar plasmids and, compared to the widely used Mash, it achieves 37%, 30%, and 13% improvements in detecting shared mating-pair formation (MPF) system, relaxase, and oriT types, respectively. This performance scales to datasets comprising thousands of sequences, as exemplified by clustering analyses of ∼56,000 plasmids, which reveal expected functional groups. Beyond global similarity, pLAST also returns per-ORF plasmid-plasmid alignments, enabling detection of shared functional modules.
Availability and implementation
pLAST is freely accessible as a web server at https://plast.lbs.cent.uw.edu.pl/ and available as a Python module along with a precomputed database at https://github.com/labstructbioinf/pLAST for customized analysis.