PlasRAG: comprehensive plasmid characterization and retrieval through sequence-text alignment

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Plasmids play a pivotal role in the emergence of multidrug-resistant and pathogenic bacteria, posing significant clinical challenges. The integration of metagenomic sequencing with advanced bioinformatics tools surpasses traditional wet lab methods, leading to the discovery of millions of plasmids from diverse origins. However, the rapidly growing number of unannotated plasmids necessitates comprehensive characterization of their multi-faceted properties, such as risk indices and ecological contexts, to support various downstream applications. Achieving this goal is hindered by several challenges, including the limited availability of plasmid characterization tools, the inadequacies of alignment-based methods for novel plasmids, and inconsistencies in manual annotations across plasmid reference databases. To address these issues, we present PlasRAG, a novel tool that integrates two key modules: multi-faceted property characterization of query plasmids and plasmid DNA retrieval based on textual queries. At its core, PlasRAG employs a bidirectional multi-modal information retrieval model that aligns DNA sequences with textual data, effectively overcoming the limitations of traditional approaches. Specifically, within the characterization module, PlasRAG leverages the retrieval-augmented generation (RAG) framework and the Llama-3 large language model (LLM) to provide accurate and context-aware responses to user queries. Rigorous experiments demonstrate that PlasRAG delivers robust performance and enhanced analytical capabilities, underscoring the effectiveness of its architectural design. In particular, experiments on a real-world plasmid dataset curated from diverse human gut metagenomes suggest that plasmids with a broader host range and encoded ARGs tend to spread more extensively. The source code of PlasRAG is available via: https://github.com/Orin-beep/PlasRAG .

Article activity feed