Adapting bioinformatics data systems in the era of foundational models: leveraging retrieval-augmented generation and low-resource large language models

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

We investigated how to leverage existing assets, such as curated life science database catalogs, in the era of information retrieval powered by large language models (LLMs). Although LLMs exhibit unprecedented information provision capabilities, they inherently suffer from hallucinations, which is an unavoidable limitation. Retrieval-augmented generation (RAG) is a promising approach to mitigate this issue. Furthermore, the analysis of personal data, such as human biological samples, must be conducted in an isolated environment, precluding the use of external Internet-based services. If a system that integrates LLMs and a RAG could be implemented within an isolated environment, it would significantly enhance research activities, including those involving personal data analysis. We evaluated the feasibility of using local LLMs and the effectiveness of the RAG score in reducing the incidence of hallucinations. Regarding the former, existing technologies such as those in Ollama suggest that local deployment is viable. For the latter, a rigorous selection of data sources for the RAG is essential. In particular, we found that establishing a well-structured repository of Japanese-language resources is crucial. Future challenges include optimizing the LLMs for this system and incorporating AI agent functionalities to enhance its overall performance.

Article activity feed