Adapting bioinformatics data systems in the era of foundational models: leveraging retrieval-augmented generation and low-resource large language models

Chihiro Higuchi
Miho Irie
Takahiro Ide
Tatsuya Kushida

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

We investigated how to leverage existing assets, such as curated life science database catalogs, in the era of information retrieval powered by large language models (LLMs). Although LLMs exhibit unprecedented information provision capabilities, they inherently suffer from hallucinations, which is an unavoidable limitation. Retrieval-augmented generation (RAG) is a promising approach to mitigate this issue. Furthermore, the analysis of personal data, such as human biological samples, must be conducted in an isolated environment, precluding the use of external Internet-based services. If a system that integrates LLMs and a RAG could be implemented within an isolated environment, it would significantly enhance research activities, including those involving personal data analysis. We evaluated the feasibility of using local LLMs and the effectiveness of the RAG score in reducing the incidence of hallucinations. Regarding the former, existing technologies such as those in Ollama suggest that local deployment is viable. For the latter, a rigorous selection of data sources for the RAG is essential. In particular, we found that establishing a well-structured repository of Japanese-language resources is crucial. Future challenges include optimizing the LLMs for this system and incorporating AI agent functionalities to enhance its overall performance.

Version published to 10.21203/rs.3.rs-7360068/v1 on Research Square
Aug 19, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed