Open Source Large Language Models in Action: A Bioinformatics Chatbot for PRIDE database

Jingwen Bai
Selvakumar Kamatchinathan
Deepti J Kundu
Chakradhar Bandla
Juan Antonio Vizcaino
Yasset Perez Riverol

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

We here present a chatbot assistant infrastructure (https://www.ebi.ac.uk/pride/chatbot/) that simplifies user interactions with the PRIDE database, the most popular proteomics data repository. Our system utilizes two advanced Large Language Models (LLM), llama2-13b and chatglm2-6b, and includes a web service API (Application Programming Interface), web interface, and sophisticated algorithms. We have developed a novel approach to construct vector-based representations for enabling the LLM responses, featuring a curated version and a comprehensive database of relevant links and paragraphs for each generated response. An important part of the framework is a benchmark component based on an Elo-ranking system, providing a scalable method for evaluating not only the performance of llama2-13b and chatglm2-6b but also, of any other available and future open-source LLMs. Throughout the benchmarking process, the PRIDE documentation for external users was refined to enhance the clarity and efficacy in addressing user queries. Importantly, while our infrastructure is exemplified through its application in the PRIDE database context, the modular and adaptable nature of our approach positions it as a valuable tool for improving user experiences across a spectrum of bioinformatics and proteomics tools and resources, among other domains. The integration of advanced LLMs, innovative vector-based construction, the benchmarking framework, and optimized documentation collectively form a robust and transferable chatbot assistant infrastructure.

Version published to 10.22541/au.171025539.92037103/v1
Mar 12, 2024

Best Practices for Using Large Language Models at Scale

This article has 5 authors:
1. Bhargavee Kannikanti
2. Arjun Coimbatore Nagarasan
3. Alberto Rosas
4. Sriram Kothandaraman
5. Sravan Kumar Kannuri
This article has no evaluationsLatest version Dec 12, 2025
topSEARCH: a Comprehensive Tool for the Retrieval and Analysis of Multi-Type Online Resources

This article has 6 authors:
1. Ander Cejudo
2. Yone Tellechea
3. Teresa García-Navarro
4. Amaia Calvo
5. Garazi Artola
6. Nekane Larburu
This article has no evaluationsLatest version Jan 20, 2026
ReviewAid: An Open-Source Tool for Efficient PICO-Based Screening and Data Extraction in Systematic Reviews

This article has 2 authors:
1. Vihaan Sahu
2. Mohith Balakrishnan
This article has no evaluationsLatest version Jan 5, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Best Practices for Using Large Language Models at Scale

topSEARCH: a Comprehensive Tool for the Retrieval and Analysis of Multi-Type Online Resources

ReviewAid: An Open-Source Tool for Efficient PICO-Based Screening and Data Extraction in Systematic Reviews