The effectiveness of Large Language Models with RAG for auto-annotating phenotype descriptions

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Ontologies are highly prevalent in biology and medicine and are always evolving. Annotating biological text, such as observed phenotype descriptions, with ontology terms is a challenging and tedious task. The process of annotation requires a contextual understanding of the input text and of the ontological terms available. While text-mining tools are available to assist they are largely based on directly matching words and phrases and so lack understanding of the meaning of the query item and of the ontology term labels. Large Language Models (LLMs), however, excel at tasks that require semantic understanding of input text and therefore may provide an improvement for the auto-annotation of text with ontological terms. Here we describe a series of workflows incorporating OpenAI GPT’s capabilities to annotate Arabidopsis thaliana and forest tree phenotypic observations with ontology terms, aiming for results that resemble manually curated annotations. These workflows make use of an LLM to intelligently parse phenotypes into short concepts, followed by finding appropriate ontology terms via embedding vector similarity or via Retrieval-Augmented Generation (RAG). The RAG model is a state-of-the-art approach that augments conversational prompts to the LLM with context-specific data to empower it beyond its pre-trained parameter space. We show that the RAG produces the most accurate automated annotations that are often highly similar or identical to expert-curated annotations.

Short description

Large Language Models excel at tasks that require semantic understanding of text. Here we use that capability to auto-annotate plant phenotypes with ontological terms and compare to expert annotation.

Article activity feed