Extracting massive ecological data on state and interactions of species using large language models

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

The contemporary ecological crisis calls for integration and synthesis of ecological data describing the state, change and processes of ecological communities. However, such synthesis depends on the integration of vast amounts of mostly scattered and often hard-to-extract information that is published and dispersed across hundreds of thousands of scientific papers, for example describing species-specific interactions and trophic relationships. Recent advancements in natural language processing (NLP) and in particular the emergence of large language models (LLMs) offer a novel, and potentially revolutionary solution to this persistent challenge, for the first time creating the opportunity to access and extract virtually all data ever published. Here, we demonstrate the transformative potential of LLMs by extracting all types of biological interactions among species directly from a corpus of 83,910 scientific articles. Our approach successfully extracted a network of 144,402 interactions between 36,471 taxa. Performance analysis shows that the model exhibits a high sensitivity (70.0%) and excellent precision (89.5%). Our approach proves that LLMs are capable of carrying out complex extraction tasks on key ecological data on a very large scale, paving the way for a multitude of potential applications in ecology and beyond.

Article activity feed