Using large language models to address the bottleneck of georeferencing natural history collections

Yuyang Xie
Daniel Park
Miranda Sinnott-Armstrong
Joyce Ho
Tianlong Chen
Alan Weakley
Luis Aguirre
Jaein Choi
Marisa Laitinen
Nicholas Steeves
Chingyan Huang
Ran Xu
Xiao Feng

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Natural history collections are fundamental for biodiversity research. The broad use of them relies on the digitization effort, especially georeferencing that translates textual locality descriptions into geographic coordinates. However, traditional georeferencing approaches are labor-intensive and costly, thus georeferencing is a major bottleneck in the digitization process that prevents the usage of millions of specimens across the world. This study investigated the potential of using large language models (LLMs) to facilitate georeferencing. We utilized LLMs from OpenAI and DeepSeek to georeference 5,000 vascular plant specimen records with known coordinates, and compared the results against those of GEOLocate (a widely used georeferencing tool) and manual georeferencing. We found that the best-performing LLMs (e.g., gpt-4o) outperformed specialized tools like GEOLocate in spatial applicability, and demonstrated near-human-level accuracy with a median georeferencing error of <10 km. Georeferencing based on LLMs were also considerably fast (<1 s per record) and affordable ($0.10 per 100 records); thus, they present a cost-effective approach for georeferencing. LLMs may not fully replace human curation in the short term, but can be incorporated into current workflows to greatly increase the efficiency of georeferencing. Future advances in LLMs may revolutionize the digitization of natural history collections.

Version published to 10.32942/x2134g
May 2, 2025

Mapping 25,000 Cultural Heritage Sites with GIS and NLP: A Data-Driven Framework for Spatiotemporal Pattern Recognition

This article has 2 authors:
1. Zheng Xu
2. Wei Ren
This article has no evaluationsLatest version Dec 4, 2025
RECODE - Relational Ecological COrpus for Data Extraction

This article has 9 authors:
1. Vasco Branco
2. Lidia Pivovarova
3. Kari-E Lintulaakso
4. Luís Correia
5. Lenka Baranovičová
6. Iiris Lahin
7. Francisco Dias
8. David Filipe
9. Pedro Cardoso
This article has no evaluationsLatest version Nov 11, 2025
Geocoding historical census data for Stockholm, 1878-1950

This article has 1 author:
1. Martin Önnerfors
This article has no evaluationsLatest version Oct 23, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Mapping 25,000 Cultural Heritage Sites with GIS and NLP: A Data-Driven Framework for Spatiotemporal Pattern Recognition

RECODE - Relational Ecological COrpus for Data Extraction

Geocoding historical census data for Stockholm, 1878-1950