Domain-Specific Embedding Models for Hydrology and Environmental Sciences: Enhancing Semantic Retrieval and Question Answering in RAG Pipelines

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Large Language Models (LLMs) have shown strong performance across natural language processing tasks, yet their general-purpose embeddings often fall short in domains with specialized terminology and complex syntax, such as hydrology and environmental science. This study introduces HydroEmbed, a suite of open-source sentence embedding models fine-tuned for four QA formats: multiple-choice (MCQ), true/false (TF), fill-in-the-blank (FITB), and open-ended questions. Models were trained on the HydroLLM Benchmark, a domain-aligned dataset combining textbook and scientific article content. Fine-tuning strategies included MultipleNegativesRankingLoss, CosineSimilarityLoss, and TripletLoss, selected to match each task's semantic structure. Evaluation was conducted on a held-out set of 400 textbook-derived QA pairs, using top-k similarity-based context retrieval and GPT-4o-mini for answer generation. Results show that the fine-tuned models match or exceed performance of strong proprietary and open-source baselines, particularly in FITB and open-ended tasks, where domain alignment significantly improves semantic precision. The MCQ/TF model also achieved competitive accuracy. These findings highlight the value of task- and domain-specific embedding models for building robust retrieval-augmented generation (RAG) pipelines and intelligent QA systems in scientific domains. This work represents a foundational step toward HydroLLM, a domain-specialized language model ecosystem for environmental sciences.

Article activity feed