Domain-Specific Embedding Models for Hydrology and Environmental Sciences: Enhancing Semantic Retrieval and Question Answering in RAG Pipelines

Ramteja Sajja
Yusuf Sermet
Ibrahim Demir

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Large Language Models (LLMs) have shown strong performance across natural language processing tasks, yet their general-purpose embeddings often fall short in domains with specialized terminology and complex syntax, such as hydrology and environmental science. This study introduces HydroEmbed, a suite of open-source sentence embedding models fine-tuned for four QA formats: multiple-choice (MCQ), true/false (TF), fill-in-the-blank (FITB), and open-ended questions. Models were trained on the HydroLLM Benchmark, a domain-aligned dataset combining textbook and scientific article content. Fine-tuning strategies included MultipleNegativesRankingLoss, CosineSimilarityLoss, and TripletLoss, selected to match each task's semantic structure. Evaluation was conducted on a held-out set of 400 textbook-derived QA pairs, using top-k similarity-based context retrieval and GPT-4o-mini for answer generation. Results show that the fine-tuned models match or exceed performance of strong proprietary and open-source baselines, particularly in FITB and open-ended tasks, where domain alignment significantly improves semantic precision. The MCQ/TF model also achieved competitive accuracy. These findings highlight the value of task- and domain-specific embedding models for building robust retrieval-augmented generation (RAG) pipelines and intelligent QA systems in scientific domains. This work represents a foundational step toward HydroLLM, a domain-specialized language model ecosystem for environmental sciences.

Version published to 10.31223/x5dq71
Jul 13, 2025

Towards HydroLLM: Building a Domain-Specific Language Model for Hydrology

This article has 3 authors:
1. Dilara Kizilkaya
2. Yusuf Sermet
3. Ibrahim Demir
This article has no evaluationsLatest version Jul 14, 2025
An Intelligent-Aware Transformer with Domain Adaptation and Contextual Reasoning for Question Answering

This article has 4 authors:
1. Jianyang Zhuo
2. Yuchen Han
3. Hairu Wen
4. Kejian Tong
This article has no evaluationsLatest version Jun 16, 2025
DSA-GNAS: Graph Neural Architecture Search with Deep Semantic Adaption of Large Language Models

This article has 5 authors:
1. Siyang Xiao
2. Jiamin Chen
3. Zhenpeng Wu
4. Shuqing Wu
5. Jianliang Gao
This article has no evaluationsLatest version Jun 13, 2025

Listed in

Abstract

Article activity feed

Related articles

Towards HydroLLM: Building a Domain-Specific Language Model for Hydrology

An Intelligent-Aware Transformer with Domain Adaptation and Contextual Reasoning for Question Answering

DSA-GNAS: Graph Neural Architecture Search with Deep Semantic Adaption of Large Language Models