A Novel Multi-Layer Semantic Chunking and Embedding Dimension Transformation Techniques for Enhanced Retrieval Augmented Generation

Luyen Nguyen Tien
Binh Hoang Tieu

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Retrieval-Augmented Generation (RAG) systems combine information retrieval with generative models to provide accurate responses, but existing frameworks face limitations in semantic chunking flexibility and embedding compatibility across models. This paper introduces MERCED RAG, a novel framework addressing these challenges through twokey innovations: (1) multi-layer semantic chunking with HDBSCAN/Agglomerative clustering algorithms and intelligent outlier handling strategies, and (2) comprehensive embedding dimension transformation techniques including DCT-based upsampling, orthogonal projection, weighted redistribution, and PCA reduction for cross-model interoperability. Our segmentation approach integrates semantic clustering with advanced algorithms and performs comparisons with traditional token-based methods, achieving 98.9% faithfulness and 100% answer relevancy. Our dimension transformation suite enables seamless integration of diverse embedding models while preserving 87.4% semantic similarity. Experimental evaluation on Vietnamese Undergraduate Training Regulations (VUTR) and Narrative Comprehension(NarrativeQA) datasets demonstrates significant improvements: overall scores of 1.027 on VUTR (+3.0% over semantic baseline, +5.2% over traditional baseline) and 0.760 on NarrativeQA (+2.6% over semantic baseline, +4.7% over traditional baseline). The NDCG score improvesfrom 2.889 to 2.948 (+2.0%), establishing new benchmarks for RAG systems and validating ourintegrated approach to semantic chunking andcross-model compatibility.

Version published to 10.21203/rs.3.rs-7562846/v1 on Research Square
Sep 17, 2025

Joint Modeling of Intelligent Retrieval-Augmented Generation in LLM-Based Knowledge Fusion

This article has 2 authors:
1. Di Wu
2. Shuaidong Pan
This article has no evaluationsLatest version Sep 30, 2025
Joint Modeling of Intelligent Retrieval-Augmented Generation in LLM-Based Knowledge Fusion

This article has 2 authors:
1. Di Wu
2. Shuaidong Pan
This article has no evaluationsLatest version Sep 10, 2025
Design and Evaluation of a Context-Aware Multimodal Recommendation and QA System with Retrieval-Augmented Generation

This article has 4 authors:
1. D. Venkata Subramanian
2. Bhuvan Unhelkar
3. S Ramacharan
4. Prasun Chakrabarti
This article has no evaluationsLatest version Sep 4, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Joint Modeling of Intelligent Retrieval-Augmented Generation in LLM-Based Knowledge Fusion

Joint Modeling of Intelligent Retrieval-Augmented Generation in LLM-Based Knowledge Fusion

Design and Evaluation of a Context-Aware Multimodal Recommendation and QA System with Retrieval-Augmented Generation