A Novel Multi-Layer Semantic Chunking and Embedding Dimension Transformation Techniques for Enhanced Retrieval Augmented Generation

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Retrieval-Augmented Generation (RAG) systems combine information retrieval with generative models to provide accurate responses, but existing frameworks face limitations in semantic chunking flexibility and embedding compatibility across models. This paper introduces MERCED RAG, a novel framework addressing these challenges through twokey innovations: (1) multi-layer semantic chunking with HDBSCAN/Agglomerative clustering algorithms and intelligent outlier handling strategies, and (2) comprehensive embedding dimension transformation techniques including DCT-based upsampling, orthogonal projection, weighted redistribution, and PCA reduction for cross-model interoperability. Our segmentation approach integrates semantic clustering with advanced algorithms and performs comparisons with traditional token-based methods, achieving 98.9% faithfulness and 100% answer relevancy. Our dimension transformation suite enables seamless integration of diverse embedding models while preserving 87.4% semantic similarity. Experimental evaluation on Vietnamese Undergraduate Training Regulations (VUTR) and Narrative Comprehension(NarrativeQA) datasets demonstrates significant improvements: overall scores of 1.027 on VUTR (+3.0% over semantic baseline, +5.2% over traditional baseline) and 0.760 on NarrativeQA (+2.6% over semantic baseline, +4.7% over traditional baseline). The NDCG score improvesfrom 2.889 to 2.948 (+2.0%), establishing new benchmarks for RAG systems and validating ourintegrated approach to semantic chunking andcross-model compatibility.

Article activity feed