SAREC: A Semantic- Aware Retrieval-Augmented Conformer for Multilingual Low-Resource Speech Recognition

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Low-resource speech recognition remains fundamentally limited by training distribution, as neural networks cannot access linguistic knowledge beyond their training data. SAREC (Semantic-Aware Retrieval Enhanced Conformer), a novel neural architecture that augments Conformer-based acoustic encoders with dynamically retrieved phonetic exemplars during both training and inference. The key innovation is frame-level retrieval with learned cross-attention fusion, enabling phonetic-precision knowledge integration at each encoding layer. Unlike existing utterance-level retrieval approaches (SRAG), SAREC system retrieves and fuses acoustic prototypes at frame granularity, capturing phonetic patterns at multiple scales through strategic insertion at encoder layers 3, 6, 9, and 12. A learned gating mechanism dynamically weights acoustic versus exemplar information per frame, providing principled information fusion superior to simple concatenation. Evaluation on Telugu, Kannada, Tamil demonstrates consistent improvements: Telugu 18.2% WER (Word Error Rate) ( (vs. 21.4% baseline, 15.0% relative, p < 0.001, 95% CI: [17.8, 18.6]%); Kannada 20.3% (vs. 24.8%, 18.1% relative); Tamil 16.4% (vs. 19.2%, 14.6% relative). Robustness improves 12.1% at 5 dB SNR (Signal to Noise Ratio), latency remains practical (356 ms/10s audio). Ablation studies confirm frame-level superiority (15% over utterance-level, K = 5 optimal).Three mechanisms explain gains: (1) phonological disambiguation, (2) morphological integration, (3) cross-dialect generalization. Results advance low-resource ASR accessibility through systematic frame-level retrieval augmentation

Article activity feed