Step Wise Approximation of CBOW Reduces Hallucinations in Tail Cases

Boris A. Galitsky
Anatoly Tsirlin

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

This paper introduces a cognitively inspired approach to word representation called step-wise approximation of embeddings, which bridges the gap between static word embeddings and fully contextualized language model outputs. Traditional embedding models like Word2Vec assign a single vector to each word, failing to account for polysemy and context-dependent meanings. In contrast, large language models produce distinct embeddings for every token instance but at the cost of interpretability and computational efficiency. We propose modeling embeddings as piecewise-constant approximations that evolve in discrete semantic steps across contexts. This approach enables a word to be represented by a finite set of context-sensitive vectors, capturing different senses or usage patterns. We formalize the approximation process using entropy-minimizing segmentation and demonstrate its application in a continuous Word2Vec setting that handles context shifts smoothly. Our experiments show that this method improves representation quality for tail entities—words with limited training frequency—yielding up to 5% improvement in question answering tasks within a retrieval-augmented generation (RAG) framework. These results suggest that step-wise approximation offers a computationally efficient and interpretable alternative to contextual embeddings, with particular benefits for underrepresented vocabulary.

Version published to 10.20944/preprints202507.0670.v1
Jul 9, 2025

Prompt based contextualized phrase embedding

This article has 1 author:
1. Brice Tsakam-Sotche
This article has no evaluationsLatest version Jun 2, 2025
NeuroConText: Contrastive Learning for Neuroscience Meta-Analysis with Rich Text Representation

This article has 5 authors:
1. Fateme Ghayem
2. Raphaël Meudec
3. Jérôme Dockès
4. Bertrand Thirion
5. Demian Wassermann
This article has no evaluationsLatest version May 27, 2025
Text embedding models yield high-resolution insights into conceptual knowledge from short multiple-choice quizzes

This article has 3 authors:
1. Paxton C. Fitzpatrick
2. Andrew C. Heusser
3. Jeremy R. Manning
This article has no evaluationsLatest version May 26, 2025

Listed in

Abstract

Article activity feed

Related articles

Prompt based contextualized phrase embedding

NeuroConText: Contrastive Learning for Neuroscience Meta-Analysis with Rich Text Representation

Text embedding models yield high-resolution insights into conceptual knowledge from short multiple-choice quizzes