Step Wise Approximation of CBOW Reduces Hallucinations in Tail Cases

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

This paper introduces a cognitively inspired approach to word representation called step-wise approximation of embeddings, which bridges the gap between static word embeddings and fully contextualized language model outputs. Traditional embedding models like Word2Vec assign a single vector to each word, failing to account for polysemy and context-dependent meanings. In contrast, large language models produce distinct embeddings for every token instance but at the cost of interpretability and computational efficiency. We propose modeling embeddings as piecewise-constant approximations that evolve in discrete semantic steps across contexts. This approach enables a word to be represented by a finite set of context-sensitive vectors, capturing different senses or usage patterns. We formalize the approximation process using entropy-minimizing segmentation and demonstrate its application in a continuous Word2Vec setting that handles context shifts smoothly. Our experiments show that this method improves representation quality for tail entities—words with limited training frequency—yielding up to 5% improvement in question answering tasks within a retrieval-augmented generation (RAG) framework. These results suggest that step-wise approximation offers a computationally efficient and interpretable alternative to contextual embeddings, with particular benefits for underrepresented vocabulary.

Article activity feed