A Topic-Based Framework for Interpretable and Sparse Semantic Representations
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
In computational approaches to language, semantic representations produced by modern deep learning systems are often encoded as latent variables that are difficult to interpret. By contrast, cognitive science places strong emphasis on interpretable semantic structures. Although researchers have explored concept representations using corpus-driven distributed embeddings and various feature-based paradigms, a scalable and interpretable semantic representation framework for diverse linguistic materials remains an open methodological challenge. Recent advances in pretrained language models have led to topic modeling methods that leverage these models, such as BERTopic, which uses sentence-level embeddings to derive clusters corresponding to human-interpretable topics. Building on this line of work, the present study introduces a framework, SISTR, for representing Semantics as Interpretable and Sparse Topic-based Representations. The framework extracts interpretable topics from the corpus and models the semantics of linguistic materials via their relations to the derived topics. This framework can be applied across multiple linguistic levels and languages. We validated the interpretability of the resulting dimensions through behavioral experiments. We applied this representation to word association tasks and semantic judgments, comparing model predictions with human behavioral data. The results suggest that sparse topic-based representations may better capture human semantic intuitions than dense embeddings. Overall, SISTR provides a promising tool for cognitive science and neuroscience, offering a more interpretable bridge between linguistic data, conceptual structure, and human cognition.