Automatic quantification of lexical ambiguity using large-scale word association data

Ignacio Iglesias
Blair Armstrong
Julieta Laurino
Laura Kaczer
Álvaro Cabana

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Most words in a language are lexically ambiguous and are associated with multiple meanings that vary in their frequency and relatedness. Although ambiguity is a fundamental property of language, there are extensive issues with existing measures of this construct. For instance, dictionary-based classifications and subjective ratings of meaning number and frequency struggle to capture the graded nature of ambiguity and, by proxy, its impact on cognition and performance in experimental tasks. It is also difficult to scale subjective measures to the full lexicon. We introduce a novel, automated framework to measure lexical ambiguity based on word association data from the Small World of Words (SWOW) project. We apply community detection algorithms to association graphs to quantify both the number and distribution of semantic communities for each word. This in turn allows us to derive graded representations of meaning frequency and relatedness. To better understand our new metrics, we compare them to previously published subjective norms, and establish their validity by showing that they predict lexical decision performance in English and Rioplatense Spanish. Furthermore, our results reveal cross-linguistic differences in lexical ambiguity—Spanish is less ambiguous than English overall—which we hypothesize is due to typological differences between the languages. Our validated framework contributes novel insights for computational and psycholinguistic models of semantic processing, and offers a scalable, automated, and language-independent framework for quantifying different facets of lexical ambiguity. We provide all of our code and ambiguity measures for approximately 7000 words in both languages to facilitate their use by other researchers.

Version published to 10.31234/osf.io/tpnxq_v1 on OSF Preprints
Aug 9, 2025

The neglected role of lexical ambiguity in embodied cognition models

This article has 2 authors:
1. Sara Božić
2. Dušica Filipović Đurđević
This article has no evaluationsLatest version Sep 6, 2025
Word Sense Disambiguation (WSD) in Indonesian Sentences Using Simplified Lesk Algorithm

This article has 3 authors:
1. Nurul Akhni
2. Abdiansah
3. Danny Matthew Saputra
This article has no evaluationsLatest version Aug 28, 2025
Core vocabulary reveals differences between human word prediction and large language models

This article has 4 authors:
1. Andrew Wang
2. Simon De Deyne
3. Meredith McKague
4. Andrew Perfors
This article has no evaluationsLatest version Aug 29, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

The neglected role of lexical ambiguity in embodied cognition models

Word Sense Disambiguation (WSD) in Indonesian Sentences Using Simplified Lesk Algorithm

Core vocabulary reveals differences between human word prediction and large language models