Automatic quantification of lexical ambiguity using large-scale word association data
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Most words in a language are lexically ambiguous and are associated with multiple meanings that vary in their frequency and relatedness. Although ambiguity is a fundamental property of language, there are extensive issues with existing measures of this construct. For instance, dictionary-based classifications and subjective ratings of meaning number and frequency struggle to capture the graded nature of ambiguity and, by proxy, its impact on cognition and performance in experimental tasks. It is also difficult to scale subjective measures to the full lexicon. We introduce a novel, automated framework to measure lexical ambiguity based on word association data from the Small World of Words (SWOW) project. We apply community detection algorithms to association graphs to quantify both the number and distribution of semantic communities for each word. This in turn allows us to derive graded representations of meaning frequency and relatedness. To better understand our new metrics, we compare them to previously published subjective norms, and establish their validity by showing that they predict lexical decision performance in English and Rioplatense Spanish. Furthermore, our results reveal cross-linguistic differences in lexical ambiguity—Spanish is less ambiguous than English overall—which we hypothesize is due to typological differences between the languages. Our validated framework contributes novel insights for computational and psycholinguistic models of semantic processing, and offers a scalable, automated, and language-independent framework for quantifying different facets of lexical ambiguity. We provide all of our code and ambiguity measures for approximately 7000 words in both languages to facilitate their use by other researchers.