Language selectively encodes atypical features of the world
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Language contains a wealth of information about the world. However, language does not necessarily reflect the world veridically; instead, communicative pressure may lead it to selectively encode surprising or atypical information. If language picks out the atypical features of things (e.g., “purple carrot”) more often than the typical features of things (e.g., “orange carrot”), learning about the world from language is not straightforward. Here, we test whether a bias to overrepresent atypical information is present and robust across a variety of sources: everyday conversations among adults, the language children hear from parents, and children’s own language. To do so, we extracted usage data for nearly 5,000 unique adjective-noun pairs and collected human typicality ratings for each pair. We found adults speaking to other adults, parents speaking to children, and even children themselves predominantly use adjectives to mark atypical features of things. We also found that parents of very young children comment on typical features slightly more often than parents of older children. Thus, language is structured to emphasize what is atypical—so how can one learn about what things are typically like from language? Using large language models, we test how this bias shapes what can be learned from language alone. We find that even language models with extensive training data (word2vec and BERT) fail to capture the typicality of adjective–noun pairs well, and only a much more sophisticated large language model (GPT-3) succeeds. Though large language models have input unlike what human learners have access to, they provide useful bounds on the typicality information learnable from applying simple training objectives to language alone. In sum, we find that people talk more about the atypical than the typical, and we examine how this shapes the problem of learning about the world from language in children, adults, and language models.