A Simulation-Based Slope Metric for Anchor List Reliability in Word Embedding Spaces

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Inducing semantic relations in word vector spaces and analyzing how other words or entire documents discursively engage these relations is a popular form of cultural analysis. We propose a reliability metric that is easily interpretable and agnostic to the type of relation. The metric, which we call the anchor reliability coefficient (or relco), is found by creating an artificial document-term matrix of simulated documents that sequentially shift more of their probability mass from relation-relevant anchor terms to non-anchor words, and then regressing the documents' similarity to an induced relation on the anchor inclusion score of the documents. We validate the metric at the word-level with both expert- and crowd-sourced dictionaries and at the document-level with expert-annotated social media posts. We also provide some heuristic baselines for assessing reliability effect sizes and null hypothesis testing.

Article activity feed