Substitute-Space Embeddings for Label-Free Syntax: Unsupervised AI for POS Discovery

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

This paper reinterprets part-of-speech induction as an AI representation-learning problem, embedding words alongside their probabilistic substitutes to induce discrete categories without labels. A spherical embedding objective maps target words, substitute distributions, and auxiliary orthographic/morphological cues into a shared space where clusters align with syntactic functions, enabling token- and type-level induction via simple clustering. Experiments across English and 17+ languages use standardized PTB, MULTEXT-East, and CoNLL-X corpora, showing state-of-the-art many-to-one and V-measure scores and analyzing sensitivity to embedding dimension, substitute set size, and feature augmentations. The approach highlights how classic language models and unsupervised embeddings can yield emergent structure, offering a scalable path to label-free linguistic analysis in low-resource AI settings.

Article activity feed