A diagnostic and evaluative analysis of PARSEME corpora complexity

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Verbal multiword expressions (VMWEs) constitute a major source of structural complexity in multilingual corpora. While previous work has focused primarily on their identification performance and label distributions, less attention has been paid to the structural properties of the corpora themselves. This paper presents a multi-dimensional diagnostic of structural complexity in the PARSEME VMWE corpora, with the aim of providing a systematic basis for cross-linguistic comparison and for the development and evaluation of automatic VMWE identification systems. We analyse all available languages and annotation splits (train, development, and test) along four complementary dimensions: sentence-level properties, token-level properties, MWE-level properties, and inter-MWE relations. In particular, we introduce a formally defined taxonomy of inter-MWE patterns based on token-index relations, distinguishing positional, boundary-sharing, and identity-based patterns, while treating token sharing as a transversal structural property. Our results reveal substantial cross-lingual asymmetries in sentence length distributions, VMWE density, token sharing, and higher-order interaction patterns. Although most VMWEs occur in isolation, certain languages exhibit non-negligible proportions of overlapping and multi-relation patterns. We further show that structural properties are not always evenly distributed across annotation splits, which may have implications for system evaluation. By quantifying structural complexity across languages and annotation splits, this study contributes to corpus diagnostics and resource evaluation in multilingual MWE research. The proposed framework makes structural properties explicit, supports reproducible corpus profiling, and provides an empirical basis for interpreting cross-lingual benchmarking results. Beyond VMWE identification, the methodology offers a generalisable approach to evaluating structural complexity in annotated linguistic resources.

Article activity feed