A diagnostic and evaluative analysis of PARSEME corpora complexity

Santiago Fernández Lanza
Víctor Manuel Darriba Bilbao
Daniel Fernández-González

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Verbal multiword expressions (VMWEs) constitute a major source of structural complexity in multilingual corpora. While previous work has focused primarily on their identification performance and label distributions, less attention has been paid to the structural properties of the corpora themselves. This paper presents a multi-dimensional diagnostic of structural complexity in the PARSEME VMWE corpora, with the aim of providing a systematic basis for cross-linguistic comparison and for the development and evaluation of automatic VMWE identification systems. We analyse all available languages and annotation splits (train, development, and test) along four complementary dimensions: sentence-level properties, token-level properties, MWE-level properties, and inter-MWE relations. In particular, we introduce a formally defined taxonomy of inter-MWE patterns based on token-index relations, distinguishing positional, boundary-sharing, and identity-based patterns, while treating token sharing as a transversal structural property. Our results reveal substantial cross-lingual asymmetries in sentence length distributions, VMWE density, token sharing, and higher-order interaction patterns. Although most VMWEs occur in isolation, certain languages exhibit non-negligible proportions of overlapping and multi-relation patterns. We further show that structural properties are not always evenly distributed across annotation splits, which may have implications for system evaluation. By quantifying structural complexity across languages and annotation splits, this study contributes to corpus diagnostics and resource evaluation in multilingual MWE research. The proposed framework makes structural properties explicit, supports reproducible corpus profiling, and provides an empirical basis for interpreting cross-lingual benchmarking results. Beyond VMWE identification, the methodology offers a generalisable approach to evaluating structural complexity in annotated linguistic resources.

Version published to 10.21203/rs.3.rs-9023725/v1 on Research Square
Mar 30, 2026

Measuring the Information Density of Interlanguage: An Entropy Analysis

This article has 1 author:
1. Mohamed Mekheimer
This article has no evaluationsLatest version Apr 16, 2026
CrossLingBench: A Comprehensive Evaluation ofLarge Language Models on Multilingual NLPTasks Across Languages and Prompting Strategies

This article has 1 author:
1. Ahmed Cherif
This article has no evaluationsLatest version Apr 17, 2026
Character Semantic-Phonetic Structure Enhance Language Models in Classical Chinese

This article has 4 authors:
1. Bolin Chang
2. Bin Li
3. Zhixing Xu
4. Shiyan Ou
This article has no evaluationsLatest version Mar 16, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Measuring the Information Density of Interlanguage: An Entropy Analysis

CrossLingBench: A Comprehensive Evaluation ofLarge Language Models on Multilingual NLPTasks Across Languages and Prompting Strategies

Character Semantic-Phonetic Structure Enhance Language Models in Classical Chinese