Informative Missingness in Nominal Data: A Graph-Theoretic Approach to Revealing Hidden Structure

Ehsan Zangene
Veit Schwämmle
Mohieddin Jafari

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Missing data is often treated as a nuisance, routinely imputed or excluded from statistical analyses, especially in nominal datasets where its structure cannot be easily modeled. However, the form of missingness itself can reveal hidden relationships, substructures, and biological or operational constraints within a dataset. In this study, we present a graph-theoretic approach that reinterprets missing values not as gaps to be filled, but as informative signals. By representing nominal variables as nodes and encoding observed or missing associations as edges, we construct both weighted and unweighted bipartite graphs to analyze modularity, nestedness, and projection-based similarities. This framework enables downstream clustering and structural characterization of nominal data based on the topology of observed and missing associations; edge prediction via multiple imputation strategies is included as an optional downstream analysis to evaluate how well inferred values preserve the structure identified in the non-missing data. Across a series of biological, ecological, and social case studies, including proteomics data, the BeatAML drug screening dataset, ecological pollination networks, and HR analytics, we demonstrate that the structure of missing values can be highly informative. These configurations often reflect meaningful constraints and latent substructures, providing signals that help distinguish between data missing at random and not at random. When analyzed with appropriate graph-based tools, these patterns can be leveraged to improve the structural understanding of data and provide complementary signals for downstream tasks such as clustering and similarity analysis. Our findings support a conceptual shift: missing values are not merely analytical obstacles but valuable sources of insight that, when properly modeled, can enrich our understanding of complex nominal systems across domains.

Abstract Figure

Shiny app address https://ehsan-zangene.shinyapps.io/nimaa_app/

Version published to 10.1101/2025.08.22.670516 on bioRxiv
Aug 27, 2025

Evaluating Imputation Methods for Handling Missing Data in Complex Survey Designs: Evidence from the India DHS 2017–18

This article has 6 authors:
1. Mahfuzer Rohman
2. Md Sabbir Hossain
3. Md Fakrul Islam
4. Prosenjit Basak Arka
5. Md Rafi Hasan
6. Md Jamal Uddin
This article has no evaluationsLatest version Jan 23, 2026
Ordinal random forests in language data analysis

This article has 2 authors:
1. Michael Mühlbauer
2. Lukas Sönning
This article has no evaluationsLatest version Dec 23, 2025
Bayesian Network Structure Learning from Incomplete Breast Cancer Data Using Structural Expectation–Maximization

This article has 3 authors:
1. Navaee Lavasani Monireh
2. Rezaeitabar Vahid
3. Khayamzadeh Maryam
This article has no evaluationsLatest version Dec 10, 2025

Discuss this preprint

Listed in

Abstract

Abstract Figure

Article activity feed

Related articles

Evaluating Imputation Methods for Handling Missing Data in Complex Survey Designs: Evidence from the India DHS 2017–18

Ordinal random forests in language data analysis

Bayesian Network Structure Learning from Incomplete Breast Cancer Data Using Structural Expectation–Maximization