A systematic imputation framework for sparse, multimodal space biology datasets: application to retinal imaging and omics from the RR9 mission

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Missing data is a fundamental challenge in space biology, where high experimental costs, limited sample availability, and tissue allocation constraints produce datasets that are sparse, multimodal, and heterogeneous. We present a systematic four-stage framework for diagnosing, implementing, and validating data imputation strategies tailored to these characteristics, and demonstrate its application to retinal imaging and omics data from the NASA Rodent Research 9 (RR9) mission. Using logistic regression-based missingness diagnosis, we identify a Missing At Random (MAR) mechanism driven by experimental design constraints across nine assay modalities. We implement and optimize three imputation strategies: K-Nearest Neighbors (KNN), Multiple Imputation by Chained Equations with weak ElasticNet regularization (MICE-Elastic), and a per-column hybrid strategy, evaluated against a random sample imputer baseline. Validation across seven complementary metrics including supervised classification, unsupervised clustering, correlation structure preservation, masked value recovery, cross-dataset generalization, and permutation testing reveals that MICE-Elastic and the Hybrid strategy preserve genuine biological signal in both RNA-seq and TUNEL modalities, while KNN and the random sample imputer do not despite achieving comparable cross-validation accuracy. A critical finding is that imputation substantially improves supervised classification performance while consistently degrading unsupervised clustering structure, a trade-off researchers must understand before applying these methods. This framework provides practical, actionable guidance for space biologists and data scientists managing sparse multimodal datasets, and represents a foundational step toward digital twin development for space medicine.

Article activity feed