A systematic imputation framework for sparse, multimodal space biology datasets: application to retinal imaging and omics from the RR9 mission
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Missing data is a fundamental challenge in space biology, where high experimental costs, limited sample availability, and tissue allocation constraints produce datasets that are sparse, multimodal, and heterogeneous. We present a systematic four-stage framework for diagnosing, implementing, and validating data imputation strategies tailored to these characteristics, and demonstrate its application to retinal imaging and omics data from the NASA Rodent Research 9 (RR9) mission. Using logistic regression-based missingness diagnosis, we identify a Missing At Random (MAR) mechanism driven by experimental design constraints across nine assay modalities. We implement and optimize three imputation strategies: K-Nearest Neighbors (KNN), Multiple Imputation by Chained Equations with weak ElasticNet regularization (MICE-Elastic), and a per-column hybrid strategy, evaluated against a random sample imputer baseline. Validation across seven complementary metrics including supervised classification, unsupervised clustering, correlation structure preservation, masked value recovery, cross-dataset generalization, and permutation testing reveals that MICE-Elastic and the Hybrid strategy preserve genuine biological signal in both RNA-seq and TUNEL modalities, while KNN and the random sample imputer do not despite achieving comparable cross-validation accuracy. A critical finding is that imputation substantially improves supervised classification performance while consistently degrading unsupervised clustering structure, a trade-off researchers must understand before applying these methods. This framework provides practical, actionable guidance for space biologists and data scientists managing sparse multimodal datasets, and represents a foundational step toward digital twin development for space medicine.