Tackling challenges in data pooling: missing data handling in latent variable models with continuous and categorical indicators
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Data pooling is a powerful strategy in empirical research, but combining multiple datasets often results in a large amount of missing data: Variables that are not available across all datasets will contain missing values for entire groups of participants as a result. Furthermore, data pooling typically leads to a mix of continuous and categorical items with nonnormal multivariate distributions. We investigated two popular approaches to handle missing data in this context: 1. applying direct maximum likelihood by treating data as continuous (con-ML), and 2. applying categorical least squares using a polychoric correlation matrix computed from pairwise deletion (cat-LS). These approaches are available for free and relatively straightforward for empirical researchers to implement. Through simulation studies with confirmatory factor analysis and latent mediation analysis, we found cat-LS to be unsuitable for pooled data analysis, whereas con-ML yielded acceptable performance for the estimation of latent path coefficients barring severe nonnormality.