Interpretable and predictive models to harness the life science data revolution

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

The proliferation of high-dimensional biological data is kindling hope that life scientists will be able to fit statistical and machine learning models that are highly predictive and interpretable. However, large biological data are commonly burdened with an inherent trade-off: in-sample prediction will improve as additional predictors are included in the model, but this may come at the cost of poor predictive accuracy and limited generalizability for future or unsampled observations (out-of-sample prediction). To confront this problem of overfitting, sparse models narrow in on the causal predictors by correctly placing low weight on unimportant variables. We competed nine methods to quantify their performance in variable selection and prediction using simulated data with different sample sizes, numbers of predictors, and strengths of effects. Overfitting was typical for many methods and simulation scenarios. Despite this, in-sample and out-of-sample prediction converged on the true predictive target for simulations with more observations, larger causal effects, and fewer predictors. Accurate variable selection to support process-based understanding will be unattainable for many realistic sampling schemes. We use our analyses to characterize data attributes in which statistical learning is possible, and illustrate how some sparse methods can achieve predictive accuracy while mitigating and learning the extent of overfitting.

Article activity feed