Evaluating Genetic-Based Disease Prediction Approaches Through Simulation

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Common diseases exhibit substantial heritability, and GWAS of these diseases have revealed hundreds of thousands of high-frequency disease susceptibility variants throughout the genome. These studies offer the prospect of using genomic data to improve disease prediction and diagnosis, however, the relative performance of different predictive modeling approaches is not well-characterized. To investigate this systematically, we constructed a Monte Carlo simulation generating model genomes with large numbers of SNPs, with a proportion of SNPs carrying risk alleles that are parameterized by the strength of their effects and by different modes of inheritance – additive, dominant, recessive, and combinations thereof. After generating genotypes for cases and controls, several machine learning classifiers (logistic regression, naïve Bayes, random forests, and neural networks, with and without feature selection) were applied to predict disease phenotype from genotypes. Each classifier’s rates of false positives and false negatives were evaluated and compared using AUC. We found that random forest models were the most accurate predictors of disease phenotype over the range of inheritance parameters, followed by logistic regression and naïve Bayes, while the feedforward multilayer neural network-based predictive model had lower AUC. Furthermore, with the small fraction of null sites in our model, there was almost no difference in the performance of classifiers with or without LASSO-based feature selection. We also investigate the association of AUC with the difference in polygenic risk score (PRS) between disease and control samples by comparing AUC in the simulations to the values predicted from the PRS distributions based on odds-risk and liability models.

Article activity feed