Full text
3,653 characters
· extracted from
oa-doi-fallback
· click to expand
ABSTRACT
Common diseases exhibit substantial heritability, and GWAS of these diseases have revealed hundreds of thousands of high-frequency disease susceptibility variants throughout the genome. These studies offer the prospect of using genomic data to improve disease prediction and diagnosis, however, the relative performance of different predictive modeling approaches is not well-characterized. To investigate this systematically, we constructed a Monte Carlo simulation generating model genomes with large numbers of SNPs, with a proportion of SNPs carrying risk alleles that are parameterized by the strength of their effects and by different modes of inheritance – additive, dominant, recessive, and combinations thereof. After generating genotypes for cases and controls, several machine learning classifiers (logistic regression, naïve Bayes, random forests, and neural networks, with and without feature selection) were applied to predict disease phenotype from genotypes. Each classifier’s rates of false positives and false negatives were evaluated and compared using AUC. We found that random forest models were the most accurate predictors of disease phenotype over the range of inheritance parameters, followed by logistic regression and naïve Bayes, while the feedforward multilayer neural network-based predictive model had lower AUC. Furthermore, with the small fraction of null sites in our model, there was almost no difference in the performance of classifiers with or without LASSO-based feature selection. We also investigate the association of AUC with the difference in polygenic risk score (PRS) between disease and control samples by comparing AUC in the simulations to the values predicted from the PRS distributions based on odds-risk and liability models.
Competing Interest Statement
Eric Parfitt is employed by Wolfram Research, Inc. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Funding Statement
Start-up funds from the Laboratory of Genetics, School of Medicine and Public Health, Office of the Vice Chancellor for Research and Graduate Education, and the Center for Human Genomics and Precision Medicine at the University of Wisconsin at Madison were used to support this study.
Author Declarations
I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.
Yes
DATA AVAILABILITY
The R code for the Monte Carlo simulation is publicly available at https://github.com/mshpak76/Genetic_Disease_Simulation/
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.