Constructing polygenic risk predictors of psychiatric disorders
A key foundation of the future of precision medicine is the genetic description of complexhuman traits. Much of the research effort in this area consists of genetic testing, data collection, and patient interaction. However, most critical is the interpretation of this genetic data, in particular the application of this data to prediction of disease risk.
Many important psychiatric disease conditions are known to be significantly heritable [1, 2].This means genomic predictors and risk estimates can be constructed for many diseases if enough case-control data is available. While genome-wide association studies (GWAS) have uncovered single nucleotide polymorphisms (SNPs) correlated with traits of interest, for many traits the total variability accounted for has been small. This problem of "missing heritability" demonstrates that predicting (and understanding) most complex traits requires uncovering hundreds, if not thousands, of SNPs . Uncovering these SNPs requires increasingly larger case-control datasets. The Michigan State University Precision Genetics Group (MSUPGG) aims to develop machine learning techniques to handle large genetic data sets and build accurate predictors of disease risk for a wide variety of complex, heritable diseases. In this study we aim to apply L1-penalized regression (LASSO) to case-control data from the Lifelines biobank to construct disease risk predictors.
In earlier work , we applied these methods to quantitative traits such as height, bone density, and educational attainment. Our height predictor captured almost all the expected additive heritability for height and had a prediction error of roughly a few centimetres. We also implemented these techniques develop risk factor models for complex diseases such as hypothyroidism, type 1 and type 2 diabetes, breast cancer, atrial fibrillation, and heart attack . The standard method of evaluating the performance of a genomic predictor for binary outcomes is to construct the receiver operating characteristic (ROC) curve and compute the area under the ROC curve (AUC). For the referenced diseases, using SNP data alone we obtained AUCs in the range 0.580 - 0.707. Substantially higher AUCs were obtained by incorporating additional variables such as age and sex. Some individual SNPs were sufficient to identify outliers (e.g., an individual in the 99th percentile of PGS, who has 3 to 8 times higher risk than typical individuals).
The machine learning techniques we use perform better as the sample size grows (both with regard to parameter optimization and out-of-sample performance). This indicates that rapid improvement in predictive power is attainable if training sets with larger case populations are used. In addition, longitudinal data will allow for investigation of the temporal stability of disease diagnosis and the dependence of polygenic predictors on which cross-section of the data is used model training.
Our research will produce improved disease predictors, which will identify many of the alleles associated with risk for a particular disease. Combined with the rapid decline in genotyping costs—roughly $50 per sample for an array genotype which directly measures roughly a million SNPs and allows imputation of millions more—these polygenic predictors will allow for genomic disease prediction to be applied broadly in a clinical setting [6, 7]. Potential applications include identification of individuals who are outliers in risk score, and hence are candidates for increased testing, close observation, or preventative intervention (e.g., behaviour modification). Additionally, early interventions such as preimplantation genetic testing present a useful application of these disease predictors, decreasing disease incidence and reducing health care costs. Finally, elaborated understanding of underlying genetic architecture is important basic basic science and may lead to improved treatments (e.g., drug development)