class: center, middle, inverse, title-slide # Predicting complex diseases:
performance and robustness ###
Florian Privé
, Hugues Aschard, Michael G.B. Blum ### RECOMB-Genetics 2018 --- class: center, middle, inverse # Introduction --- ## Disease architectures <div class="figure" style="text-align: center"> <img src="figures/disease-archi.gif" alt="<small>Source: 10.1126/science.338.6110.1016</small>" width="75%" /> <p class="caption"><small>Source: 10.1126/science.338.6110.1016</small></p> </div> --- ## Disease architectures <img src="figures/disease-archi.png" width="76%" style="display: block; margin: auto;" /> .footnote2[How to derive a genetic risk score for common diseases based on common variants with small effects?] --- ## Interest in prediction: polygenic risk scores (PRS) - Wray, Naomi R., Michael E. Goddard, and Peter M. Visscher. "**Prediction of individual genetic risk** to disease from genome-wide association studies." Genome research 17.10 (**2007**): 1520-1528. - Wray, Naomi R., et al. "Pitfalls of **predicting complex traits** from SNPs." Nature Reviews Genetics 14.7 (**2013**): 507. - Dudbridge, Frank. "Power and **predictive accuracy of polygenic risk scores**." PLoS genetics 9.3 (**2013**): e1003348. - Chatterjee, Nilanjan, Jianxin Shi, and Montserrat García-Closas. "Developing and evaluating **polygenic risk prediction** models for stratified disease prevention." Nature Reviews Genetics 17.7 (**2016**): 392. - Martin, Alicia R., et al. "Human demographic history impacts **genetic risk prediction** across diverse populations." The American Journal of Human Genetics 100.4 (**2017**): 635-649. .footnote2[Still a gap between current predictions and clinical utility.</br>Need more optimal predictions + larger sample sizes.] --- ## Standard PRS - part 1: estimating effects ### Genome-wide association studies (GWAS) In a GWAS, each single-nucleotide polymorphism (SNP) is tested **independently**, resulting in one **effect size** `\(\hat\beta\)` and one **p-value** `\(p\)` for each SNP. <img src="figures/celiac-gwas-cut.png" width="75%" style="display: block; margin: auto;" /> Easy combining: `\(PRS_i = \sum \hat\beta_j \cdot G_{i,j}\)` --- ## Standard PRS - part 2: restricting predictors ### <span style="color:#38761D">Clumping</span> + <span style="color:#1515FF">Thresholding</span> ("C+T" or just "PRS") <br> <img src="figures/GWAS2PRS.png" width="90%" style="display: block; margin: auto;" /> `$$PRS_i = \sum_{\substack{j \in S_\text{clumping} \\ p_j~<~p_T}} \hat\beta_j \cdot G_{i,j}$$` --- ## A more optimal approach to computing PRS? In C+T: weights learned independently and heuristics for correlation and regularization. #### Statistical learning - joint models of all SNPs at once - use regularization to account for correlated and null effects - already proved useful in the litterature (Abraham et al. 2013; Okser et al. 2014; Spiliopoulou et al. 2015) #### Our contribution - a memory- and computation-efficient implementation to be used for biobank-scale data - an automatic choice of the regularization hyper-parameter - a comprehensive comparison for different disease architectures --- class: center, middle, inverse # Methods --- ## Penalized Logistic Regression <br> <Small>$$\arg\!\min_{\beta_0,~\beta}(\lambda, \alpha)\left\{ \underbrace{ -\sum_{i=1}^n \left( y_i \log\left(p_i\right) + (1 - y_i) \log\left(1 - p_i\right) \right) }_\text{Loss function} + \underbrace{ \lambda \left((1-\alpha)\frac{1}{2}\|\beta\|_2^2 + \alpha \|\beta\|_1\right) }_\text{Penalization} \right\}$$</Small> <br> *** - `\(p_i=1/\left(1+\exp\left(-(\beta_0 + x_i^T\beta)\right)\right)\)` - `\(x\)` is denoting the genotypes and covariables (e.g. principal components), - `\(y\)` is the disease status we want to predict, - `\(\lambda\)` is a regularization parameter that needs to be determined and - `\(\alpha\)` determines relative parts of the regularization `\(0 \le \alpha \le 1\)`. --- ### Efficient algorithm - sequential strong rules for discarding predictors in lasso-type problems (Tibshirani et al. 2012; Zeng et al. 2017) - implemented in our R package {bigstatsr} <a href="https://doi.org/10.1093/bioinformatics/bty185" target="_blank"> <img src="figures/bty185.png" width="65%" style="display: block; margin: auto;" /> </a> <br> #### Our implementation - uses memory-mapping to matrices stored on disk - functioning choice of the hyper-parameter `\(\lambda\)` - early stopping criterion --- ### Choice of the hyper-parameter `\(\lambda\)` <img src="figures/simple-CMSA.png" width="80%" style="display: block; margin: auto;" /> --- ## Comprehensive simulations: varying many parameters #### Simulation models <img src="figures/table-simus.png" width="85%" style="display: block; margin: auto;" /> #### Methods - PRS: - PRS-all (no p-value thresholding), - PRS-stringent (GWAS threshold of significance) and - PRS-max (best prediction for all thresholds, considered as an upper-bound) - logit-simple: penalized logistic regression with automatic of selection `\(\lambda\)` --- ## Predictive performance measures **AUC** (Area Under the ROC Curve) and partial AUC (FPR < 10%) are used. <img src="https://i.stack.imgur.com/5x3Xj.png" width="55%" style="display: block; margin: auto;" /> `$$\text{AUC} = P(S_\text{case} > S_\text{control})$$` --- class: center, middle, inverse # Results --- ### Higher predictive performance with logit-simple <img src="figures/pres-AUC-logit.svg" width="75%" style="display: block; margin: auto;" /> .footnote2[Penalized logistic regression provides higher predictive performance in the cases that matter, especially when there are correlated variables.] --- ### Predictive performance of C+T method varies with threshold <img src="figures/pres-AUC-PRS.svg" width="75%" style="display: block; margin: auto;" /> .footnote2[Recall that prediction of PRS-max is an upper-bound of the prediction provided by the C+T method.] --- ### Prediction with logit-simple is improving faster <img src="figures/pres-AUC-ntrain.svg" width="75%" style="display: block; margin: auto;" /> .footnote2[Performance of methods improve with larger sample size. Yet, penalized logistic regression is improving faster than the C+T method.] --- ## Real data <br> #### Celiac disease - intolerance to gluten - only treatment: gluten-free diet - heritability: 57-87% (Nisticò et al. 2006) - prevalence: 1-6% <br> #### Case-control study for the celiac disease (WTCCC, Dubois et al. 2010) - ~15,000 individuals - ~280,000 SNPs - ~30% cases --- ### Results: real Celiac phenotypes <img src="figures/results-celiac-prs-logit.png" width="95%" style="display: block; margin: auto;" /> <img src="figures/celiac-roc.svg" width="55%" style="display: block; margin: auto;" /> --- class: center, middle, inverse # Discussion --- ### Summary of our penalized regression as compared to the C+T method - A more **optimal** approach for predicting complex diseases - models that are **linear** and very **sparse** - very **fast** - **automatic choice** for the regularization parameter - can be extended to capture also recessive and dominant effects <br> ### Prospects: future work using the UK Biobank - use of external summary statistics to improve models - better generalization to multiple populations - integration of clinical and environmental data --- class: center, middle, inverse # Thanks! <br> Presentation: https://privefl.github.io/thesis-docs/recomb18.html R package {bigstatsr}: https://github.com/privefl/bigstatsr R package {bigsnpr}: https://github.com/privefl/bigsnpr <br>
[privefl](https://twitter.com/privefl)
[privefl](https://github.com/privefl)
[F. Privé](https://stackoverflow.com/users/6103040/f-priv%c3%a9) .footnote[Slides created via the R package [**xaringan**](https://github.com/yihui/xaringan).] --- ## Real genotype data Use real data from a case-control study for the Celiac disease. <img src="figures/PC1-4.png" width="95%" style="display: block; margin: auto;" /> Keep only **controls** from the **UK** and **not deviating from the robust Malahanobis distance**. --- ## Simulate new phenotypes ### The liability-threshold model <img src="figures/LTM.png" width="65%" style="display: block; margin: auto;" /> --- ### Two models of liability #### A "simple" model `$$y_i = \underbrace{\sum_{j\in S_\text{causal}} w_j \cdot \widetilde{G_{i,j}}}_\text{genetic effect} + \underbrace{\epsilon_i}_\text{environmental effect}$$` #### A "fancy" model `$$y_i = \underbrace{\sum_{j\in S_\text{causal}^{(1)}} w_j \cdot \widetilde{G_{i,j}}}_\text{linear} + \underbrace{\sum_{j\in S_\text{causal}^{(2)}} w_j \cdot \widetilde{D_{i,j}}}_\text{dominant} + \underbrace{\sum_{\substack{k=1 \\ j_1=e_k^{(3.1)} \\ j_2=e_k^{(3.2)}}}^{k=\left|S_\text{causal}^{(3.1)}\right|} w_{j_1} \cdot \widetilde{G_{i,j_1} G_{i,j_2}}}_\text{interaction} + \epsilon_i$$` *** - `\(w_j\)` are **weights** (generated with a Gaussian or a Laplace distribution) - `\(G_{i,j}\)` is the **allele count** of individual `\(i\)` for SNP `\(j\)` - `\(D_{i,j} = \mathbf{1}\left\{G_{i,j} \neq 0\right\}\)` --- ### Extension via feature engineering <br> We construct a separate dataset with, for each SNP variable, two more variables coding for recessive and dominant effects. <br> <img src="figures/triple.png" width="100%" style="display: block; margin: auto;" /> .footnote[We call these two methods "logit-simple" and "logit-triple".] --- ### Feature engineering improves prediction <img src="figures/pres-triple.svg" width="90%" style="display: block; margin: auto;" /> --- ### Prediction with logit-simple is improving faster <br> <img src="figures/pres-AUC-chr6.svg" width="80%" style="display: block; margin: auto;" /> <!-- .footnote[Aims at increasing the polygenicity of the simulated models and at virtually increasing the sample size.] -->