Aarhus

## Efficient statistical tools for analyzing omics data

### with a focus on polygenic risk prediction

### Florian Privé

#### Aarhus - January 2019

---

# Introduction & Motivation

---

## About

I'm a PhD Student (2016-2019) in **Predictive Human Genetics** in Grenoble.

`$$\boxed{\Large{\text{Disease} \sim \text{DNA mutations} + \cdots}}$$`

---

# Introduction

---

## Data

**Matrices of** Single Nucleotide Polymorphisms (SNPs)

counting the number of alternative alleles (**0, 1, or 2**)

for each individual (row) and each genome position (column)

\+ some phenotype(s) (e.g. disease status you want to predict)

\+ other metadata

`$$\boxed{\Large{\text{Disease} \sim \text{DNA mutations} + \cdots}}$$`

---

## From genome-wide association studies (GWAS) to polygenic risk scores (PRS)

`$$PRS_i = \sum_{\substack{j \in S \\ p_j~<~p_T}} \hat\beta_j \cdot G_{i,j}$$`

---

## Polygenic Risk Scores (PRS) for epidemiology

**One application: to provide evidence** for a polygenic contribution to a trait or a shared polygenic relationship between traits.

<div class="figure" style="text-align: center">
<img src="figures/genomic-profile.png" alt="&lt;small&gt;Source: 10.1111/jcpp.12295&lt;/small&gt;" width="90%" />
Source: 10.1111/jcpp.12295
</div>

---

## Polygenic Risk Scores (PRS) for epidemiology

<div class="figure" style="text-align: center">
<img src="figures/purcell2009.png" alt="&lt;small&gt;Source: 10.1038/nature08185&lt;/small&gt;" width="75%" />
Source: 10.1038/nature08185
</div>

---

## Polygenic Risk Scores (PRS) for predictive medicine

### Another application: to identify high risk individuals

<div class="figure" style="text-align: center">
<img src="figures/high-risk.jpg" alt="&lt;small&gt;Source: 10.1038/s41576-018-0018-x&lt;/small&gt;" width="85%" />
Source: 10.1038/s41576-018-0018-x
</div>

---

## Interest in prediction: polygenic risk scores (PRS)

- Wray, Naomi R., Michael E. Goddard, and Peter M. Visscher. "**Prediction of individual genetic risk** to disease from genome-wide association studies." Genome research 17.10 (**2007**): 1520-1528.

- Wray, Naomi R., et al. "Pitfalls of **predicting complex traits** from SNPs." Nature Reviews Genetics 14.7 (**2013**): 507.

- Dudbridge, Frank. "Power and **predictive accuracy of polygenic risk scores**." PLoS genetics 9.3 (**2013**): e1003348.

- Chatterjee, Nilanjan, Jianxin Shi, and Montserrat García-Closas. "Developing and evaluating **polygenic risk prediction** models for stratified disease prevention." Nature Reviews Genetics 17.7 (**2016**): 392.

- Martin, Alicia R., et al. "Human demographic history impacts **genetic risk prediction** across diverse populations." The American Journal of Human Genetics 100.4 (**2017**): 635-649.

.footnote2[Still a gap between current predictions and clinical utility.Need more optimal predictions + larger sample sizes.]

---

## Very large genotype matrices

- previously: 15K x 280K, [celiac disease](https://doi.org/10.1038/ng.543) (~30GB)

- currently: 500K x 500K, [UK Biobank](https://doi.org/10.1101/166298) (~2TB)
 
<img src="https://media.giphy.com/media/3o7bueyxGydy48Lwgo/giphy.gif" width="55%" style="display: block; margin: auto;" />

---

# How to analyze large genomic data?

---

## Our two R packages: bigstatsr and bigsnpr

### Statistical tools with big matrices stored on disk

- {bigstatsr} for many types of matrix, to be used by any field of research

- {bigsnpr} for functions that are specific to the analysis of genetic data

Package {bigstatsr} provides fast PCA, association and predictive models, etc.

---

## The solution I found

.footnote[Format `FBM` is very similar to format `filebacked.big.matrix` from package {bigmemory} (details in [this vignette](https://privefl.github.io/bigstatsr/articles/bigstatsr-and-bigmemory.html)).]

---

## Multiple association testing

The idea behind Genome-Wide Association Studies (GWAS) is simple: test each variant one by one for association with the phenotype of interest.
For a continuous phenotype (e.g. height), linear regression is used and a t-test is performed for each SNP `$j$` on `$\beta^{(j)}$` where
`$$\hat{y} = \alpha^{(j)} + \beta^{(j)} SNP^{(j)} + \gamma_1^{(j)} COV_1 + \cdots + \gamma_K^{(j)} COV_K~,$$`
and `$K$` is the number of covariables, including **first principal components** and other covariates such as age and gender.

Similarly, for a binary phenotype (e.g. disease status), logistic regression is used and a Z-test is performed for each SNP `$j$` on `$\beta^{(j)}$` where
`$$\log{\left(\frac{\hat{p}}{1-\hat{p}}\right)} = \alpha^{(j)} + \beta^{(j)} SNP^{(j)} + \gamma_1^{(j)} COV_1 + \cdots + \gamma_K^{(j)} COV_K~,$$`
and `$\hat{p} = \mathbb{P}(Y = 1)$` and `$Y$` denotes the binary phenotype.

---

### Which DNA mutations are associated with one disease?

---

## Partial Singular Value Decomposition

15K `$\times$` 100K -- 10 first PCs -- 6 cores -- **1 min** (vs 2h in base R)

---

## Benchmarks (GWAS)

---

## Precision (approximate PCA)

---

# How to predict disease status based on genotypes?

---

## Standard PRS - part 1: estimating effects

### Genome-wide association studies (GWAS)

In a GWAS, each single-nucleotide polymorphism (SNP) is tested **independently**, resulting in one **effect size** `$\hat\beta$` and one **p-value** `$p$` for each SNP.

Easy combining: `$PRS_i = \sum_j \hat\beta_j \cdot G_{i,j}$`

---

## Standard PRS - part 2: restricting predictors

### Clumping + Thresholding ("C+T" or just "PRS")

`$$PRS_i = \sum_{\substack{j \in S_\text{clumping} \\ p_j~<~p_T}} \hat\beta_j \cdot G_{i,j}$$`

---

## A more optimal approach to computing PRS?

In C+T: weights learned independently and heuristics for correlation and regularization.

#### Statistical learning

- joint models of all SNPs at once

- use regularization to account for correlated and null effects

- already proved useful in the litterature (Abraham et al. 2013; Okser et al. 2014; Spiliopoulou et al. 2015)

#### Our contribution

- a memory- and computation-efficient implementation to be used for biobank-scale data

- an automatic choice of the regularization hyper-parameter

- a comprehensive comparison for different disease architectures

---

## Penalized Logistic Regression (PLR)

$$\arg\!\min_{\beta_0,~\beta}(\lambda, \alpha)\left\{ \underbrace{ -\sum_{i=1}^n \left( y_i \log\left(p_i\right) + (1 - y_i) \log\left(1 - p_i\right) \right) }_\text{Loss function} + \underbrace{ \lambda \left((1-\alpha)\frac{1}{2}\|\beta\|_2^2 + \alpha \|\beta\|_1\right) }_\text{Penalization} \right\}$$

***

- `$p_i=1/\left(1+\exp\left(-(\beta_0 + x_i^T\beta)\right)\right)$`

- `$x$` is denoting the genotypes and covariables (e.g. principal components),

- `$y$` is the disease status we want to predict,

- `$\lambda$` is a regularization parameter that needs to be determined and

- `$\alpha$` determines relative parts of the regularization `$0 \le \alpha \le 1$`.

---

### Prediction with PLR is improving faster

.foonote["Efficient implementation of penalized regression for genetic risk prediction" -- under revision]

---

## Real data

#### Celiac disease

- intolerance to gluten

- only treatment: gluten-free diet

- heritability: 57-87% (Nisticò et al. 2006)

- prevalence: 1-6%

#### Case-control study for the celiac disease (WTCCC, Dubois et al. 2010)

- ~15,000 individuals

- ~280,000 SNPs

- ~30% cases

---

### Results: real Celiac phenotypes

---

## LASSO for predicting height

- 350K individuals x 656K SNPs in less than one day

- Within each sex category, 65.5% of correlation between predicted and true height (56% with C+T-max)

---

# Using summary statistics

---

## Standard PRS: C+T

In a GWAS, each SNP is tested independently, resulting in one **effect size** `$\hat\beta$` and one **p-value** `$p$` for each SNP (**summary statistics**).

`$$PRS_i = \sum_{\substack{j \in S_\text{clumping} \\ p_j~<~p_T}} \hat\beta_j \cdot G_{i,j}$$`

---

## Predictive methods based on summary statistics

When you have only summary statistics (and a small reference panel), you can use:

- C+T

- LDpred (*Vilhjálmsson, Bjarni J., et al. "Modeling linkage disequilibrium increases accuracy of polygenic risk scores." The American Journal of Human Genetics 97.4 (2015): 576-592*).

- lassosum (*Mak, Timothy Shin Heng, et al. "Polygenic scores via penalized regression on summary statistics." Genetic epidemiology 41.6 (2017): 469-480.*)

- NPS (*Chun, Sung, et al. "Non-parametric polygenic risk prediction using partitioned GWAS summary statistics." bioRxiv (2018): 370064.*)

The idea of LDpred, lassosum and NPS is to use a reference panel to **account for correlation** between SNPs, instead of pruning (removing) SNPs. Lassosum also adds some sparsity.

---

## Could those models be improved?

- take into account quality of imputation?

- support PLINK bed files only; what about BGEN files such as for the UK Biobank?

- scalable to which extent?

- combine with individual-level datasets? (of possibly different populations)

---

## One idea: stacking

For a grid of combinations of 3 parameters:

- threshold of clumping ( `$r^2_{clumping}$` )

- p-value threshold ( `$p_T$` )

- threshold of imputation quality score ( `$r^2_{imputation}$` )

Get a C+T prediction for each combination of parameters.

Then use those predictions as variables in a (penalized) regression (stacking of models).

---

## Stacking: preliminary results

<div class="figure" style="text-align: center">
<img src="figures/stacking.png" alt="Prediction for breast cancer.&lt;br&gt;Facets are representing different imputation quality thresholds." width="90%" />
Prediction for breast cancer. Facets are representing different imputation quality thresholds.
</div>

---
class: center, middle, inverse

# Conclusion

---

## My thesis work

1. Developping two packages for the analysis of large-scale genomic data.

(https://doi.org/10.1093/bioinformatics/bty185) 
    
    Package bigstatsr can be used for any data encoded as matrices.

2. Including an implementation of penalized regression for very large individual-level datasets \+ assess the potential gain in prediction over the simple standard model (C+T).
    
    (https://doi.org/10.1101/403337)

3. Including summary statistics from large GWAS to improve prediction.

(TODO)

---

## Make sure to grab an hex sticker

---
class: center, middle, inverse

# Thanks!

Presentation available at

https://privefl.github.io/thesis-docs/aarhus.html

[privefl](https://twitter.com/privefl) &nbsp;&nbsp;&nbsp;&nbsp; [privefl](https://github.com/privefl) &nbsp;&nbsp;&nbsp;&nbsp; [F. Privé](https://stackoverflow.com/users/6103040/f-priv%c3%a9)