Penalized methods for genetic data

# Efficient penalized regression methods for genetic prediction

## Florian Privé

### Prediction Modelling Presentations -- October 14, 2020

**Slides:** `https://privefl.github.io/thesis-docs/PLR-genetics.html`

---

# Context of application

---

## Application to building polygenic scores

**Polygenic scores (PGS)** = (predictive) scores that combine many genetic variants (linearly here)

**Polygenic risk scores (PRS)** = PGS for disease risk (e.g. probability of being diagnosed during lifetime)

**Major benefit of PRS:** genetic variants mostly do not change during lifetime, meaning you can derive such risk from birth.

**What is hard:** most common traits or diseases have genetic variants associated with them, usually with very small effect sizes.

**Limitation:** prediction from genetic variants is limited by heritability.

---

## Utility of polygenic scores

---

## Application to predicting height

Height is up to 80% heritable. 
Here we can predict ~40% variance using genetics (400K individuals `$\times$` 560K genetic variants).

---

## Application to predicting celiac disease

Gluten intolerance, autoimmune disease with large effects on chromosome 6.

---

# Penalized regression models

## A reminder

---

## Multiple linear regression

We want to solve

`$$y = \beta_0 + \beta_1 G_1 + \cdots + \beta_p G_p + \gamma_1 COV_1 + \cdots + \gamma_q COV_q + \epsilon~.$$`
--

Let `$\beta = (\beta_0, \beta_1, \dots, \beta_p, \gamma_1, \dots, \gamma_q)$` and `$X = [1; G_1; \dots;G_p; COV_1; \dots; COV_q]$`, then

`$$y = X \beta + \epsilon~.$$`

This is equivalent to minimizing

`$$||y - X \beta||_2^2 =  ||\epsilon||_2^2~,$$`
--

whose solution is

`$$\beta = (X^T X)^{-1} X^T y~.$$`

**What is the problem when analyzing genotype data?**

`$$n < p$$`

---

## Penalization term -- `$L_2$` regularization

Instead, we can minimize

`$$||y - X \beta||_2^2 + \lambda ||\beta||_2^2~,$$`
--

whose solution is

`$$\beta = (X^T X + \lambda I)^{-1} X^T y~.$$`

This is the L2-regularization ("**ridge**", Hoerl and Kennard, 1970); **it shrinks coefficients `$\beta$` towards 0**.

---

## Penalization term -- `$L_1$` regularization

Instead, we can minimize

`$$||y - X \beta||_2^2 + \lambda ||\beta||_1~,$$`
--

which does not have any closed form but can be solved using iterative algorithms.

This is the L1-regularization ("**lasso**", Tibshirani, 1996); **it forces some of the coefficients to be equal to 0** and can be used as a means of variable selection, leading to sparse models.

---

## Penalization term -- `$L_1$` and `$L_2$` regularization

Instead, we can minimize

`$$||y - X \beta||_2^2 + \lambda (\alpha ||\beta||_1 + (1 - \alpha) ||\beta||_2^2)~,$$`
--

which does not have any closed form but can be solved using iterative algorithms ( `$0 \le \alpha \le 1$` ).

This is the L1- and L2-regularization ("**elastic-net**", Zou and Hastie, 2005); it is a compromise between the two previous penalties.

---

## Advantages and drawbacks of penalization

### Advantages

- Makes it possible to solve linear problems when `$n < p$`

- Generally prevents overfitting (because of smaller effects)

### Drawback

- Add at least one hyper-parameter ( `$\lambda$` ) that needs to be chosen and another one if using the elastic-net regularization ( `$\alpha$` )

### Alternative

- Select a few variables before fitting the linear model (e.g. using marginal significance/p-values); heuristic: `$p = n / 10$`

---

## Cross-validation for choosing hyper-parameters

However, if using e.g. R package {glmnet}, it can take a long time to run for genetic data.

---

## A slightly different approach in {bigstatsr}

---

## Cross-Model Selection and Averaging (CMSA)

---

## Advantages of using {bigstatsr}

- automatic choice for the two hyper-parameters `$\lambda$` and `$\alpha$`

- faster (mainly because of early-stopping criterion and no need of refitting)

- memory efficient (because data is stored on disk)

- parallelization of fitting (easy because data on disk)

So, can be easily applied to huge data.

- also two new options:

- use of different scaling (default is dividing variants by SD)
    
    - adaptive lasso (larger marginal effects are penalized less)

---

# Application to 240 phenotypes within the UK Biobank

---

## Genetic data

- **1,117,182** variants (HapMap3)

- Training based on **434,868** individuals of European ancestry +    
testing in 8636 South Asians, 1803 East Asians and 6983 Africans

---

## Phenotypic data

111 continuous + 129 binary = **240 phenotypes**

Some examples:

- multiple diseases: cancers, diabetes, autoimmune, etc.

- body measures: height, BMI, BMD, etc.

- blood biochemistry: cholesterol, vitamin D, etc.

- ECG measures

- misc

---

## Results for binary phenotypes

---

## Results for continuous phenotypes

---

## Limitation of polygenic scores: prediction in different ancestries

Robust slope of pcor_other ~ pcor_eur, squared:

- 62.3% for South Asians
- 45.5% for East Asians
- 18.8% for Africans

---

## How fast is the LASSO implementation?

Running time is quadratic with the number of non-zero variables.

---

## Conclusion

- Using R package {bigstatsr}, you can fit penalized regressions on **100s of GBs** of data (any matrix-like data).

- E.g. to build polygenic scores for 240 traits in the UK Biobank

- However, more work is needed (e.g. to improve PGS in other ancestries)

- Paper describing R packages {bigstatsr} and {bigsnpr}:    
https://doi.org/10.1093/bioinformatics/bty185

- Paper specifically describing the  penalized regression implementation:    
https://doi.org/10.1534/genetics.119.302019

- Tutorial:    
https://privefl.github.io/bigstatsr/articles/penalized-regressions.html

---

# Thanks!

Presentation available at    
https://privefl.github.io/thesis-docs/PLR-genetics.html

[privefl](https://twitter.com/privefl) &nbsp;&nbsp;&nbsp;&nbsp; [privefl](https://github.com/privefl) &nbsp;&nbsp;&nbsp;&nbsp; [F. Privé](https://stackoverflow.com/users/6103040/f-priv%c3%a9)