Loading [MathJax]/jax/output/CommonHTML/jax.js
+ - 0:00:00
Notes for current slide
Notes for next slide


Efficient penalized regression methods
for genetic prediction


Florian Privé

Prediction Modelling Presentations -- October 14, 2020



Slides: https://privefl.github.io/thesis-docs/PLR-genetics.html

1 / 25

Context of application

2 / 25

Application to building polygenic scores


Polygenic scores (PGS) = (predictive) scores that combine many genetic variants (linearly here)

Polygenic risk scores (PRS) = PGS for disease risk (e.g. probability of being diagnosed during lifetime)

3 / 25

Application to building polygenic scores


Polygenic scores (PGS) = (predictive) scores that combine many genetic variants (linearly here)

Polygenic risk scores (PRS) = PGS for disease risk (e.g. probability of being diagnosed during lifetime)


Major benefit of PRS: genetic variants mostly do not change during lifetime, meaning you can derive such risk from birth.

3 / 25

Application to building polygenic scores


Polygenic scores (PGS) = (predictive) scores that combine many genetic variants (linearly here)

Polygenic risk scores (PRS) = PGS for disease risk (e.g. probability of being diagnosed during lifetime)


Major benefit of PRS: genetic variants mostly do not change during lifetime, meaning you can derive such risk from birth.

What is hard: most common traits or diseases have genetic variants associated with them, usually with very small effect sizes.

3 / 25

Application to building polygenic scores


Polygenic scores (PGS) = (predictive) scores that combine many genetic variants (linearly here)

Polygenic risk scores (PRS) = PGS for disease risk (e.g. probability of being diagnosed during lifetime)


Major benefit of PRS: genetic variants mostly do not change during lifetime, meaning you can derive such risk from birth.

What is hard: most common traits or diseases have genetic variants associated with them, usually with very small effect sizes.

Limitation: prediction from genetic variants is limited by heritability.

3 / 25

Utility of polygenic scores

4 / 25

Application to predicting height

Height is up to 80% heritable. Here we can predict ~40% variance using genetics (400K individuals × 560K genetic variants).

5 / 25

Application to predicting celiac disease

Gluten intolerance, autoimmune disease with large effects on chromosome 6.

6 / 25

Penalized regression models

A reminder

7 / 25

Multiple linear regression

We want to solve

y=β0+β1G1++βpGp+γ1COV1++γqCOVq+ϵ .

8 / 25

Multiple linear regression

We want to solve

y=β0+β1G1++βpGp+γ1COV1++γqCOVq+ϵ . Let β=(β0,β1,,βp,γ1,,γq) and X=[1;G1;;Gp;COV1;;COVq], then

y=Xβ+ϵ .

8 / 25

Multiple linear regression

We want to solve

y=β0+β1G1++βpGp+γ1COV1++γqCOVq+ϵ . Let β=(β0,β1,,βp,γ1,,γq) and X=[1;G1;;Gp;COV1;;COVq], then

y=Xβ+ϵ .

This is equivalent to minimizing

||yXβ||22=||ϵ||22 ,

8 / 25

Multiple linear regression

We want to solve

y=β0+β1G1++βpGp+γ1COV1++γqCOVq+ϵ . Let β=(β0,β1,,βp,γ1,,γq) and X=[1;G1;;Gp;COV1;;COVq], then

y=Xβ+ϵ .

This is equivalent to minimizing

||yXβ||22=||ϵ||22 , whose solution is

β=(XTX)1XTy .

8 / 25

Multiple linear regression

We want to solve

y=β0+β1G1++βpGp+γ1COV1++γqCOVq+ϵ . Let β=(β0,β1,,βp,γ1,,γq) and X=[1;G1;;Gp;COV1;;COVq], then

y=Xβ+ϵ .

This is equivalent to minimizing

||yXβ||22=||ϵ||22 , whose solution is

β=(XTX)1XTy .


What is the problem when analyzing genotype data?

8 / 25

Multiple linear regression

We want to solve

y=β0+β1G1++βpGp+γ1COV1++γqCOVq+ϵ . Let β=(β0,β1,,βp,γ1,,γq) and X=[1;G1;;Gp;COV1;;COVq], then

y=Xβ+ϵ .

This is equivalent to minimizing

||yXβ||22=||ϵ||22 , whose solution is

β=(XTX)1XTy .


What is the problem when analyzing genotype data?

n<p

8 / 25

Penalization term -- L2 regularization


Instead, we can minimize

||yXβ||22+λ||β||22 ,

9 / 25

Penalization term -- L2 regularization


Instead, we can minimize

||yXβ||22+λ||β||22 , whose solution is

β=(XTX+λI)1XTy .

9 / 25

Penalization term -- L2 regularization


Instead, we can minimize

||yXβ||22+λ||β||22 , whose solution is

β=(XTX+λI)1XTy .


This is the L2-regularization ("ridge", Hoerl and Kennard, 1970); it shrinks coefficients β towards 0.

https://doi.org/10.1080/00401706.1970.10488634

9 / 25

Penalization term -- L1 regularization


Instead, we can minimize

||yXβ||22+λ||β||1 ,

10 / 25

Penalization term -- L1 regularization


Instead, we can minimize

||yXβ||22+λ||β||1 , which does not have any closed form but can be solved using iterative algorithms.

10 / 25

Penalization term -- L1 regularization


Instead, we can minimize

||yXβ||22+λ||β||1 , which does not have any closed form but can be solved using iterative algorithms.


This is the L1-regularization ("lasso", Tibshirani, 1996); it forces some of the coefficients to be equal to 0 and can be used as a means of variable selection, leading to sparse models.

https://doi.org/10.1111/j.2517-6161.1996.tb02080.x

10 / 25

Penalization term -- L1 and L2 regularization


Instead, we can minimize

||yXβ||22+λ(α||β||1+(1α)||β||22) ,

11 / 25

Penalization term -- L1 and L2 regularization


Instead, we can minimize

||yXβ||22+λ(α||β||1+(1α)||β||22) , which does not have any closed form but can be solved using iterative algorithms ( 0α1 ).

11 / 25

Penalization term -- L1 and L2 regularization


Instead, we can minimize

||yXβ||22+λ(α||β||1+(1α)||β||22) , which does not have any closed form but can be solved using iterative algorithms ( 0α1 ).


This is the L1- and L2-regularization ("elastic-net", Zou and Hastie, 2005); it is a compromise between the two previous penalties.

https://doi.org/10.1111/j.1467-9868.2005.00503.x

11 / 25

Advantages and drawbacks of penalization

12 / 25

Advantages and drawbacks of penalization

Advantages

  • Makes it possible to solve linear problems when n<p

  • Generally prevents overfitting (because of smaller effects)

12 / 25

Advantages and drawbacks of penalization

Advantages

  • Makes it possible to solve linear problems when n<p

  • Generally prevents overfitting (because of smaller effects)

Drawback

  • Add at least one hyper-parameter ( λ ) that needs to be chosen and another one if using the elastic-net regularization ( α )
12 / 25

Advantages and drawbacks of penalization

Advantages

  • Makes it possible to solve linear problems when n<p

  • Generally prevents overfitting (because of smaller effects)

Drawback

  • Add at least one hyper-parameter ( λ ) that needs to be chosen and another one if using the elastic-net regularization ( α )

Alternative

  • Select a few variables before fitting the linear model (e.g. using marginal significance/p-values); heuristic: p=n/10
12 / 25

Cross-validation for choosing hyper-parameters



However, if using e.g. R package {glmnet}, it can take a long time to run for genetic data.

13 / 25

A slightly different approach in {bigstatsr}

14 / 25

Cross-Model Selection and Averaging (CMSA)

15 / 25

Advantages of using {bigstatsr}


  • automatic choice for the two hyper-parameters λ and α

  • faster (mainly because of early-stopping criterion and no need of refitting)

  • memory efficient (because data is stored on disk)

  • parallelization of fitting (easy because data on disk)


So, can be easily applied to huge data.

16 / 25

Advantages of using {bigstatsr}


  • automatic choice for the two hyper-parameters λ and α

  • faster (mainly because of early-stopping criterion and no need of refitting)

  • memory efficient (because data is stored on disk)

  • parallelization of fitting (easy because data on disk)


So, can be easily applied to huge data.


  • also two new options:

    • use of different scaling (default is dividing variants by SD)

    • adaptive lasso (larger marginal effects are penalized less)

16 / 25

Application to 240 phenotypes
within the UK Biobank

17 / 25

Genetic data

  • 1,117,182 variants (HapMap3)

  • Training based on 434,868 individuals of European ancestry +
    testing in 8636 South Asians, 1803 East Asians and 6983 Africans

18 / 25

Phenotypic data


111 continuous + 129 binary = 240 phenotypes


Some examples:

  • multiple diseases: cancers, diabetes, autoimmune, etc.

  • body measures: height, BMI, BMD, etc.

  • blood biochemistry: cholesterol, vitamin D, etc.

  • ECG measures

  • misc

19 / 25

Results for binary phenotypes


20 / 25

Results for continuous phenotypes


21 / 25

Limitation of polygenic scores: prediction in different ancestries



Robust slope of pcor_other ~ pcor_eur, squared:

  • 62.3% for South Asians
  • 45.5% for East Asians
  • 18.8% for Africans
22 / 25

How fast is the LASSO implementation?



Running time is quadratic with the number of non-zero variables.

23 / 25

Conclusion


  • Using R package {bigstatsr}, you can fit penalized regressions on 100s of GBs of data (any matrix-like data).

  • E.g. to build polygenic scores for 240 traits in the UK Biobank

  • However, more work is needed (e.g. to improve PGS in other ancestries)

24 / 25

Conclusion


  • Using R package {bigstatsr}, you can fit penalized regressions on 100s of GBs of data (any matrix-like data).

  • E.g. to build polygenic scores for 240 traits in the UK Biobank

  • However, more work is needed (e.g. to improve PGS in other ancestries)


24 / 25

Thanks!


Presentation available at
https://privefl.github.io/thesis-docs/PLR-genetics.html



privefl      privefl      F. Privé

Slides created via the R package xaringan.

25 / 25

Context of application

2 / 25
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow