Slides: https://privefl.github.io/thesis-docs/PLR-genetics.html
Polygenic scores (PGS) = (predictive) scores that combine many genetic variants (linearly here)
Polygenic risk scores (PRS) = PGS for disease risk (e.g. probability of being diagnosed during lifetime)
Polygenic scores (PGS) = (predictive) scores that combine many genetic variants (linearly here)
Polygenic risk scores (PRS) = PGS for disease risk (e.g. probability of being diagnosed during lifetime)
Major benefit of PRS: genetic variants mostly do not change during lifetime, meaning you can derive such risk from birth.
Polygenic scores (PGS) = (predictive) scores that combine many genetic variants (linearly here)
Polygenic risk scores (PRS) = PGS for disease risk (e.g. probability of being diagnosed during lifetime)
Major benefit of PRS: genetic variants mostly do not change during lifetime, meaning you can derive such risk from birth.
What is hard: most common traits or diseases have genetic variants associated with them, usually with very small effect sizes.
Polygenic scores (PGS) = (predictive) scores that combine many genetic variants (linearly here)
Polygenic risk scores (PRS) = PGS for disease risk (e.g. probability of being diagnosed during lifetime)
Major benefit of PRS: genetic variants mostly do not change during lifetime, meaning you can derive such risk from birth.
What is hard: most common traits or diseases have genetic variants associated with them, usually with very small effect sizes.
Limitation: prediction from genetic variants is limited by heritability.
Height is up to 80% heritable. Here we can predict ~40% variance using genetics (400K individuals × 560K genetic variants).
Gluten intolerance, autoimmune disease with large effects on chromosome 6.
We want to solve
y=β0+β1G1+⋯+βpGp+γ1COV1+⋯+γqCOVq+ϵ .
We want to solve
y=β0+β1G1+⋯+βpGp+γ1COV1+⋯+γqCOVq+ϵ . Let β=(β0,β1,…,βp,γ1,…,γq) and X=[1;G1;…;Gp;COV1;…;COVq], then
y=Xβ+ϵ .
We want to solve
y=β0+β1G1+⋯+βpGp+γ1COV1+⋯+γqCOVq+ϵ . Let β=(β0,β1,…,βp,γ1,…,γq) and X=[1;G1;…;Gp;COV1;…;COVq], then
y=Xβ+ϵ .
This is equivalent to minimizing
||y−Xβ||22=||ϵ||22 ,
We want to solve
y=β0+β1G1+⋯+βpGp+γ1COV1+⋯+γqCOVq+ϵ . Let β=(β0,β1,…,βp,γ1,…,γq) and X=[1;G1;…;Gp;COV1;…;COVq], then
y=Xβ+ϵ .
This is equivalent to minimizing
||y−Xβ||22=||ϵ||22 , whose solution is
β=(XTX)−1XTy .
We want to solve
y=β0+β1G1+⋯+βpGp+γ1COV1+⋯+γqCOVq+ϵ . Let β=(β0,β1,…,βp,γ1,…,γq) and X=[1;G1;…;Gp;COV1;…;COVq], then
y=Xβ+ϵ .
This is equivalent to minimizing
||y−Xβ||22=||ϵ||22 , whose solution is
β=(XTX)−1XTy .
What is the problem when analyzing genotype data?
We want to solve
y=β0+β1G1+⋯+βpGp+γ1COV1+⋯+γqCOVq+ϵ . Let β=(β0,β1,…,βp,γ1,…,γq) and X=[1;G1;…;Gp;COV1;…;COVq], then
y=Xβ+ϵ .
This is equivalent to minimizing
||y−Xβ||22=||ϵ||22 , whose solution is
β=(XTX)−1XTy .
What is the problem when analyzing genotype data?
n<p
Instead, we can minimize
||y−Xβ||22+λ||β||22 ,
Instead, we can minimize
||y−Xβ||22+λ||β||22 , whose solution is
β=(XTX+λI)−1XTy .
Instead, we can minimize
||y−Xβ||22+λ||β||22 , whose solution is
β=(XTX+λI)−1XTy .
This is the L2-regularization ("ridge", Hoerl and Kennard, 1970); it shrinks coefficients β towards 0.
https://doi.org/10.1080/00401706.1970.10488634
Instead, we can minimize
||y−Xβ||22+λ||β||1 ,
Instead, we can minimize
||y−Xβ||22+λ||β||1 , which does not have any closed form but can be solved using iterative algorithms.
Instead, we can minimize
||y−Xβ||22+λ||β||1 , which does not have any closed form but can be solved using iterative algorithms.
This is the L1-regularization ("lasso", Tibshirani, 1996); it forces some of the coefficients to be equal to 0 and can be used as a means of variable selection, leading to sparse models.
https://doi.org/10.1111/j.2517-6161.1996.tb02080.x
Instead, we can minimize
||y−Xβ||22+λ(α||β||1+(1−α)||β||22) ,
Instead, we can minimize
||y−Xβ||22+λ(α||β||1+(1−α)||β||22) , which does not have any closed form but can be solved using iterative algorithms ( 0≤α≤1 ).
Instead, we can minimize
||y−Xβ||22+λ(α||β||1+(1−α)||β||22) , which does not have any closed form but can be solved using iterative algorithms ( 0≤α≤1 ).
This is the L1- and L2-regularization ("elastic-net", Zou and Hastie, 2005); it is a compromise between the two previous penalties.
https://doi.org/10.1111/j.1467-9868.2005.00503.x
Makes it possible to solve linear problems when n<p
Generally prevents overfitting (because of smaller effects)
Makes it possible to solve linear problems when n<p
Generally prevents overfitting (because of smaller effects)
Makes it possible to solve linear problems when n<p
Generally prevents overfitting (because of smaller effects)
However, if using e.g. R package {glmnet}, it can take a long time to run for genetic data.
automatic choice for the two hyper-parameters λ and α
faster (mainly because of early-stopping criterion and no need of refitting)
memory efficient (because data is stored on disk)
parallelization of fitting (easy because data on disk)
So, can be easily applied to huge data.
automatic choice for the two hyper-parameters λ and α
faster (mainly because of early-stopping criterion and no need of refitting)
memory efficient (because data is stored on disk)
parallelization of fitting (easy because data on disk)
So, can be easily applied to huge data.
also two new options:
use of different scaling (default is dividing variants by SD)
adaptive lasso (larger marginal effects are penalized less)
1,117,182 variants (HapMap3)
Training based on 434,868 individuals of European ancestry +
testing in 8636 South Asians, 1803 East Asians and 6983 Africans
111 continuous + 129 binary = 240 phenotypes
Some examples:
multiple diseases: cancers, diabetes, autoimmune, etc.
body measures: height, BMI, BMD, etc.
blood biochemistry: cholesterol, vitamin D, etc.
ECG measures
misc
Robust slope of pcor_other ~ pcor_eur, squared:
Running time is quadratic with the number of non-zero variables.
Using R package {bigstatsr}, you can fit penalized regressions on 100s of GBs of data (any matrix-like data).
E.g. to build polygenic scores for 240 traits in the UK Biobank
However, more work is needed (e.g. to improve PGS in other ancestries)
Using R package {bigstatsr}, you can fit penalized regressions on 100s of GBs of data (any matrix-like data).
E.g. to build polygenic scores for 240 traits in the UK Biobank
However, more work is needed (e.g. to improve PGS in other ancestries)
Paper describing R packages {bigstatsr} and {bigsnpr}:
https://doi.org/10.1093/bioinformatics/bty185
Paper specifically describing the penalized regression implementation:
https://doi.org/10.1534/genetics.119.302019
Tutorial:
https://privefl.github.io/bigstatsr/articles/penalized-regressions.html
Presentation available at
https://privefl.github.io/thesis-docs/PLR-genetics.html
Slides created via the R package xaringan.
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |