class: center, middle, inverse, title-slide # Suivi de thèse n°1 ### Florian Privé ### September 28, 2017 --- ## Outline 1. Main objective of the thesis 2. Data analyzed 3. R packages 4. Future work --- class: center, middle, inverse # Main objective --- ## Compute polygenic risk scores ### in order to differentiate a healthy person from a diseased person <img src="figures/density-scores.jpeg" width="80%" style="display: block; margin: auto;" /> --- ## Usefulness ### Precision medecine <img src="https://www.ucdmc.ucdavis.edu/precision-medicine/images/pmSlide1.jpg" width="100%" style="display: block; margin: auto;" /> .footnote[Source: https://www.ucdmc.ucdavis.edu/precision-medicine/] --- ## Data analyzed for now ### case/control cohort for the celiac disease .footnote[(Dubois et al., 2010)] --- ## Celiac disease ### Intolerance to gluten <img src="http://www.strettoweb.com/wp-content/uploads/2016/12/celiaci.jpg" width="60%" style="display: block; margin: auto;" /> <center>is the only treatment. --- ## Celiac disease ### Prevalence of 1% in western countries but.. <img src="https://www.beyondceliac.org/SiteData/images/FastFacts2/413d26a2a7026920/FastFacts_2.png" width="100%" style="display: block; margin: auto;" /> .footnote[Source: https://www.beyondceliac.org/celiac-disease/facts-and-figures/] --- ## Celiac disease ### The dataset: SNP array with </br> <table> <thead> <tr> <th style="text-align:left;"> Population </th> <th style="text-align:center;"> UK </th> <th style="text-align:center;"> Finland </th> <th style="text-align:center;"> Netherlands </th> <th style="text-align:center;"> Italy </th> <th style="text-align:right;"> Total </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Cases </td> <td style="text-align:center;"> 2586 </td> <td style="text-align:center;"> 647 </td> <td style="text-align:center;"> 803 </td> <td style="text-align:center;"> 497 </td> <td style="text-align:right;"> 4533 </td> </tr> <tr> <td style="text-align:left;"> Controls </td> <td style="text-align:center;"> 7532 </td> <td style="text-align:center;"> 1829 </td> <td style="text-align:center;"> 846 </td> <td style="text-align:center;"> 543 </td> <td style="text-align:right;"> 10750 </td> </tr> <tr> <td style="text-align:left;"> Total </td> <td style="text-align:center;"> 10118 </td> <td style="text-align:center;"> 2476 </td> <td style="text-align:center;"> 1649 </td> <td style="text-align:center;"> 1040 </td> <td style="text-align:right;"> 15283 </td> </tr> </tbody> </table> </br> <center>over ~300K SNPs</center> </br> This data would take **32GB** if stored in RAM as a standard R matrix. --- class: center, middle, inverse # Polygenic Risk Scores --- ## Standard model used in the human litterature ### P+T procedure, begins with a GWAS <img src="figures/celiac-gwas-cut.png" width="90%" style="display: block; margin: auto;" /> <center>+ Pruning and Thresholding --- ### All steps required in the P+T procedure. <img src="figures/steps-PT.svg" width="80%" style="display: block; margin: auto;" /> --- ### PCA in GWAS? <br><br> <div class="figure" style="text-align: center"> <img src="figures/qqplot1.png" alt="Inflated Q-Q plot when not correcting for population structure." width="70%" /> <p class="caption">Inflated Q-Q plot when not correcting for population structure.</p> </div> --- ### PCA in GWAS? Principal Components to adjust for the condounding effect of population structure (Patterson, Price, and Reich 2006). <div class="figure" style="text-align: center"> <img src="figures/qqplot2.png" alt="Less inflated Q-Q plot when correcting for population structure." width="70%" /> <p class="caption">Less inflated Q-Q plot when correcting for population structure.</p> </div> --- ## How to perform PCA for genetic data? ### On the whole (scaled) matrix <img src="figures/PC-1-2.png" width="70%" style="display: block; margin: auto;" /> --- ## How to perform PCA for genetic data? ### On the whole (scaled) matrix <img src="figures/PC-3-4.png" width="70%" style="display: block; margin: auto;" /> --- ## How to perform PCA for genetic data? ### On the whole (scaled) matrix **with pruning** (Abdellaoui et al. 2013) <img src="figures/PC2-3-4.png" width="70%" style="display: block; margin: auto;" /> --- ## How to perform PCA for genetic data? ### On the whole (scaled) matrix **with pruning** <img src="figures/PC3-3-4.png" width="70%" style="display: block; margin: auto;" /> --- ## How to perform PCA for genetic data? ### Capture long-range LD regions of chromosomes 6 and 8 <img src="figures/load3-3-4.png" width="70%" style="display: block; margin: auto;" /> --- ### Long-range LD regions for the human genome (Price et al. 2008)
.footnote[Source: https://goo.gl/wTPY7n] --- ## How to perform PCA for genetic data? ### When removing long-range LD regions and pruning <img src="figures/PC4-3-4.png" width="70%" style="display: block; margin: auto;" /> .footnote[Source: https://goo.gl/wTPY7n] --- ## Not sure that people always do it correctly - Importance of PCA in genetic association studies: <img src="figures/price-cite.png" width="80%" style="display: block; margin: auto;" /> - Importance of pruning in computing PCA: <img src="figures/pruning-cite.png" width="80%" style="display: block; margin: auto;" /> Maybe not make another paper about it, but certainly a vignette for the package. --- class: center, middle, inverse # Polygenic Risk Scores ## Another approach --- ## Recall of what we want to achieve ### Predict a phenotype: pitfalls of the P+T model - Weigths learned independently - Correlation taken care of heuristically (with pruning) - Regularization taken care of heuristically (with thresholding) ### A better solution? For example, for binary outcomes, why not using - logistic regression - Support Vector Machine (SVM) on the whole matrix (+ PCs)? --- class: center, middle, inverse # Big Data ### Simpler solutions are easier to implement --- ## What I want to be able to do ### Data analysis on large-scale genotype matrices! - Be fast to test many ideas quickly - code should be fast - I shouldn't have to make many conversions - easily combine multiple functions - Not be restricted in my analysis - Basically use all I already know in R - Work on my computer - I have 64 GB of RAM and 12 cores - Working on a server is not as easy as on my computer <br><center>**Smooth and fast analysis!** --- ## Memory problem when working in R <br> <img src="figures/memory-problem.svg" width="80%" style="display: block; margin: auto;" /> --- ## Memory solution when working in R <br> <img src="figures/memory-solution.svg" width="80%" style="display: block; margin: auto;" /> .footnote[I don't use **bigmemory** anymore but still something very similar.] --- ## My first paper (as a [preprint](https://www.biorxiv.org/content/early/2017/09/19/190926)) <img src="figures/mypaper.png" width="80%" style="display: block; margin: auto;" /> --- ## Two R packages ### bigstatsr and bigsnpr <br> - **bigstatsr** for many types of matrix, to be used by any field of research - **bigsnpr** for functions which are specific to the analysis of SNP arrays <br> <img src="figures/gad-abraham.png" width="50%" style="display: block; margin: auto;" /> .footnote[Gad Abraham would propably be one of the reviewer of my paper.] --- ## Comparative performance ### Computing partial SVD <img src="figures/benchmark-pca.png" width="80%" style="display: block; margin: auto;" /> --- ## Ease the development of new methods ### E.g. `snp_autoSVD` <img src="figures/svd.svg" width="60%" style="display: block; margin: auto;" /> --- class: center, middle, inverse # Next paper ### Comparison of methods for computing PRS --- ## Assessing predictive performance AUC (Area Under the ROC Curve) is often used. <img src="https://i.stack.imgur.com/5x3Xj.png" width="40%" style="display: block; margin: auto;" /> *** > The AUC of a classifier is equivalent to the probability that the classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance. (Fawcett, 2006) `$$\text{AUC} = P(S_\text{case} > S_\text{control})$$` --- ### PRS with P+T on Celiac <img src="figures/AUC-PRS.png" width="75%" style="display: block; margin: auto;" /> --- ### PRS with bigstatsr's regularized logistic regression on Celiac #### `big_spLog` <img src="figures/AUC-spLog.png" width="80%" style="display: block; margin: auto;" /> --- ## Choose the number of predictors ### Cross-Model Selection and Averaging (`big_CMSA`) 1. This function separates the training set in K folds (e.g. 10). 2. In turn, - each fold is considered as an inner validation set and the others (K - 1) folds form an inner training set, - the model is trained on the inner training set and the corresponding predictions (scores) for the inner validation set are computed, - the vector of scores which maximizes feval is determined, - the vector of coefficients corresponding to the previous vector of scores is chosen. 3. The K resulting vectors of coefficients are then combined into one vector. --- ## Using CMSA for `big_spLog` ### works really well <img src="figures/AUC-spLog-CMSA.png" width="80%" style="display: block; margin: auto;" /> --- class: center, middle, inverse # Future work ### UK Biobank --- ## UK Biobank <img src="figures/UKB.png" width="80%" style="display: block; margin: auto;" /> --- ## UK Biobank ### Some possible prospects - compare difference between heritability and our predictive performance - use SNPs AND environmental data to improve predictive performance - find how to predict well on external population --- ## Bonus ### When you earn 275 pts on Stack Overflow with your new package <img src="figures/screenshot.png" width="69%" style="display: block; margin: auto;" /> <img src="http://weknowyourdreams.com/images/smile/smile-08.jpg" width="22%" style="display: block; margin: auto;" /> --- class: center, middle, inverse # Thanks! ### Presentation available at ### https://privefl.github.io/thesis-docs/suivi-these.html .footnote[Slides created via the R package [**xaringan**](https://github.com/yihui/xaringan).]