Suivi de thèse n°1

class: center, middle, inverse, title-slide

# Suivi de thèse n°1
### Florian Privé
### September 28, 2017

---

## Outline

1. Main objective of the thesis

2. Data analyzed

3. R packages

4. Future work

---

class: center, middle, inverse

# Main objective

---

## Compute polygenic risk scores 
### in order to differentiate a healthy person from a diseased person

<img src="figures/density-scores.jpeg" width="80%" style="display: block; margin: auto;" />

---

## Usefulness

### Precision medecine

<img src="https://www.ucdmc.ucdavis.edu/precision-medicine/images/pmSlide1.jpg" width="100%" style="display: block; margin: auto;" />

.footnote[Source: https://www.ucdmc.ucdavis.edu/precision-medicine/]

---

## Data analyzed for now

### case/control cohort for the celiac disease

.footnote[(Dubois et al., 2010)]

---

## Celiac disease

### Intolerance to gluten

<img src="http://www.strettoweb.com/wp-content/uploads/2016/12/celiaci.jpg" width="60%" style="display: block; margin: auto;" />

<center>is the only treatment.

---

## Celiac disease

### Prevalence of 1% in western countries but..

<img src="https://www.beyondceliac.org/SiteData/images/FastFacts2/413d26a2a7026920/FastFacts_2.png" width="100%" style="display: block; margin: auto;" />

.footnote[Source: https://www.beyondceliac.org/celiac-disease/facts-and-figures/]

---

## Celiac disease

### The dataset: SNP array with

</br>
<table>
 <thead>
  <tr>
   <th style="text-align:left;"> Population </th>
   <th style="text-align:center;"> UK </th>
   <th style="text-align:center;"> Finland </th>
   <th style="text-align:center;"> Netherlands </th>
   <th style="text-align:center;"> Italy </th>
   <th style="text-align:right;"> Total </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> Cases </td>
   <td style="text-align:center;"> 2586 </td>
   <td style="text-align:center;"> 647 </td>
   <td style="text-align:center;"> 803 </td>
   <td style="text-align:center;"> 497 </td>
   <td style="text-align:right;"> 4533 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Controls </td>
   <td style="text-align:center;"> 7532 </td>
   <td style="text-align:center;"> 1829 </td>
   <td style="text-align:center;"> 846 </td>
   <td style="text-align:center;"> 543 </td>
   <td style="text-align:right;"> 10750 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Total </td>
   <td style="text-align:center;"> 10118 </td>
   <td style="text-align:center;"> 2476 </td>
   <td style="text-align:center;"> 1649 </td>
   <td style="text-align:center;"> 1040 </td>
   <td style="text-align:right;"> 15283 </td>
  </tr>
</tbody>
</table>

</br>

<center>over ~300K SNPs</center>

</br>

This data would take **32GB** if stored in RAM as a standard R matrix.

---

class: center, middle, inverse

# Polygenic Risk Scores

---

## Standard model used in the human litterature

### P+T procedure, begins with a GWAS

<img src="figures/celiac-gwas-cut.png" width="90%" style="display: block; margin: auto;" />

<center>+ Pruning and Thresholding

---

### All steps required in the P+T procedure.

<img src="figures/steps-PT.svg" width="80%" style="display: block; margin: auto;" />

---

### PCA in GWAS?

<br><br>

<div class="figure" style="text-align: center">
<img src="figures/qqplot1.png" alt="Inflated Q-Q plot when not correcting for population structure." width="70%" />
<p class="caption">Inflated Q-Q plot when not correcting for population structure.</p>
</div>

---

### PCA in GWAS?

Principal Components to adjust for the condounding effect of population structure (Patterson,
Price, and Reich 2006).

<div class="figure" style="text-align: center">
<img src="figures/qqplot2.png" alt="Less inflated Q-Q plot when correcting for population structure." width="70%" />
<p class="caption">Less inflated Q-Q plot when correcting for population structure.</p>
</div>

---

## How to perform PCA for genetic data?

### On the whole (scaled) matrix

<img src="figures/PC-1-2.png" width="70%" style="display: block; margin: auto;" />

---

## How to perform PCA for genetic data?

### On the whole (scaled) matrix

<img src="figures/PC-3-4.png" width="70%" style="display: block; margin: auto;" />

---

## How to perform PCA for genetic data?

### On the whole (scaled) matrix **with pruning** (Abdellaoui et al. 2013)

<img src="figures/PC2-3-4.png" width="70%" style="display: block; margin: auto;" />

---

## How to perform PCA for genetic data?

### On the whole (scaled) matrix **with pruning**

<img src="figures/PC3-3-4.png" width="70%" style="display: block; margin: auto;" />

---

## How to perform PCA for genetic data?

### Capture long-range LD regions of chromosomes 6 and 8

<img src="figures/load3-3-4.png" width="70%" style="display: block; margin: auto;" />

---

### Long-range LD regions for the human genome (Price et al. 2008)

<div id="htmlwidget-77fda32265a3e9629152" style="width:100%;height:auto;" class="datatables html-widget"></div>
<script type="application/json" data-for="htmlwidget-77fda32265a3e9629152">{"x":{"filter":"none","data":[[1,2,2,2,3,3,3,5,5,5,5,6,6,6,7,8,8,8,10,11,11,12,12,20,23,23,23,23,23,23,23,23,23,23],[48060567,85941853,134382738,182882739,47500000,83500000,89000000,44500000,98000000,129000000,135500000,25500000,57000000,140000000,55193285,8000000,43000000,112000000,37000000,46000000,87500000,33000000,109521663,32000000,14150264,25650264,33150264,55133704,65133704,71633704,80080511,100580511,125602146,129102146],[52060567,100407914,137882738,189882739,50000000,87000000,97500000,50500000,100500000,132000000,138500000,33500000,64000000,142500000,66193285,12000000,50000000,115000000,43000000,57000000,90500000,40000000,112021663,34500000,16650264,28650264,35650264,60500000,67633704,77580511,86080511,103080511,128102146,131602146],["hild1","hild2","hild3","hild4","hild5","hild6","hild7","hild8","hild9","hild10","hild11","hild12","hild13","hild14","hild15","hild16","hild17","hild18","hild19","hild20","hild21","hild22","hild23","hild24","hild25","hild26","hild27","hild28","hild29","hild30","hild31","hild32","hild33","hild34"]],"container":"<table class=\"display\">\n  <thead>\n    <tr>\n      <th>Chr<\/th>\n      <th>Start<\/th>\n      <th>Stop<\/th>\n      <th>ID<\/th>\n    <\/tr>\n  <\/thead>\n<\/table>","options":{"pageLength":7,"columnDefs":[{"className":"dt-right","targets":[0,1,2]}],"order":[],"autoWidth":false,"orderClasses":false,"lengthMenu":[7,10,25,50,100]}},"evals":[],"jsHooks":[]}</script>

.footnote[Source: https://goo.gl/wTPY7n]

---

## How to perform PCA for genetic data?

### When removing long-range LD regions and pruning

<img src="figures/PC4-3-4.png" width="70%" style="display: block; margin: auto;" />

.footnote[Source: https://goo.gl/wTPY7n]

---

## Not sure that people always do it correctly

- Importance of PCA in genetic association studies:

<img src="figures/price-cite.png" width="80%" style="display: block; margin: auto;" />

- Importance of pruning in computing PCA:

<img src="figures/pruning-cite.png" width="80%" style="display: block; margin: auto;" />

Maybe not make another paper about it, but certainly a vignette for the package.

---

class: center, middle, inverse

# Polygenic Risk Scores

## Another approach

---

## Recall of what we want to achieve

### Predict a phenotype: pitfalls of the P+T model

- Weigths learned independently

- Correlation taken care of heuristically (with pruning)

- Regularization taken care of heuristically (with thresholding)

### A better solution?

For example, for binary outcomes, why not using

- logistic regression

- Support Vector Machine (SVM)

on the whole matrix (+ PCs)?

---

class: center, middle, inverse

# Big Data

### Simpler solutions are easier to implement

---

## What I want to be able to do

### Data analysis on large-scale genotype matrices!

- Be fast to test many ideas quickly

- code should be fast
    - I shouldn't have to make many conversions
    - easily combine multiple functions
    
- Not be restricted in my analysis
   
    - Basically use all I already know in R
    
- Work on my computer

- I have 64 GB of RAM and 12 cores
    - Working on a server is not as easy as on my computer

<br><center>**Smooth and fast analysis!**

---

## Memory problem when working in R

<br>

<img src="figures/memory-problem.svg" width="80%" style="display: block; margin: auto;" />

---

## Memory solution when working in R

<br>

<img src="figures/memory-solution.svg" width="80%" style="display: block; margin: auto;" />

.footnote[I don't use **bigmemory** anymore but still something very similar.]

---

## My first paper (as a [preprint](https://www.biorxiv.org/content/early/2017/09/19/190926))

<img src="figures/mypaper.png" width="80%" style="display: block; margin: auto;" />

---

## Two R packages

### bigstatsr and bigsnpr

<br>

- **bigstatsr** for many types of matrix, to be used by any field of research

- **bigsnpr** for functions which are specific to the analysis of SNP arrays

<br>

<img src="figures/gad-abraham.png" width="50%" style="display: block; margin: auto;" />

.footnote[Gad Abraham would propably be one of the reviewer of my paper.]

---

## Comparative performance

### Computing partial SVD

<img src="figures/benchmark-pca.png" width="80%" style="display: block; margin: auto;" />

---

## Ease the development of new methods

### E.g. `snp_autoSVD`

<img src="figures/svd.svg" width="60%" style="display: block; margin: auto;" />

---

class: center, middle, inverse

# Next paper

### Comparison of methods for computing PRS

---

## Assessing predictive performance

AUC (Area Under the ROC Curve) is often used.

<img src="https://i.stack.imgur.com/5x3Xj.png" width="40%" style="display: block; margin: auto;" />

***

> The AUC of a classifier is equivalent to the probability that the classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance. (Fawcett, 2006)

`$$\text{AUC} = P(S_\text{case} > S_\text{control})$$`

---

### PRS with P+T on Celiac

<img src="figures/AUC-PRS.png" width="75%" style="display: block; margin: auto;" />

---

### PRS with bigstatsr's regularized logistic regression on Celiac

#### `big_spLog`

<img src="figures/AUC-spLog.png" width="80%" style="display: block; margin: auto;" />

---

## Choose the number of predictors

### Cross-Model Selection and Averaging (`big_CMSA`)

1. This function separates the training set in K folds (e.g. 10).

2. In turn,

- each fold is considered as an inner validation set and the others (K - 1) folds form an inner training set,

- the model is trained on the inner training set and the corresponding predictions (scores) for the inner validation set are computed,

- the vector of scores which maximizes feval is determined,

- the vector of coefficients corresponding to the previous vector of scores is chosen.

3. The K resulting vectors of coefficients are then combined into one vector.

---

## Using CMSA for `big_spLog`

### works really well

<img src="figures/AUC-spLog-CMSA.png" width="80%" style="display: block; margin: auto;" />

---

class: center, middle, inverse

# Future work

### UK Biobank

---

## UK Biobank

<img src="figures/UKB.png" width="80%" style="display: block; margin: auto;" />

---

## UK Biobank

### Some possible prospects

- compare difference between heritability and our predictive performance

- use SNPs AND environmental data to improve predictive performance

- find how to predict well on external population

---

## Bonus

### When you earn 275 pts on Stack Overflow with your new package

<img src="figures/screenshot.png" width="69%" style="display: block; margin: auto;" />

<img src="http://weknowyourdreams.com/images/smile/smile-08.jpg" width="22%" style="display: block; margin: auto;" />

---

class: center, middle, inverse

# Thanks!

### Presentation available at
### https://privefl.github.io/thesis-docs/suivi-these.html

.footnote[Slides created via the R package [**xaringan**](https://github.com/yihui/xaringan).]