In this document, I show how to use some of the features of packages {bigsnpr} and {bigstatsr}. Note that many functions used here come from package {bigstatsr} and could therefore be used on other data encoded as matrix-like (outside of the field of genotype data).

Get data

Download data and unzip files. I store those files in a directory called "tmp-data" here.

You can see there how I generated these data from the 1000 Genomes project.

What you must do

You need to

For this, you can use whatever tools you want because the data is quite small. In the following section, I give some (scalable) solutions using using packages {bigstatsr} and {bigsnpr}.

Solution using {bigstatsr} and {bigsnpr}

Population structure: Principal Component Analysis

Let us compute first principal components of the scaled genotype matrix:

plot(svd, type = "scores", scores = 3:4) +
  aes(color = pop)

# Loadings (effects of each variable for each PC)
plot(svd, type = "loadings", loadings = 1:10, coeff = 0.4)

Looking at the loadings, we can see that the PCA captures some variation due to large correlation between variables. To learn more about this possible pitfall, please look at this vignette.

Association: Genome-Wide Association Study (GWAS)

# Q-Q plot
plot(gwas, type = "Q-Q") + xlim(1, NA)  # snp_qq(gwas) + xlim(1, NA)
## Warning: Removed 118148 rows containing missing values (geom_point).

# Manhattan plot
snp_manhattan(gwas, CHR, POS, npoints = 20e3) +
  geom_hline(yintercept = -log10(5e-8), color = "red")

Polygenic Risk Score (PRS)

with Clumping and Thresholding (C+T)

\[\rm{PRS}_i = \sum_{\substack{j \in S_\text{clumping} \\ p_j~<~p_T}} \hat\beta_j \cdot G_{i,j}~,\]

where \(\hat\beta_j\) (\(p_j\)) are the effect sizes (p-values) estimated from the GWAS and \(G_{i,j}\) is the allele count (genotype) for individual \(i\) and SNP \(j\).

sumstats <- bigreadr::fread2("tmp-data/public-data-sumstats.txt")
lpval <- -log10(sumstats$p)
ind.keep <- snp_clumping(G, CHR, ind.row = ind.train, S = lpval, infos.pos = POS, ncores = 3)
THR <- seq_log(1, 8, length.out = 20)
prs <- snp_PRS(G, sumstats$beta[ind.keep], ind.keep = ind.keep, 
               lpS.keep = lpval[ind.keep], thr.list = THR)
# Learn the optimal threshold on the training set
aucs <- apply(prs[ind.train, ], 2, AUC, target = y[ind.train])
plot(THR, aucs, xlab = "-log10(p-value)", ylab = "AUC", pch = 20)

## [1] 0.6890316

(TODO: ADD SCT)

with Penalized Logistic Regression (PLR)

\[\arg\!\min_{\beta_0,~\beta}(\lambda, \alpha)\left\{ \underbrace{ -\sum_{i=1}^n \left( y_i \log\left(p_i\right) + (1 - y_i) \log\left(1 - p_i\right) \right) }_\text{Loss function} + \underbrace{ \lambda \left((1-\alpha)\frac{1}{2}\|\beta\|_2^2 + \alpha \|\beta\|_1\right) }_\text{Penalization} \right\}\]

where

  • \(p_i=1/\left(1+\exp\left(-(\beta_0 + x_i^T\beta)\right)\right)\)

  • \(x\) is denoting the genotypes and covariables (e.g. principal components),

  • \(y\) is the disease status we want to predict,

  • \(\lambda\) is a regularization parameter that needs to be determined and

  • \(\alpha\) determines relative parts of the regularization \(0 \le \alpha \le 1\).


If you want to learn more about our implementation of PLR, please look at this paper.


## # A tibble: 5 x 7
##    alpha validation_loss intercept beta            nb_var message  all_conv
##    <dbl>           <dbl>     <dbl> <list>           <int> <list>   <lgl>   
## 1 0.0001           0.533     -1.53 <dbl [131,286]>  57528 <chr [5~ TRUE    
## 2 0.001            0.535     -1.82 <dbl [131,286]>  13438 <chr [5~ TRUE    
## 3 0.01             0.534     -1.97 <dbl [131,286]>   2615 <chr [5~ TRUE    
## 4 0.1              0.552     -2.06 <dbl [131,286]>    634 <chr [5~ TRUE    
## 5 1                0.564     -1.83 <dbl [131,286]>    277 <chr [5~ TRUE
## [1] 0.7278656