In this vignette, I show how to perform Principal Component Analysis (PCA) with packages bigstatsr and bigsnpr (Privé, Aschard, and Blum 2017). I also show why pruning and removing long-range Linkage Disequilibrium (LD) are two important steps before computing PCs in order to capture population structure (Abdellaoui et al. 2013).

Data

I use data from a case/control cohort on celiac disease (Dubois et al. 2010). The data has already been QCed, genotyped SNPs have been imputed and binary PLINK files have been converted to the “bigSNP” format used in bigsnpr (see these preprocessing steps).

library(bigsnpr)
## Loading required package: bigstatsr
library(ggplot2)

celiac <- snp_attach("backingfiles/celiacQC.rds")
G <- celiac$genotypes
CHR <- celiac$map$chromosome
POS <- celiac$map$physical.pos
NCORES <- nb_cores()

# "Verification" there is no missing value
big_counts(G, ind.col = 1:12) # OK
##       [,1] [,2]  [,3] [,4]  [,5]  [,6]  [,7] [,8]  [,9] [,10] [,11] [,12]
## 0      418 1166   425 1114   165   293   354 2044    81   601   109   190
## 1     4073 6054  4198 5968  2720  3553  3843 6837  2008  4620  2262  3190
## 2    10664 7935 10532 8073 12270 11309 10958 6274 13066  9934 12784 11775
## <NA>     0    0     0    0     0     0     0    0     0     0     0     0
# Get population from external files
pop.files <- list.files(path = "data", pattern = "cluster_*", full.names = TRUE)
pop <- snp_getSampleInfos(celiac, pop.files)[[1]]
pop.names <- c("Netherlands", "Italy", "UK1", "UK2", "Finland")

Principal Component Analysis

On the whole genotype matrix

svd1 <- big_randomSVD(G, snp_scaleBinom(), ncores = NCORES)
plot(svd1, type = "scores") +
  aes(color = pop.names[pop]) +
  labs(color = "Population")

First two PCs capture population structure.

plot(svd1, type = "scores", scores = 3:4) +
  aes(color = pop.names[pop]) +
  labs(color = "Population")