In this vignette, I show how to perform Principal Component Analysis (PCA) with packages bigstatsr and bigsnpr (Privé, Aschard, and Blum 2017). I also show why pruning and removing long-range Linkage Disequilibrium (LD) are two important steps before computing PCs in order to capture population structure (Abdellaoui et al. 2013).
I use data from a case/control cohort on celiac disease (Dubois et al. 2010). The data has already been QCed, genotyped SNPs have been imputed and binary PLINK files have been converted to the “bigSNP” format used in bigsnpr (see these preprocessing steps).
library(bigsnpr)
## Loading required package: bigstatsr
library(ggplot2)
celiac <- snp_attach("backingfiles/celiacQC.rds")
G <- celiac$genotypes
CHR <- celiac$map$chromosome
POS <- celiac$map$physical.pos
NCORES <- nb_cores()
# "Verification" there is no missing value
big_counts(G, ind.col = 1:12) # OK
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
## 0 418 1166 425 1114 165 293 354 2044 81 601 109 190
## 1 4073 6054 4198 5968 2720 3553 3843 6837 2008 4620 2262 3190
## 2 10664 7935 10532 8073 12270 11309 10958 6274 13066 9934 12784 11775
## <NA> 0 0 0 0 0 0 0 0 0 0 0 0
# Get population from external files
pop.files <- list.files(path = "data", pattern = "cluster_*", full.names = TRUE)
pop <- snp_getSampleInfos(celiac, pop.files)[[1]]
pop.names <- c("Netherlands", "Italy", "UK1", "UK2", "Finland")
svd1 <- big_randomSVD(G, snp_scaleBinom(), ncores = NCORES)
plot(svd1, type = "scores") +
aes(color = pop.names[pop]) +
labs(color = "Population")
First two PCs capture population structure.
plot(svd1, type = "scores", scores = 3:4) +
aes(color = pop.names[pop]) +
labs(color = "Population")