In this vignette, I show how to perform Principal Component Analysis (PCA) with packages bigstatsr and bigsnpr (Privé, Aschard, and Blum 2017). I also show why pruning and removing long-range Linkage Disequilibrium (LD) are two important steps before computing PCs in order to capture population structure (Abdellaoui et al. 2013).
I use data from a case/control cohort on celiac disease (Dubois et al. 2010). The data has already been QCed, genotyped SNPs have been imputed and binary PLINK files have been converted to the “bigSNP” format used in bigsnpr (see these preprocessing steps).
library(bigsnpr)
## Loading required package: bigstatsr
library(ggplot2)
celiac <- snp_attach("backingfiles/celiacQC.rds")
G <- celiac$genotypes
CHR <- celiac$map$chromosome
POS <- celiac$map$physical.pos
NCORES <- nb_cores()
# "Verification" there is no missing value
big_counts(G, ind.col = 1:12) # OK
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
## 0 418 1166 425 1114 165 293 354 2044 81 601 109 190
## 1 4073 6054 4198 5968 2720 3553 3843 6837 2008 4620 2262 3190
## 2 10664 7935 10532 8073 12270 11309 10958 6274 13066 9934 12784 11775
## <NA> 0 0 0 0 0 0 0 0 0 0 0 0
# Get population from external files
pop.files <- list.files(path = "data", pattern = "cluster_*", full.names = TRUE)
pop <- snp_getSampleInfos(celiac, pop.files)[[1]]
pop.names <- c("Netherlands", "Italy", "UK1", "UK2", "Finland")
svd1 <- big_randomSVD(G, snp_scaleBinom(), ncores = NCORES)
plot(svd1, type = "scores") +
aes(color = pop.names[pop]) +
labs(color = "Population")
First two PCs capture population structure.
plot(svd1, type = "scores", scores = 3:4) +
aes(color = pop.names[pop]) +
labs(color = "Population")
PC3 and PC4 don’t capture population structure.
# The SNP with max loading for PC3
theone1.1 <- which.max(abs(svd1$v[, 3]))
# The SNP with max loading for PC4
theone1.2 <- which.max(abs(svd1$v[, 4]))
plot(svd1, type = "scores", scores = 3:4) +
aes(color = as.factor(paste(G[, theone1.1], G[, theone1.2]))) +
labs(color = "Genotypes") +
guides(colour = guide_legend(override.aes = list(size = rel(4))))
So basically, PC3 is capturing a variation of one SNP, as well as PC4. These two SNPs are located in long-range LD regions and corresponds to peaks in the next figure.
plot(svd1, type = "loadings", loadings = 1:4, coeff = 0.7)
In fact, I’m using clumping on the Minor Allele Frequencies (MAF) instead of pruning. You can see this vignette to know why.
ind.keep2 <- snp_clumping(G, CHR, ncores = NCORES)
svd2 <- big_randomSVD(G, snp_scaleBinom(), ncores = NCORES,
ind.col = ind.keep2)
plot(svd2, type = "scores", scores = 3:4) +
aes(color = pop.names[pop]) +
labs(color = "Population")
theone2 <- ind.keep2[which.max(abs(svd2$v[, 4]))]
plot(svd2, type = "scores", scores = 3:4) +
aes(color = as.factor(G[, theone2])) +
labs(color = "Genotype")
PC4 again captures variation at one SNP, which is possibly in a long-range LD region.
As recommended by Price et al. (2008), it is possible to remove a list of predetermined long-range LD regions.
ind.keep3 <- cols_along(G)[-snp_indLRLDR(CHR, POS)]
svd3 <- big_randomSVD(G, snp_scaleBinom(), ncores = NCORES,
ind.col = ind.keep3)
plot(svd3, type = "scores", scores = 3:4) +
aes(color = pop.names[pop]) +
labs(color = "Population")
This is quite better at capturing population structure. Yet..
plot(svd3, type = "loadings", loadings = 1:10, coeff = 0.6)
So, first 4 PCs are mostly capturing population structure, but the next PCs are likely to capture only LD structure.
plot(svd3, type = "scores", scores = 5:6) +
aes(color = pop.names[pop]) +
labs(color = "Population")
theone3 <- ind.keep3[which.max(abs(svd3$v[, 5]))]
plot(svd3, type = "scores", scores = 5:6) +
aes(color = as.factor(G[, theone3])) +
labs(color = "Genotype")
As recommend by Abdellaoui et al. (2013), I prune AND remove long-range LD-regions.
ind.keep4 <- snp_clumping(G, CHR, ncores = NCORES,
exclude = snp_indLRLDR(CHR, POS))
svd4 <- big_randomSVD(G, snp_scaleBinom(), ncores = NCORES,
ind.col = ind.keep4)
plot(svd4, type = "scores", scores = 3:4) +
aes(color = pop.names[pop]) +
labs(color = "Population")
plot(svd4, type = "scores", scores = 5:6) +
aes(color = pop.names[pop]) +
labs(color = "Population")
Maybe, PC5 is now capturing some non-European ancestry, or something else.
plot(svd4, type = "loadings", loadings = 1:10, coeff = 0.6)
The loadings are all approximately normally distributed (no huge peak), we’re good.
You can use either the previous method (for human data only) or try the following automatic procedure to prune and remove long-range LD regions (Privé, Aschard, and Blum 2017).
svd0 <- snp_autoSVD(G, CHR, POS, ncores = NCORES)
## Phase of clumping at r2 > 0.2.. keep 94517 SNPs.
##
## Iteration 1:
## Computing SVD..
## 3 long-range LD regions were detected..
##
## Iteration 2:
## Computing SVD..
## 2 long-range LD regions were detected..
##
## Iteration 3:
## Computing SVD..
##
## Converged!
attr(svd0, "lrldr")
## Chr Start Stop
## 1 2 134374210 138130055
## 2 6 23788094 35793692
## 3 8 6341567 13526256
## 4 3 163071168 165026940
## 5 14 46690298 47500807
plot(svd0, type = "scores", scores = 3:4) +
aes(color = pop.names[pop]) +
labs(color = "Population")
plot(svd0, type = "scores", scores = 5:6) +
aes(color = pop.names[pop]) +
labs(color = "Population")
plot(svd0, type = "loadings", loadings = 1:10, coeff = 0.6)
Always use both pruning and removing of long-range LD regions when computing the PCs, as recommended by Abdellaoui et al. (2013). To check that the results are capturing population structure, you should plot PCA scores. To check that PCs are not capturing LD, you should check PCA loadings.
Abdellaoui, Abdel, Jouke-Jan Hottenga, Peter De Knijff, Michel G Nivard, Xiangjun Xiao, Paul Scheet, Andrew Brooks, et al. 2013. “Population structure, migration, and diversifying selection in the Netherlands.” European Journal of Human Genetics 21 (10). Nature Publishing Group: 1277–85. doi:10.1038/ejhg.2013.48.
Dubois, Patrick C A, Gosia Trynka, Lude Franke, Karen A Hunt, Jihane Romanos, Alessandra Curtotti, Alexandra Zhernakova, et al. 2010. “Multiple common variants for celiac disease influencing immune gene expression.” Nature Genetics 42 (4). Nature Publishing Group: 295–302. doi:10.1038/ng.543.
Price, Alkes L, Michael E Weale, Nick Patterson, Simon R Myers, Anna C Need, Kevin V Shianna, Dongliang Ge, et al. 2008. “Long-Range LD Can Confound Genome Scans in Admixed Populations.” Elsevier. doi:10.1016/j.ajhg.2008.06.005.
Privé, Florian, Hugues Aschard, and Michael G B Blum. 2017. “Efficient management and analysis of large-scale genome-wide data with two R packages: bigstatsr and bigsnpr.” BioRxiv. Cold Spring Harbor Laboratory. doi:10.1101/190926.