How to capture population structure with PCA?

In this vignette, I show how to perform Principal Component Analysis (PCA) with packages bigstatsr and bigsnpr (Privé, Aschard, and Blum 2017). I also show why pruning and removing long-range Linkage Disequilibrium (LD) are two important steps before computing PCs in order to capture population structure (Abdellaoui et al. 2013).

Data

I use data from a case/control cohort on celiac disease (Dubois et al. 2010). The data has already been QCed, genotyped SNPs have been imputed and binary PLINK files have been converted to the “bigSNP” format used in bigsnpr (see these preprocessing steps).

library(bigsnpr)

## Loading required package: bigstatsr

library(ggplot2)

celiac <- snp_attach("backingfiles/celiacQC.rds")
G <- celiac$genotypes
CHR <- celiac$map$chromosome
POS <- celiac$map$physical.pos
NCORES <- nb_cores()

# "Verification" there is no missing value
big_counts(G, ind.col = 1:12) # OK

##       [,1] [,2]  [,3] [,4]  [,5]  [,6]  [,7] [,8]  [,9] [,10] [,11] [,12]
## 0      418 1166   425 1114   165   293   354 2044    81   601   109   190
## 1     4073 6054  4198 5968  2720  3553  3843 6837  2008  4620  2262  3190
## 2    10664 7935 10532 8073 12270 11309 10958 6274 13066  9934 12784 11775
## <NA>     0    0     0    0     0     0     0    0     0     0     0     0

# Get population from external files
pop.files <- list.files(path = "data", pattern = "cluster_*", full.names = TRUE)
pop <- snp_getSampleInfos(celiac, pop.files)[[1]]
pop.names <- c("Netherlands", "Italy", "UK1", "UK2", "Finland")

Principal Component Analysis

On the whole genotype matrix

svd1 <- big_randomSVD(G, snp_scaleBinom(), ncores = NCORES)

plot(svd1, type = "scores") +
  aes(color = pop.names[pop]) +
  labs(color = "Population")

First two PCs capture population structure.

plot(svd1, type = "scores", scores = 3:4) +
  aes(color = pop.names[pop]) +
  labs(color = "Population")

PC3 and PC4 don’t capture population structure.

# The SNP with max loading for PC3
theone1.1 <- which.max(abs(svd1$v[, 3]))
# The SNP with max loading for PC4
theone1.2 <- which.max(abs(svd1$v[, 4]))
plot(svd1, type = "scores", scores = 3:4) +
  aes(color = as.factor(paste(G[, theone1.1], G[, theone1.2]))) +
  labs(color = "Genotypes") +
  guides(colour = guide_legend(override.aes = list(size = rel(4))))

So basically, PC3 is capturing a variation of one SNP, as well as PC4. These two SNPs are located in long-range LD regions and corresponds to peaks in the next figure.

plot(svd1, type = "loadings", loadings = 1:4, coeff = 0.7)

When pruning only

In fact, I’m using clumping on the Minor Allele Frequencies (MAF) instead of pruning. You can see this vignette to know why.

ind.keep2 <- snp_clumping(G, CHR, ncores = NCORES)
svd2 <- big_randomSVD(G, snp_scaleBinom(), ncores = NCORES,
                      ind.col = ind.keep2)

plot(svd2, type = "scores", scores = 3:4) +
  aes(color = pop.names[pop]) +
  labs(color = "Population")

theone2 <- ind.keep2[which.max(abs(svd2$v[, 4]))]
plot(svd2, type = "scores", scores = 3:4) +
  aes(color = as.factor(G[, theone2])) +
  labs(color = "Genotype")

PC4 again captures variation at one SNP, which is possibly in a long-range LD region.

When removing long-range LD regions only

As recommended by Price et al. (2008), it is possible to remove a list of predetermined long-range LD regions.

ind.keep3 <- cols_along(G)[-snp_indLRLDR(CHR, POS)]
svd3 <- big_randomSVD(G, snp_scaleBinom(), ncores = NCORES,
                      ind.col = ind.keep3)

plot(svd3, type = "scores", scores = 3:4) +
  aes(color = pop.names[pop]) +
  labs(color = "Population")

This is quite better at capturing population structure. Yet..

plot(svd3, type = "loadings", loadings = 1:10, coeff = 0.6)

So, first 4 PCs are mostly capturing population structure, but the next PCs are likely to capture only LD structure.

plot(svd3, type = "scores", scores = 5:6) +
  aes(color = pop.names[pop]) +
  labs(color = "Population")

theone3 <- ind.keep3[which.max(abs(svd3$v[, 5]))]
plot(svd3, type = "scores", scores = 5:6) +
  aes(color = as.factor(G[, theone3])) +
  labs(color = "Genotype")

When pruning and removing long-range LD regions

As recommend by Abdellaoui et al. (2013), I prune AND remove long-range LD-regions.

ind.keep4 <- snp_clumping(G, CHR, ncores = NCORES,
                          exclude = snp_indLRLDR(CHR, POS))
svd4 <- big_randomSVD(G, snp_scaleBinom(), ncores = NCORES,
                      ind.col = ind.keep4)

plot(svd4, type = "scores", scores = 3:4) +
  aes(color = pop.names[pop]) +
  labs(color = "Population")

plot(svd4, type = "scores", scores = 5:6) +
  aes(color = pop.names[pop]) +
  labs(color = "Population")

Maybe, PC5 is now capturing some non-European ancestry, or something else.

plot(svd4, type = "loadings", loadings = 1:10, coeff = 0.6)

The loadings are all approximately normally distributed (no huge peak), we’re good.

An automatic procedure

You can use either the previous method (for human data only) or try the following automatic procedure to prune and remove long-range LD regions (Privé, Aschard, and Blum 2017).

svd0 <- snp_autoSVD(G, CHR, POS, ncores = NCORES)

## Phase of clumping at r2 > 0.2.. keep 94517 SNPs.
## 
## Iteration 1:
## Computing SVD..
## 3 long-range LD regions were detected..
## 
## Iteration 2:
## Computing SVD..
## 2 long-range LD regions were detected..
## 
## Iteration 3:
## Computing SVD..
## 
## Converged!

attr(svd0, "lrldr")

##   Chr     Start      Stop
## 1   2 134374210 138130055
## 2   6  23788094  35793692
## 3   8   6341567  13526256
## 4   3 163071168 165026940
## 5  14  46690298  47500807

plot(svd0, type = "scores", scores = 3:4) +
  aes(color = pop.names[pop]) +
  labs(color = "Population")

plot(svd0, type = "scores", scores = 5:6) +
  aes(color = pop.names[pop]) +
  labs(color = "Population")

plot(svd0, type = "loadings", loadings = 1:10, coeff = 0.6)

Conclusion

Always use both pruning and removing of long-range LD regions when computing the PCs, as recommended by Abdellaoui et al. (2013). To check that the results are capturing population structure, you should plot PCA scores. To check that PCs are not capturing LD, you should check PCA loadings.

References

Abdellaoui, Abdel, Jouke-Jan Hottenga, Peter De Knijff, Michel G Nivard, Xiangjun Xiao, Paul Scheet, Andrew Brooks, et al. 2013. “Population structure, migration, and diversifying selection in the Netherlands.” European Journal of Human Genetics 21 (10). Nature Publishing Group: 1277–85. doi:10.1038/ejhg.2013.48.

Dubois, Patrick C A, Gosia Trynka, Lude Franke, Karen A Hunt, Jihane Romanos, Alessandra Curtotti, Alexandra Zhernakova, et al. 2010. “Multiple common variants for celiac disease influencing immune gene expression.” Nature Genetics 42 (4). Nature Publishing Group: 295–302. doi:10.1038/ng.543.

Price, Alkes L, Michael E Weale, Nick Patterson, Simon R Myers, Anna C Need, Kevin V Shianna, Dongliang Ge, et al. 2008. “Long-Range LD Can Confound Genome Scans in Admixed Populations.” Elsevier. doi:10.1016/j.ajhg.2008.06.005.

Privé, Florian, Hugues Aschard, and Michael G B Blum. 2017. “Efficient management and analysis of large-scale genome-wide data with two R packages: bigstatsr and bigsnpr.” BioRxiv. Cold Spring Harbor Laboratory. doi:10.1101/190926.