Chapter 5 Population structure
5.1 Principal Component Analysis (PCA)
PCA on the genotype matrix can be used to capture population structure. However, PCA can actually capture different kinds of structure (Privé, Luu, Blum, et al., 2020):
- population structure (what we want),
- LD structure, when there are two many correlated variants (e.g. within long-range LD regions) and not enough population structure (see this vignette),
- relatedness structure, when there are related individuals who usually cluster together in later PCs,
- noise, basically just circles when looking at PC scores.
Population structure is the second main topic of my research work (after polygenic scores).
In Privé et al. (2018), I introduced an algorithm to compute PCA for a
bigSNP
object while accounting for the LD problem by using clumping (not pruning, cf. this vignette) and an automatic detection and removal of long-range LD regions.In Privé, Luu, Vilhjálmsson, et al. (2020), I improved the implementation of an algorithm, pcadapt, to detect variants associated with population structure (could be considered as a GWAS for population structure).
In Privé, Luu, Blum, et al. (2020), I extended {bigsnpr} to also be able to run PCA on PLINK bed files directly with a small percentage of missing values, and investigated best practices for PCA in more detail.
In Privé, Aschard, et al. (2022) and Privé (2022), I showed how to use PCA for ancestry inference, including grouping individuals in homogeneous ancestry groups, and inferring ancestry proportions from allele frequencies only (see this vignette).
5.2 Example
There can be many steps to properly perform a PCA, which I try to showcase in the following example. You can find another detailed example of how to conduct a PCA in this vignette.
Let us reuse the data prepared in 4.3.
First, let us get an idea of the relatedness in the data using
library(bigsnpr)
#> Loading required package: bigstatsr
<- nb_cores())
(NCORES <- download_plink2("tmp-data")
plink2 <- snp_plinkKINGQC(plink2, "tmp-data/GWAS_data_sorted_QC.bed",
rel thr.king = 2^-4.5, make.bed = FALSE, ncores = NCORES)
hist(log2(rel$KINSHIP), "FD"); abline(v = c(-4.5, -3.5), col = "red")
When computing relatedness with KING, LD pruning is NOT recommended. However, it may be useful to filter out some variants that are highly associated with population structure, e.g. as performed in the UK Biobank (Bycroft et al., 2018). For example, see this code.
Relatedness should not be a major issue here. Let us now compute PCs.
All the code that follows could be run on the bigSNP object we made before. Nevertheless, to showcase the bed_*
functions here, we will run the following analyses on the bed file directly.
# Memory-map a bed file directly
<- bed("tmp-data/GWAS_data_sorted_QC.bed")
obj.bed str(obj.bed, max.level = 2)
#> Reference class 'bed' [package "bigsnpr"] with 13 fields
#> $ bedfile: chr "tmp-data/GWAS_data_sorted_QC.bed"
#> $ extptr :<externalptr>
#> $ .fam :'data.frame': 1401 obs. of 6 variables:
#> $ .map :'data.frame': 404663 obs. of 6 variables:
#> $ prefix : chr "tmp-data/GWAS_data_sorted_QC"
#> $ bimfile: chr "tmp-data/GWAS_data_sorted_QC.bim"
#> $ famfile: chr "tmp-data/GWAS_data_sorted_QC.fam"
#> $ fam :'data.frame': 1401 obs. of 6 variables:
#> $ map :'data.frame': 404663 obs. of 6 variables:
#> $ nrow : int 1401
#> $ ncol : int 404663
#> $ light :Reference class 'bed_light' [package "bigsnpr"] with 5 fields
#> ..and 15 methods, of which 1 is possibly relevant
#> $ address:<externalptr>
#> and 16 methods, of which 2 are possibly relevant:
#> initialize, show#envRefClass
# If you expect individuals to all come from the same continent,
# use e.g. k=10, otherwise you can try k=25.
<- runonce::save_run(
obj.svd bed_autoSVD(obj.bed, k = 12, ncores = NCORES),
file = "tmp-data/PCA_GWAS_data.rds")
#>
#> Phase of clumping (on MAC) at r^2 > 0.2.. keep 115258 variants.
#> Discarding 0 variant with MAC < 10.
#>
#> Iteration 1:
#> Computing SVD..
#> 309 outlier variants detected..
#> 2 long-range LD regions detected..
#>
#> Iteration 2:
#> Computing SVD..
#> 19 outlier variants detected..
#> 0 long-range LD region detected..
#>
#> Iteration 3:
#> Computing SVD..
#> 3 outlier variants detected..
#> 0 long-range LD region detected..
#>
#> Iteration 4:
#> Computing SVD..
#> 15 outlier variants detected..
#> 0 long-range LD region detected..
#>
#> Iteration 5:
#> Computing SVD..
#> 0 outlier variant detected..
#>
#> Converged!
#> user system elapsed
#> 26.58 1.34 139.57
plot(obj.svd)
plot(obj.svd, type = "scores", scores = 1:12, coeff = 0.5)
There is some population structure (maybe up to 6 PCs). You should also check loadings to make sure there is no LD structure (peaks on loadings):
plot(obj.svd, type = "loadings", loadings = 1:6, coeff = 0.5)
No peaks, but loadings of PC4 and PC5 are a bit odd.
If you expect the individuals to mostly come from one population (e.g. in a national biobank), you can simply use a robust distance to identify a homogeneous subset of individuals, then look at the histogram of log-distances to choose a threshold based on visual inspection (here I would probably choose 4.5).
<- predict(obj.svd)
PC <- log(bigutilsr::dist_ogk(PC))
ldist hist(ldist, "FD")
library(ggplot2)
::source_url(
devtools"https://raw.githubusercontent.com/privefl/paper4-bedpca/master/code/plot_grid2.R")
plot_grid2(plotlist = lapply(1:4, function(k) {
<- 2 * k - 1
k1 <- 2 * k
k2 qplot(PC[, k1], PC[, k2], color = ldist, size = I(2)) +
scale_color_viridis_c() +
theme_bigstatsr(0.6) +
labs(x = paste0("PC", k1), y = paste0("PC", k2), color = "log-distance") +
coord_equal()
nrow = 2, legend_ratio = 0.2, title_ratio = 0) }),
#> Warning: `qplot()` was deprecated in ggplot2 3.4.0.
It would be nice if we could get a better sense of the ancestry of these individuals. To achieve this, we can project this data onto the PCA space of many known population groups defined in Privé (2022).
<- bigreadr::fread2(
all_freq ::download_file(
runonce"https://figshare.com/ndownloader/files/38019027", # for the tutorial (46 MB)
# "https://figshare.com/ndownloader/files/31620968", # for real analyses (849 MB)
dir = "tmp-data", fname = "ref_freqs.csv.gz"))
<- bigreadr::fread2(
projection ::download_file(
runonce"https://figshare.com/ndownloader/files/38019024", # for the tutorial (44 MB)
# "https://figshare.com/ndownloader/files/31620953", # for real analyses (847 MB)
dir = "tmp-data", fname = "projection.csv.gz"))
# coefficients to correct for overfitting of PCA
<- c(1, 1, 1, 1.008, 1.021, 1.034, 1.052, 1.074, 1.099,
correction 1.123, 1.15, 1.195, 1.256, 1.321, 1.382, 1.443)
library(dplyr)
<- obj.bed$map %>%
matched transmute(chr = chromosome, pos = physical.pos, a1 = allele1, a0 = allele2) %>%
mutate(beta = 1) %>%
snp_match(all_freq[1:5]) %>%
print()
#> chr pos a0 a1 beta _NUM_ID_.ss rsid _NUM_ID_
#> 1 1 752566 G A -1 2 rs3094315 1
#> 2 1 785989 T C -1 4 rs2980300 2
#> 3 1 798959 G A 1 5 rs11240777 3
#> 4 1 947034 G A -1 6 rs2465126 4
#> 5 1 949608 G A 1 7 rs1921 5
#> 6 1 1018704 A G -1 8 rs9442372 6
#> [ reached 'max' / getOption("max.print") -- omitted 301150 rows ]
# further subsetting on missing values
<- bed_counts(obj.bed, ind.col = matched$`_NUM_ID_.ss`, ncores = NCORES)
counts # hist(nb_na <- counts[4, ])
<- which(counts[4, ] < (nrow(obj.bed) * 0.05))
ind
# project individuals (divided by 2) onto the PC space
<- as.matrix(projection[matched$`_NUM_ID_`[ind], -(1:5)])
PROJ <- apply(sweep(PROJ, 2, correction / 2, '*'), 2, function(x) {
all_proj bed_prodVec(obj.bed, x, ind.col = matched$`_NUM_ID_.ss`[ind], ncores = NCORES,
# scaling to get G if beta = 1 and (2 - G) if beta = -1
center = 1 - matched$beta[ind],
scale = matched$beta[ind])
})
We can then assign individuals to the closest center:
<- crossprod(as.matrix(all_freq[matched$`_NUM_ID_`[ind], -(1:5)]), PROJ)
all_centers <- apply(all_centers, 1, function(one_center) {
all_sq_dist rowSums(sweep(all_proj, 2, one_center, '-')^2)
})
<- 0.002 # you can adjust this threshold
THR <- max(dist(all_centers)^2) * THR / 0.16
thr_sq_dist
<- colnames(all_freq)[-(1:5)]
group %in% c("Scandinavia", "United Kingdom", "Ireland")] <- "Europe (North West)"
group[group %in% c("Europe (South East)", "Europe (North East)")] <- "Europe (East)"
group[group
<- apply(all_sq_dist, 1, function(sq_dist) {
cluster <- which.min(sq_dist)
ind if (sq_dist[ind] < thr_sq_dist) group[ind] else NA
})
table(cluster, exclude = NULL) # no NA -> all assigned
#> cluster
#> Ashkenazi Europe (East) Europe (North West)
#> 110 148 872
#> Europe (South West) Italy Middle East
#> 45 219 4
#> <NA>
#> 3
plot_grid2(plotlist = lapply(1:4, function(k) {
<- 2 * k - 1
k1 <- 2 * k
k2 qplot(PC[, k1], PC[, k2], color = cluster, size = I(2)) +
theme_bigstatsr(0.6) +
labs(x = paste0("PC", k1), y = paste0("PC", k2), color = "Assigned\npopulation") +
coord_equal()
nrow = 2, legend_ratio = 0.25, title_ratio = 0) }),
These are mostly European individuals. PC4-PC6 are definitively a bit odd.
Adapt all the previous code to rerun it on the bigSNP object we made before (instead of the bed object we have used here).