vignettes/pruning-vs-clumping.Rmd
pruning-vs-clumping.Rmd
In this vignette, I show why clumping (on MAF) should be preferred over pruning for single-nucleotide polymorphism (SNP) arrays.
Pruning and clumping are used to keep a subset of SNPs that are nearly uncorrelated with each other. For instance, pruning is used before Principal Component Analysis to avoid capturing too much variance of linkage disequilibrium (LD) regions. Clumping is used to keep only one representative SNP per region of LD.
I generate a (toy) SNP array with 500 individuals and 10 SNPs, where each SNP has a squared correlation > 0.2 with their direct neighbors and only with them. Moreover, the SNPs have an increasing MAF.
gen <- function(n, m) {
I <- 1:m
p <- I / (2 * m + 1)
mat <- outer(I, I, FUN = function(i, j) {
1 / (abs(i - j) + 1)
})
bindata::rmvbin(n, p, bincorr = mat) +
bindata::rmvbin(n, p, bincorr = mat)
}
set.seed(1)
X <- gen(500, 10)
print(head(X, 20))
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 0 0 0 1 0 0 1 1 1 1
## [2,] 0 0 0 0 0 0 1 1 1 1
## [3,] 0 1 1 0 0 0 0 0 0 1
## [4,] 0 0 0 0 0 0 0 0 1 1
## [5,] 0 0 0 0 0 0 0 1 1 1
## [6,] 0 0 0 0 1 1 0 0 1 1
## [7,] 1 0 1 0 1 1 1 2 2 1
## [8,] 1 1 1 1 1 1 0 1 1 1
## [9,] 0 0 0 0 0 1 1 0 1 1
## [10,] 0 0 1 1 1 1 1 1 1 1
## [11,] 0 1 1 0 0 2 2 2 2 1
## [12,] 0 0 0 0 0 0 0 0 0 0
## [13,] 0 0 0 0 0 0 1 0 1 1
## [14,] 0 0 0 0 0 0 0 0 1 1
## [15,] 0 0 0 0 0 0 1 0 0 0
## [16,] 0 0 0 0 0 0 0 1 1 1
## [17,] 0 0 1 1 0 1 0 0 1 1
## [18,] 1 1 2 1 1 2 2 2 2 2
## [19,] 0 1 0 0 0 1 2 1 1 0
## [20,] 0 0 0 0 0 0 1 1 0 0
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 1.00 0.23 0.13 0.08 0.05 0.03 0.02 0.01 0.01 0.00
## [2,] 0.23 1.00 0.21 0.09 0.05 0.05 0.01 0.01 0.01 0.01
## [3,] 0.13 0.21 1.00 0.25 0.12 0.10 0.06 0.06 0.02 0.01
## [4,] 0.08 0.09 0.25 1.00 0.31 0.16 0.07 0.07 0.03 0.04
## [5,] 0.05 0.05 0.12 0.31 1.00 0.29 0.11 0.10 0.07 0.04
## [6,] 0.03 0.05 0.10 0.16 0.29 1.00 0.25 0.17 0.08 0.06
## [7,] 0.02 0.01 0.06 0.07 0.11 0.25 1.00 0.23 0.13 0.07
## [8,] 0.01 0.01 0.06 0.07 0.10 0.17 0.23 1.00 0.28 0.14
## [9,] 0.01 0.01 0.02 0.03 0.07 0.08 0.13 0.28 1.00 0.31
## [10,] 0.00 0.01 0.01 0.04 0.04 0.06 0.07 0.14 0.31 1.00
## [1] 0.043 0.091 0.154 0.177 0.234 0.283 0.338 0.372 0.436 0.450
Let’s convert this to our format bigSNP
in order to
write corresponding PLINK files.
## Loading required package: bigstatsr
fake <- snp_fake(nrow(X), ncol(X))
fake$genotypes[] <- X
tmp <- tempfile()
snp_writeBed(fake, paste0(tmp, ".bed"))
## [1] "C:\\Users\\au639593\\AppData\\Local\\Temp\\RtmpMPLB7m\\file57805eec32d.bed"
Pruning, as implemented in PLINK, sequentially scan the genomes for pairs of correlated SNPs, keeping only the one with the higher MAF (see this). Let’s use PLINK pruning on this simulated toy dataset.
library(glue)
plink <- download_plink()
system(glue("{plink} --bfile {tmp} --indep-pairwise 50 1 0.2 --out {tmp}"))
## [1] 0
read.table(glue("{tmp}.prune.out"))
## V1
## 1 snp_1
## 2 snp_2
## 3 snp_3
## 4 snp_4
## 5 snp_5
## 6 snp_6
## 7 snp_7
## 8 snp_8
## 9 snp_9
read.table(glue("{tmp}.prune.in"))
## V1
## 1 snp_10
The first SNP is pruned because of its correlation with the second. The second SNP is pruned because of its correlation with the third and so on. In the end, only the last SNP (10th) is kept with the LD pruning procedure of PLINK, which corresponds to less than 18% of the total variance.
You can also do the pruning directly in R with
snp_pruning()
:
snp_pruning(fake$genotypes, infos.chr = fake$map$chromosome)
## Error: Pruning is deprecated; please use clumping (on MAF) instead..
## See why at https://bit.ly/2uKo3MN.
A clumping approach would consider the SNPs sorted (in a decreasing order) by a statistic. This statistic is often a test statistic computed from a GWAS of a given phenotype. Yet, for example, for Principal Components Analysis (PCA), the thinning procedure should remain unsupervised (phenotype mustn’t be used!). So, we propose to also use the MAF as a statistic of importance, so that SNPs are sorted by decreasing MAF.
The benefit of the clumping procedure is that the index SNP is always the SNP that is kept (because it has the highest MAF). Let’s use PLINK again:
# Compute MAFs
write.table(data.frame(SNP = fake$map$marker.ID,
P = 1 - snp_MAF(fake$genotypes)),
file = paste0(tmp, ".frq"), row.names = FALSE, quote = FALSE)
# Clumping
system(glue("{plink} --bfile {tmp} --out {tmp}",
" --clump {tmp}.frq",
" --clump-p1 1 --clump-p2 1 --clump-r2 0.2"))
## [1] 0
read.table(glue("{tmp}.clumped"), header = TRUE)
## CHR F SNP BP P TOTAL NSIG S05 S01 S001 S0001 SP2
## 1 1 1 snp_10 10000 0.550 1 1 0 0 0 0 snp_9(1)
## 2 1 1 snp_8 8000 0.628 1 1 0 0 0 0 snp_7(1)
## 3 1 1 snp_6 6000 0.717 1 1 0 0 0 0 snp_5(1)
## 4 1 1 snp_4 4000 0.823 1 1 0 0 0 0 snp_3(1)
## 5 1 1 snp_2 2000 0.909 1 1 0 0 0 0 snp_1(1)
So the last SNP (10th) would be considered first and the 9th SNP would be pruned. Then the 8th SNP would be considered and the 7th SNP would be pruned and so on. So, the even SNPs would be kept and the odd SNPs would be pruned, which corresponds to 56.7% of the total variance.
You can also do the clumping (on MAF by default) directly in R with
snp_clumping()
:
snp_clumping(fake$genotypes, infos.chr = fake$map$chromosome)
## [1] 2 4 6 8 10