In this vignette, I show why clumping (on MAF) should be preferred over pruning for single-nucleotide polymorphism (SNP) arrays.

Pruning and clumping are used to keep a subset of SNPs that are nearly uncorrelated with each other. For instance, pruning is used before Principal Component Analysis to avoid capturing too much variance of linkage disequilibrium (LD) regions. Clumping is used to keep only one representative SNP per region of LD.

Simulation

I generate a (toy) SNP array with 500 individuals and 10 SNPs, where each SNP has a squared correlation > 0.2 with their direct neighbors and only with them. Moreover, the SNPs have an increasing MAF.

gen <- function(n, m) {
  I <- 1:m
  p <- I / (2 * m + 1)

  mat <- outer(I, I, FUN = function(i, j) {
    1 / (abs(i - j) + 1)
  })

  bindata::rmvbin(n, p, bincorr = mat) + 
    bindata::rmvbin(n, p, bincorr = mat)
}

set.seed(1)
X <- gen(500, 10)
print(head(X, 20))
##       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
##  [1,]    0    0    0    1    0    0    1    1    1     1
##  [2,]    0    0    0    0    0    0    1    1    1     1
##  [3,]    0    1    1    0    0    0    0    0    0     1
##  [4,]    0    0    0    0    0    0    0    0    1     1
##  [5,]    0    0    0    0    0    0    0    1    1     1
##  [6,]    0    0    0    0    1    1    0    0    1     1
##  [7,]    1    0    1    0    1    1    1    2    2     1
##  [8,]    1    1    1    1    1    1    0    1    1     1
##  [9,]    0    0    0    0    0    1    1    0    1     1
## [10,]    0    0    1    1    1    1    1    1    1     1
## [11,]    0    1    1    0    0    2    2    2    2     1
## [12,]    0    0    0    0    0    0    0    0    0     0
## [13,]    0    0    0    0    0    0    1    0    1     1
## [14,]    0    0    0    0    0    0    0    0    1     1
## [15,]    0    0    0    0    0    0    1    0    0     0
## [16,]    0    0    0    0    0    0    0    1    1     1
## [17,]    0    0    1    1    0    1    0    0    1     1
## [18,]    1    1    2    1    1    2    2    2    2     2
## [19,]    0    1    0    0    0    1    2    1    1     0
## [20,]    0    0    0    0    0    0    1    1    0     0
print(round(cor(X)^2, 2)) # squared correlation between SNPs
##       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
##  [1,] 1.00 0.23 0.13 0.08 0.05 0.03 0.02 0.01 0.01  0.00
##  [2,] 0.23 1.00 0.21 0.09 0.05 0.05 0.01 0.01 0.01  0.01
##  [3,] 0.13 0.21 1.00 0.25 0.12 0.10 0.06 0.06 0.02  0.01
##  [4,] 0.08 0.09 0.25 1.00 0.31 0.16 0.07 0.07 0.03  0.04
##  [5,] 0.05 0.05 0.12 0.31 1.00 0.29 0.11 0.10 0.07  0.04
##  [6,] 0.03 0.05 0.10 0.16 0.29 1.00 0.25 0.17 0.08  0.06
##  [7,] 0.02 0.01 0.06 0.07 0.11 0.25 1.00 0.23 0.13  0.07
##  [8,] 0.01 0.01 0.06 0.07 0.10 0.17 0.23 1.00 0.28  0.14
##  [9,] 0.01 0.01 0.02 0.03 0.07 0.08 0.13 0.28 1.00  0.31
## [10,] 0.00 0.01 0.01 0.04 0.04 0.06 0.07 0.14 0.31  1.00
print(colMeans(X) / 2) # MAF of SNPs
##  [1] 0.043 0.091 0.154 0.177 0.234 0.283 0.338 0.372 0.436 0.450

Let’s convert this to our format bigSNP in order to write corresponding PLINK files.

## Loading required package: bigstatsr
fake <- snp_fake(nrow(X), ncol(X))
fake$genotypes[] <- X
tmp <- tempfile()
snp_writeBed(fake, paste0(tmp, ".bed"))
## [1] "C:\\Users\\au639593\\AppData\\Local\\Temp\\RtmpMPLB7m\\file57805eec32d.bed"

Pruning

Pruning, as implemented in PLINK, sequentially scan the genomes for pairs of correlated SNPs, keeping only the one with the higher MAF (see this). Let’s use PLINK pruning on this simulated toy dataset.

library(glue)
plink <- download_plink()
system(glue("{plink} --bfile {tmp} --indep-pairwise 50 1 0.2 --out {tmp}"))
## [1] 0
read.table(glue("{tmp}.prune.out"))
##      V1
## 1 snp_1
## 2 snp_2
## 3 snp_3
## 4 snp_4
## 5 snp_5
## 6 snp_6
## 7 snp_7
## 8 snp_8
## 9 snp_9
read.table(glue("{tmp}.prune.in"))
##       V1
## 1 snp_10

The first SNP is pruned because of its correlation with the second. The second SNP is pruned because of its correlation with the third and so on. In the end, only the last SNP (10th) is kept with the LD pruning procedure of PLINK, which corresponds to less than 18% of the total variance.

You can also do the pruning directly in R with snp_pruning():

snp_pruning(fake$genotypes, infos.chr = fake$map$chromosome)
## Error: Pruning is deprecated; please use clumping (on MAF) instead..
## See why at https://bit.ly/2uKo3MN.

Clumping

A clumping approach would consider the SNPs sorted (in a decreasing order) by a statistic. This statistic is often a test statistic computed from a GWAS of a given phenotype. Yet, for example, for Principal Components Analysis (PCA), the thinning procedure should remain unsupervised (phenotype mustn’t be used!). So, we propose to also use the MAF as a statistic of importance, so that SNPs are sorted by decreasing MAF.

The benefit of the clumping procedure is that the index SNP is always the SNP that is kept (because it has the highest MAF). Let’s use PLINK again:

# Compute MAFs
write.table(data.frame(SNP = fake$map$marker.ID,
                       P = 1 - snp_MAF(fake$genotypes)),
            file = paste0(tmp, ".frq"), row.names = FALSE, quote = FALSE)
# Clumping
system(glue("{plink} --bfile {tmp} --out {tmp}",
              " --clump {tmp}.frq",
              " --clump-p1 1 --clump-p2 1 --clump-r2 0.2"))
## [1] 0
read.table(glue("{tmp}.clumped"), header = TRUE)
##   CHR F    SNP    BP     P TOTAL NSIG S05 S01 S001 S0001      SP2
## 1   1 1 snp_10 10000 0.550     1    1   0   0    0     0 snp_9(1)
## 2   1 1  snp_8  8000 0.628     1    1   0   0    0     0 snp_7(1)
## 3   1 1  snp_6  6000 0.717     1    1   0   0    0     0 snp_5(1)
## 4   1 1  snp_4  4000 0.823     1    1   0   0    0     0 snp_3(1)
## 5   1 1  snp_2  2000 0.909     1    1   0   0    0     0 snp_1(1)

So the last SNP (10th) would be considered first and the 9th SNP would be pruned. Then the 8th SNP would be considered and the 7th SNP would be pruned and so on. So, the even SNPs would be kept and the odd SNPs would be pruned, which corresponds to 56.7% of the total variance.

You can also do the clumping (on MAF by default) directly in R with snp_clumping():

snp_clumping(fake$genotypes, infos.chr = fake$map$chromosome)
## [1]  2  4  6  8 10

Conclusion

Pruning removes SNPs in a way that may leave regions of the genome with no representative SNP at all. Clumping is similar to pruning when sorting by descending MAF, but does not have this issue, and should therefore be preferred.