LD clumping — bed_clumping • bigsnpr

For a bigSNP:

snp_pruning(): LD pruning. Similar to "--indep-pairwise (size+1) 1 thr.r2" in PLINK. This function is deprecated (see this article).
snp_clumping() (and bed_clumping()): LD clumping. If you do not provide any statistic to rank SNPs, it would use minor allele frequencies (MAFs), making clumping similar to pruning.
snp_indLRLDR(): Get SNP indices of long-range LD regions for the human genome.

bed_clumping(
  obj.bed,
  ind.row = rows_along(obj.bed),
  S = NULL,
  thr.r2 = 0.2,
  size = 100/thr.r2,
  exclude = NULL,
  ncores = 1
)

snp_clumping(
  G,
  infos.chr,
  ind.row = rows_along(G),
  S = NULL,
  thr.r2 = 0.2,
  size = 100/thr.r2,
  infos.pos = NULL,
  is.size.in.bp = NULL,
  exclude = NULL,
  ncores = 1
)

snp_pruning(
  G,
  infos.chr,
  ind.row = rows_along(G),
  size = 49,
  is.size.in.bp = FALSE,
  infos.pos = NULL,
  thr.r2 = 0.2,
  exclude = NULL,
  nploidy = 2,
  ncores = 1
)

snp_indLRLDR(infos.chr, infos.pos, LD.regions = LD.wiki34)

Arguments

obj.bed: Object of type bed, which is the mapping of some bed file. Use obj.bed <- bed(bedfile) to get this object.
ind.row: An optional vector of the row indices (individuals) that are used. If not specified, all rows are used.
Don't use negative indices.
S: A vector of column statistics which express the importance of each SNP (the more important is the SNP, the greater should be the corresponding statistic).
For example, if S follows the standard normal distribution, and "important" means significantly different from 0, you must use abs(S) instead.
If not specified, MAFs are computed and used.
thr.r2: Threshold over the squared correlation between two SNPs. Default is 0.2.
size: For one SNP, window size around this SNP to compute correlations. Default is 100 / thr.r2 for clumping (0.2 -> 500; 0.1 -> 1000; 0.5 -> 200). If not providing infos.pos (NULL, the default), this is a window in number of SNPs, otherwise it is a window in kb (physical distance). I recommend that you provide the positions if available.
exclude: Vector of SNP indices to exclude anyway. For example, can be used to exclude long-range LD regions (see Price2008). Another use can be for thresholding with respect to p-values associated with S.
ncores: Number of cores used. Default doesn't use parallelism. You may use bigstatsr::nb_cores().
G: A FBM.code256 (typically <bigSNP>$genotypes).
You shouldn't have missing values. Also, remember to do quality control, e.g. some algorithms in this package won't work if you use SNPs with 0 MAF.
infos.chr: Vector of integers specifying each SNP's chromosome.
Typically <bigSNP>$map$chromosome.
infos.pos: Vector of integers specifying the physical position on a chromosome (in base pairs) of each SNP.
Typically <bigSNP>$map$physical.pos.
is.size.in.bp: Deprecated.
nploidy: Number of trials, parameter of the binomial distribution. Default is 2, which corresponds to diploidy, such as for the human genome.
LD.regions: A data.frame with columns "Chr", "Start" and "Stop". Default use LD.wiki34.

Value

snp_clumping() (and bed_clumping()): SNP indices that are kept.
snp_indLRLDR(): SNP indices to be used as (part of) the 'exclude' parameter of snp_clumping().

References

Price AL, Weale ME, Patterson N, et al. Long-Range LD Can Confound Genome Scans in Admixed Populations. Am J Hum Genet. 2008;83(1):132-135. doi:10.1016/j.ajhg.2008.06.005

Examples

test <- snp_attachExtdata()
G <- test$genotypes

# clumping (prioritizing higher MAF)
ind.keep <- snp_clumping(G, infos.chr = test$map$chromosome,
                         infos.pos = test$map$physical.pos,
                         thr.r2 = 0.1)

# keep most of them -> not much LD in this simulated dataset
length(ind.keep) / ncol(G)
#> [1] 0.7919419