For a bigSNP:

  • snp_pruning(): LD pruning. Similar to "--indep-pairwise (size+1) 1 thr.r2" in PLINK. This function is deprecated (see this article).

  • snp_clumping() (and bed_clumping()): LD clumping. If you do not provide any statistic to rank SNPs, it would use minor allele frequencies (MAFs), making clumping similar to pruning.

  • snp_indLRLDR(): Get SNP indices of long-range LD regions for the human genome.

bed_clumping(
  obj.bed,
  ind.row = rows_along(obj.bed),
  S = NULL,
  thr.r2 = 0.2,
  size = 100/thr.r2,
  exclude = NULL,
  ncores = 1
)

snp_clumping(
  G,
  infos.chr,
  ind.row = rows_along(G),
  S = NULL,
  thr.r2 = 0.2,
  size = 100/thr.r2,
  infos.pos = NULL,
  is.size.in.bp = NULL,
  exclude = NULL,
  ncores = 1
)

snp_pruning(
  G,
  infos.chr,
  ind.row = rows_along(G),
  size = 49,
  is.size.in.bp = FALSE,
  infos.pos = NULL,
  thr.r2 = 0.2,
  exclude = NULL,
  nploidy = 2,
  ncores = 1
)

snp_indLRLDR(infos.chr, infos.pos, LD.regions = LD.wiki34)

Arguments

obj.bed

Object of type bed, which is the mapping of some bed file. Use obj.bed <- bed(bedfile) to get this object.

ind.row

An optional vector of the row indices (individuals) that are used. If not specified, all rows are used.
Don't use negative indices.

S

A vector of column statistics which express the importance of each SNP (the more important is the SNP, the greater should be the corresponding statistic).
For example, if S follows the standard normal distribution, and "important" means significantly different from 0, you must use abs(S) instead.
If not specified, MAFs are computed and used.

thr.r2

Threshold over the squared correlation between two SNPs. Default is 0.2.

size

For one SNP, window size around this SNP to compute correlations. Default is 100 / thr.r2 for clumping (0.2 -> 500; 0.1 -> 1000; 0.5 -> 200). If not providing infos.pos (NULL, the default), this is a window in number of SNPs, otherwise it is a window in kb (genetic distance). I recommend that you provide the positions if available.

exclude

Vector of SNP indices to exclude anyway. For example, can be used to exclude long-range LD regions (see Price2008). Another use can be for thresholding with respect to p-values associated with S.

ncores

Number of cores used. Default doesn't use parallelism. You may use bigstatsr::nb_cores().

G

A FBM.code256 (typically <bigSNP>$genotypes).
You shouldn't have missing values. Also, remember to do quality control, e.g. some algorithms in this package won't work if you use SNPs with 0 MAF.

infos.chr

Vector of integers specifying each SNP's chromosome.
Typically <bigSNP>$map$chromosome.

infos.pos

Vector of integers specifying the physical position on a chromosome (in base pairs) of each SNP.
Typically <bigSNP>$map$physical.pos.

is.size.in.bp

Deprecated.

nploidy

Number of trials, parameter of the binomial distribution. Default is 2, which corresponds to diploidy, such as for the human genome.

LD.regions

A data.frame with columns "Chr", "Start" and "Stop". Default use LD.wiki34.

Value

  • snp_clumping() (and bed_clumping()): SNP indices that are kept.

  • snp_indLRLDR(): SNP indices to be used as (part of) the 'exclude' parameter of snp_clumping().

References

Price AL, Weale ME, Patterson N, et al. Long-Range LD Can Confound Genome Scans in Admixed Populations. Am J Hum Genet. 2008;83(1):132-135. doi:10.1016/j.ajhg.2008.06.005

Examples

test <- snp_attachExtdata()
G <- test$genotypes

# clumping (prioritizing higher MAF)
ind.keep <- snp_clumping(G, infos.chr = test$map$chromosome,
                         infos.pos = test$map$physical.pos,
                         thr.r2 = 0.1)

# keep most of them -> not much LD in this simulated dataset
length(ind.keep) / ncol(G)
#> [1] 0.7919419