For a bigSNP:

• snp_pruning(): LD pruning. Similar to "--indep-pairwise (size+1) 1 thr.r2" in PLINK. This function is deprecated (see this article).

• snp_clumping() (and bed_clumping()): LD clumping. If you do not provide any statistic to rank SNPs, it would use minor allele frequencies (MAFs), making clumping similar to pruning.

• snp_indLRLDR(): Get SNP indices of long-range LD regions for the human genome.

bed_clumping(
obj.bed,
ind.row = rows_along(obj.bed),
S = NULL,
thr.r2 = 0.2,
size = 100/thr.r2,
exclude = NULL,
ncores = 1
)

snp_clumping(
G,
infos.chr,
ind.row = rows_along(G),
S = NULL,
thr.r2 = 0.2,
size = 100/thr.r2,
infos.pos = NULL,
is.size.in.bp = NULL,
exclude = NULL,
ncores = 1
)

snp_pruning(
G,
infos.chr,
ind.row = rows_along(G),
size = 49,
is.size.in.bp = FALSE,
infos.pos = NULL,
thr.r2 = 0.2,
exclude = NULL,
nploidy = 2,
ncores = 1
)

snp_indLRLDR(infos.chr, infos.pos, LD.regions = LD.wiki34)

obj.bed Object of type bed, which is the mapping of some bed file. Use obj.bed <- bed(bedfile) to get this object. An optional vector of the row indices (individuals) that are used. If not specified, all rows are used. Don't use negative indices. A vector of column statistics which express the importance of each SNP (the more important is the SNP, the greater should be the corresponding statistic). For example, if S follows the standard normal distribution, and "important" means significantly different from 0, you must use abs(S) instead. If not specified, MAFs are computed and used. Threshold over the squared correlation between two SNPs. Default is 0.2. For one SNP, window size around this SNP to compute correlations. Default is 100 / thr.r2 for clumping (0.2 -> 500; 0.1 -> 1000; 0.5 -> 200). If not providing infos.pos (NULL, the default), this is a window in number of SNPs, otherwise it is a window in kb (genetic distance). I recommend that you provide the positions if available. Vector of SNP indices to exclude anyway. For example, can be used to exclude long-range LD regions (see Price2008). Another use can be for thresholding with respect to p-values associated with S. Number of cores used. Default doesn't use parallelism. You may use nb_cores. A FBM.code256 (typically $genotypes). You shouldn't have missing values. Also, remember to do quality control, e.g. some algorithms in this package won't work if you use SNPs with 0 MAF. Vector of integers specifying each SNP's chromosome. Typically $map$chromosome. Vector of integers specifying the physical position on a chromosome (in base pairs) of each SNP. Typically $map$physical.pos. Deprecated. Number of trials, parameter of the binomial distribution. Default is 2, which corresponds to diploidy, such as for the human genome. A data.frame with columns "Chr", "Start" and "Stop". Default use the table of 34 long-range LD regions that you can find there. ## Value • snp_clumping() (and bed_clumping()): SNP indices that are kept. • snp_indLRLDR(): SNP indices to be used as (part of) the 'exclude' parameter of snp_clumping(). ## References Price AL, Weale ME, Patterson N, et al. Long-Range LD Can Confound Genome Scans in Admixed Populations. Am J Hum Genet. 2008;83(1):132-135. http://dx.doi.org/10.1016/j.ajhg.2008.06.005 ## Examples test <- snp_attachExtdata() G <- test$genotypes

# clumping (prioritizing higher MAF)
ind.keep <- snp_clumping(G, infos.chr = test$map$chromosome,
infos.pos = test$map$physical.pos,
thr.r2 = 0.1)

# keep most of them -> not much LD in this simulated dataset
length(ind.keep) / ncol(G)#> [1] 0.7919419