For a bigSNP
:
snp_pruning()
: LD pruning. Similar to "--indep-pairwise (size+1) 1 thr.r2
"
in PLINK.
This function is deprecated (see
this article).
snp_clumping()
(and bed_clumping()
): LD clumping. If you do not provide
any statistic to rank SNPs, it would use minor allele frequencies (MAFs),
making clumping similar to pruning.
snp_indLRLDR()
: Get SNP indices of long-range LD regions for the
human genome.
bed_clumping(
obj.bed,
ind.row = rows_along(obj.bed),
S = NULL,
thr.r2 = 0.2,
size = 100/thr.r2,
exclude = NULL,
ncores = 1
)
snp_clumping(
G,
infos.chr,
ind.row = rows_along(G),
S = NULL,
thr.r2 = 0.2,
size = 100/thr.r2,
infos.pos = NULL,
is.size.in.bp = NULL,
exclude = NULL,
ncores = 1
)
snp_pruning(
G,
infos.chr,
ind.row = rows_along(G),
size = 49,
is.size.in.bp = FALSE,
infos.pos = NULL,
thr.r2 = 0.2,
exclude = NULL,
nploidy = 2,
ncores = 1
)
snp_indLRLDR(infos.chr, infos.pos, LD.regions = LD.wiki34)
Object of type bed, which is the mapping of some bed file.
Use obj.bed <- bed(bedfile)
to get this object.
An optional vector of the row indices (individuals) that
are used. If not specified, all rows are used.
Don't use negative indices.
A vector of column statistics which express the importance
of each SNP (the more important is the SNP, the greater should be
the corresponding statistic).
For example, if S
follows the standard normal distribution, and "important"
means significantly different from 0, you must use abs(S)
instead.
If not specified, MAFs are computed and used.
Threshold over the squared correlation between two SNPs.
Default is 0.2
.
For one SNP, window size around this SNP to compute correlations.
Default is 100 / thr.r2
for clumping (0.2 -> 500; 0.1 -> 1000; 0.5 -> 200).
If not providing infos.pos
(NULL
, the default), this is a window in
number of SNPs, otherwise it is a window in kb (genetic distance).
I recommend that you provide the positions if available.
Vector of SNP indices to exclude anyway. For example,
can be used to exclude long-range LD regions (see Price2008). Another use
can be for thresholding with respect to p-values associated with S
.
Number of cores used. Default doesn't use parallelism.
You may use bigstatsr::nb_cores()
.
A FBM.code256
(typically <bigSNP>$genotypes
).
You shouldn't have missing values. Also, remember to do quality control,
e.g. some algorithms in this package won't work if you use SNPs with 0 MAF.
Vector of integers specifying each SNP's chromosome.
Typically <bigSNP>$map$chromosome
.
Vector of integers specifying the physical position
on a chromosome (in base pairs) of each SNP.
Typically <bigSNP>$map$physical.pos
.
Deprecated.
Number of trials, parameter of the binomial distribution.
Default is 2
, which corresponds to diploidy, such as for the human genome.
A data.frame
with columns "Chr", "Start" and "Stop".
Default use LD.wiki34.
snp_clumping()
(and bed_clumping()
): SNP indices that are kept.
snp_indLRLDR()
: SNP indices to be used as (part of) the 'exclude
'
parameter of snp_clumping()
.
Price AL, Weale ME, Patterson N, et al. Long-Range LD Can Confound Genome Scans in Admixed Populations. Am J Hum Genet. 2008;83(1):132-135. doi:10.1016/j.ajhg.2008.06.005
test <- snp_attachExtdata()
G <- test$genotypes
# clumping (prioritizing higher MAF)
ind.keep <- snp_clumping(G, infos.chr = test$map$chromosome,
infos.pos = test$map$physical.pos,
thr.r2 = 0.1)
# keep most of them -> not much LD in this simulated dataset
length(ind.keep) / ncol(G)
#> [1] 0.7919419