Computing and projecting PCA of reference dataset to a target dataset.
bed_projectPCA(
obj.bed.ref,
obj.bed.new,
k = 10,
ind.row.new = rows_along(obj.bed.new),
ind.row.ref = rows_along(obj.bed.ref),
ind.col.ref = cols_along(obj.bed.ref),
strand_flip = TRUE,
join_by_pos = TRUE,
match.min.prop = 0.5,
build.new = "hg19",
build.ref = "hg19",
liftOver = NULL,
...,
verbose = TRUE,
ncores = 1
)
Object of type bed, which is the mapping of the bed file of
the reference data. Use obj.bed <- bed(bedfile)
to get this object.
Object of type bed, which is the mapping of the bed file of
the target data. Use obj.bed <- bed(bedfile)
to get this object.
Number of principal components to compute and project.
Rows to be used in the target data. Default uses them all.
Rows to be used in the reference data. Default uses them all.
Columns to be potentially used in the reference data. Default uses all the ones in common with target data.
Whether to try to flip strand? (default is TRUE
)
If so, ambiguous alleles A/T and C/G are removed.
Whether to join by chromosome and position (default), or instead by rsid.
Minimum proportion of variants in the smallest data
to be matched, otherwise stops with an error. Default is 20%
.
Genome build of the target data. Default is hg19
.
Genome build of the reference data. Default is hg19
.
Path to liftOver executable. Binaries can be downloaded at https://hgdownload.cse.ucsc.edu/admin/exe/macOSX.x86_64/liftOver for Mac and at https://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/liftOver for Linux.
Arguments passed on to bed_autoSVD
fun.scaling
A function with parameters X
(or obj.bed
), ind.row
and
ind.col
, and that returns a data.frame with $center
and $scale
for the
columns corresponding to ind.col
, to scale each of their elements such as followed:
$$\frac{X_{i,j} - center_j}{scale_j}.$$ Default uses binomial scaling.
You can also provide your own center
and scale
by using bigstatsr::as_scaling_fun()
.
roll.size
Radius of rolling windows to smooth log-p-values.
Default is 50
.
int.min.size
Minimum number of consecutive outlier variants
in order to be reported as long-range LD region. Default is 20
.
thr.r2
Threshold over the squared correlation between two variants.
Default is 0.2
. Use NA
if you want to skip the clumping step.
alpha.tukey
Default is 0.1
. The type-I error rate in outlier
detection (that is further corrected for multiple testing).
min.mac
Minimum minor allele count (MAC) for variants to be included.
Default is 10
. Can actually be higher because of min.maf
.
min.maf
Minimum minor allele frequency (MAF) for variants to be included.
Default is 0.02
. Can actually be higher because of min.mac
.
max.iter
Maximum number of iterations of outlier detection.
Default is 5
.
size
For one SNP, window size around this SNP to compute correlations.
Default is 100 / thr.r2
for clumping (0.2 -> 500; 0.1 -> 1000; 0.5 -> 200).
If not providing infos.pos
(NULL
, the default), this is a window in
number of SNPs, otherwise it is a window in kb (genetic distance).
I recommend that you provide the positions if available.
Output some information on the iterations? Default is TRUE
.
Number of cores used. Default doesn't use parallelism.
You may use bigstatsr::nb_cores()
.
A list of 3 elements:
$obj.svd.ref
: big_SVD object computed from reference data.
$simple_proj
: simple projection of new data into space of reference PCA.
$OADP_proj
: Online Augmentation, Decomposition, and Procrustes (OADP)
projection of new data into space of reference PCA.