Fast truncated SVD with initial pruning and that iteratively removes long-range LD regions.

snp_autoSVD(
  G,
  infos.chr,
  infos.pos = NULL,
  ind.row = rows_along(G),
  ind.col = cols_along(G),
  fun.scaling = snp_scaleBinom(),
  thr.r2 = 0.2,
  size = 100/thr.r2,
  k = 10,
  roll.size = 50,
  int.min.size = 20,
  alpha.tukey = 0.05,
  min.mac = 10,
  max.iter = 5,
  is.size.in.bp = NULL,
  ncores = 1,
  verbose = TRUE
)

bed_autoSVD(
  obj.bed,
  ind.row = rows_along(obj.bed),
  ind.col = cols_along(obj.bed),
  fun.scaling = bed_scaleBinom,
  thr.r2 = 0.2,
  size = 100/thr.r2,
  k = 10,
  roll.size = 50,
  int.min.size = 20,
  alpha.tukey = 0.05,
  min.mac = 10,
  max.iter = 5,
  ncores = 1,
  verbose = TRUE
)

Arguments

G

A FBM.code256 (typically <bigSNP>$genotypes).
You shouldn't have missing values. Also, remember to do quality control, e.g. some algorithms in this package won't work if you use SNPs with 0 MAF.

infos.chr

Vector of integers specifying each SNP's chromosome.
Typically <bigSNP>$map$chromosome.

infos.pos

Vector of integers specifying the physical position on a chromosome (in base pairs) of each SNP.
Typically <bigSNP>$map$physical.pos.

ind.row

An optional vector of the row indices (individuals) that are used. If not specified, all rows are used.
Don't use negative indices.

ind.col

An optional vector of the column indices (SNPs) that are used. If not specified, all columns are used.
Don't use negative indices.

fun.scaling

A function that returns a named list of mean and sd for every column, to scale each of their elements such as followed: $$\frac{X_{i,j} - mean_j}{sd_j}.$$ Default is snp_scaleBinom().

thr.r2

Threshold over the squared correlation between two SNPs. Default is 0.2. Use NA if you want to skip the clumping step.

size

For one SNP, window size around this SNP to compute correlations. Default is 100 / thr.r2 for clumping (0.2 -> 500; 0.1 -> 1000; 0.5 -> 200). If not providing infos.pos (NULL, the default), this is a window in number of SNPs, otherwise it is a window in kb (genetic distance). I recommend that you provide the positions if available.

k

Number of singular vectors/values to compute. Default is 10. This algorithm should be used to compute a few singular vectors/values.

roll.size

Radius of rolling windows to smooth log-p-values. Default is 50.

int.min.size

Minimum number of consecutive outlier SNPs in order to be reported as long-range LD region. Default is 20.

alpha.tukey

Default is 0.1. The type-I error rate in outlier detection (that is further corrected for multiple testing).

min.mac

Minimum minor allele count (MAC) for variants to be included. Default is 10.

max.iter

Maximum number of iterations of outlier detection. Default is 5.

is.size.in.bp

Deprecated.

ncores

Number of cores used. Default doesn't use parallelism. You may use nb_cores.

verbose

Output some information on the iterations? Default is TRUE.

obj.bed

Object of type bed, which is the mapping of some bed file. Use obj.bed <- bed(bedfile) to get this object.

Value

A named list (an S3 class "big_SVD") of

  • d, the singular values,

  • u, the left singular vectors,

  • v, the right singular vectors,

  • niter, the number of the iteration of the algorithm,

  • nops, number of Matrix-Vector multiplications used,

  • center, the centering vector,

  • scale, the scaling vector.

Note that to obtain the Principal Components, you must use predict on the result. See examples.

Details

If you don't have any information about SNPs, you can try using

  • infos.chr = rep(1, ncol(G)),

  • size = ncol(G) (if SNPs are not sorted),

  • roll.size = 0 (if SNPs are not sorted).

Examples

ex <- snp_attachExtdata() obj.svd <- snp_autoSVD(G = ex$genotypes, infos.chr = ex$map$chromosome, infos.pos = ex$map$physical.position)
#> #> Phase of clumping (on MAF) at r^2 > 0.2.. keep 4270 SNPs. #> Discarding 0 variant with MAC < 10. #> #> Iteration 1: #> Computing SVD.. #> 0 outlier variant detected.. #> #> Converged!
str(obj.svd)
#> List of 7 #> $ d : num [1:10] 235.4 148 105.5 96.4 94.9 ... #> $ u : num [1:517, 1:10] 0.0801 0.0798 0.0646 0.0781 0.0818 ... #> $ v : num [1:4270, 1:10] -0.00174 0.03142 -0.01527 0.0132 0.0154 ... #> $ niter : num 10 #> $ nops : num 170 #> $ center: num [1:4270] 0.412 0.474 0.369 0.913 0.712 ... #> $ scale : num [1:4270] 0.572 0.601 0.549 0.704 0.677 ... #> - attr(*, "class")= chr "big_SVD" #> - attr(*, "subset")= int [1:4270] 2 3 4 5 6 7 8 9 10 11 ... #> - attr(*, "lrldr")='data.frame': 0 obs. of 3 variables: #> ..$ Chr : int(0) #> ..$ Start: int(0) #> ..$ Stop : int(0)