Polygenic Risk Scores for a grid of clumping and thresholding parameters.

Stacking over many Polygenic Risk Scores, corresponding to a grid of many different parameters for clumping and thresholding.

snp_grid_clumping(
  G,
  infos.chr,
  infos.pos,
  lpS,
  ind.row = rows_along(G),
  grid.thr.r2 = c(0.01, 0.05, 0.1, 0.2, 0.5, 0.8, 0.95),
  grid.base.size = c(50, 100, 200, 500),
  infos.imp = rep(1, ncol(G)),
  grid.thr.imp = 1,
  groups = list(cols_along(G)),
  exclude = NULL,
  ncores = 1
)

snp_grid_PRS(
  G,
  all_keep,
  betas,
  lpS,
  n_thr_lpS = 50,
  grid.lpS.thr = 0.9999 * seq_log(max(0.1, min(lpS)), max(lpS), n_thr_lpS),
  ind.row = rows_along(G),
  backingfile = tempfile(),
  type = c("float", "double"),
  ncores = 1
)

snp_grid_stacking(
  multi_PRS,
  y.train,
  alphas = c(1, 0.01, 1e-04),
  ncores = 1,
  ...
)

Arguments

G

A FBM.code256 (typically <bigSNP>$genotypes).
You shouldn't have missing values. Also, remember to do quality control, e.g. some algorithms in this package won't work if you use SNPs with 0 MAF.

infos.chr

Vector of integers specifying each SNP's chromosome.
Typically <bigSNP>$map$chromosome.

infos.pos

Vector of integers specifying the physical position on a chromosome (in base pairs) of each SNP.
Typically <bigSNP>$map$physical.pos.

lpS

Numeric vector of -log10(p-value) associated with betas.

ind.row

An optional vector of the row indices (individuals) that are used. If not specified, all rows are used.
Don't use negative indices.

grid.thr.r2

Grid of thresholds over the squared correlation between two SNPs for clumping. Default is c(0.01, 0.05, 0.1, 0.2, 0.5, 0.8, 0.95).

grid.base.size

Grid for base window sizes. Sizes are then computed as base.size / thr.r2 (in kb). Default is c(50, 100, 200, 500).

infos.imp

Vector of imputation scores. Default is all 1 if you do not provide it.

grid.thr.imp

Grid of thresholds over infos.imp (default is 1), but you should change it (e.g. c(0.3, 0.6, 0.9, 0.95)) if providing infos.imp.

groups

List of vectors of indices to define your own categories. This could be used e.g. to derive C+T scores using two different GWAS summary statistics, or to include other information such as functional annotations. Default just makes one group with all variants.

exclude

Vector of SNP indices to exclude anyway.

ncores

Number of cores used. Default doesn't use parallelism. You may use nb_cores.

all_keep

Output of snp_grid_clumping() (indices passing clumping).

betas

Numeric vector of weights (effect sizes from GWAS) associated with each variant (column of G). If alleles are reversed, make sure to multiply corresponding effects by -1.

n_thr_lpS

Length for default grid.lpS.thr. Default is 50.

grid.lpS.thr

Sequence of thresholds to apply on lpS. Default is a grid (of length n_thr_lpS) evenly spaced on a logarithmic scale, i.e. on a log-log scale for p-values.

backingfile

Prefix for backingfiles where to store scores of C+T. As we typically use a large grid, this can result in a large matrix so that we store it on disk. Default uses a temporary file.

type

Type of backingfile values. Either "float" (the default) or "double". Using "float" requires half disk space.

multi_PRS

Output of snp_grid_PRS().

y.train

Vector of phenotypes. If there are two levels (binary 0/1), it uses big_spLogReg() for stacking, otherwise big_spLinReg().

alphas

Vector of values for grid-search. See big_spLogReg(). Default for this function is c(1, 0.01, 0.0001).

...

Other parameters to be passed to big_spLogReg(). For example, using covar.train, you can add covariates in the model with all C+T scores. You can also use pf.covar if you do not want to penalize these covariates.

Value

snp_grid_PRS(): An FBM (matrix on disk) that stores the C+T scores for all parameters of the grid (and for each chromosome separately). It also stores as attributes the input parameters all_keep, betas, lpS and grid.lpS.thr that are also needed in snp_grid_stacking().