Stacked C+T (SCT) — SCT • bigsnpr

Polygenic Risk Scores for a grid of clumping and thresholding parameters.

Stacking over many Polygenic Risk Scores, corresponding to a grid of many different parameters for clumping and thresholding.

snp_grid_clumping(
  G,
  infos.chr,
  infos.pos,
  lpS,
  ind.row = rows_along(G),
  grid.thr.r2 = c(0.01, 0.05, 0.1, 0.2, 0.5, 0.8, 0.95),
  grid.base.size = c(50, 100, 200, 500),
  infos.imp = rep(1, ncol(G)),
  grid.thr.imp = 1,
  groups = list(cols_along(G)),
  exclude = NULL,
  ncores = 1
)

snp_grid_PRS(
  G,
  all_keep,
  betas,
  lpS,
  n_thr_lpS = 50,
  grid.lpS.thr = 0.9999 * seq_log(max(0.1, min(lpS, na.rm = TRUE)), max(lpS, na.rm =
    TRUE), n_thr_lpS),
  ind.row = rows_along(G),
  backingfile = tempfile(),
  type = c("float", "double"),
  ncores = 1
)

snp_grid_stacking(
  multi_PRS,
  y.train,
  alphas = c(1, 0.01, 1e-04),
  ncores = 1,
  ...
)

Arguments

G: A FBM.code256 (typically <bigSNP>$genotypes).
You shouldn't have missing values. Also, remember to do quality control, e.g. some algorithms in this package won't work if you use SNPs with 0 MAF.
infos.chr: Vector of integers specifying each SNP's chromosome.
Typically <bigSNP>$map$chromosome.
infos.pos: Vector of integers specifying the physical position on a chromosome (in base pairs) of each SNP.
Typically <bigSNP>$map$physical.pos.
lpS: Numeric vector of -log10(p-value) associated with betas.
ind.row: An optional vector of the row indices (individuals) that are used. If not specified, all rows are used.
Don't use negative indices.
grid.thr.r2: Grid of thresholds over the squared correlation between two SNPs for clumping. Default is c(0.01, 0.05, 0.1, 0.2, 0.5, 0.8, 0.95).
grid.base.size: Grid for base window sizes. Sizes are then computed as base.size / thr.r2 (in kb). Default is c(50, 100, 200, 500).
infos.imp: Vector of imputation scores. Default is all 1 if you do not provide it.
grid.thr.imp: Grid of thresholds over infos.imp (default is 1), but you should change it (e.g. c(0.3, 0.6, 0.9, 0.95)) if providing infos.imp.
groups: List of vectors of indices to define your own categories. This could be used e.g. to derive C+T scores using two different GWAS summary statistics, or to include other information such as functional annotations. Default just makes one group with all variants.
exclude: Vector of SNP indices to exclude anyway.
ncores: Number of cores used. Default doesn't use parallelism. You may use bigstatsr::nb_cores().
all_keep: Output of snp_grid_clumping() (indices passing clumping).
betas: Numeric vector of weights (effect sizes from GWAS) associated with each variant (column of G). If alleles are reversed, make sure to multiply corresponding effects by -1.
n_thr_lpS: Length for default grid.lpS.thr. Default is 50.
grid.lpS.thr: Sequence of thresholds to apply on lpS. Default is a grid (of length n_thr_lpS) evenly spaced on a logarithmic scale, i.e. on a log-log scale for p-values.
backingfile: Prefix for backingfiles where to store scores of C+T. As we typically use a large grid, this can result in a large matrix so that we store it on disk. Default uses a temporary file.
type: Type of backingfile values. Either "float" (the default) or "double". Using "float" requires half disk space.
multi_PRS: Output of snp_grid_PRS().
y.train: Vector of phenotypes. If there are two levels (binary 0/1), it uses bigstatsr::big_spLogReg() for stacking, otherwise bigstatsr::big_spLinReg().
alphas: Vector of values for grid-search. See bigstatsr::big_spLogReg(). Default for this function is c(1, 0.01, 0.0001).
...: Other parameters to be passed to bigstatsr::big_spLogReg(). For example, using covar.train, you can add covariates in the model with all C+T scores. You can also use pf.covar if you do not want to penalize these covariates.

Value

snp_grid_PRS(): An FBM (matrix on disk) that stores the C+T scores for all parameters of the grid (and for each chromosome separately). It also stores as attributes the input parameters all_keep, betas, lpS and grid.lpS.thr that are also needed in snp_grid_stacking().