Chapter 6 Genome-Wide Association Study (GWAS)
In {bigstatsr}, you can perform both standard linear and logistic regressions GWAS, using either big_univLinReg()
or big_univLogReg()
.
Function big_univLinReg()
should be very fast, while big_univLogReg()
is slower.
This type of association, where each variable is considered independently, can be performed for any type of FBM (i.e. it does not have to be a genotype matrix). This is why these two functions are in package {bigstatsr}, and not {bigsnpr}.
6.1 Example
Let us reuse the data prepared in 4.3 and the PCs in 5.2.
library(bigsnpr)
#> Loading required package: bigstatsr
<- snp_attach("tmp-data/GWAS_data_sorted_QC.rds") obj.bigsnp
<- readRDS("tmp-data/PCA_GWAS_data.rds")
obj.svd <- predict(obj.svd) PC
The clinical data includes age, sex, high-density lipoprotein (HDL)-cholesterol (hdl
), low-density lipoprotein (LDL)-cholesterol (ldl
), triglycerides (tg
) and coronary artery disease status (CAD
).
For the set of covariates, we will use sex, age, and the first 6 PCs:
<- cbind(as.matrix(obj.bigsnp$fam[c("sex", "age")]), PC[, 1:6]) covar
You probably should not account for other information such as cholesterol as they are heritable covariates (Aschard, Vilhjálmsson, Joshi, Price, & Kraft, 2015).
<- obj.bigsnp$genotypes
G <- obj.bigsnp$fam$CAD
y <- which(!is.na(y) & complete.cases(covar)) ind.gwas
To only use a subset of the data stored as an FBM (G
here), you should almost never make a copy of the data; instead, use parameters ind.row
(or ind.train
) and ind.col
to apply functions to a subset of the data.
Let us perform a case-control GWAS for CAD:
<- runonce::save_run(
gwas big_univLogReg(G, y[ind.gwas], ind.train = ind.gwas,
covar.train = covar[ind.gwas, ], ncores = nb_cores()),
file = "tmp-data/GWAS_CAD.rds")
#> For 1 columns, IRLS didn't converge; `glm` was used instead.
#> user system elapsed
#> 0.17 0.05 143.11
This takes about two minutes with 4 cores on my laptop. Note that big_univLinReg()
takes two seconds only, and should give very similar p-values, if you just need something quick.
plot(gwas)
<- obj.bigsnp$map$chromosome
CHR <- obj.bigsnp$map$physical.pos
POS snp_manhattan(gwas, CHR, POS, npoints = 50e3) +
::geom_hline(yintercept = -log10(5e-8), linetype = 2, color = "red") ggplot2
Here, nothing is genome-wide significant because of the small sample size.
<- obj.bigsnp$fam$hdl
y2 <- which(!is.na(y2) & complete.cases(covar))
ind.gwas2 <- big_univLinReg(G, y2[ind.gwas2], ind.train = ind.gwas2,
gwas2 covar.train = covar[ind.gwas2, ], ncores = nb_cores())
snp_manhattan(gwas2, CHR, POS, npoints = 50e3) +
::geom_hline(yintercept = -log10(5e-8), linetype = 2, color = "red") ggplot2
Some other example code:
- GWAS in iPSYCH; you can perform the GWAS on multiple nodes in parallel that would each process a chunk of the variants only
- GWAS for very large data and multiple phenotypes; you should perform the GWAS for all phenotypes for a “small” chunk of columns to avoid repeated access from disk, and can process these chunks on multiple nodes in parallel
- some template for {future.batchtools} when using Slurm