vignettes/LDpred2.Rmd
LDpred2.RmdHere we show how to compute polygenic risk scores using LDpred2.
This tutorial only uses fake data for educational purposes. You should also probably look at the code of the paper, particularly at the code to prepare summary statistics (including performing the quality control presented in the Methods section “Quality control of summary statistics” of the paper), at the code to read BGEN files into the data format used by bigsnpr, at the code to prepare LD matrices and at the code to run LDpred2 (genome-wide).
In practice, until we find a better set of variants, we recommend using the HapMap3 variants used in PRS-CS and the LDpred2 papers. Information about these variants can be retrieved with
info <- readRDS(url("https://github.com/privefl/bigsnpr/raw/master/data-raw/hm3_variants.rds")) str(info)
## 'data.frame': 1120696 obs. of 8 variables:
## $ chr : int 1 1 1 1 1 1 1 1 1 1 ...
## $ rsid : chr "rs3131972" "rs3131969" "rs1048488" "rs12562034" ...
## $ pos : int 752721 754182 760912 768448 779322 838555 846808 853954 854250 861808 ...
## $ a1 : chr "A" "A" "C" "A" ...
## $ a0 : chr "G" "G" "T" "G" ...
## $ maf : num 0.161 0.1282 0.16 0.0925 0.1183 ...
## $ pos_hg19: int 762858 764319 771049 778585 789459 848692 856945 864091 864387 871945 ...
## $ pos_hg17: int 802721 804182 810912 818448 829322 888555 896808 903954 904250 912088 ...
Alternatively, we also provide an LD reference to be used direcly, along with an example script on how to use it at https://doi.org/10.6084/m9.figshare.13034123.
Note that we now recommend to run LDpred2 genome-wide, contrary to what was shown in the first versions of this tutorial. The only difference it makes is when building the SFBM (the sparse LD matrix on disk), you need to build it so that it contains all variants genome-wide (see e.g. this code).
You can download data and unzip files in R. We store those files in a directory called "tmp-data" here.
You can see there how we generated these data from the 1000 Genomes project. Note that these data are for educational purposes only, not for use as a reference panel.
First, you need to read genotype data from the PLINK files (or BGEN files) as well as the text file containing summary statistics.
# Load packages bigsnpr and bigstatsr library(bigsnpr)
## Loading required package: bigstatsr
## Warning: replacing previous import 'data.table:::=' by 'ggplot2:::=' when
## loading 'bigsnpr'
# Read from bed/bim/fam, it generates .bk and .rds files. snp_readBed("tmp-data/public-data.bed")
## [1] "C:\\Users\\au639593\\Desktop\\bigsnpr\\tmp-data\\public-data.rds"
# Attach the "bigSNP" object in R session obj.bigSNP <- snp_attach("tmp-data/public-data.rds") # See how the file looks like str(obj.bigSNP, max.level = 2, strict.width = "cut")
## List of 3
## $ genotypes:Reference class 'FBM.code256' [package "bigstatsr"] with 15 ..
## ..and 26 methods, of which 12 are possibly relevant:
## .. add_columns, as.FBM, bm, bm.desc, check_dimensions,
## .. check_write_permissions, copy#envRefClass, initialize,
## .. initialize#FBM, save, show#envRefClass, show#FBM
## $ fam :'data.frame': 559 obs. of 6 variables:
## ..$ family.ID : chr [1:559] "EUR_GBR" "EUR_GBR" "EUR_GBR" "EUR_GBR" ...
## ..$ sample.ID : chr [1:559] "HG00096" "HG00097" "HG00099" "HG00100" ...
## ..$ paternal.ID: int [1:559] 0 0 0 0 0 0 0 0 0 0 ...
## ..$ maternal.ID: int [1:559] 0 0 0 0 0 0 0 0 0 0 ...
## ..$ sex : int [1:559] 1 2 2 2 1 2 1 2 2 1 ...
## ..$ affection : int [1:559] 1 2 1 1 1 1 2 1 2 1 ...
## $ map :'data.frame': 130816 obs. of 6 variables:
## ..$ chromosome : int [1:130816] 2 2 2 2 2 2 2 2 2 2 ...
## ..$ marker.ID : chr [1:130816] "rs13400442" "rs7594567" "rs7597758" "..
## ..$ genetic.dist: int [1:130816] 0 0 0 0 0 0 0 0 0 0 ...
## ..$ physical.pos: int [1:130816] 18506 21833 22398 28228 32003 32005 36..
## ..$ allele1 : chr [1:130816] "C" "G" "T" "A" ...
## ..$ allele2 : chr [1:130816] "T" "C" "C" "G" ...
## - attr(*, "class")= chr "bigSNP"
# Get aliases for useful slots G <- obj.bigSNP$genotypes CHR <- obj.bigSNP$map$chromosome POS <- obj.bigSNP$map$physical.pos y <- obj.bigSNP$fam$affection - 1 NCORES <- nb_cores() # Read external summary statistics sumstats <- bigreadr::fread2("tmp-data/public-data-sumstats.txt") str(sumstats)
## 'data.frame': 130816 obs. of 10 variables:
## $ chromosome : int 2 2 2 2 2 2 2 2 2 2 ...
## $ marker.ID : chr "rs13400442" "rs7594567" "rs7597758" "rs13383216" ...
## $ physical.pos: int 18506 21833 22398 28228 32003 32005 36787 55237 56916 61687 ...
## $ allele1 : chr "C" "G" "T" "A" ...
## $ allele2 : chr "T" "C" "C" "G" ...
## $ beta : num -0.073 0.0439 -0.3325 -0.5445 -0.4881 ...
## $ beta_se : num 0.277 0.248 0.192 0.247 0.242 ...
## $ n_case : int 157 157 157 157 157 157 157 157 157 157 ...
## $ n_control : int 402 402 402 402 402 402 402 402 402 402 ...
## $ p : num 0.7925 0.8593 0.0846 0.028 0.0439 ...
We split genotype data using part of the data to choose hyper-parameters and another part of the data to evaluate statistical properties of polygenic risk score such as AUC. Here we consider that there are 400 individuals to be used as validation set to tune hyper-parameters for LDpred2-grid. The other 159 individuals are used as test set to evaluate the final models.
To match variants contained in genotype data and summary statistics, the variables "chr" (chromosome number), "pos" (genetic position), "a0" (reference allele) and "a1" (derived allele) should be available in the summary statistics and in the genotype data. These 4 variables are used to match variants between the two data frames.
sumstats$n_eff <- 4 / (1 / sumstats$n_case + 1 / sumstats$n_control) sumstats$n_case <- sumstats$n_control <- NULL names(sumstats) <- c("chr", "rsid", "pos", "a0", "a1", "beta", "beta_se", "p", "n_eff") map <- obj.bigSNP$map[-(2:3)] names(map) <- c("chr", "pos", "a0", "a1") info_snp <- snp_match(sumstats, map)
## 130,816 variants to be matched.
## 18,932 ambiguous SNPs have been removed.
## Some duplicates were removed.
## 111,866 variants have been matched; 0 were flipped and 0 were reversed.
If no or few variants are actually flipped, you might want to disable the strand flipping option. Here, these are simulated data so all variants use the same strand and the same reference.
info_snp <- snp_match(sumstats, map, strand_flip = FALSE)
## 130,816 variants to be matched.
## Some duplicates were removed.
## 130,792 variants have been matched; 0 were flipped and 0 were reversed.
Some quality control on summary statistics is highly recommended (see paper).
First, you need to compute correlations between variants. We recommend to use a window size of 3 cM (see ref).
POS2 <- snp_asGeneticPos(CHR, POS, dir = "tmp-data", ncores = NCORES)
## indices in info_snp ind.chr <- which(info_snp$chr == 2) df_beta <- info_snp[ind.chr, c("beta", "beta_se", "n_eff")]
## indices in G ind.chr2 <- info_snp$`_NUM_ID_`[ind.chr] corr0 <- snp_cor(G, ind.col = ind.chr2, ncores = NCORES, infos.pos = POS2[ind.chr2], size = 3 / 1000)
tmp <- tempfile(tmpdir = "tmp-data") if (packageVersion("bigsnpr") >= package_version("1.4.9") && packageVersion("bigsparser") >= package_version("0.4.0")) { corr <- as_SFBM(corr0, tmp) } else { corr <- bigsparser::as_SFBM(as(corr0, "dgCMatrix"), tmp) }
Here, we have built the LD matrix using variants from one chromosome only. In practice, you need to build it for variants from all chromosomes. Please look at the code linked at the beginning.
(ldsc <- snp_ldsc2(corr0, df_beta))
## int h2
## 1.0000000 0.1146349
h2_est <- ldsc[["h2"]]
beta_inf <- snp_ldpred2_inf(corr, df_beta, h2 = h2_est)
pred_inf <- big_prodVec(G, beta_inf, ind.row = ind.test, ind.col = ind.chr2) AUCBoot(pred_inf, y[ind.test])
## Mean 2.5% 97.5% Sd
## 0.66347145 0.56739110 0.75341191 0.04784474
In practice, we recommend to test multiple values for h2 and p.
## [1] 0.0802 0.1146 0.1605
## [1] 0.00010 0.00018 0.00032 0.00056 0.00100 0.00180 0.00320 0.00560
## [9] 0.01000 0.01800 0.03200 0.05600 0.10000 0.18000 0.32000 0.56000
## [17] 1.00000
(params <- expand.grid(p = p_seq, h2 = h2_seq, sparse = c(FALSE, TRUE)))
## p h2 sparse
## 1 0.00010 0.0802 FALSE
## 2 0.00018 0.0802 FALSE
## 3 0.00032 0.0802 FALSE
## 4 0.00056 0.0802 FALSE
## 5 0.00100 0.0802 FALSE
## 6 0.00180 0.0802 FALSE
## 7 0.00320 0.0802 FALSE
## 8 0.00560 0.0802 FALSE
## 9 0.01000 0.0802 FALSE
## 10 0.01800 0.0802 FALSE
## 11 0.03200 0.0802 FALSE
## 12 0.05600 0.0802 FALSE
## 13 0.10000 0.0802 FALSE
## 14 0.18000 0.0802 FALSE
## 15 0.32000 0.0802 FALSE
## 16 0.56000 0.0802 FALSE
## 17 1.00000 0.0802 FALSE
## 18 0.00010 0.1146 FALSE
## 19 0.00018 0.1146 FALSE
## 20 0.00032 0.1146 FALSE
## 21 0.00056 0.1146 FALSE
## 22 0.00100 0.1146 FALSE
## 23 0.00180 0.1146 FALSE
## 24 0.00320 0.1146 FALSE
## 25 0.00560 0.1146 FALSE
## 26 0.01000 0.1146 FALSE
## 27 0.01800 0.1146 FALSE
## 28 0.03200 0.1146 FALSE
## 29 0.05600 0.1146 FALSE
## 30 0.10000 0.1146 FALSE
## 31 0.18000 0.1146 FALSE
## 32 0.32000 0.1146 FALSE
## 33 0.56000 0.1146 FALSE
## 34 1.00000 0.1146 FALSE
## 35 0.00010 0.1605 FALSE
## 36 0.00018 0.1605 FALSE
## 37 0.00032 0.1605 FALSE
## 38 0.00056 0.1605 FALSE
## 39 0.00100 0.1605 FALSE
## 40 0.00180 0.1605 FALSE
## 41 0.00320 0.1605 FALSE
## 42 0.00560 0.1605 FALSE
## 43 0.01000 0.1605 FALSE
## 44 0.01800 0.1605 FALSE
## 45 0.03200 0.1605 FALSE
## 46 0.05600 0.1605 FALSE
## 47 0.10000 0.1605 FALSE
## 48 0.18000 0.1605 FALSE
## 49 0.32000 0.1605 FALSE
## 50 0.56000 0.1605 FALSE
## 51 1.00000 0.1605 FALSE
## 52 0.00010 0.0802 TRUE
## 53 0.00018 0.0802 TRUE
## 54 0.00032 0.0802 TRUE
## 55 0.00056 0.0802 TRUE
## 56 0.00100 0.0802 TRUE
## 57 0.00180 0.0802 TRUE
## 58 0.00320 0.0802 TRUE
## 59 0.00560 0.0802 TRUE
## 60 0.01000 0.0802 TRUE
## 61 0.01800 0.0802 TRUE
## 62 0.03200 0.0802 TRUE
## 63 0.05600 0.0802 TRUE
## 64 0.10000 0.0802 TRUE
## 65 0.18000 0.0802 TRUE
## 66 0.32000 0.0802 TRUE
## 67 0.56000 0.0802 TRUE
## 68 1.00000 0.0802 TRUE
## 69 0.00010 0.1146 TRUE
## 70 0.00018 0.1146 TRUE
## 71 0.00032 0.1146 TRUE
## 72 0.00056 0.1146 TRUE
## 73 0.00100 0.1146 TRUE
## 74 0.00180 0.1146 TRUE
## 75 0.00320 0.1146 TRUE
## 76 0.00560 0.1146 TRUE
## 77 0.01000 0.1146 TRUE
## 78 0.01800 0.1146 TRUE
## 79 0.03200 0.1146 TRUE
## 80 0.05600 0.1146 TRUE
## 81 0.10000 0.1146 TRUE
## 82 0.18000 0.1146 TRUE
## 83 0.32000 0.1146 TRUE
## 84 0.56000 0.1146 TRUE
## 85 1.00000 0.1146 TRUE
## 86 0.00010 0.1605 TRUE
## 87 0.00018 0.1605 TRUE
## 88 0.00032 0.1605 TRUE
## 89 0.00056 0.1605 TRUE
## 90 0.00100 0.1605 TRUE
## 91 0.00180 0.1605 TRUE
## 92 0.00320 0.1605 TRUE
## 93 0.00560 0.1605 TRUE
## 94 0.01000 0.1605 TRUE
## 95 0.01800 0.1605 TRUE
## 96 0.03200 0.1605 TRUE
## 97 0.05600 0.1605 TRUE
## 98 0.10000 0.1605 TRUE
## 99 0.18000 0.1605 TRUE
## 100 0.32000 0.1605 TRUE
## 101 0.56000 0.1605 TRUE
## 102 1.00000 0.1605 TRUE
# takes several minutes if you do not have many cores beta_grid <- snp_ldpred2_grid(corr, df_beta, params, ncores = NCORES)
pred_grid <- big_prodMat(G, beta_grid, ind.col = ind.chr2) params$score <- big_univLogReg(as_FBM(pred_grid[ind.val, ]), y[ind.val])$score
library(ggplot2) ggplot(params, aes(x = p, y = score, color = as.factor(h2))) + theme_bigstatsr() + geom_point() + geom_line() + scale_x_log10(breaks = 10^(-5:0), minor_breaks = params$p) + facet_wrap(~ sparse, labeller = label_both) + labs(y = "GLM Z-Score", color = "h2") + theme(legend.position = "top", panel.spacing = unit(1, "lines"))

library(dplyr) params %>% mutate(sparsity = colMeans(beta_grid == 0), id = row_number()) %>% arrange(desc(score)) %>% mutate_at(c("score", "sparsity"), round, digits = 3) %>% slice(1:10)
## p h2 sparse score sparsity id
## 1 0.0180 0.1146 TRUE 4.372 0.565 78
## 2 0.0032 0.0802 TRUE 4.363 0.585 58
## 3 0.0018 0.0802 FALSE 4.361 0.000 6
## 4 0.0056 0.1146 TRUE 4.359 0.565 76
## 5 0.0018 0.1605 TRUE 4.358 0.560 91
## 6 0.0032 0.1146 TRUE 4.358 0.567 75
## 7 0.0560 0.0802 TRUE 4.357 0.586 63
## 8 0.0018 0.1146 FALSE 4.357 0.000 23
## 9 0.0056 0.0802 TRUE 4.357 0.578 59
## 10 0.0180 0.0802 TRUE 4.353 0.582 61
You can then choose the best model according to your preferred criterion (e.g. max AUC). Here, we use the Z-Score from the regression of the phenotype by the PRS since we have found it more robust than using the AUC. It also enables adjusting for covariates in this step.
Also note that we separate both sparse and non-sparse models here (and in the paper) to show that their predictive performance are similar. In practice, if you do not really care about sparsity, you could choose the best LDpred2-grid model among all sparse and non-sparse models.
best_grid_nosp <- params %>% mutate(id = row_number()) %>% filter(!sparse) %>% arrange(desc(score)) %>% slice(1) %>% pull(id) %>% beta_grid[, .] pred_nosp <- big_prodVec(G, best_grid_nosp, ind.row = ind.test, ind.col = ind.chr2) AUCBoot(pred_nosp, y[ind.test])
## Mean 2.5% 97.5% Sd
## 0.66926124 0.57291055 0.76355229 0.04861706
best_grid_sp <- params %>% mutate(id = row_number()) %>% filter(sparse) %>% arrange(desc(score)) %>% slice(1) %>% pull(id) %>% beta_grid[, .] pred_sp <- big_prodVec(G, best_grid_sp, ind.row = ind.test, ind.col = ind.chr2) AUCBoot(pred_sp, y[ind.test])
## Mean 2.5% 97.5% Sd
## 0.66839182 0.56919716 0.76002580 0.04851766
We recommend to run many of them in parallel with different initial values for p (e.g. length.out = 30).
# takes a few minutes multi_auto <- snp_ldpred2_auto(corr, df_beta, h2_init = h2_est, vec_p_init = seq_log(1e-4, 0.9, length.out = NCORES), ncores = NCORES) str(multi_auto)
## List of 4
## $ :List of 5
## ..$ beta_est : num [1:52546] -1.66e-07 1.00e-07 -7.57e-07 -1.24e-06 -1.11e-06 ...
## ..$ p_est : num 0.00671
## ..$ h2_est : num 0.000265
## ..$ path_p_est : num [1:1500] 0.000165 0.000199 0.000146 0.000152 0.000202 ...
## ..$ path_h2_est: num [1:1500] 0.0455 0.0472 0.0405 0.0417 0.0482 ...
## $ :List of 5
## ..$ beta_est : num [1:52546] -7.62e-08 4.59e-08 -3.47e-07 -5.69e-07 -5.10e-07 ...
## ..$ p_est : num 0.032
## ..$ h2_est : num 0.000122
## ..$ path_p_est : num [1:1500] 0.00204 0.00218 0.00207 0.00199 0.00206 ...
## ..$ path_h2_est: num [1:1500] 0.11 0.132 0.112 0.106 0.121 ...
## $ :List of 5
## ..$ beta_est : num [1:52546] -2.90e-06 1.76e-06 -1.33e-05 -2.18e-05 -1.95e-05 ...
## ..$ p_est : num 0.0782
## ..$ h2_est : num 0.00465
## ..$ path_p_est : num [1:1500] 0.0438 0.0434 0.0444 0.0417 0.041 ...
## ..$ path_h2_est: num [1:1500] 0.1145 0.114 0.1029 0.1015 0.0977 ...
## $ :List of 5
## ..$ beta_est : num [1:52546] -1.75e-06 1.06e-06 -7.98e-06 -1.31e-05 -1.17e-05 ...
## ..$ p_est : num 0.889
## ..$ h2_est : num 0.00279
## ..$ path_p_est : num [1:1500] 0.898 0.901 0.901 0.902 0.9 ...
## ..$ path_h2_est: num [1:1500] 0.1144 0.1141 0.1105 0.1012 0.0974 ...
You should verify if the chains “converged”. You can look at the path of the chains, as shown below. In the paper, we propose an automatic way to filter bad chains by comparing the scale of the resulting predictions (see this code, reproduced below).
This is not the case here, which is probably because the data is so small.
auto <- multi_auto[[1]] plot_grid( qplot(y = auto$path_p_est) + theme_bigstatsr() + geom_hline(yintercept = auto$p_est, col = "blue") + scale_y_log10() + labs(y = "p"), qplot(y = auto$path_h2_est) + theme_bigstatsr() + geom_hline(yintercept = auto$h2_est, col = "blue") + labs(y = "h2"), ncol = 1, align = "hv" )

beta_auto <- sapply(multi_auto, function(auto) auto$beta_est) pred_auto <- big_prodMat(G, beta_auto, ind.row = ind.test, ind.col = ind.chr2)
sc <- apply(pred_auto, 2, sd) keep <- abs(sc - median(sc)) < 3 * mad(sc) final_beta_auto <- rowMeans(beta_auto[, keep]) final_pred_auto <- rowMeans(pred_auto[, keep])
AUCBoot(final_pred_auto, y[ind.test])
## Mean 2.5% 97.5% Sd
## 0.66237927 0.56728105 0.75282774 0.04728923
# Some cleaning rm(corr); gc(); file.remove(paste0(tmp, ".sbk"))
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 3433760 183.4 5787472 309.1 5787472 309.1
## Vcells 80517088 614.3 198228050 1512.4 182399515 1391.6
## [1] TRUE
We have seen how to run 3 versions of LDpred2 (“-inf”, “-grid” and “-auto”) for one chromosome.
Note that we now recommend to run LDpred2 genome-wide, contrary to what was shown in the first versions of this tutorial. The only difference it makes is when building the SFBM (the sparse LD matrix on disk), you need to build it so that it contains all variants genome-wide (see e.g. this code).
Privé, F., Arbel, J., & Vilhjálmsson, B. J. (2020). LDpred2: better, faster, stronger. BioRxiv.