Here we show how to compute polygenic risk scores using LDpred2.

This tutorial only uses fake data for educational purposes. You should also probably look at the code of the paper, particularly at the code to prepare summary statistics (including performing the quality control presented in the Methods section “Quality control of summary statistics” of the paper), at the code to read BGEN files into the data format used by bigsnpr, at the code to prepare LD matrices and at the code to run LDpred2 (genome-wide).

In practice, until we find a better set of variants, we recommend using the HapMap3 variants used in PRS-CS and the LDpred2 papers. Information about these variants can be retrieved with

info <- readRDS(url("https://github.com/privefl/bigsnpr/raw/master/data-raw/hm3_variants.rds"))
str(info)
## 'data.frame':    1120696 obs. of  8 variables:
##  $ chr     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ rsid    : chr  "rs3131972" "rs3131969" "rs1048488" "rs12562034" ...
##  $ pos     : int  752721 754182 760912 768448 779322 838555 846808 853954 854250 861808 ...
##  $ a1      : chr  "A" "A" "C" "A" ...
##  $ a0      : chr  "G" "G" "T" "G" ...
##  $ maf     : num  0.161 0.1282 0.16 0.0925 0.1183 ...
##  $ pos_hg19: int  762858 764319 771049 778585 789459 848692 856945 864091 864387 871945 ...
##  $ pos_hg17: int  802721 804182 810912 818448 829322 888555 896808 903954 904250 912088 ...

Alternatively, we also provide an LD reference to be used direcly, along with an example script on how to use it at https://doi.org/10.6084/m9.figshare.13034123.

Note that we now recommend to run LDpred2 genome-wide, contrary to what was shown in the first versions of this tutorial. The only difference it makes is when building the SFBM (the sparse LD matrix on disk), you need to build it so that it contains all variants genome-wide (see e.g. this code).

Downloading genotype data and summary statistics

You can download data and unzip files in R. We store those files in a directory called "tmp-data" here.

You can see there how we generated these data from the 1000 Genomes project. Note that these data are for educational purposes only, not for use as a reference panel.

First, you need to read genotype data from the PLINK files (or BGEN files) as well as the text file containing summary statistics.

# Load packages bigsnpr and bigstatsr
library(bigsnpr)
## Loading required package: bigstatsr
## Warning: replacing previous import 'data.table:::=' by 'ggplot2:::=' when
## loading 'bigsnpr'
# Read from bed/bim/fam, it generates .bk and .rds files.
snp_readBed("tmp-data/public-data.bed")
## [1] "C:\\Users\\au639593\\Desktop\\bigsnpr\\tmp-data\\public-data.rds"
# Attach the "bigSNP" object in R session
obj.bigSNP <- snp_attach("tmp-data/public-data.rds")
# See how the file looks like
str(obj.bigSNP, max.level = 2, strict.width = "cut")
## List of 3
##  $ genotypes:Reference class 'FBM.code256' [package "bigstatsr"] with 15 ..
##   ..and 26 methods, of which 12 are  possibly relevant:
##   ..  add_columns, as.FBM, bm, bm.desc, check_dimensions,
##   ..  check_write_permissions, copy#envRefClass, initialize,
##   ..  initialize#FBM, save, show#envRefClass, show#FBM
##  $ fam      :'data.frame':   559 obs. of  6 variables:
##   ..$ family.ID  : chr [1:559] "EUR_GBR" "EUR_GBR" "EUR_GBR" "EUR_GBR" ...
##   ..$ sample.ID  : chr [1:559] "HG00096" "HG00097" "HG00099" "HG00100" ...
##   ..$ paternal.ID: int [1:559] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ maternal.ID: int [1:559] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ sex        : int [1:559] 1 2 2 2 1 2 1 2 2 1 ...
##   ..$ affection  : int [1:559] 1 2 1 1 1 1 2 1 2 1 ...
##  $ map      :'data.frame':   130816 obs. of  6 variables:
##   ..$ chromosome  : int [1:130816] 2 2 2 2 2 2 2 2 2 2 ...
##   ..$ marker.ID   : chr [1:130816] "rs13400442" "rs7594567" "rs7597758" "..
##   ..$ genetic.dist: int [1:130816] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ physical.pos: int [1:130816] 18506 21833 22398 28228 32003 32005 36..
##   ..$ allele1     : chr [1:130816] "C" "G" "T" "A" ...
##   ..$ allele2     : chr [1:130816] "T" "C" "C" "G" ...
##  - attr(*, "class")= chr "bigSNP"
# Get aliases for useful slots
G   <- obj.bigSNP$genotypes
CHR <- obj.bigSNP$map$chromosome
POS <- obj.bigSNP$map$physical.pos
y   <- obj.bigSNP$fam$affection - 1
NCORES <- nb_cores()
# Read external summary statistics
sumstats <- bigreadr::fread2("tmp-data/public-data-sumstats.txt")
str(sumstats)
## 'data.frame':    130816 obs. of  10 variables:
##  $ chromosome  : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ marker.ID   : chr  "rs13400442" "rs7594567" "rs7597758" "rs13383216" ...
##  $ physical.pos: int  18506 21833 22398 28228 32003 32005 36787 55237 56916 61687 ...
##  $ allele1     : chr  "C" "G" "T" "A" ...
##  $ allele2     : chr  "T" "C" "C" "G" ...
##  $ beta        : num  -0.073 0.0439 -0.3325 -0.5445 -0.4881 ...
##  $ beta_se     : num  0.277 0.248 0.192 0.247 0.242 ...
##  $ n_case      : int  157 157 157 157 157 157 157 157 157 157 ...
##  $ n_control   : int  402 402 402 402 402 402 402 402 402 402 ...
##  $ p           : num  0.7925 0.8593 0.0846 0.028 0.0439 ...

We split genotype data using part of the data to choose hyper-parameters and another part of the data to evaluate statistical properties of polygenic risk score such as AUC. Here we consider that there are 400 individuals to be used as validation set to tune hyper-parameters for LDpred2-grid. The other 159 individuals are used as test set to evaluate the final models.

set.seed(1)
ind.val <- sample(nrow(G), 400)
ind.test <- setdiff(rows_along(G), ind.val)

Matching variants between genotype data and summary statistics

To match variants contained in genotype data and summary statistics, the variables "chr" (chromosome number), "pos" (genetic position), "a0" (reference allele) and "a1" (derived allele) should be available in the summary statistics and in the genotype data. These 4 variables are used to match variants between the two data frames.

sumstats$n_eff <- 4 / (1 / sumstats$n_case + 1 / sumstats$n_control)
sumstats$n_case <- sumstats$n_control <- NULL
names(sumstats) <- c("chr", "rsid", "pos", "a0", "a1", "beta", "beta_se", "p", "n_eff")
map <- obj.bigSNP$map[-(2:3)]
names(map) <- c("chr", "pos", "a0", "a1")
info_snp <- snp_match(sumstats, map)
## 130,816 variants to be matched.
## 18,932 ambiguous SNPs have been removed.
## Some duplicates were removed.
## 111,866 variants have been matched; 0 were flipped and 0 were reversed.

If no or few variants are actually flipped, you might want to disable the strand flipping option. Here, these are simulated data so all variants use the same strand and the same reference.

info_snp <- snp_match(sumstats, map, strand_flip = FALSE)
## 130,816 variants to be matched.
## Some duplicates were removed.
## 130,792 variants have been matched; 0 were flipped and 0 were reversed.

Computing LDpred2 scores for one chromosome

Some quality control on summary statistics is highly recommended (see paper).

Correlation

First, you need to compute correlations between variants. We recommend to use a window size of 3 cM (see ref).

POS2 <- snp_asGeneticPos(CHR, POS, dir = "tmp-data", ncores = NCORES)
## indices in info_snp
ind.chr <- which(info_snp$chr == 2)
df_beta <- info_snp[ind.chr, c("beta", "beta_se", "n_eff")]
## indices in G
ind.chr2 <- info_snp$`_NUM_ID_`[ind.chr]
corr0 <- snp_cor(G, ind.col = ind.chr2, ncores = NCORES,
                 infos.pos = POS2[ind.chr2], size = 3 / 1000)
tmp <- tempfile(tmpdir = "tmp-data")
if (packageVersion("bigsnpr") >= package_version("1.4.9") &&
    packageVersion("bigsparser") >= package_version("0.4.0")) {
  corr <- as_SFBM(corr0, tmp)
} else {
  corr <- bigsparser::as_SFBM(as(corr0, "dgCMatrix"), tmp)
}

Here, we have built the LD matrix using variants from one chromosome only. In practice, you need to build it for variants from all chromosomes. Please look at the code linked at the beginning.

Infinitesimal model

(ldsc <- snp_ldsc2(corr0, df_beta))
##       int        h2 
## 1.0000000 0.1146349
h2_est <- ldsc[["h2"]]
beta_inf <- snp_ldpred2_inf(corr, df_beta, h2 = h2_est)
pred_inf <- big_prodVec(G, beta_inf, ind.row = ind.test, ind.col = ind.chr2)
AUCBoot(pred_inf, y[ind.test])
##       Mean       2.5%      97.5%         Sd 
## 0.66347145 0.56739110 0.75341191 0.04784474

Grid of models

In practice, we recommend to test multiple values for h2 and p. 

(h2_seq <- round(h2_est * c(0.7, 1, 1.4), 4))
## [1] 0.0802 0.1146 0.1605
(p_seq <- signif(seq_log(1e-4, 1, length.out = 17), 2))
##  [1] 0.00010 0.00018 0.00032 0.00056 0.00100 0.00180 0.00320 0.00560
##  [9] 0.01000 0.01800 0.03200 0.05600 0.10000 0.18000 0.32000 0.56000
## [17] 1.00000
(params <- expand.grid(p = p_seq, h2 = h2_seq, sparse = c(FALSE, TRUE)))
##           p     h2 sparse
## 1   0.00010 0.0802  FALSE
## 2   0.00018 0.0802  FALSE
## 3   0.00032 0.0802  FALSE
## 4   0.00056 0.0802  FALSE
## 5   0.00100 0.0802  FALSE
## 6   0.00180 0.0802  FALSE
## 7   0.00320 0.0802  FALSE
## 8   0.00560 0.0802  FALSE
## 9   0.01000 0.0802  FALSE
## 10  0.01800 0.0802  FALSE
## 11  0.03200 0.0802  FALSE
## 12  0.05600 0.0802  FALSE
## 13  0.10000 0.0802  FALSE
## 14  0.18000 0.0802  FALSE
## 15  0.32000 0.0802  FALSE
## 16  0.56000 0.0802  FALSE
## 17  1.00000 0.0802  FALSE
## 18  0.00010 0.1146  FALSE
## 19  0.00018 0.1146  FALSE
## 20  0.00032 0.1146  FALSE
## 21  0.00056 0.1146  FALSE
## 22  0.00100 0.1146  FALSE
## 23  0.00180 0.1146  FALSE
## 24  0.00320 0.1146  FALSE
## 25  0.00560 0.1146  FALSE
## 26  0.01000 0.1146  FALSE
## 27  0.01800 0.1146  FALSE
## 28  0.03200 0.1146  FALSE
## 29  0.05600 0.1146  FALSE
## 30  0.10000 0.1146  FALSE
## 31  0.18000 0.1146  FALSE
## 32  0.32000 0.1146  FALSE
## 33  0.56000 0.1146  FALSE
## 34  1.00000 0.1146  FALSE
## 35  0.00010 0.1605  FALSE
## 36  0.00018 0.1605  FALSE
## 37  0.00032 0.1605  FALSE
## 38  0.00056 0.1605  FALSE
## 39  0.00100 0.1605  FALSE
## 40  0.00180 0.1605  FALSE
## 41  0.00320 0.1605  FALSE
## 42  0.00560 0.1605  FALSE
## 43  0.01000 0.1605  FALSE
## 44  0.01800 0.1605  FALSE
## 45  0.03200 0.1605  FALSE
## 46  0.05600 0.1605  FALSE
## 47  0.10000 0.1605  FALSE
## 48  0.18000 0.1605  FALSE
## 49  0.32000 0.1605  FALSE
## 50  0.56000 0.1605  FALSE
## 51  1.00000 0.1605  FALSE
## 52  0.00010 0.0802   TRUE
## 53  0.00018 0.0802   TRUE
## 54  0.00032 0.0802   TRUE
## 55  0.00056 0.0802   TRUE
## 56  0.00100 0.0802   TRUE
## 57  0.00180 0.0802   TRUE
## 58  0.00320 0.0802   TRUE
## 59  0.00560 0.0802   TRUE
## 60  0.01000 0.0802   TRUE
## 61  0.01800 0.0802   TRUE
## 62  0.03200 0.0802   TRUE
## 63  0.05600 0.0802   TRUE
## 64  0.10000 0.0802   TRUE
## 65  0.18000 0.0802   TRUE
## 66  0.32000 0.0802   TRUE
## 67  0.56000 0.0802   TRUE
## 68  1.00000 0.0802   TRUE
## 69  0.00010 0.1146   TRUE
## 70  0.00018 0.1146   TRUE
## 71  0.00032 0.1146   TRUE
## 72  0.00056 0.1146   TRUE
## 73  0.00100 0.1146   TRUE
## 74  0.00180 0.1146   TRUE
## 75  0.00320 0.1146   TRUE
## 76  0.00560 0.1146   TRUE
## 77  0.01000 0.1146   TRUE
## 78  0.01800 0.1146   TRUE
## 79  0.03200 0.1146   TRUE
## 80  0.05600 0.1146   TRUE
## 81  0.10000 0.1146   TRUE
## 82  0.18000 0.1146   TRUE
## 83  0.32000 0.1146   TRUE
## 84  0.56000 0.1146   TRUE
## 85  1.00000 0.1146   TRUE
## 86  0.00010 0.1605   TRUE
## 87  0.00018 0.1605   TRUE
## 88  0.00032 0.1605   TRUE
## 89  0.00056 0.1605   TRUE
## 90  0.00100 0.1605   TRUE
## 91  0.00180 0.1605   TRUE
## 92  0.00320 0.1605   TRUE
## 93  0.00560 0.1605   TRUE
## 94  0.01000 0.1605   TRUE
## 95  0.01800 0.1605   TRUE
## 96  0.03200 0.1605   TRUE
## 97  0.05600 0.1605   TRUE
## 98  0.10000 0.1605   TRUE
## 99  0.18000 0.1605   TRUE
## 100 0.32000 0.1605   TRUE
## 101 0.56000 0.1605   TRUE
## 102 1.00000 0.1605   TRUE
# takes several minutes if you do not have many cores
beta_grid <- snp_ldpred2_grid(corr, df_beta, params, ncores = NCORES)
pred_grid <- big_prodMat(G, beta_grid, ind.col = ind.chr2)
params$score <- big_univLogReg(as_FBM(pred_grid[ind.val, ]), y[ind.val])$score
library(ggplot2)
ggplot(params, aes(x = p, y = score, color = as.factor(h2))) +
  theme_bigstatsr() +
  geom_point() +
  geom_line() +
  scale_x_log10(breaks = 10^(-5:0), minor_breaks = params$p) +
  facet_wrap(~ sparse, labeller = label_both) +
  labs(y = "GLM Z-Score", color = "h2") +
  theme(legend.position = "top", panel.spacing = unit(1, "lines"))

library(dplyr)
params %>%
  mutate(sparsity = colMeans(beta_grid == 0), id = row_number()) %>%
  arrange(desc(score)) %>%
  mutate_at(c("score", "sparsity"), round, digits = 3) %>%
  slice(1:10)
##         p     h2 sparse score sparsity id
## 1  0.0180 0.1146   TRUE 4.372    0.565 78
## 2  0.0032 0.0802   TRUE 4.363    0.585 58
## 3  0.0018 0.0802  FALSE 4.361    0.000  6
## 4  0.0056 0.1146   TRUE 4.359    0.565 76
## 5  0.0018 0.1605   TRUE 4.358    0.560 91
## 6  0.0032 0.1146   TRUE 4.358    0.567 75
## 7  0.0560 0.0802   TRUE 4.357    0.586 63
## 8  0.0018 0.1146  FALSE 4.357    0.000 23
## 9  0.0056 0.0802   TRUE 4.357    0.578 59
## 10 0.0180 0.0802   TRUE 4.353    0.582 61

You can then choose the best model according to your preferred criterion (e.g. max AUC). Here, we use the Z-Score from the regression of the phenotype by the PRS since we have found it more robust than using the AUC. It also enables adjusting for covariates in this step.

Also note that we separate both sparse and non-sparse models here (and in the paper) to show that their predictive performance are similar. In practice, if you do not really care about sparsity, you could choose the best LDpred2-grid model among all sparse and non-sparse models.

best_grid_nosp <- params %>%
  mutate(id = row_number()) %>%
  filter(!sparse) %>%
  arrange(desc(score)) %>%
  slice(1) %>%
  pull(id) %>%
  beta_grid[, .]

pred_nosp <- big_prodVec(G, best_grid_nosp, ind.row = ind.test, ind.col = ind.chr2)
AUCBoot(pred_nosp, y[ind.test])
##       Mean       2.5%      97.5%         Sd 
## 0.66926124 0.57291055 0.76355229 0.04861706
best_grid_sp <- params %>%
  mutate(id = row_number()) %>%
  filter(sparse) %>%
  arrange(desc(score)) %>%
  slice(1) %>%
  pull(id) %>%
  beta_grid[, .]

pred_sp <- big_prodVec(G, best_grid_sp, ind.row = ind.test, ind.col = ind.chr2)
AUCBoot(pred_sp, y[ind.test])
##       Mean       2.5%      97.5%         Sd 
## 0.66839182 0.56919716 0.76002580 0.04851766

Automatic model

We recommend to run many of them in parallel with different initial values for p (e.g. length.out = 30).

# takes a few minutes
multi_auto <- snp_ldpred2_auto(corr, df_beta, h2_init = h2_est,
                               vec_p_init = seq_log(1e-4, 0.9, length.out = NCORES),
                               ncores = NCORES)
str(multi_auto)
## List of 4
##  $ :List of 5
##   ..$ beta_est   : num [1:52546] -1.66e-07 1.00e-07 -7.57e-07 -1.24e-06 -1.11e-06 ...
##   ..$ p_est      : num 0.00671
##   ..$ h2_est     : num 0.000265
##   ..$ path_p_est : num [1:1500] 0.000165 0.000199 0.000146 0.000152 0.000202 ...
##   ..$ path_h2_est: num [1:1500] 0.0455 0.0472 0.0405 0.0417 0.0482 ...
##  $ :List of 5
##   ..$ beta_est   : num [1:52546] -7.62e-08 4.59e-08 -3.47e-07 -5.69e-07 -5.10e-07 ...
##   ..$ p_est      : num 0.032
##   ..$ h2_est     : num 0.000122
##   ..$ path_p_est : num [1:1500] 0.00204 0.00218 0.00207 0.00199 0.00206 ...
##   ..$ path_h2_est: num [1:1500] 0.11 0.132 0.112 0.106 0.121 ...
##  $ :List of 5
##   ..$ beta_est   : num [1:52546] -2.90e-06 1.76e-06 -1.33e-05 -2.18e-05 -1.95e-05 ...
##   ..$ p_est      : num 0.0782
##   ..$ h2_est     : num 0.00465
##   ..$ path_p_est : num [1:1500] 0.0438 0.0434 0.0444 0.0417 0.041 ...
##   ..$ path_h2_est: num [1:1500] 0.1145 0.114 0.1029 0.1015 0.0977 ...
##  $ :List of 5
##   ..$ beta_est   : num [1:52546] -1.75e-06 1.06e-06 -7.98e-06 -1.31e-05 -1.17e-05 ...
##   ..$ p_est      : num 0.889
##   ..$ h2_est     : num 0.00279
##   ..$ path_p_est : num [1:1500] 0.898 0.901 0.901 0.902 0.9 ...
##   ..$ path_h2_est: num [1:1500] 0.1144 0.1141 0.1105 0.1012 0.0974 ...

You should verify if the chains “converged”. You can look at the path of the chains, as shown below. In the paper, we propose an automatic way to filter bad chains by comparing the scale of the resulting predictions (see this code, reproduced below).

This is not the case here, which is probably because the data is so small.

auto <- multi_auto[[1]]
plot_grid(
  qplot(y = auto$path_p_est) +
    theme_bigstatsr() +
    geom_hline(yintercept = auto$p_est, col = "blue") +
    scale_y_log10() +
    labs(y = "p"),
  qplot(y = auto$path_h2_est) +
    theme_bigstatsr() +
    geom_hline(yintercept = auto$h2_est, col = "blue") +
    labs(y = "h2"),
  ncol = 1, align = "hv"
)

beta_auto <- sapply(multi_auto, function(auto) auto$beta_est)
pred_auto <- big_prodMat(G, beta_auto, ind.row = ind.test, ind.col = ind.chr2)
sc <- apply(pred_auto, 2, sd)
keep <- abs(sc - median(sc)) < 3 * mad(sc)
final_beta_auto <- rowMeans(beta_auto[, keep])
final_pred_auto <- rowMeans(pred_auto[, keep])
AUCBoot(final_pred_auto, y[ind.test])
##       Mean       2.5%      97.5%         Sd 
## 0.66237927 0.56728105 0.75282774 0.04728923
# Some cleaning
rm(corr); gc(); file.remove(paste0(tmp, ".sbk"))
##            used  (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells  3433760 183.4    5787472  309.1   5787472  309.1
## Vcells 80517088 614.3  198228050 1512.4 182399515 1391.6
## [1] TRUE

Conclusion

We have seen how to run 3 versions of LDpred2 (“-inf”, “-grid” and “-auto”) for one chromosome.

Note that we now recommend to run LDpred2 genome-wide, contrary to what was shown in the first versions of this tutorial. The only difference it makes is when building the SFBM (the sparse LD matrix on disk), you need to build it so that it contains all variants genome-wide (see e.g. this code).

Reference

Privé, F., Arbel, J., & Vilhjálmsson, B. J. (2020). LDpred2: better, faster, stronger. BioRxiv.