Here we show how to compute polygenic risk scores using LDpred2.

This tutorial only uses fake data for educational purposes. You should also probably look at the code of the paper, particularly at the code to prepare summary statistics (including performing the quality control presented in the Methods section “Quality control of summary statistics” of the paper), at the code to read BGEN files into the data format used by bigsnpr, at the code to prepare LD matrices and at the code to run LDpred2 (genome-wide).

In practice, until we find a better set of variants, we recommend using the HapMap3 variants used in PRS-CS and the LDpred2 papers. If you do not have enough data to use as LD reference, we provide an LD reference to be used directly, along with an example script on how to use it at https://doi.org/10.6084/m9.figshare.13034123. Note that before Nov. 16, 2020, there was an error in the conversion of positions between genome builds. Information about these variants can be retrieved with

# $pos is in build GRCh37 / hg19, but we provide positions in 3 other builds 
info <- readRDS(url("https://ndownloader.figshare.com/files/25503788"))
str(info)
## Classes 'tbl_df', 'tbl' and 'data.frame':    1054330 obs. of  10 variables:
##  $ chr     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ pos     : int  752721 754182 760912 768448 779322 838555 846808 853954 854250 864938 ...
##  $ a0      : chr  "A" "A" "C" "G" ...
##  $ a1      : chr  "G" "G" "T" "A" ...
##  $ rsid    : chr  "rs3131972" "rs3131969" "rs1048488" "rs12562034" ...
##  $ af_UKBB : num  0.841 0.87 0.84 0.106 0.128 ...
##  $ ld      : num  3.69 3.73 3.69 1.4 3.68 ...
##  $ pos_hg17: int  792584 794045 800775 808311 819185 878418 886671 893817 894113 904801 ...
##  $ pos_hg18: int  742584 744045 750775 758311 769185 828418 836671 843817 844113 854801 ...
##  $ pos_hg38: int  817341 818802 825532 833068 843942 903175 911428 918574 918870 929558 ...

Note that we now recommend to run LDpred2 genome-wide, contrary to what was shown in the first versions of this tutorial. The only difference it makes is when building the SFBM (the sparse LD matrix on disk), you need to build it so that it contains all variants genome-wide (see e.g. this code).

Downloading genotype data and summary statistics

You can download data and unzip files in R. We store those files in a directory called "tmp-data" here.

You can see there how we generated these data from the 1000 Genomes project. Note that these data are for educational purposes only, not for use as a reference panel.

First, you need to read genotype data from the PLINK files (or BGEN files) as well as the text file containing summary statistics.

# Load packages bigsnpr and bigstatsr
library(bigsnpr)
## Loading required package: bigstatsr
# Read from bed/bim/fam, it generates .bk and .rds files.
snp_readBed("tmp-data/public-data.bed")
## [1] "C:\\Users\\au639593\\Desktop\\bigsnpr\\tmp-data\\public-data.rds"
# Attach the "bigSNP" object in R session
obj.bigSNP <- snp_attach("tmp-data/public-data.rds")
# See how the file looks like
str(obj.bigSNP, max.level = 2, strict.width = "cut")
## List of 3
##  $ genotypes:Reference class 'FBM.code256' [package "bigstatsr"] with 16 ..
##   ..and 26 methods, of which 12 are  possibly relevant:
##   ..  add_columns, as.FBM, bm, bm.desc, check_dimensions,
##   ..  check_write_permissions, copy#envRefClass, initialize,
##   ..  initialize#FBM, save, show#envRefClass, show#FBM
##  $ fam      :'data.frame':   559 obs. of  6 variables:
##   ..$ family.ID  : chr [1:559] "EUR_GBR" "EUR_GBR" "EUR_GBR" "EUR_GBR" ...
##   ..$ sample.ID  : chr [1:559] "HG00096" "HG00097" "HG00099" "HG00100" ...
##   ..$ paternal.ID: int [1:559] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ maternal.ID: int [1:559] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ sex        : int [1:559] 1 2 2 2 1 2 1 2 2 1 ...
##   ..$ affection  : int [1:559] 1 2 1 1 1 1 2 1 2 1 ...
##  $ map      :'data.frame':   130816 obs. of  6 variables:
##   ..$ chromosome  : int [1:130816] 2 2 2 2 2 2 2 2 2 2 ...
##   ..$ marker.ID   : chr [1:130816] "rs13400442" "rs7594567" "rs7597758" "..
##   ..$ genetic.dist: int [1:130816] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ physical.pos: int [1:130816] 18506 21833 22398 28228 32003 32005 36..
##   ..$ allele1     : chr [1:130816] "C" "G" "T" "A" ...
##   ..$ allele2     : chr [1:130816] "T" "C" "C" "G" ...
##  - attr(*, "class")= chr "bigSNP"
# Get aliases for useful slots
G   <- obj.bigSNP$genotypes
CHR <- obj.bigSNP$map$chromosome
POS <- obj.bigSNP$map$physical.pos
y   <- obj.bigSNP$fam$affection - 1
NCORES <- nb_cores()
# Read external summary statistics
sumstats <- bigreadr::fread2("tmp-data/public-data-sumstats.txt")
str(sumstats)
## 'data.frame':    130816 obs. of  10 variables:
##  $ chromosome  : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ marker.ID   : chr  "rs13400442" "rs7594567" "rs7597758" "rs13383216" ...
##  $ physical.pos: int  18506 21833 22398 28228 32003 32005 36787 55237 56916 61687 ...
##  $ allele1     : chr  "C" "G" "T" "A" ...
##  $ allele2     : chr  "T" "C" "C" "G" ...
##  $ beta        : num  -0.073 0.0439 -0.3325 -0.5445 -0.4881 ...
##  $ beta_se     : num  0.277 0.248 0.192 0.247 0.242 ...
##  $ n_case      : int  157 157 157 157 157 157 157 157 157 157 ...
##  $ n_control   : int  402 402 402 402 402 402 402 402 402 402 ...
##  $ p           : num  0.7925 0.8593 0.0846 0.028 0.0439 ...

We split genotype data using part of the data to choose hyper-parameters and another part of the data to evaluate statistical properties of polygenic risk score such as AUC. Here we consider that there are 400 individuals to be used as validation set to tune hyper-parameters for LDpred2-grid. The other 159 individuals are used as test set to evaluate the final models.

set.seed(1)
ind.val <- sample(nrow(G), 400)
ind.test <- setdiff(rows_along(G), ind.val)

Matching variants between genotype data and summary statistics

To match variants contained in genotype data and summary statistics, the variables "chr" (chromosome number), "pos" (genetic position), "a0" (reference allele) and "a1" (derived allele) should be available in the summary statistics and in the genotype data. These 4 variables are used to match variants between the two data frames.

sumstats$n_eff <- 4 / (1 / sumstats$n_case + 1 / sumstats$n_control)
sumstats$n_case <- sumstats$n_control <- NULL
names(sumstats) <- c("chr", "rsid", "pos", "a0", "a1", "beta", "beta_se", "p", "n_eff")
map <- obj.bigSNP$map[-(2:3)]
names(map) <- c("chr", "pos", "a0", "a1")
info_snp <- snp_match(sumstats, map)
## 130,816 variants to be matched.
## 18,932 ambiguous SNPs have been removed.
## Some duplicates were removed.
## 111,866 variants have been matched; 0 were flipped and 0 were reversed.

If no or few variants are actually flipped, you might want to disable the strand flipping option. Here, these are simulated data so all variants use the same strand and the same reference.

info_snp <- snp_match(sumstats, map, strand_flip = FALSE)
## 130,816 variants to be matched.
## Some duplicates were removed.
## 130,792 variants have been matched; 0 were flipped and 0 were reversed.

Computing LDpred2 scores for one chromosome

Some quality control on summary statistics is highly recommended (see paper).

Correlation

First, you need to compute correlations between variants. We recommend to use a window size of 3 cM (see ref).

POS2 <- snp_asGeneticPos(CHR, POS, dir = "tmp-data", ncores = NCORES)
## indices in info_snp
ind.chr <- which(info_snp$chr == 2)         
df_beta <- info_snp[ind.chr, c("beta", "beta_se", "n_eff")]
## indices in G
ind.chr2 <- info_snp$`_NUM_ID_`[ind.chr]    
corr0 <- runonce::save_run(
  snp_cor(G, ind.col = ind.chr2, ncores = NCORES, 
          infos.pos = POS2[ind.chr2], size = 3 / 1000),
  file = "tmp-data/corr_ldpred2_tutorial.rds")
tmp <- tempfile(tmpdir = "tmp-data")
corr <- as_SFBM(corr0, tmp)

Here, we have built the LD matrix using variants from one chromosome only. In practice, you need to build it for variants from all chromosomes. Please look at the code linked at the beginning.

LDpred2-inf: infinitesimal model

(ldsc <- snp_ldsc2(corr0, df_beta))
##       int        h2 
## 1.0000000 0.1146349
h2_est <- ldsc[["h2"]]
beta_inf <- snp_ldpred2_inf(corr, df_beta, h2 = h2_est)
pred_inf <- big_prodVec(G, beta_inf, ind.row = ind.test, ind.col = ind.chr2)
AUCBoot(pred_inf, y[ind.test])
##       Mean       2.5%      97.5%         Sd 
## 0.66347145 0.56739110 0.75341191 0.04784474

LDpred2(-grid): grid of models

In practice, we recommend to test multiple values for h2 and p. 

(h2_seq <- round(h2_est * c(0.7, 1, 1.4), 4))
## [1] 0.0802 0.1146 0.1605
(p_seq <- signif(seq_log(1e-4, 1, length.out = 17), 2))
##  [1] 0.00010 0.00018 0.00032 0.00056 0.00100 0.00180 0.00320 0.00560
##  [9] 0.01000 0.01800 0.03200 0.05600 0.10000 0.18000 0.32000 0.56000
## [17] 1.00000
(params <- expand.grid(p = p_seq, h2 = h2_seq, sparse = c(FALSE, TRUE)))
##          p     h2 sparse
## 1  0.00010 0.0802  FALSE
## 2  0.00018 0.0802  FALSE
## 3  0.00032 0.0802  FALSE
## 4  0.00056 0.0802  FALSE
## 5  0.00100 0.0802  FALSE
## 6  0.00180 0.0802  FALSE
## 7  0.00320 0.0802  FALSE
## 8  0.00560 0.0802  FALSE
## 9  0.01000 0.0802  FALSE
## 10 0.01800 0.0802  FALSE
##  [ reached 'max' / getOption("max.print") -- omitted 92 rows ]
# takes several minutes if you do not have many cores
beta_grid <- snp_ldpred2_grid(corr, df_beta, params, ncores = NCORES)
pred_grid <- big_prodMat(G, beta_grid, ind.col = ind.chr2)
params$score <- big_univLogReg(as_FBM(pred_grid[ind.val, ]), y[ind.val])$score
## Warning: package 'ggplot2' was built under R version 3.6.3
ggplot(params, aes(x = p, y = score, color = as.factor(h2))) +
  theme_bigstatsr() +
  geom_point() +
  geom_line() +
  scale_x_log10(breaks = 10^(-5:0), minor_breaks = params$p) +
  facet_wrap(~ sparse, labeller = label_both) +
  labs(y = "GLM Z-Score", color = "h2") +
  theme(legend.position = "top", panel.spacing = unit(1, "lines"))

## Warning: package 'dplyr' was built under R version 3.6.3
params %>%
  mutate(sparsity = colMeans(beta_grid == 0), id = row_number()) %>%
  arrange(desc(score)) %>%
  mutate_at(c("score", "sparsity"), round, digits = 3) %>%
  slice(1:10)
##        p     h2 sparse score sparsity id
## 1 0.0032 0.1146   TRUE 4.366    0.569 75
## 2 0.0018 0.1605  FALSE 4.363    0.000 40
## 3 0.0032 0.1605   TRUE 4.363    0.549 92
## 4 0.3200 0.1146   TRUE 4.358    0.576 83
## 5 0.1000 0.1605   TRUE 4.357    0.555 98
##  [ reached 'max' / getOption("max.print") -- omitted 5 rows ]

You can then choose the best model according to your preferred criterion (e.g. max AUC). Here, we use the Z-Score from the regression of the phenotype by the PRS since we have found it more robust than using the AUC. It also enables adjusting for covariates in this step (using parameter covar.train in big_univLogReg() or big_univLinReg()).

Also note that we separate both sparse and non-sparse models here (and in the paper) to show that their predictive performance are similar. In practice, if you do not really care about sparsity, you could choose the best LDpred2-grid model among all sparse and non-sparse models.

best_grid_nosp <- params %>%
  mutate(id = row_number()) %>%
  filter(!sparse) %>% 
  arrange(desc(score)) %>%
  slice(1) %>%
  pull(id) %>% 
  beta_grid[, .]

pred_nosp <- big_prodVec(G, best_grid_nosp, ind.row = ind.test, ind.col = ind.chr2)
AUCBoot(pred_nosp, y[ind.test])
##      Mean      2.5%     97.5%        Sd 
## 0.6720025 0.5747335 0.7652376 0.0489906
best_grid_sp <- params %>%
  mutate(id = row_number()) %>%
  filter(sparse) %>% 
  arrange(desc(score)) %>%
  slice(1) %>%
  pull(id) %>% 
  beta_grid[, .]

pred_sp <- big_prodVec(G, best_grid_sp, ind.row = ind.test, ind.col = ind.chr2)
AUCBoot(pred_sp, y[ind.test])
##      Mean      2.5%     97.5%        Sd 
## 0.6705315 0.5711805 0.7638674 0.0488126

LDpred2-auto: automatic model

We recommend to run many of them in parallel with different initial values for p (e.g. length.out = 30).

# takes a few minutes
multi_auto <- snp_ldpred2_auto(corr, df_beta, h2_init = h2_est,
                               vec_p_init = seq_log(1e-4, 0.5, length.out = NCORES),
                               ncores = NCORES)
str(multi_auto)
## List of 4
##  $ :List of 9
##   ..$ beta_est   : num [1:52546] -5.51e-07 3.35e-07 -2.52e-06 -4.13e-06 -3.70e-06 ...
##   ..$ postp_est  : num [1:52546] 0.0129 0.0129 0.013 0.013 0.013 ...
##   ..$ sample_beta: num[1:52546, 0 ] 
##   ..$ p_est      : num 0.013
##   ..$ h2_est     : num 0.000881
##   ..$ path_p_est : num [1:1500] 1.26e-04 8.66e-05 1.04e-04 1.36e-04 1.04e-04 ...
##   ..$ path_h2_est: num [1:1500] 0.03192 0.01093 0.00906 0.00713 0.00255 ...
##   ..$ h2_init    : num 0.115
##   ..$ p_init     : num 1e-04
##  $ :List of 9
##   ..$ beta_est   : num [1:52546] -1.86e-07 1.12e-07 -8.50e-07 -1.39e-06 -1.25e-06 ...
##   ..$ postp_est  : num [1:52546] 0.0122 0.0122 0.0122 0.0122 0.0122 ...
##   ..$ sample_beta: num[1:52546, 0 ] 
##   ..$ p_est      : num 0.0122
##   ..$ h2_est     : num 0.000297
##   ..$ path_p_est : num [1:1500] 0.001662 0.001523 0.001168 0.000811 0.001023 ...
##   ..$ path_h2_est: num [1:1500] 0.1107 0.0733 0.0344 0.0247 0.0287 ...
##   ..$ h2_init    : num 0.115
##   ..$ p_init     : num 0.00171
##  $ :List of 9
##   ..$ beta_est   : num [1:52546] -1.49e-07 8.99e-08 -6.80e-07 -1.11e-06 -9.99e-07 ...
##   ..$ postp_est  : num [1:52546] 0.0235 0.0235 0.0235 0.0235 0.0235 ...
##   ..$ sample_beta: num[1:52546, 0 ] 
##   ..$ p_est      : num 0.0235
##   ..$ h2_est     : num 0.000238
##   ..$ path_p_est : num [1:1500] 0.0284 0.0288 0.0283 0.0283 0.0288 ...
##   ..$ path_h2_est: num [1:1500] 0.112 0.116 0.111 0.112 0.109 ...
##   ..$ h2_init    : num 0.115
##   ..$ p_init     : num 0.0292
##  $ :List of 9
##   ..$ beta_est   : num [1:52546] -5.11e-06 3.16e-06 -2.33e-05 -3.83e-05 -3.43e-05 ...
##   ..$ postp_est  : num [1:52546] 0.51 0.51 0.51 0.51 0.51 ...
##   ..$ sample_beta: num[1:52546, 0 ] 
##   ..$ p_est      : num 0.51
##   ..$ h2_est     : num 0.0082
##   ..$ path_p_est : num [1:1500] 0.496 0.493 0.497 0.5 0.497 ...
##   ..$ path_h2_est: num [1:1500] 0.108 0.105 0.106 0.108 0.109 ...
##   ..$ h2_init    : num 0.115
##   ..$ p_init     : num 0.5

You should verify if the chains “converged”. You can look at the path of the chains, as shown below. In the paper, we propose an automatic way to filter bad chains by comparing the scale of the resulting predictions (see this code, reproduced below).

This is not the case here, which is probably because the data is so small.

auto <- multi_auto[[1]]
plot_grid(
  qplot(y = auto$path_p_est) + 
    theme_bigstatsr() + 
    geom_hline(yintercept = auto$p_est, col = "blue") +
    scale_y_log10() +
    labs(y = "p"),
  qplot(y = auto$path_h2_est) + 
    theme_bigstatsr() + 
    geom_hline(yintercept = auto$h2_est, col = "blue") +
    labs(y = "h2"),
  ncol = 1, align = "hv"
)

beta_auto <- sapply(multi_auto, function(auto) auto$beta_est)
pred_auto <- big_prodMat(G, beta_auto, ind.row = ind.test, ind.col = ind.chr2)
sc <- apply(pred_auto, 2, sd)
keep <- abs(sc - median(sc)) < 3 * mad(sc)
final_beta_auto <- rowMeans(beta_auto[, keep])
final_pred_auto <- rowMeans(pred_auto[, keep])
AUCBoot(final_pred_auto, y[ind.test])
##       Mean       2.5%      97.5%         Sd 
## 0.66330593 0.56894186 0.75215811 0.04732721
# Some cleaning
rm(corr); gc(); file.remove(paste0(tmp, ".sbk"))
##            used  (Mb) gc trigger   (Mb) max used  (Mb)
## Ncells  3408861 182.1    5798727  309.7  5798727 309.7
## Vcells 81840859 624.4  133092554 1015.5 92303119 704.3
## [1] TRUE

Conclusion

We have seen how to run 3 versions of LDpred2 (“-inf”, “-grid” and “-auto”) for one chromosome.

Note that we now recommend to run LDpred2 genome-wide, contrary to what was shown in the first versions of this tutorial. The only difference it makes is when building the SFBM (the sparse LD matrix on disk), you need to build it so that it contains all variants genome-wide (see e.g. this code).

Reference

Privé, F., Arbel, J., & Vilhjálmsson, B. J. (2020). LDpred2: better, faster, stronger. BioRxiv.