Estimating uncertainty in polygenic scores using LDpred2

Here we show how to estimate the full posterior distribution of genetic value using LDpred2 as described in Large uncertainty in individual PRS estimation impacts PRS-based risk stratification.

This tutorial assumes that you have already tuned the hyperparameters using grid search. Please refer to Computing polygenic scores using LDpred2 for tutorials on data preprocessing and LDpred2-grid.

Approximating the posterior distribution of genetic value

Given the set of tuned hyperparameters (see above), we want to approximate the full posterior distribution of an individual’s genetic value, \(\small{GV_i = \mathrm{\bf{x}}_i^T \boldsymbol\beta}\). From the posterior, one can obtain the posterior mean, \(\small{\widehat{PRS}_i \equiv \mathbb{E}(GV_i|\mathrm{Data})}\), and various metrics of uncertainty such as the posterior variance, \(\small{var(GV_i|\mathrm{Data})}\), where \(\small{GV_i}\) is the genetic value of individual \(i\) and \(\mathrm{\small{Data}}\) refers to a given GWAS, and \(\rho\)-credible interval, the interval within which the individual’s genetic value falls with probability \(\rho\).

An individual’s genetic value is their genotype vector multipled by the causal effect vector. We therefore need to obtain MCMC samples from the posterior of the causal effects. This is done by supplying the tuned hyperparameters (best_param in the example below) to snp_ldpred2_grid and setting return_sampling_betas = TRUE. In the example below, we return 500 MCMC samples from the posterior of the causal effects with the option num_iter = 500.

Note that, when using return_sampling_betas = TRUE, you must supply one set of parameters only, i.e. best_param must be a data frame with exactly 1 row (otherwise it will raise an error).

best_param <- data.frame(p = 0.01, h2 = 0.2, sparse = FALSE)

posterior_beta_samples <- snp_ldpred2_grid(
  corr, df_beta, best_param,
  return_sampling_betas = TRUE, num_iter = 500)
dim(posterior_beta_samples)

## [1] 45337   500

Posterior samples of the individual’s genetic value are then obtained by multiplying their genotype with the posterior samples of \(\boldsymbol{\beta}\).

posterior_gv_samples <- big_prodMat(G, posterior_beta_samples, ind.col = df_beta[["_NUM_ID_"]])
dim(posterior_gv_samples)

## [1] 503 500

The autocorrelation is generally weak, otherwise, you can perform some thinning.

acf(posterior_gv_samples[1, ], lag.max = 10, plot = TRUE)$acf

## , , 1
## 
##               [,1]
##  [1,]  1.000000000
##  [2,]  0.124746441
##  [3,]  0.108593634
##  [4,]  0.048078430
##  [5,]  0.008082020
##  [6,]  0.002583350
##  [7,] -0.013405861
##  [8,] -0.029479457
##  [9,]  0.004941942
## [10,] -0.002931033
## [11,] -0.002058727

rowMeans(apply(posterior_gv_samples, 1, function(x) {
  acf(x, lag.max = 10, plot = FALSE)$acf
}))

##  [1]  1.000000000  0.099738287  0.032833478  0.038316768 -0.006360946
##  [6]  0.011291062  0.011917135  0.008185207  0.009776595 -0.008539467
## [11]  0.034323505

Estimating the posterior mean and posterior variance

From the posterior samples of genetic value (see above), we can compute summary statistics such as the posterior mean and posterior variance. The posterior mean can be interpreted as the individual’s polygenic (risk) score, i.e. \(\small{\widehat{PRS}_i \equiv \mathbb{E}(GV_i|\mathrm{Data})}\). The posterior variance, \(\small{var(GV_i|\mathrm{Data})}\), is one metric of uncertainty; credible intervals are discussed below.

samples <- posterior_gv_samples[1, ]
posterior_gv_mean <- mean(samples)
posterior_gv_var <- var(samples)

Constructing credible intervals of genetic value

Another way to quantify uncertainty in an individual’s polygenic score is by constructing a \(\rho\)-credible interval of the individual’s genetic value, i.e. the range of values that contains the individual’s true genetic value with probability \(\rho\).

The example below demonstrates how one can obtain a 95% credible interval of the individual’s genetic value.

rho <- 0.95
bound <- (1 - rho) / 2
samples <- posterior_gv_samples[1, ]
mean <- mean(samples)
lower_ci <- quantile(samples, bound)
upper_ci <- quantile(samples, 1 - bound)
hist(samples, main = "Posterior distribution of genetic value", xlab = NULL) 
abline(v = c(lower_ci, mean, upper_ci), col = c("blue", "red", "blue"), lty = c(2,1,2))
legend("topright", legend = c("Credible Interval", "PRS"),
       col = c("blue", "red"), lty = c(2,1), cex = 0.8)

To get for all individuals:

means <- rowMeans(posterior_gv_samples)
limits <- t(apply(posterior_gv_samples, 1, quantile, probs = c(bound, 1 - bound)))

Yi Ding and colleagues

March 26, 2021

Approximating the posterior distribution of genetic value

Estimating the posterior mean and posterior variance

Constructing credible intervals of genetic value