Chapter 8 Polygenic (risk) scores (PGS, or PRS)

Improving PGS methods is the main topic of my research work. These are the main methods currently available in my packages:

efficient penalized regressions, with individual-level data (Privé, Aschard, & Blum (2019) + tutorial)
Clumping and Thresholding (C+T) and Stacked C+T (SCT), with GWAS summary statistics and individual level data (Privé, Vilhjálmsson, Aschard, & Blum (2019) + tutorial)
LDpred2, with summary statistics (Privé, Arbel, et al. (2020) + tutorial)
lassosum2, with the same input data as LDpred2 (Privé, Arbel, et al. (2022) + tutorial)

You can now use LDpred2-auto for inference of e.g. the SNP-heritability and polygenicity, i.e. the proportion of phenotypic variance that can be explained by the set of genetic variants considered and the proportion of causal variants in this set (Privé et al. (2023) + tutorial).

8.1 One use of PRS

For common complex diseases such as heart diseases, cancers, diabetes,

many common genetic variants are causal, but with a small effect
(they do have an impact on the probability of developing the disease, but often small)

$\Rightarrow$ a common causal variant is not useful as a standalone risk factor

in contrast to rare mutations causing rare monogenic diseases

With polygenic risk scores (PRS),

many genetic variants are aggregated in a joint predictive model
by aggregating many small effects, the PRS can have a large effect

$\Rightarrow$ the PRS is useful as a risk factor

Then PRS can be used in public health to improve disease risk stratification beyond traditional risk factors (age, smoking, pollution, low SES, diet, physical inactivity, family history, low-frequency large-effect genetic mutations, etc).

PRS are complementary to these risk factors and improve prediction accuracy—for example, by identifying more individuals at high risk for a given disease. The clinical validity of PRS has been shown in many studies, but their clinical utility has been demonstrated in a clinical trial conducted in the UK for cardiovascular disease (Fuat et al., 2024).

8.2 PGS from individual-level data

If you have individual-level data (i.e. genotypes and phenotypes), you can basically use any supervised learning (machine learning) method to train a PGS. However, because of the size of the genetic data, you will quickly have scalability issues with these models. Moreover, it has been shown that effects for most diseases and traits are small and essentially additive, and that fancy methods such as deep learning are not much effective at constructing PGS (Kelemen et al., 2025).

Therefore, using penalized linear/logistic regression (PLR) can be a very efficient and effective method to train PGS. In my R package bigstatsr, I have developed a very fast implementation with automatic choice of the two hyper-parameters (Privé, Aschard, et al., 2019).

This is an example of using PLR for predicting height from genotypes in the UK Biobank

training on 350K individuals x 656K variants in less than 24H
within both males and females, PGS achieved a correlation of 65.5% ($r^2$ of 42.9%) between genetically predicted and true height

8.3 PGS from GWAS summary statistics

Why is it suboptimal to derive a PGS like this: $PGS_i = \sum_j \hat\beta_j \cdot G_{i,j}$?
($\hat\beta$ are the GWAS effects sizes here)

Figure 6.2: Local correlation between variants causes redundant GWAS signals.

8.3.1 Clumping + Thresholding (C+T, or P+T): restricting predictors

C+T is the simplest and has long been the most widely used method for constructing PGS based on GWAS summary statistics. C+T is simple because it directly uses the GWAS effects as the predictive (PGS) effects. To reduce noise in the predictor, C+T only uses genetic variants that pass some chosen p-value threshold (the thresholding part). To avoid redundancy, C+T retains only one variant per group of correlated variants (the clumping step), as GWAS effects were learned independently rather than jointly.

\[PGS_i = \sum_{\substack{j \in S_\text{clumping} \\ p_j~<~p_T}} \hat\beta_j \cdot G_{i,j}\]

In GWAS, a p-value threshold of $5 \times 10^{-8}$ is used when reporting significant findings, which corresponds to correcting for one million independent tests. Yet for prediction purposes, including less significant variants can substantially improve predictive performance of C+T. Therefore, the p-value threshold must be chosen carefully in C+T. A stringent threshold risks excluding informative variants, while a lenient threshold may introduce noise by including too many non-informative variants. For the clumping step, one must choose the threshold above which to discard correlated variants.

Hyper-parameters in C+T (and usual values):

threshold on squared correlation in clumping (e.g. 0.2)
window size for LD computation in clumping (e.g. 500 kb)
p-value threshold (e.g. test $p_T$ between $1$ and $10^{-8}$ and choose the best one in a tuning/validation set)
other parameters such as the threshold of imputation quality score (e.g. $INFO > 0.5$) or minor allele frequency (e.g. $MAF > 0.01$)

A popular software to derive C+T scores is PRSice-2 (Choi & O’Reilly, 2019).

My contributions in Privé, Vilhjálmsson, et al. (2019):

an implementation to efficiently compute thousands of C+T scores corresponding to different sets of hyper-parameters (to explore more values for more hyper-parameters, rather than simply a few values for $p_T$ only)
going further by stacking with a (penalized) linear combination of all C+T models (instead of just choosing the best model) $\Rightarrow$ SCT (Stacked C+T)

This can lead to achieving much better predictive performance compared to some standard use of C+T (tuning $p_T$ only). Over the past years, stacking has become a very popular methodology to improve PGS predictive performance, sometimes by combining many models from multiple methods.

8.3.2 Other methods

There are methods that better model LD than C+T (which uses simple heuristics):

lassosum (Mak, Porsch, Choi, Zhou, & Sham, 2017) and lassosum2 (Privé, Arbel, et al., 2022), which approximate a penalized linear regression based on GWAS summary statistics and some LD reference
Bayesian methods such as LDpred2 (Privé, Arbel, et al., 2020), SBayesR/RC (Lloyd-Jones et al., 2019; Zheng et al., 2024) and PRS-CS (Ge, Chen, Ni, Feng, & Smoller, 2019), also approximating a penalized regression, with some different priors and details in the implementation
many others

My opinion on these methods:

C+T is known to be suboptimal but works okay for traits with low polygenicity (the proportion of causal variants) and can be improved with proper tuning/stacking (cf. above)
lassosum is very good at getting very sparse solutions (lots of variants have a zero effect; they are not used in the PGS); C+T can be good for that as well (when a low p-value threshold is used; e.g. with $p_T < 0.01$, you can achieve a 95–99% sparsity)
LDpred2 is the best in my very biased opinion 😉; there is a “grid” option where you need to tune two hyper-parameters and an “auto” option to directly infer hyper-parameters of the method from the data, and many developments have been made regarding improving its robustness
SBayesR is very good with in-sample LD (i.e. from the GWAS data), otherwise it lacks robustness
SBayesRC uses external functional annotations to help prioritize causal variants, which can help improve prediction, especially when applying the resulting PGS to other populations
PRS-CS uses a lot of regularization internally, which makes it very robust, but sometimes at the expense of lower predictive ability; it is also usually slower than the other methods mentioned before

Comparison of PGS methods from @prive2020ldpred2. External comparisons have also been performed [@kulm2020systematic; @pain2021evaluation]. Since then, developments have been made to ensure even better predictive performance of LDpred2 models.

Figure 6.4: Comparison of PGS methods from Privé, Arbel, et al. (2020). External comparisons have also been performed (Kulm, Marderstein, Mezey, & Elemento, 2020; Pain et al., 2021). Since then, developments have been made to ensure even better predictive performance of LDpred2 models.

8.4 Predictive ability of PGS

What influences predictive power of PGS?

Predictive power $r^2$ is bounded by the heritability $h^2$ captured by the set of variants used (called SNP-heritability), except for some particular cases (X. Wang et al., 2023).
$r^2$ should increase with GWAS sample size $N$, even though it probably increases slower than expected (Henches et al., 2025), possibly due to meta-analyzing heterogeneous GWAS summary statistics
$r^2$ decreases with polygenicity (the proportion of causal variants), because there are many smaller effects, harder to detect and estimate.
Let’s denote $M_c$ the number of causal variants.

The theoretical upper bound for the phenotypic variance that can be explained by PGS (Daetwyler, Villanueva, & Woolliams, 2008):

\[r^2_\text{max} = \dfrac{h^2}{1 + (1 - r^2_\text{max}) \dfrac{M_c}{N h^2}}\] In R: uniroot(function(r2) r2 - h2 / (1 + (1 - r2) * M_c / (N * h2)), interval = c(0, h2))$root

A major limitation of current PGS is their poor portability across ancestries.

Here is an example across 245 phenotypes and 9 ancestry groups (Privé, Aschard, et al., 2022):

Partial correlations and 95% CIs in the UK test set versus in a test set from another ancestry group. Each point represents a phenotype and training has been performed with penalized regression on UK individuals and HapMap3 variants. The slope (in blue) is computed using Deming regression accounting for standard errors in both x and y, fixing the intercept at 0. The square of this slope is provided above each plot, which we report as the relative predictive performance compared to testing in the 'United Kingdom' ancestry group (see next figure).

Figure 6.5: Partial correlations and 95% CIs in the UK test set versus in a test set from another ancestry group. Each point represents a phenotype and training has been performed with penalized regression on UK individuals and HapMap3 variants. The slope (in blue) is computed using Deming regression accounting for standard errors in both x and y, fixing the intercept at 0. The square of this slope is provided above each plot, which we report as the relative predictive performance compared to testing in the ‘United Kingdom’ ancestry group (see next figure).

Basically, predictive performance drops with genetic distance (captured by distance in the PCA space here) from the training population (Ding et al., 2023; Privé, Aschard, et al., 2022):

One possible explanation: different tagging, because correlations between genetic variants are different across populations (Y. Wang et al., 2020):

effect of a causal variant should be similar for all populations (Hou et al., 2023; Hu et al., 2025)
all 3 variants have same effect in N.W. Europeans (perfect correlation)
however, in E. Asians, variant 1 retains only 60% of the causal effect, and 40% in W. Africans

8.5 Future of PGS

Use ever-increasing GWAS summary statistics
Use more variants to include more causal variants (currently, most methods use around 1M variants)
Integrate functional annotations and multi-ancestry data to prioritize causal variants and get better PGS for all populations

8.6 Exercise with LDpred2 and lassosum2

First, let’s carefully look at the LDpred2 tutorial.

8.6.1 Preparing the data

Let’s first read the data produced in 3.3:

library(bigsnpr)
obj.bigsnp <- snp_attach("tmp-data/GWAS_data_sorted_QC.rds")
G <- obj.bigsnp$genotypes
NCORES <- nb_cores()

Let’s use some GWAS summary statistics for CAD that I derived from the UK Biobank (Bycroft et al., 2018):

gz <- runonce::download_file(
  "https://figshare.com/ndownloader/files/38077323",
  dir = "tmp-data", fname = "sumstats_CAD_tuto.csv.gz")
readLines(gz, n = 5)

#> [1] "chr,pos,rsid,allele1,allele2,freq,info,beta,se"                                                   
#> [2] "1,721290,rs12565286,C,G,0.035889027911808,0.941918079726998,0.0361758959140647,0.0290865883937757"
#> [3] "1,752566,rs3094315,A,G,0.840799909379283,0.997586884856296,-0.0340838522604864,0.0144572980122262"
#> [4] "1,777122,rs2980319,T,A,0.871069627596275,0.997302103009046,-0.018725387068563,0.0155088941489917" 
#> [5] "1,785989,rs2980300,C,T,0.869926014169995,0.991323313568096,-0.0182890096248982,0.0154915306067785"

Read and prepare them in the format required by LDpred2 (columns “chr”, “pos”, “a0”, “a1”, “beta”, “beta_se”, and “n_eff”, as well as additional columns “freq” and “info” for QC)

Note that there were 20791 cases and 323124 controls in the GWAS. Can you estimate the total effective sample size without this information?

Click to see solution

sumstats <- bigreadr::fread2(
  gz,
  select    = c("chr", "pos", "allele2", "allele1", "beta", "se", "freq", "info"),
  col.names = c("chr", "pos", "a0", "a1", "beta", "beta_se", "freq", "info"))

# GWAS effective sample size for binary traits (4 / (1 / n_case + 1 / n_control))
# For quantitative traits, just use the total sample size for `n_eff`.
(Neff <- 4 / (1 / 20791 + 1 / 323124))

#> [1] 78136.41

quantile(8 / sumstats$beta_se^2, 0.999)  # one estimation

#>    99.9% 
#> 73780.49

sumstats$n_eff <- Neff

Note that we recommend to use imputed HapMap3(+) variants when available, for which you can download some precomputed LD reference for European individuals based on the UK Biobank. Here, we will use the genotyped variants for this tutorial. Try to use an LD reference with at least 2000 individuals (we have only 1401 in this example). The LDpred2 tutorial provides more information.

Match the variants in the GWAS summary statistics with the internal data we have here.

What should you do for the allele frequencies?

Click to see solution

library(dplyr)
map <- transmute(obj.bigsnp$map,
                 chr = chromosome, pos = physical.pos,
                 a0 = allele2, a1 = allele1)
info_snp <- snp_match(sumstats, map, return_flip_and_rev = TRUE) %>%
  mutate(freq = ifelse(`_REV_`, 1 - freq, freq),
         `_REV_` = NULL, `_FLIP_`= NULL) %>%
  print()

#>   chr    pos a0 a1         beta    beta_se       freq      info    n_eff
#> 1   1 752566  T  C  0.034083852 0.01445730 0.15920009 0.9975869 78136.41
#> 2   1 785989  G  A  0.018289010 0.01549153 0.13007399 0.9913233 78136.41
#> 3   1 798959  G  A  0.003331013 0.01307707 0.20524280 0.9734898 78136.41
#> 4   1 947034  T  C -0.021202725 0.02838148 0.03595249 0.9924989 78136.41
#>   _NUM_ID_.ss _NUM_ID_
#> 1           2        2
#> 2           4        4
#> 3           5        5
#> 4           6        6
#>  [ reached 'max' / getOption("max.print") -- omitted 340206 rows ]

What QC could you perform on the GWAS summary statistics? Apply it here.

Click to see solution

Check the summary statistics; some quality control may be needed:

hist(info_snp$n_eff)    # all the same values, otherwise filter at 70% of max

hist(info_snp$info)     # very good imputation; filter e.g. at 0.7 or 0.8

summary(info_snp$freq)  # can filter for MAF > 0.01 (i.e. AF > 0.01 & AF < 0.99)

#>      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
#> 0.0000027 0.1100250 0.2261357 0.2365588 0.3580946 0.9998703

Then we can perform some quality control on the summary statistics by checking whether standard deviations (of genotypes) inferred from the external GWAS summary statistics are consistent with the ones in the internal data we have:

af_ref <- big_colstats(G, ind.col = info_snp$`_NUM_ID_`, ncores = NCORES)$sum / (2 * nrow(G))
sd_ref <- sqrt(2 * af_ref * (1 - af_ref))
sd_ss <- with(info_snp, 2 / sqrt(n_eff * beta_se^2 + beta^2))
is_bad <-
  sd_ss < (0.5 * sd_ref) | sd_ss > (sd_ref + 0.1) |
  sd_ss < 0.05 | sd_ref < 0.05  # basically filtering small MAF

library(ggplot2)
ggplot(slice_sample(data.frame(sd_ref, sd_ss, is_bad), n = 100e3)) +
  geom_point(aes(sd_ref, sd_ss, color = is_bad), alpha = 0.5) +
  theme_bigstatsr(0.9) +
  scale_color_viridis_d(direction = -1) +
  geom_abline(linetype = 2) +
  labs(x = "Standard deviations in the reference set",
       y = "Standard deviations derived from the summary statistics",
       color = "To remove?")

When using quantitative traits (linear regression instead of logistic regression for the GWAS), you need to replace 2 by sd(y) when computing sd_ss (equations (6.2) and (6.1)).

When allele frequencies are available in the GWAS summary statistics, you can use them (along with INFO scores) to get an even better match (equation (6.3)):

sd_af <- with(info_snp, sqrt(2 * freq * (1 - freq) * info))

ggplot(slice_sample(data.frame(sd_af, sd_ss), n = 100e3)) +
  geom_point(aes(sd_af, sd_ss), alpha = 0.5) +
  theme_bigstatsr(0.9) +
  geom_abline(linetype = 2, color = "red") +
  labs(x = "Standard deviations derived from allele frequencies",
       y = "Standard deviations derived from the summary statistics")

You can still use the reference panel to do some quality control by comparing allele frequencies, which is useful for detecting allelic errors (inversion of alleles so that the effect size should have the opposite sign) and the other issues detected before:

af_diff <- af_ref - info_snp$freq
hist(af_diff, "FD", xlim = c(-0.1, 0.1))

Then you can filter

is_bad2 <-
  sd_ss < (0.7 * sd_af) | sd_ss > (sd_af + 0.1) |
  sd_ss < 0.05 | sd_af < 0.05 |
  info_snp$info < 0.7 | abs(af_diff) > 0.07 # based on visual inspection
mean(is_bad2)

#> [1] 0.002410276

df_beta <- info_snp[!is_bad2, ]

Then, we compute the correlation for each chromosome (note that we are using only 4 chromosomes here, for faster running of this tutorial):

# Precomputed genetic positions (in cM) to avoid downloading large files in this tuto
gen_pos <- readRDS(runonce::download_file(
  "https://figshare.com/ndownloader/files/38247288",
  dir = "tmp-data", fname = "gen_pos_tuto.rds"))

df_beta <- dplyr::filter(df_beta, chr %in% 1:4)  # TO REMOVE (for speed here)

for (chr in 1:4) {  # REPLACE BY 1:22

  print(chr)

  corr0 <- runonce::save_run({

    # indices in 'sumstats'
    ind.chr <- which(df_beta$chr == chr)
    # indices in 'G'
    ind.chr2 <- df_beta$`_NUM_ID_`[ind.chr]

    # genetic positions (in cM)
    # POS2 <- snp_asGeneticPos(map$chr[ind.chr2], map$pos[ind.chr2], dir = "tmp-data")
    POS2 <- gen_pos[ind.chr2]  # PRECOMPUTED HERE; USE snp_asGeneticPos() IN REAL CODE

    # compute the banded correlation matrix in sparse matrix format
    snp_cor(G, ind.col = ind.chr2, size = 3 / 1000, infos.pos = POS2,
            ncores = NCORES)

  }, file = paste0("tmp-data/corr_chr", chr, ".rds"))

  # transform to SFBM (on-disk format) on the fly
  if (chr == 1) {
    ld <- Matrix::colSums(corr0^2)
    corr <- as_SFBM(corr0, "tmp-data/corr", compact = TRUE)
  } else {
    ld <- c(ld, Matrix::colSums(corr0^2))
    corr$add_columns(corr0, nrow(corr))
  }
}

#> [1] 1
#>    user  system elapsed 
#>  110.20    0.71   28.93 
#> Code finished running at 2025-06-13 14:37:03 CEST 
#> [1] 2
#>    user  system elapsed 
#>  141.60    0.64   37.57 
#> Code finished running at 2025-06-13 14:37:47 CEST 
#> [1] 3
#>    user  system elapsed 
#>  130.27    0.75   34.44 
#> Code finished running at 2025-06-13 14:38:27 CEST 
#> [1] 4
#>    user  system elapsed 
#>  112.91    0.95   31.21 
#> Code finished running at 2025-06-13 14:39:04 CEST

file.size(corr$sbk) / 1024^3  # file size in GB

#> [1] 0.5756225

Note that you will need at least the same memory as this file size (to keep it cached for faster processing) + some other memory for all the results returned by LDpred2. If you do not have enough memory, processing will be very slow (because you would read the data from disk all the time). If using HapMap3 variants, requesting 60 GB should be enough. For this small example, 8 GB of RAM on a laptop should be enough.

8.6.2 LDpred2

Following the LDpred2 tutorial, run the three versions of LDpred2 and test their predictive performance for obj.bigsnp$fam$CAD.

For speed here, you can use a smaller grid of hyper-parameters for LDpred2-grid, and a smaller number of burn-in/iterations for LDpred2-auto.

Click to see solution

We can now run LD score regression:

(ldsc <- with(df_beta, snp_ldsc(ld, length(ld), chi2 = (beta / beta_se)^2,
                                 sample_size = n_eff, blocks = NULL)))

#>       int        h2 
#> 0.9795999 0.0528327

ldsc_h2_est <- ldsc[["h2"]]

We can now run LDpred2-inf very easily:

# LDpred2-inf
beta_inf <- snp_ldpred2_inf(corr, df_beta, ldsc_h2_est)
pred_inf <- big_prodVec(G, beta_inf, ind.col = df_beta$`_NUM_ID_`)
AUCBoot(pred_inf, obj.bigsnp$fam$CAD)

#>       Mean       2.5%      97.5%         Sd 
#> 0.55123396 0.51920429 0.58365525 0.01631818

For LDpred2(-grid), this is the grid we recommend to use:

# LDpred2-grid
(h2_seq <- round(ldsc_h2_est * c(0.3, 0.7, 1, 1.4), 4))

#> [1] 0.0158 0.0370 0.0528 0.0740

(p_seq <- signif(seq_log(1e-5, 1, length.out = 21), 2))

#>  [1] 1.0e-05 1.8e-05 3.2e-05 5.6e-05 1.0e-04 1.8e-04 3.2e-04 5.6e-04 1.0e-03
#> [10] 1.8e-03 3.2e-03 5.6e-03 1.0e-02 1.8e-02 3.2e-02 5.6e-02 1.0e-01 1.8e-01
#> [19] 3.2e-01 5.6e-01 1.0e+00

params <- expand.grid(p = p_seq, h2 = h2_seq, sparse = c(FALSE, TRUE))
dim(params)

#> [1] 168   3

Here, we will be using this smaller grid instead (for speed in this tutorial):

# smaller grid for tutorial only (USE PREVIOUS ONE IN REAL CODE)
(params <- expand.grid(p = signif(seq_log(1e-4, 0.5, length.out = 16), 2),
                       h2 = round(ldsc_h2_est, 4), sparse = TRUE))

#>          p     h2 sparse
#> 1  0.00010 0.0528   TRUE
#> 2  0.00018 0.0528   TRUE
#> 3  0.00031 0.0528   TRUE
#> 4  0.00055 0.0528   TRUE
#> 5  0.00097 0.0528   TRUE
#> 6  0.00170 0.0528   TRUE
#> 7  0.00300 0.0528   TRUE
#> 8  0.00530 0.0528   TRUE
#> 9  0.00940 0.0528   TRUE
#> 10 0.01700 0.0528   TRUE
#> 11 0.02900 0.0528   TRUE
#> 12 0.05200 0.0528   TRUE
#> 13 0.09100 0.0528   TRUE
#> 14 0.16000 0.0528   TRUE
#> 15 0.28000 0.0528   TRUE
#> 16 0.50000 0.0528   TRUE

beta_grid <- snp_ldpred2_grid(corr, df_beta, params, ncores = NCORES)
params$sparsity <- colMeans(beta_grid == 0)

Then, we can compute the corresponding PGS for all these models, and visualize their performance:

pred_grid <- big_prodMat(G, beta_grid, ind.col = df_beta[["_NUM_ID_"]],
                         ncores = NCORES)

params$score <- apply(pred_grid, 2, function(.x) {
  if (all(is.na(.x))) return(NA)  # models that diverged substantially
  summary(glm(  # simply use `lm()` for quantitative traits
    CAD ~ .x + sex + age, data = obj.bigsnp$fam, family = "binomial"
  ))$coef[".x", 3]
})

ggplot(params, aes(x = p, y = score, color = as.factor(h2))) +
  theme_bigstatsr() +
  geom_point() +
  geom_line() +
  scale_x_log10(breaks = 10^(-5:0), minor_breaks = params$p) +
  facet_wrap(~ sparse, labeller = label_both) +
  labs(y = "GLM Z-Score", color = "h2") +
  theme(legend.position = "top", panel.spacing = unit(1, "lines"))

Then you can use the best-performing model here. Note that, in practice, you should use only individuals from the validation set to compute the $score and then evaluate the best model for the individuals in the test set (unrelated to those in the validation set).

library(dplyr)
best_beta_grid <- params %>%
  mutate(id = row_number()) %>%
  arrange(desc(score)) %>%
  print() %>%
  slice(1) %>%
  pull(id) %>%
  beta_grid[, .]

#>         p     h2 sparse  sparsity    score id
#> 1 0.00300 0.0528   TRUE 0.7668965 4.141109  7
#> 2 0.00530 0.0528   TRUE 0.6853274 3.994907  8
#> 3 0.00940 0.0528   TRUE 0.5946515 3.978679  9
#> 4 0.00170 0.0528   TRUE 0.8318286 3.956292  6
#> 5 0.00097 0.0528   TRUE 0.8793796 3.929743  5
#> 6 0.01700 0.0528   TRUE 0.5077261 3.842456 10
#> 7 0.00031 0.0528   TRUE 0.9383287 3.836808  3
#> 8 0.00010 0.0528   TRUE 0.9666869 3.818947  1
#>  [ reached 'max' / getOption("max.print") -- omitted 8 rows ]

What is the best model with less than 5% of variants used?

To run LDpred2-auto, you can use:

# LDpred2-auto
multi_auto <- snp_ldpred2_auto(
  corr, df_beta, h2_init = ldsc_h2_est,
  vec_p_init = seq_log(1e-4, 0.2, 30),
  burn_in = 100, num_iter = 100,         # TO REMOVE, for speed here
  allow_jump_sign = FALSE,
  use_MLE = FALSE,         # USE `TRUE` ONLY FOR GWAS WITH LARGE N AND M
  shrink_corr = 0.95,
  ncores = NCORES)

Perform some quality control on the chains:

# `range` should be between 0 and 2
(range <- sapply(multi_auto, function(auto) diff(range(auto$corr_est))))

#>  [1] 0.05448907 0.05442862 0.05261028 0.05480762 0.05368479 0.05576853
#>  [7] 0.05444713 0.05452793 0.05538609 0.05197896 0.05241466 0.05446092
#> [13] 0.05406081 0.05331133 0.05599592 0.05493762 0.05328035 0.05303112
#> [19] 0.05475263 0.05279201 0.05363907 0.05490972 0.05541536 0.05407016
#> [25] 0.05410121 0.05563394 0.05560854 0.05553329 0.05448200 0.05344541

(keep <- which(range > (0.95 * quantile(range, 0.95, na.rm = TRUE))))

#>  [1]  1  2  4  5  6  7  8  9 12 13 14 15 16 17 18 19 21 22 23 24 25 26 27 28 29
#> [26] 30

To get the final effects / predictions, you should only use chains that pass this filtering:

final_beta_auto <-
  rowMeans(sapply(multi_auto[keep], function(auto) auto$beta_est))

We can finally test the final prediction

final_pred_auto <- big_prodVec(G, final_beta_auto,
                               ind.col = df_beta[["_NUM_ID_"]],
                               ncores = NCORES)
AUCBoot(final_pred_auto, obj.bigsnp$fam$CAD)

#>       Mean       2.5%      97.5%         Sd 
#> 0.56399026 0.53174361 0.59614571 0.01638768

8.6.3 lassosum2: grid of models

lassosum2 is a re-implementation of the lassosum model (Mak et al., 2017) that now uses the exact same input parameters as LDpred2 (corr and df_beta). It can therefore be run next to LDpred2 and the best model can be chosen using the validation set.

Run lassosum2 and test its predictive performance for obj.bigsnp$fam$CAD.

For speed here, you can use parameters nlambda = 10, maxiter = 50.

Click to see solution

beta_lassosum2 <- snp_lassosum2(
  corr, df_beta, ncores = NCORES,
  nlambda = 10, maxiter = 50)      # TO REMOVE, for speed here

As with LDpred2-grid, we can compute the corresponding PGS for all these models, and visualize their performance:

pred_grid2 <- big_prodMat(G, beta_lassosum2, ind.col = df_beta[["_NUM_ID_"]],
                          ncores = NCORES)

params2 <- attr(beta_lassosum2, "grid_param")
params2$score <- apply(pred_grid2, 2, function(.x) {
  if (all(is.na(.x))) return(NA)  # models that diverged substantially
  summary(glm(  # simply use `lm()` for quantitative traits
    CAD ~ .x + sex + age, data = obj.bigsnp$fam, family = "binomial"
  ))$coef[".x", 3]
})

ggplot(params2, aes(x = lambda, y = score, color = as.factor(delta))) +
  theme_bigstatsr() +
  geom_point() +
  geom_line() +
  scale_x_log10(breaks = 10^(-5:0)) +
  labs(y = "GLM Z-Score", color = "delta")

#> Warning: Removed 4 rows containing missing values or values outside the scale range
#> (`geom_point()`).

#> Warning: Removed 4 rows containing missing values or values outside the scale range
#> (`geom_line()`).

When are large delta values useful?

best_grid_lassosum2 <- params2 %>%
  mutate(id = row_number()) %>%
  arrange(desc(score)) %>%
  print() %>%
  slice(1) %>%
  pull(id) %>%
  beta_lassosum2[, .]

#>       lambda delta num_iter time  sparsity    score id
#> 1 0.00996541 0.001       51 0.15 0.9937624 4.112203  3
#> 2 0.00996541 0.010       51 0.15 0.9936938 4.106545 13
#> 3 0.00996541 0.100       45 0.14 0.9930475 4.053764 23
#> 4 0.01579411 0.001       51 0.10 0.9997062 3.977849  2
#> 5 0.00996541 1.000       14 0.05 0.9902959 3.976200 33
#> 6 0.01579411 0.010       51 0.08 0.9997062 3.973049 12
#> 7 0.01579411 0.100       35 0.08 0.9996181 3.900467 22
#>  [ reached 'max' / getOption("max.print") -- omitted 33 rows ]

We can choose the best-overall model from both LDpred2-grid and lassosum2:

best_grid_overall <- `if`(max(params2$score, na.rm = TRUE) > max(params$score, na.rm = TRUE),
                          best_grid_lassosum2, best_beta_grid)

References

Bycroft, C., Freeman, C., Petkova, D., Band, G., Elliott, L.T., Sharp, K., et al. (2018). The UK Biobank resource with deep phenotyping and genomic data. Nature, 562, 203–209.

Choi, S.W., & O’Reilly, P.F. (2019). PRSice-2: Polygenic risk score software for biobank-scale data. Gigascience, 8, giz082.

Daetwyler, H.D., Villanueva, B., & Woolliams, J.A. (2008). Accuracy of predicting the genetic risk of disease using a genome-wide approach. PloS One, 3, e3395.

Ding, Y., Hou, K., Xu, Z., Pimplaskar, A., Petter, E., Boulier, K., et al. (2023). Polygenic scoring accuracy varies across the genetic ancestry continuum. Nature, 618, 774–781.

Fuat, A., Adlen, E., Monane, M., Coll, R., Groves, S., Little, E., et al. (2024). A polygenic risk score added to a QRISK 2 cardiovascular disease risk calculator demonstrated robust clinical acceptance and clinical utility in the primary care setting. European Journal of Preventive Cardiology, 31, 716–722.

Ge, T., Chen, C.-Y., Ni, Y., Feng, Y.-C.A., & Smoller, J.W. (2019). Polygenic prediction via Bayesian regression and continuous shrinkage priors. Nature Communications, 10, 1776.

Henches, L., Kim, J., Yang, Z., Rubinacci, S., Pires, G., Albiñana, C., et al. (2025). Polygenic risk score prediction accuracy convergence. Human Genetics and Genomics Advances, 6. Retrieved from https://doi.org/10.1016/j.xhgg.2025.100457

Hou, K., Ding, Y., Xu, Z., Wu, Y., Bhattacharya, A., Mester, R., et al. (2023). Causal effects on complex traits are similar for common variants across segments of different continental ancestries within admixed individuals. Nature Genetics, 55, 549–558.

Hu, S., Ferreira, L.A., Shi, S., Hellenthal, G., Marchini, J., Lawson, D.J., & Myers, S.R. (2025). Fine-scale population structure and widespread conservation of genetic effect sizes between human groups across traits. Nature Genetics, 1–11.

Kelemen, M., Xu, Y., Jiang, T., Zhao, J.H., Anderson, C.A., Wallace, C., et al. (2025). Performance of deep-learning-based approaches to improve polygenic scores. Nature Communications, 16, 1–9.

Kulm, S., Marderstein, A., Mezey, J., & Elemento, O. (2020). A systematic framework for assessing the clinical impact of polygenic risk scores. medRxiv, 2020–2004.

Lloyd-Jones, L.R., Zeng, J., Sidorenko, J., Yengo, L., Moser, G., Kemper, K.E., et al. (2019). Improved polygenic prediction by Bayesian multiple regression on summary statistics. Nature Communications, 10, 5086.

Mak, T.S.H., Porsch, R.M., Choi, S.W., Zhou, X., & Sham, P.C. (2017). Polygenic scores via penalized regression on summary statistics. Genetic Epidemiology, 41, 469–480.

Pain, O., Glanville, K.P., Hagenaars, S.P., Selzam, S., Fürtjes, A.E., Gaspar, H.A., et al. (2021). Evaluation of polygenic prediction methodology within a reference-standardized framework. PLoS Genetics, 17, e1009021.

Privé, F., Albiñana, C., Arbel, J., Pasaniuc, B., & Vilhjálmsson, B.J. (2023). Inferring disease architecture and predictive ability with LDpred2-auto. The American Journal of Human Genetics, 110, 2042–2055.

Privé, F., Arbel, J., Aschard, H., & Vilhjálmsson, B.J. (2022). Identifying and correcting for misspecifications in GWAS summary statistics and polygenic scores. Human Genetics and Genomics Advances, 3, 100136.

Privé, F., Arbel, J., & Vilhjálmsson, B.J. (2020). LDpred2: better, faster, stronger. Bioinformatics, 36, 5424–5431.

Privé, F., Aschard, H., & Blum, M.G. (2019). Efficient implementation of penalized regression for genetic risk prediction. Genetics, 212, 65–74.

Privé, F., Aschard, H., Carmi, S., Folkersen, L., Hoggart, C., O’Reilly, P.F., & Vilhjálmsson, B.J. (2022). Portability of 245 polygenic scores when derived from the UK Biobank and applied to 9 ancestry groups from the same cohort. The American Journal of Human Genetics, 109, 12–23.

Privé, F., Vilhjálmsson, B.J., Aschard, H., & Blum, M.G.B. (2019). Making the most of clumping and thresholding for polygenic scores. The American Journal of Human Genetics, 105, 1213–1221.

Wang, X., Walker, A., Revez, J.A., Ni, G., Adams, M.J., McIntosh, A.M., et al. (2023). Polygenic risk prediction: Why and when out-of-sample prediction R2 can exceed SNP-based heritability. The American Journal of Human Genetics, 110, 1207–1215.

Wang, Y., Guo, J., Ni, G., Yang, J., Visscher, P.M., & Yengo, L. (2020). Theoretical and empirical quantification of the accuracy of polygenic scores in ancestry divergent populations. Nature Communications, 11, 3865.

Zheng, Z., Liu, S., Sidorenko, J., Wang, Y., Lin, T., Yengo, L., et al. (2024). Leveraging functional genomic annotations and genome coverage to improve polygenic prediction of complex traits within and between ancestries. Nature Genetics, 56, 767–777.