Chapter 9 Fine-mapping with SuSiE

Read more in this PowerPoint presentation.

9.1 Objective

SuSiE (G. Wang, Sarkar, Carbonetto, & Stephens, 2020; Zou et al., 2022) is a fine-mapping method used to determine which variants are most likely to be causal for a trait. It uses a Bayesian iterative approach to create “credible sets” of SNPs that are likely to each contain a causal variant.

In this tutorial we will first use SuSiE on simulated data where we know the true causal effects. We will then apply it to real data from (Ruth et al., 2020).

This tutorial was created largely by modifying pre-existing tutorials written by Alesha Hatton (https://cnsgenomics.com/data/teaching/GNGWS23/module1/11_fine-mapping.html) and Gao Wang (https://stephenslab.github.io/susie-paper/manuscript_results/pedagogical_example.html). Thank you Alesha and Gao for your excellent tutorials!

9.1.1 Requirements:

  • 1000 Genomes European reference panel in PLINK1 format (bed/bim/fam)
  • GWAS summary statistics from (Ruth et al., 2020) (GWAS catalog accession: GCST90012102)
  • PLINK 1.9
  • R packages susieR, dplyr, ggplot2, ggrepel, bigsnpr, and bigreadr.

9.1.2 Load libraries

library(susieR)
library(dplyr)
library(ggplot2)
library(ggrepel)
library(bigsnpr)
library(bigreadr)
set.seed(7236)

9.2 Simulated Example

This example uses simulated data to illustrate the use of SuSiE for fine-mapping over stage-wise selection.

9.2.1 Load data

# This data is included in the susieR package
dat <- N2finemapping
str(dat)
#> List of 8
#>  $ X                : num [1:574, 1:1002] 0.703 0.703 -0.297 -0.297 0.703 ...
#>  $ chrom            : chr "chr8"
#>  $ pos              : int [1:1002(1d)] 38854423 38854585 38854614 38854829 38855770 38856373 38856591 38856612 38856903 38857229 ...
#>  $ true_coef        : num [1:1002, 1:2] 0 0 0 0 0 0 0 0 0 0 ...
#>  $ residual_variance: num [1:2(1d)] 1.1 1.01
#>  $ Y                : num [1:574, 1:2] 1.0446 -1.3656 -0.0235 -0.4344 0.4681 ...
#>  $ allele_freq      : num [1:1002, 1] 0.1483 0.0209 0.1368 0.2134 0.5627 ...
#>  $ V                : num [1:2, 1:2] 1.344 -0.0336 -0.0336 1.2541
# This simulated data-set has two replicates. Let's focus on the first replicate:
y <- dat$Y[, 1]

9.2.2 Fit SuSiE to the data

Also perform univariate regression so that PIPs can be compared with p-values

# Fit SuSiE with L=5 (maximum number of causal variants per replicate)
fitted <- susie(dat$X, y, L = 5,
                estimate_residual_variance = TRUE,
                scaled_prior_variance = 0.2,
                tol = 1e-3, track_fit = TRUE,
                compute_univariate_zscore = TRUE,
                coverage = 0.95, min_abs_corr = 0.1)
str(fitted, max.level = 1)
#> List of 21
#>  $ alpha                 : num [1:5, 1:1002] 9.98e-04 8.96e-32 9.32e-19 9.98e-04 9.98e-04 ...
#>  $ mu                    : num [1:5, 1:1002] 0 -0.0221 -0.0514 0 0 ...
#>  $ mu2                   : num [1:5, 1:1002] 0 0.00232 0.00447 0 0 ...
#>  $ Xr                    : num [1:574] -0.842 -0.274 -0.842 -0.842 1.209 ...
#>  $ KL                    : num [1:5] -1.49e-05 5.87 8.68 -1.49e-05 -1.49e-05
#>  $ lbf                   : num [1:5] 1.49e-05 6.23e+01 3.31e+01 1.49e-05 1.49e-05
#>  $ lbf_variable          : num [1:5, 1:1002] 0 -2.33 -1.49 0 0 ...
#>  $ sigma2                : num 1.06
#>  $ V                     : num [1:5] 0 0.251 0.154 0 0
#>  $ pi                    : num [1:1002] 0.000998 0.000998 0.000998 0.000998 0.000998 ...
#>  $ null_index            : num 0
#>  $ converged             : logi TRUE
#>  $ elbo                  : num [1:10] -867 -861 -858 -854 -850 ...
#>  $ niter                 : int 10
#>  $ intercept             : num -7.82e-17
#>  $ fitted                : num [1:574] -0.842 -0.274 -0.842 -0.842 1.209 ...
#>  $ trace                 :List of 10
#>  $ sets                  :List of 5
#>  $ pip                   : num [1:1002] 0 0 0 0 0 0 0 0 0 0 ...
#>  $ z                     : num [1:1002] -0.8091 1.1147 -0.5836 -0.0842 -2.1866 ...
#>  $ X_column_scale_factors: num [1:1002] 0.486 0.2 0.494 0.603 0.708 ...
#>  - attr(*, "class")= chr "susie"
# Let's have a look at the first iteration (by setting max_iter=1 this will give us the first iteration of fitted)
fitted.one.iter <- susie(dat$X, y, L = 5, max_iter = 1,
                         estimate_residual_variance = TRUE,
                         scaled_prior_variance = 0.2,
                         tol = 1e-3,
                         coverage = 0.95, min_abs_corr = 0.1)
#> Warning in susie(dat$X, y, L = 5, max_iter = 1, estimate_residual_variance =
#> TRUE, : IBSS algorithm did not converge in 1 iterations!
str(fitted.one.iter)
#> List of 19
#>  $ alpha                 : num [1:5, 1:1002] 5.22e-18 1.61e-04 6.10e-04 7.44e-04 8.03e-04 ...
#>  $ mu                    : num [1:5, 1:1002] -0.03868 -0.02164 -0.01568 -0.01138 -0.00922 ...
#>  $ mu2                   : num [1:5, 1:1002] 0.00381 0.002501 0.001695 0.001228 0.000992 ...
#>  $ Xr                    : num [1:574] -0.814 -0.187 -0.802 -0.748 0.675 ...
#>  $ KL                    : num [1:5] 6.021 2.322 0.713 0.404 0.295
#>  $ lbf                   : num [1:5] 31.0473 0.9323 0.0954 0.0364 0.0199
#>  $ lbf_variable          : num [1:5, 1:1002] -1.836 -0.893 -0.396 -0.257 -0.198 ...
#>  $ sigma2                : num 1.15
#>  $ V                     : num [1:5] 0.17389 0.01527 0.00379 0.00207 0.00148
#>  $ pi                    : num [1:1002] 0.000998 0.000998 0.000998 0.000998 0.000998 ...
#>  $ null_index            : num 0
#>  $ elbo                  : num -867
#>  $ niter                 : int 1
#>  $ converged             : logi FALSE
#>  $ intercept             : num -3.47e-17
#>  $ fitted                : num [1:574] -0.814 -0.187 -0.802 -0.748 0.675 ...
#>  $ sets                  :List of 5
#>   ..$ cs                :List of 1
#>   .. ..$ L1: int [1:24] 765 767 770 774 776 778 780 782 783 788 ...
#>   ..$ purity            :'data.frame':   1 obs. of  3 variables:
#>   .. ..$ min.abs.corr   : num 0.78
#>   .. ..$ mean.abs.corr  : num 0.922
#>   .. ..$ median.abs.corr: num 0.92
#>   ..$ cs_index          : int 1
#>   ..$ coverage          : num 0.955
#>   ..$ requested_coverage: num 0.95
#>  $ pip                   : num [1:1002] 0.00232 0.00219 0.0022 0.0022 0.00315 ...
#>  $ X_column_scale_factors: num [1:1002] 0.486 0.2 0.494 0.603 0.708 ...
#>  - attr(*, "class")= chr "susie"

9.2.3 Plot the SuSiE results

Plot both PIPs from SuSiE as well as the p-values from the regressions.

b <- dat$true_coef[, 1]
b[which(b != 0)] <- 1

# Run this code all at once to get side-by-side plots
par_saved <- par(mfrow = c(1, 3), cex.axis = 0.9)
# Plot the marginal associations 
susie_plot(fitted, y = "z", b = b, max_cs = 1, main = "Marginal associations", 
           xlab = "variable (SNP)", col = "#767676")
# Plot PIPs after the first iteration
susie_plot(fitted.one.iter, y = "PIP", b = b, max_cs = 0.4, 
           main = "IBSS after 1 iteration", add_legend = FALSE,
           ylim = c(0, 1), xlab = "variable (SNP)", col = "#767676") 
# Plot PIPs after convergence 
susie_plot(fitted, y = "PIP", b = b, max_cs = 0.4, 
           main = "IBSS after 10 iterations", add_legend = FALSE, 
           ylim = c(0, 1), xlab = "variable (SNP)", col = "#767676")

par(par_saved)  # back to as before

The “true” effects are highlighted in red. The strongest signal by p-value does not contain the causal variant, but is being tagged by two causal variants. The first iteration of SuSiE identifies the strongest signal by p-value, but by the 10th iteration the true causal variants are identified within two credible sets.

9.2.4 Let’s take a closer look at which variants are in the credible sets

fitted.one.iter$sets$cs
#> $L1
#>  [1] 765 767 770 774 776 778 780 782 783 788 790 791 792 814 817 824 827 834 837
#> [20] 838 847 849 868 869
fitted$sets$cs
#> $L2
#>  [1]  850  913  914  915  916  920  924  925  926  927  930  931  933  934  935
#> [16]  942  946  948  951  952  962  967  968  979  980  982  983  985  988  989
#> [31]  993  994  996  999 1000 1001 1002
#> 
#> $L3
#> [1] 337 379 440

9.2.5 How correlated are the variants in the credible sets?

fitted$sets$purity
#>    min.abs.corr mean.abs.corr median.abs.corr
#> L2    0.9722386     0.9938064       0.9947184
#> L3    0.8534981     0.8775989       0.8848378

9.3 Real data

In the simulation we used individual level data (genotypes and phenotypes). Often this is not available due to data access restrictions. Here we will use GWAS summary statistics and an LD reference panel. Let’s look at bioavailable testosterone in females from (Ruth et al., 2020).

9.3.1 Import GWAS data

Here we use (a subset of) the GWAS summary statistics originally accessible from the GWAS Catalog (accession number: GCST90012102).

# original data
# tgz <- runonce::download_file(
#   "https://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/GCST90012001-GCST90013000/GCST90012102/GCST90012102_buildGRCh37.tsv.gz",
#   dir = "tmp-data")  # 419 MB

# small subset, for the tutorial
tgz <- runonce::download_file(
  "https://figshare.com/ndownloader/files/55262768", 
  dir = "tmp-data", fname = "FT_sumstats_small.tsv.gz")

readLines(tgz, n = 5)
#> [1] "variant_id\tchromosome\tbase_pair_location\teffect_allele\tother_allele\teffect_allele_frequency\timputation_quality\tbeta\tstandard_error\tp_value"
#> [2] "rs72858371\t1\t6409383\tT\tC\t0.986224\t0.977552\t-0.00615364\t0.0108886\t0.64"                                                                     
#> [3] "rs72633437\t1\t6409494\tC\tT\t0.906138\t0.988959\t0.00289667\t0.00434703\t0.39"                                                                     
#> [4] "rs151043271\t1\t6409495\tG\tA\t0.998537\t0.732147\t-0.0244749\t0.0386808\t0.4"                                                                      
#> [5] "rs58484986\t1\t6409653\tC\tG\t0.986228\t0.976894\t-0.00623617\t0.0108958\t0.63"

And some reference panel (also subsetted for this tutorial) to compute LD from:

# original data
# gzip <- runonce::download_file(
#   "https://vu.data.surfsara.nl/index.php/s/VZNByNwpD8qqINe/download",
#   fname = "g1000_eur.zip", dir = "tmp-data")  # 488 MB

# small subset, for the tutorial
gzip <- runonce::download_file(
  "https://figshare.com/ndownloader/files/55263098",
  fname = "g1000_eur_small.zip", dir = "tmp-data")

unzip(gzip, exdir = "tmp-data", overwrite = FALSE)
g1000_map <- fread2("tmp-data/g1000_eur_small.bim", select = c(1:2, 4:6),
                    col.names = c("chr", "rsid", "pos", "a1", "a0"))
# import GWAS summary statistics of bioavailable testosterone in females
FT <- fread2(tgz)
head(FT)
#>    variant_id chromosome base_pair_location effect_allele other_allele
#> 1  rs72858371          1            6409383             T            C
#> 2  rs72633437          1            6409494             C            T
#> 3 rs151043271          1            6409495             G            A
#> 4  rs58484986          1            6409653             C            G
#> 5  rs60772726          1            6409662             A            T
#>   effect_allele_frequency imputation_quality        beta standard_error p_value
#> 1                0.986224           0.977552 -0.00615364     0.01088860    0.64
#> 2                0.906138           0.988959  0.00289667     0.00434703    0.39
#> 3                0.998537           0.732147 -0.02447490     0.03868080    0.40
#> 4                0.986228           0.976894 -0.00623617     0.01089580    0.63
#> 5                0.986228           0.976844 -0.00625052     0.01089570    0.63
#>  [ reached 'max' / getOption("max.print") -- omitted 1 rows ]
FT_cleaned <- FT %>%
  # QC on MAF >= 0.01
  filter(effect_allele_frequency >= 0.01 & effect_allele_frequency <= 0.99) %>%
  # rename some columns for compatibility with bigsnpr::snp_match()
  rename(chr = chromosome, pos = base_pair_location, 
         a1 = effect_allele, a0 = other_allele) %>%
  # align alleles to reference panel
  snp_match(g1000_map)  # match on chr/pos but can also match on chr/rsid
#> 25,328 variants to be matched.
#> 3,099 ambiguous SNPs have been removed.
#> 18,293 variants have been matched; 0 were flipped and 14,274 were reversed.

9.3.2 Find window around loci of interest

Let’s look at two loci:

  • Locus 1: rs1989147.

  • Locus 2: rs34954997, rs11879227, rs34255979. These variants in locus 2 are nearby so will be considered as one locus for the purpose of running SuSiE.

# find coordinates
filter(FT_cleaned, rsid == "rs1989147")  # on chr1
#>   chr     pos a0 a1 variant_id effect_allele_frequency imputation_quality
#> 1   1 7909373  C  T  rs1989147                0.806791            0.97999
#>         beta standard_error p_value _NUM_ID_.ss      rsid _NUM_ID_
#> 1 -0.0235512     0.00322554   7e-14        5329 rs1989147    12274
# Extract 1 Mb locus surrounding rs1989147
locus1 <- filter(FT_cleaned, chr == 1, pos > 7909373 - 5e5, pos < 7909373 + 5e5)

# Plot locus
ggplot(locus1, aes(x = pos, y = -log10(p_value))) +
  theme_bw(14) +
  geom_point(alpha = 0.8, size = 1.3) +
  geom_point(aes(x = pos, y = -log10(p_value)), color = "red", size = 2,
             data = filter(locus1, rsid == "rs1989147")) +
  geom_label_repel(aes(label = ifelse(rsid == "rs1989147", rsid, NA)), 
                   size = 4, min.segment.length = 0)
#> Warning: Removed 2815 rows containing missing values or values outside the scale range
#> (`geom_label_repel()`).

9.3.4 Read in data and fit SuSiE

susie_rss() is the function to fit SuSiE using summary statistics. Earlier we used susie() function which requires individual level data.

# Fit SuSiE using sample size from publication (n=188507)
fitted_rss_1 = susie_rss(bhat = locus1$beta, shat = locus1$standard_error, 
                         n = 188507, R = ld_mat)

9.3.5 Examine results

# Examine results
summary(fitted_rss_1)
#> 
#> Variables in credible sets:
#> 
#>  variable variable_prob cs
#>      1456    0.09199869  1
#>      1420    0.07451778  1
#>      1417    0.06848847  1
#>      1401    0.06379124  1
#>      1334    0.05783427  1
#>      1364    0.05781606  1
#>      1367    0.05778666  1
#>      1331    0.05777183  1
#>      1347    0.05429258  1
#>      1288    0.05367458  1
#>      1339    0.04928034  1
#>      1280    0.04659030  1
#>      1391    0.04472727  1
#>      1378    0.04098725  1
#>      1256    0.03098191  1
#>      1257    0.03085088  1
#>  [ reached 'max' / getOption("max.print") -- omitted 5 rows ]
#> 
#> Credible sets summary:
#> 
#>  cs cs_log10bf cs_avg_r2 cs_min_r2
#>   1   7.455457 0.9186663 0.8268138
#>                                                                                                  variable
#>  1256,1257,1267,1270,1272,1280,1288,1314,1316,1331,1334,1339,1347,1364,1367,1378,1391,1401,1417,1420,1456
locus1_susie <- arrange(summary(fitted_rss_1)$vars, variable)
locus1 <- cbind(locus1, locus1_susie[-1])

# Plot
ggplot(locus1, aes(x = pos, y = -log10(p_value))) +
  theme_bw(14) +
  geom_point(alpha = 0.8, size = 1.3) +
  geom_point(color = "red", size = 2, data = filter(locus1, cs == 1)) +
  geom_label_repel(aes(label = rsid), size = 2, max.overlaps = 45,
                   data = filter(locus1, cs == 1))

How do the PIPs compare to the p-values? Generate a plot to show this.

9.3.6 Locus 2

locus2 <- filter(FT_cleaned, chr == 19, pos > 45417638 - 5e5, pos < 46384830 + 5e5)

Now perform similar analyses on locus2.

How do the PIPs compare to the p-values? Generate a plot to show this.

What if we changed to a 75% credible set for locus 2? Write some code to do this and plot. Can we trust this result?

Why might there be no result in fitted_rss_2 for the second credible set?

9.3.7 Reflections

  • What can you do next with these results?

  • What is the advantages of using SuSiE over conditional approaches (such as LD clumping or COJO)?

  • What assumptions are we making when using SuSiE?

References

Ruth, K.S., Day, F.R., Tyrrell, J., Thompson, D.J., Wood, A.R., Mahajan, A., et al. (2020). Using human genetics to understand the disease impacts of testosterone in men and women. Nature Medicine, 26, 252–258.
Wang, G., Sarkar, A., Carbonetto, P., & Stephens, M. (2020). A simple new approach to variable selection in regression, with application to genetic fine mapping. Journal of the Royal Statistical Society Series B: Statistical Methodology, 82, 1273–1300.
Zou, Y., Carbonetto, P., Wang, G., & Stephens, M. (2022). Fine-mapping from summary data with the "Sum of Single Effects" model. PLoS Genetics, 18, e1010299.