Chapter 9 Fine-mapping with SuSiE
Read more in this PowerPoint presentation.
9.1 Objective
SuSiE (G. Wang, Sarkar, Carbonetto, & Stephens, 2020; Zou et al., 2022) is a fine-mapping method used to determine which variants are most likely to be causal for a trait. It uses a Bayesian iterative approach to create “credible sets” of SNPs that are likely to each contain a causal variant.
In this tutorial we will first use SuSiE on simulated data where we know the true causal effects. We will then apply it to real data from (Ruth et al., 2020).
This tutorial was created largely by modifying pre-existing tutorials written by Alesha Hatton (https://cnsgenomics.com/data/teaching/GNGWS23/module1/11_fine-mapping.html) and Gao Wang (https://stephenslab.github.io/susie-paper/manuscript_results/pedagogical_example.html). Thank you Alesha and Gao for your excellent tutorials!
9.1.1 Requirements:
- 1000 Genomes European reference panel in PLINK1 format (bed/bim/fam)
- GWAS summary statistics from (Ruth et al., 2020) (GWAS catalog accession: GCST90012102)
- PLINK 1.9
- R packages susieR, dplyr, ggplot2, ggrepel, bigsnpr, and bigreadr.
9.2 Simulated Example
This example uses simulated data to illustrate the use of SuSiE for fine-mapping over stage-wise selection.
9.2.1 Load data
#> List of 8
#> $ X : num [1:574, 1:1002] 0.703 0.703 -0.297 -0.297 0.703 ...
#> $ chrom : chr "chr8"
#> $ pos : int [1:1002(1d)] 38854423 38854585 38854614 38854829 38855770 38856373 38856591 38856612 38856903 38857229 ...
#> $ true_coef : num [1:1002, 1:2] 0 0 0 0 0 0 0 0 0 0 ...
#> $ residual_variance: num [1:2(1d)] 1.1 1.01
#> $ Y : num [1:574, 1:2] 1.0446 -1.3656 -0.0235 -0.4344 0.4681 ...
#> $ allele_freq : num [1:1002, 1] 0.1483 0.0209 0.1368 0.2134 0.5627 ...
#> $ V : num [1:2, 1:2] 1.344 -0.0336 -0.0336 1.2541
9.2.2 Fit SuSiE to the data
Also perform univariate regression so that PIPs can be compared with p-values
# Fit SuSiE with L=5 (maximum number of causal variants per replicate)
fitted <- susie(dat$X, y, L = 5,
estimate_residual_variance = TRUE,
scaled_prior_variance = 0.2,
tol = 1e-3, track_fit = TRUE,
compute_univariate_zscore = TRUE,
coverage = 0.95, min_abs_corr = 0.1)
str(fitted, max.level = 1)
#> List of 21
#> $ alpha : num [1:5, 1:1002] 9.98e-04 8.96e-32 9.32e-19 9.98e-04 9.98e-04 ...
#> $ mu : num [1:5, 1:1002] 0 -0.0221 -0.0514 0 0 ...
#> $ mu2 : num [1:5, 1:1002] 0 0.00232 0.00447 0 0 ...
#> $ Xr : num [1:574] -0.842 -0.274 -0.842 -0.842 1.209 ...
#> $ KL : num [1:5] -1.49e-05 5.87 8.68 -1.49e-05 -1.49e-05
#> $ lbf : num [1:5] 1.49e-05 6.23e+01 3.31e+01 1.49e-05 1.49e-05
#> $ lbf_variable : num [1:5, 1:1002] 0 -2.33 -1.49 0 0 ...
#> $ sigma2 : num 1.06
#> $ V : num [1:5] 0 0.251 0.154 0 0
#> $ pi : num [1:1002] 0.000998 0.000998 0.000998 0.000998 0.000998 ...
#> $ null_index : num 0
#> $ converged : logi TRUE
#> $ elbo : num [1:10] -867 -861 -858 -854 -850 ...
#> $ niter : int 10
#> $ intercept : num -7.82e-17
#> $ fitted : num [1:574] -0.842 -0.274 -0.842 -0.842 1.209 ...
#> $ trace :List of 10
#> $ sets :List of 5
#> $ pip : num [1:1002] 0 0 0 0 0 0 0 0 0 0 ...
#> $ z : num [1:1002] -0.8091 1.1147 -0.5836 -0.0842 -2.1866 ...
#> $ X_column_scale_factors: num [1:1002] 0.486 0.2 0.494 0.603 0.708 ...
#> - attr(*, "class")= chr "susie"
# Let's have a look at the first iteration (by setting max_iter=1 this will give us the first iteration of fitted)
fitted.one.iter <- susie(dat$X, y, L = 5, max_iter = 1,
estimate_residual_variance = TRUE,
scaled_prior_variance = 0.2,
tol = 1e-3,
coverage = 0.95, min_abs_corr = 0.1)
#> Warning in susie(dat$X, y, L = 5, max_iter = 1, estimate_residual_variance =
#> TRUE, : IBSS algorithm did not converge in 1 iterations!
#> List of 19
#> $ alpha : num [1:5, 1:1002] 5.22e-18 1.61e-04 6.10e-04 7.44e-04 8.03e-04 ...
#> $ mu : num [1:5, 1:1002] -0.03868 -0.02164 -0.01568 -0.01138 -0.00922 ...
#> $ mu2 : num [1:5, 1:1002] 0.00381 0.002501 0.001695 0.001228 0.000992 ...
#> $ Xr : num [1:574] -0.814 -0.187 -0.802 -0.748 0.675 ...
#> $ KL : num [1:5] 6.021 2.322 0.713 0.404 0.295
#> $ lbf : num [1:5] 31.0473 0.9323 0.0954 0.0364 0.0199
#> $ lbf_variable : num [1:5, 1:1002] -1.836 -0.893 -0.396 -0.257 -0.198 ...
#> $ sigma2 : num 1.15
#> $ V : num [1:5] 0.17389 0.01527 0.00379 0.00207 0.00148
#> $ pi : num [1:1002] 0.000998 0.000998 0.000998 0.000998 0.000998 ...
#> $ null_index : num 0
#> $ elbo : num -867
#> $ niter : int 1
#> $ converged : logi FALSE
#> $ intercept : num -3.47e-17
#> $ fitted : num [1:574] -0.814 -0.187 -0.802 -0.748 0.675 ...
#> $ sets :List of 5
#> ..$ cs :List of 1
#> .. ..$ L1: int [1:24] 765 767 770 774 776 778 780 782 783 788 ...
#> ..$ purity :'data.frame': 1 obs. of 3 variables:
#> .. ..$ min.abs.corr : num 0.78
#> .. ..$ mean.abs.corr : num 0.922
#> .. ..$ median.abs.corr: num 0.92
#> ..$ cs_index : int 1
#> ..$ coverage : num 0.955
#> ..$ requested_coverage: num 0.95
#> $ pip : num [1:1002] 0.00232 0.00219 0.0022 0.0022 0.00315 ...
#> $ X_column_scale_factors: num [1:1002] 0.486 0.2 0.494 0.603 0.708 ...
#> - attr(*, "class")= chr "susie"
9.2.3 Plot the SuSiE results
Plot both PIPs from SuSiE as well as the p-values from the regressions.
b <- dat$true_coef[, 1]
b[which(b != 0)] <- 1
# Run this code all at once to get side-by-side plots
par_saved <- par(mfrow = c(1, 3), cex.axis = 0.9)
# Plot the marginal associations
susie_plot(fitted, y = "z", b = b, max_cs = 1, main = "Marginal associations",
xlab = "variable (SNP)", col = "#767676")
# Plot PIPs after the first iteration
susie_plot(fitted.one.iter, y = "PIP", b = b, max_cs = 0.4,
main = "IBSS after 1 iteration", add_legend = FALSE,
ylim = c(0, 1), xlab = "variable (SNP)", col = "#767676")
# Plot PIPs after convergence
susie_plot(fitted, y = "PIP", b = b, max_cs = 0.4,
main = "IBSS after 10 iterations", add_legend = FALSE,
ylim = c(0, 1), xlab = "variable (SNP)", col = "#767676")
The “true” effects are highlighted in red. The strongest signal by p-value does not contain the causal variant, but is being tagged by two causal variants. The first iteration of SuSiE identifies the strongest signal by p-value, but by the 10th iteration the true causal variants are identified within two credible sets.
9.2.4 Let’s take a closer look at which variants are in the credible sets
#> $L1
#> [1] 765 767 770 774 776 778 780 782 783 788 790 791 792 814 817 824 827 834 837
#> [20] 838 847 849 868 869
#> $L2
#> [1] 850 913 914 915 916 920 924 925 926 927 930 931 933 934 935
#> [16] 942 946 948 951 952 962 967 968 979 980 982 983 985 988 989
#> [31] 993 994 996 999 1000 1001 1002
#>
#> $L3
#> [1] 337 379 440
9.3 Real data
In the simulation we used individual level data (genotypes and phenotypes). Often this is not available due to data access restrictions. Here we will use GWAS summary statistics and an LD reference panel. Let’s look at bioavailable testosterone in females from (Ruth et al., 2020).
9.3.1 Import GWAS data
Here we use (a subset of) the GWAS summary statistics originally accessible from the GWAS Catalog (accession number: GCST90012102).
# original data
# tgz <- runonce::download_file(
# "https://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/GCST90012001-GCST90013000/GCST90012102/GCST90012102_buildGRCh37.tsv.gz",
# dir = "tmp-data") # 419 MB
# small subset, for the tutorial
tgz <- runonce::download_file(
"https://figshare.com/ndownloader/files/55262768",
dir = "tmp-data", fname = "FT_sumstats_small.tsv.gz")
readLines(tgz, n = 5)
#> [1] "variant_id\tchromosome\tbase_pair_location\teffect_allele\tother_allele\teffect_allele_frequency\timputation_quality\tbeta\tstandard_error\tp_value"
#> [2] "rs72858371\t1\t6409383\tT\tC\t0.986224\t0.977552\t-0.00615364\t0.0108886\t0.64"
#> [3] "rs72633437\t1\t6409494\tC\tT\t0.906138\t0.988959\t0.00289667\t0.00434703\t0.39"
#> [4] "rs151043271\t1\t6409495\tG\tA\t0.998537\t0.732147\t-0.0244749\t0.0386808\t0.4"
#> [5] "rs58484986\t1\t6409653\tC\tG\t0.986228\t0.976894\t-0.00623617\t0.0108958\t0.63"
And some reference panel (also subsetted for this tutorial) to compute LD from:
# original data
# gzip <- runonce::download_file(
# "https://vu.data.surfsara.nl/index.php/s/VZNByNwpD8qqINe/download",
# fname = "g1000_eur.zip", dir = "tmp-data") # 488 MB
# small subset, for the tutorial
gzip <- runonce::download_file(
"https://figshare.com/ndownloader/files/55263098",
fname = "g1000_eur_small.zip", dir = "tmp-data")
unzip(gzip, exdir = "tmp-data", overwrite = FALSE)
g1000_map <- fread2("tmp-data/g1000_eur_small.bim", select = c(1:2, 4:6),
col.names = c("chr", "rsid", "pos", "a1", "a0"))
#> variant_id chromosome base_pair_location effect_allele other_allele
#> 1 rs72858371 1 6409383 T C
#> 2 rs72633437 1 6409494 C T
#> 3 rs151043271 1 6409495 G A
#> 4 rs58484986 1 6409653 C G
#> 5 rs60772726 1 6409662 A T
#> effect_allele_frequency imputation_quality beta standard_error p_value
#> 1 0.986224 0.977552 -0.00615364 0.01088860 0.64
#> 2 0.906138 0.988959 0.00289667 0.00434703 0.39
#> 3 0.998537 0.732147 -0.02447490 0.03868080 0.40
#> 4 0.986228 0.976894 -0.00623617 0.01089580 0.63
#> 5 0.986228 0.976844 -0.00625052 0.01089570 0.63
#> [ reached 'max' / getOption("max.print") -- omitted 1 rows ]
FT_cleaned <- FT %>%
# QC on MAF >= 0.01
filter(effect_allele_frequency >= 0.01 & effect_allele_frequency <= 0.99) %>%
# rename some columns for compatibility with bigsnpr::snp_match()
rename(chr = chromosome, pos = base_pair_location,
a1 = effect_allele, a0 = other_allele) %>%
# align alleles to reference panel
snp_match(g1000_map) # match on chr/pos but can also match on chr/rsid
#> 25,328 variants to be matched.
#> 3,099 ambiguous SNPs have been removed.
#> 18,293 variants have been matched; 0 were flipped and 14,274 were reversed.
9.3.2 Find window around loci of interest
Let’s look at two loci:
Locus 1: rs1989147.
Locus 2: rs34954997, rs11879227, rs34255979. These variants in locus 2 are nearby so will be considered as one locus for the purpose of running SuSiE.
#> chr pos a0 a1 variant_id effect_allele_frequency imputation_quality
#> 1 1 7909373 C T rs1989147 0.806791 0.97999
#> beta standard_error p_value _NUM_ID_.ss rsid _NUM_ID_
#> 1 -0.0235512 0.00322554 7e-14 5329 rs1989147 12274
# Extract 1 Mb locus surrounding rs1989147
locus1 <- filter(FT_cleaned, chr == 1, pos > 7909373 - 5e5, pos < 7909373 + 5e5)
# Plot locus
ggplot(locus1, aes(x = pos, y = -log10(p_value))) +
theme_bw(14) +
geom_point(alpha = 0.8, size = 1.3) +
geom_point(aes(x = pos, y = -log10(p_value)), color = "red", size = 2,
data = filter(locus1, rsid == "rs1989147")) +
geom_label_repel(aes(label = ifelse(rsid == "rs1989147", rsid, NA)),
size = 4, min.segment.length = 0)
#> Warning: Removed 2815 rows containing missing values or values outside the scale range
#> (`geom_label_repel()`).
9.3.3 Create an LD matrix
As we are using GWAS summary statistics, we need information of the correlation between variants. Here we are using 1000 Genomes European for convenience. If you are performing this analysis yourself, it is advised to use a larger sample. For fine-mapping, it is often recommended to use in-sample LD (LD computed from the original GWAS data).
Here we use PLINK, but remember that you can also use functions from bigsnpr for computing LD and actually also from bigstatsr when you need all correlation values for a set of variants (see big_cor()
).
# Export the SNPs in this locus
locus1_snps_txt <- tempfile(fileext = ".txt")
write(locus1$rsid, file = locus1_snps_txt)
# Compute correlation matrix (LD) with PLINK
plink <- download_plink("tmp-data")
system(glue::glue(
"{plink} --bfile tmp-data/g1000_eur_small",
" --extract {locus1_snps_txt}",
" --keep-allele-order",
" --r square", # correlations as a square matrix, not squared correlations
" --out tmp-data/ld_locus1"
))
#> PLINK v1.9.0-b.7.7 64-bit (22 Oct 2024) cog-genomics.org/plink/1.9/
#> (C) 2005-2024 Shaun Purcell, Christopher Chang GNU General Public License v3
#> Logging to tmp-data/ld_locus1.log.
#> Options in effect:
#> --bfile tmp-data/g1000_eur_small
#> --extract C:\Users\au639593\AppData\Local\Temp\Rtmp6Th2Hp\file6ebc60487075.txt
#> --keep-allele-order
#> --out tmp-data/ld_locus1
#> --r square
#>
#> 32574 MB RAM detected; reserving 16287 MB for main workspace.
#> 56148 variants loaded from .bim file.
#> 503 people (240 males, 263 females) loaded from .fam.
#> --extract: 2816 variants remaining.
#> Using up to 8 threads (change this with --threads).
#> Before main variant filters, 503 founders and 0 nonfounders present.
#> Calculating allele frequencies... 0%1%2%3%4%5%6%7%8%9%10%11%12%13%14%15%16%17%18%19%20%21%22%23%24%25%26%27%28%29%30%31%32%33%34%35%36%37%38%39%40%41%42%43%44%45%46%47%48%49%50%51%52%53%54%55%56%57%58%59%60%61%62%63%64%65%66%67%68%69%70%71%72%73%74%75%76%77%78%79%80%81%82%83%84%85%86%87%88%89%90%91%92%93%94%95%96%97%98%99% done.
#> Total genotyping rate is in [0.9999995, 1).
#> 2816 variants and 503 people pass filters and QC.
#> Note: No phenotypes present.
#> --r square to tmp-data/ld_locus1.ld ... 0% [processing]writing] done.
9.3.4 Read in data and fit SuSiE
susie_rss()
is the function to fit SuSiE using summary statistics. Earlier we used susie()
function which requires individual level data.
9.3.5 Examine results
#>
#> Variables in credible sets:
#>
#> variable variable_prob cs
#> 1456 0.09199869 1
#> 1420 0.07451778 1
#> 1417 0.06848847 1
#> 1401 0.06379124 1
#> 1334 0.05783427 1
#> 1364 0.05781606 1
#> 1367 0.05778666 1
#> 1331 0.05777183 1
#> 1347 0.05429258 1
#> 1288 0.05367458 1
#> 1339 0.04928034 1
#> 1280 0.04659030 1
#> 1391 0.04472727 1
#> 1378 0.04098725 1
#> 1256 0.03098191 1
#> 1257 0.03085088 1
#> [ reached 'max' / getOption("max.print") -- omitted 5 rows ]
#>
#> Credible sets summary:
#>
#> cs cs_log10bf cs_avg_r2 cs_min_r2
#> 1 7.455457 0.9186663 0.8268138
#> variable
#> 1256,1257,1267,1270,1272,1280,1288,1314,1316,1331,1334,1339,1347,1364,1367,1378,1391,1401,1417,1420,1456
locus1_susie <- arrange(summary(fitted_rss_1)$vars, variable)
locus1 <- cbind(locus1, locus1_susie[-1])
# Plot
ggplot(locus1, aes(x = pos, y = -log10(p_value))) +
theme_bw(14) +
geom_point(alpha = 0.8, size = 1.3) +
geom_point(color = "red", size = 2, data = filter(locus1, cs == 1)) +
geom_label_repel(aes(label = rsid), size = 2, max.overlaps = 45,
data = filter(locus1, cs == 1))
How do the PIPs compare to the p-values? Generate a plot to show this.
9.3.6 Locus 2
Now perform similar analyses on locus2.
How do the PIPs compare to the p-values? Generate a plot to show this.
What if we changed to a 75% credible set for locus 2? Write some code to do this and plot. Can we trust this result?
Why might there be no result in fitted_rss_2 for the second credible set?