Chapter 6 GWAS summary statistics

When genome-wide association studies (GWAS) were first introduced to explore genetic factors associated with diseases, researchers may not have realized how valuable the resulting GWAS summary statistics would be. Over the past decade, GWAS summary statistics have become essential for many genetic analyses.

The summary statistics generated by GWAS do not include personal information, making GWAS summary statistics easy to share and use for further analysis. Collaboration among researchers in large consortia has facilitated meta-analyses of GWAS summary statistics from numerous study sites, resulting in unprecedentedly large GWAS sample sizes (Levey et al., 2021; Okbay et al., 2022; Suzuki et al., 2023; Yengo et al., 2022; Zhou et al., 2022).

6.1 Format

This an example of GWAS summary statistics (a subset of it at least):

tgz <- runonce::download_file(
  "https://figshare.com/ndownloader/files/55262768", 
  dir = "tmp-data", fname = "FT_sumstats_small.tsv.gz")

writeLines(readLines(tgz, n = 6))
#> variant_id   chromosome  base_pair_location  effect_allele   other_allele    effect_allele_frequency imputation_quality  beta    standard_error  p_value
#> rs72858371   1   6409383 T   C   0.986224    0.977552    -0.00615364 0.0108886   0.64
#> rs72633437   1   6409494 C   T   0.906138    0.988959    0.00289667  0.00434703  0.39
#> rs151043271  1   6409495 G   A   0.998537    0.732147    -0.0244749  0.0386808   0.4
#> rs58484986   1   6409653 C   G   0.986228    0.976894    -0.00623617 0.0108958   0.63
#> rs60772726   1   6409662 A   T   0.986228    0.976844    -0.00625052 0.0108957   0.63

Fields included in GWAS summary statistics and their specific names in the header of the file can change very much from one file to the other; this is why tools like MungeSumstats have been developed (Murphy, Schilder, & Skene, 2021).

They often include

  • Information to match variants (chromosome, physical position, alleles, rsid).

  • \(\hat{\beta}_j\) or \(\hat{\gamma}_j\) — the GWAS effect size of variant \(j\) (marginal effect),

  • \(\text{se}(\hat{\gamma}_j)\) — its standard error,

  • \(z_j = \frac{\hat{\gamma}_j}{\text{se}(\hat{\gamma}_j)}\) — the Z-score of variant \(j\),

  • p-values derived from Z-scores

but they more rarely include

  • \(n_j\) — the GWAS sample size associated with variant \(j\),

  • \(f_j\) — the allele frequency of variant \(j\),

  • \(\text{INFO}_j\) — the imputation INFO score (imputation quality) of variant \(j\)

Read-in these GWAS summary statistics and derive the p-values again, using the other columns. Recall that the Z-score squared is \(\chi^2\)-distributed with one degree of freedom.

Click to see solution
(sumstats <- bigreadr::fread2(tgz))
#>    variant_id chromosome base_pair_location effect_allele other_allele
#> 1  rs72858371          1            6409383             T            C
#> 2  rs72633437          1            6409494             C            T
#> 3 rs151043271          1            6409495             G            A
#> 4  rs58484986          1            6409653             C            G
#> 5  rs60772726          1            6409662             A            T
#>   effect_allele_frequency imputation_quality        beta standard_error p_value
#> 1                0.986224           0.977552 -0.00615364     0.01088860    0.64
#> 2                0.906138           0.988959  0.00289667     0.00434703    0.39
#> 3                0.998537           0.732147 -0.02447490     0.03868080    0.40
#> 4                0.986228           0.976894 -0.00623617     0.01089580    0.63
#> 5                0.986228           0.976844 -0.00625052     0.01089570    0.63
#>  [ reached 'max' / getOption("max.print") -- omitted 41805 rows ]
chi2 <- with(sumstats, (beta / standard_error)^2)
pval <- pchisq(chi2, df = 1, lower.tail = FALSE)
plot(pval, sumstats$p_value, log = "xy"); abline(0, 1, col = "red")

6.2 Quality control

However, pooling many GWAS summary statistics in large meta-analyses has been a double-edged sword. On the one hand, this has given much-needed power to many genetic analyses. On the other hand, the quality and standardization of GWAS summary statistics varies widely, and the more studies that are pooled together, the more issues arise when using these pooled summary data.

An overview of known issues in using GWAS summary statistics is presented in Figure 6.1. This includes weakening the construction of polygenic scores (Privé, Arbel, et al., 2022), biasing heritability estimates (Gazal et al., 2018; Grotzinger, Fuente, Privé, Nivard, & Tucker-Drob, 2023), identifying spurious causal genes (Kanai et al., 2022; Zou, Carbonetto, Wang, & Stephens, 2022), as well as undermining other genetic analyses (Chen et al., 2021; Julienne et al., 2021).

Overview of possible errors and misspecifications in GWAS summary statistics, with possible harmful consequences, as well as possible remedies [@prive2021identifying]. $\hat{\gamma}$: GWAS effect sizes; SE: standard errors; $n^{eff}$: effective GWAS sample sizes; SD: standard deviations of genotypes; INFO: measure of imputation quality.

Figure 6.1: Overview of possible errors and misspecifications in GWAS summary statistics, with possible harmful consequences, as well as possible remedies (Privé, Arbel, et al., 2022). \(\hat{\gamma}\): GWAS effect sizes; SE: standard errors; \(n^{eff}\): effective GWAS sample sizes; SD: standard deviations of genotypes; INFO: measure of imputation quality.


The QC I recommend to perform consist in comparing standard deviations of genotypes estimated in 2 ways (Privé, Arbel, et al., 2022; Privé, Arbel, & Vilhjálmsson, 2020):


    • When linear regression was used \[\begin{equation} \text{sd}(G_j) \approx \dfrac{\text{sd}(y)}{\sqrt{n_j \cdot \text{se}(\hat{\gamma}_j)^2 + \hat{\gamma}_j^2}} \tag{6.1} \end{equation}\]

    • When logistic regression was used (case-control phenotype) \[\begin{equation}\label{eq:approx-sd-log} \text{sd}(G_j) \approx \dfrac{2}{\sqrt{n_j^\text{eff} \cdot \text{se}(\hat{\gamma}_j)^2 + \hat{\gamma}_j^2}}~, \tag{6.2} \end{equation}\] where \(n_\text{eff} = \frac{4}{1 / n_\text{ca} + 1 / n_\text{co}}\)

  1. \[\begin{equation} \text{sd}(G_j) \approx \sqrt{2 \cdot f_j \cdot (1 - f_j) \cdot \text{INFO}_j} \tag{6.3} \end{equation}\]

Why would different variants have different GWAS sample sizes?

With this QC, you can detect differences in per-variant GWAS sample sizes

This is based on simulations.

Figure 6.2: This is based on simulations.

When \(n_j\) are missing, you can use this to perform some QC (e.g. at 70% of max n) or to impute \(n_j\).


You can detect bias in total effective GWAS sample size

This is based on CAD (coronary artery disease) GWAS summary statistics.

Figure 6.3: This is based on CAD (coronary artery disease) GWAS summary statistics.

Here I used \(N_\text{eff} = \frac{4}{1 / N_\text{ca} + 1 / N_\text{co}}\) in (6.2), where \(N_\text{ca}\) is the total number of cases and \(N_\text{co}\) is the total number of controls.

What is the problem with this approach when you have a meta-analysis?
(have a look at the following table with sample sizes for all studies comprising the meta-analysis)

For example, overestimating the sample size leads to underestimating the SNP-heritability in many methods such as LD score regression (Grotzinger et al., 2023).


You can detect and QC low imputation INFO scores, & other issues

This is based on breast cancer GWAS summary statistics.

Figure 6.4: This is based on breast cancer GWAS summary statistics.

Any idea what’s going on for outliers in the second figure?


Beware that multi-ancestry INFO scores are overestimated (e.g. in the UK Biobank), because the formula used to derive INFO scores assumes an homogeneous population.

This is based on the UK Biobank.

Figure 6.5: This is based on the UK Biobank.


Use the following GWAS results for HDL and allele frequencies to estimate standard deviations and compare both estimates.

How can you estimate sd(y)?

library(bigsnpr)
obj.bigsnp <- snp_attach("tmp-data/GWAS_data_sorted_QC.rds")
NCORES <- nb_cores()
obj.svd <- readRDS("tmp-data/PCA_GWAS_data.rds")
PC <- predict(obj.svd)
covar <- cbind(as.matrix(obj.bigsnp$fam[c("sex", "age")]), PC[, 1:6])
G <- obj.bigsnp$genotypes
y <- obj.bigsnp$fam$hdl
ind.gwas <- which(!is.na(y) & complete.cases(covar))
gwas <- big_univLinReg(G, y[ind.gwas], ind.train = ind.gwas,
                       covar.train = covar[ind.gwas, ], ncores = NCORES)
N <- length(ind.gwas)
af <- big_colstats(G, ind.row = ind.gwas, ncores = NCORES)$sum / (2 * N)
Click to see solution
sd_y <- with(gwas, quantile(sqrt(0.5 * (N * std.err^2 + estim^2)), 0.01))
c(sd(y[ind.gwas]), sd_y)
#>                1% 
#> 12.96917 11.84793
sd_ss <- sd_y / with(gwas, sqrt(N * std.err^2 + estim^2))
sd_af <- sqrt(2 * af * (1 - af))
library(ggplot2)
ggplot(dplyr::slice_sample(data.frame(sd_af, sd_ss), n = 100e3)) +
  geom_point(aes(sd_af, sd_ss), alpha = 0.5) +
  theme_bigstatsr(0.9) +
  geom_abline(linetype = 2, color = "red") +
  labs(x = "Standard deviations derived from the allele frequencies",
       y = "Standard deviations derived from the summary statistics")

nb_na <- big_counts(G$copy(CODE_012), ind.row = ind.gwas)[4, ]
ggplot(dplyr::slice_sample(data.frame(sd_af, sd_ss, nb_na), n = 100e3)) +
  geom_point(aes(sd_af, sd_ss, color = nb_na), alpha = 0.5) +
  theme_bigstatsr(0.9) +
  scale_color_viridis_c(direction = -1, trans = "sqrt") +
  geom_abline(linetype = 2, color = "red") +
  labs(x = "Standard deviations derived from the allele frequencies",
       y = "Standard deviations derived from the summary statistics")


We will practice more QC on GWAS summary statistics in 8.6.

I provide some example R script in the LDpred2 tutorial that implements this QC. You can also find some other scripts with examples how to prepare several GWAS summary statistics here.

I am currently working on implementing a method that performs a complementary QC, as well as the imputation of GWAS summary statistics (i.e. GWAS results for more variants).

6.3 Ancestry inference

In 4.5, we’ve seen how to infer ancestry for individual-level data using reference data provided in Privé (2022); we can also use this to infer the ancestry composition of a GWAS dataset using only allele frequencies from the GWAS summary statistics. The tutorial accompanying Privé (2022) is here.

Reuse some of the code from 4.5 (and/or from the tutorial) and allele frequencies (that you should compute) to get the ancestry composition of this dataset using function snp_ancestry_summary().

What do you need to do for allele frequencies after matching variants?

Click to see solution
library(bigsnpr)
obj.bed <- bed("tmp-data/GWAS_data_sorted_QC.bed")
NCORES <- nb_cores()
all_freq <- bigreadr::fread2(
  runonce::download_file(
    "https://figshare.com/ndownloader/files/38019027",  # subset for the tutorial (46 MB)
    # "https://figshare.com/ndownloader/files/31620968",  # for real analyses (849 MB)
    dir = "tmp-data", fname = "ref_freqs.csv.gz"))

projection <- bigreadr::fread2(
  runonce::download_file(
    "https://figshare.com/ndownloader/files/38019024",  # subset for the tutorial (44 MB)
    # "https://figshare.com/ndownloader/files/31620953",  # for real analyses (847 MB)
    dir = "tmp-data", fname = "projection.csv.gz"))

# coefficients to correct for overfitting of PCA
correction <- c(1, 1, 1, 1.008, 1.021, 1.034, 1.052, 1.074, 1.099,
                1.123, 1.15, 1.195, 1.256, 1.321, 1.382, 1.443)
# match variants between the two datasets
library(dplyr)
matched <- obj.bed$map %>%
  transmute(chr = chromosome, pos = physical.pos, a1 = allele1, a0 = allele2) %>%
  mutate(beta = 1) %>%
  snp_match(all_freq[1:5]) %>%
  print()
#>   chr     pos a0 a1 beta _NUM_ID_.ss       rsid _NUM_ID_
#> 1   1  752566  G  A   -1           2  rs3094315        1
#> 2   1  785989  T  C   -1           4  rs2980300        2
#> 3   1  798959  G  A    1           5 rs11240777        3
#> 4   1  947034  G  A   -1           6  rs2465126        4
#> 5   1  949608  G  A    1           7     rs1921        5
#> 6   1 1018704  A  G   -1           8  rs9442372        6
#>  [ reached 'max' / getOption("max.print") -- omitted 301150 rows ]

When matching between variants from two datasets, sometimes physical positions are in two different genome builds, usually it can be hg19 and hg38 (for newer datasets). You will need to either convert one of the two sets of positions with snp_modifyBuild() (that uses liftOver, not available for Windows) or by matching using rsIDs instead of positions (by using join_by_pos = FALSE).

# further subsetting on missing values
counts <- bed_counts(obj.bed, ind.col = matched$`_NUM_ID_.ss`, ncores = NCORES)
ind <- which((counts[4, ] / nrow(obj.bed)) < 0.05)
matched2 <- matched[ind, ]
# compute allele frequencies
af <- bed_MAF(obj.bed, ind.col = matched2$`_NUM_ID_.ss`)$af
af2 <- ifelse(matched2$beta > 0, af, 1 - af)

# get ancestry composition
res <- snp_ancestry_summary(
  freq = af2,
  info_freq_ref = all_freq[matched2$`_NUM_ID_`, -(1:5)],
  projection = projection[matched2$`_NUM_ID_`, -(1:5)],
  correction = correction
)
res
#>       Africa (West)      Africa (South)       Africa (East)      Africa (North) 
#>           0.0015918           0.0000000           0.0000000           0.0000000 
#>         Middle East           Ashkenazi               Italy Europe (South East) 
#>           0.0000000           0.0824672           0.2020482           0.1529422 
#> Europe (North East)             Finland         Scandinavia      United Kingdom 
#>           0.0000000           0.0051918           0.1891938           0.1510849 
#>             Ireland Europe (South West)       South America           Sri Lanka 
#>           0.2154802           0.0000000           0.0000000           0.0000000 
#>            Pakistan          Bangladesh         Asia (East)               Japan 
#>           0.0000000           0.0000000           0.0000000           0.0000000 
#>         Philippines 
#>           0.0000000 
#> attr(,"cor_each")
#>       Africa (West)      Africa (South)       Africa (East)      Africa (North) 
#>           0.6109985           0.6137032           0.8576549           0.9560451 
#>         Middle East           Ashkenazi               Italy Europe (South East) 
#>           0.9775514           0.9809802           0.9924371           0.9953612 
#> Europe (North East)             Finland         Scandinavia      United Kingdom 
#>           0.9902400           0.9782142           0.9946803           0.9975349 
#>             Ireland Europe (South West)       South America           Sri Lanka 
#>           0.9956773           0.9949990           0.9002633           0.8909370 
#>            Pakistan          Bangladesh         Asia (East)               Japan 
#>           0.9428874           0.8967786           0.7286470           0.7290332 
#>         Philippines 
#>           0.7343768 
#> attr(,"cor_pred")
#> [1] 0.9990652
# combine some close groups
group <- colnames(all_freq)[-(1:5)]
group[group %in% c("Scandinavia", "United Kingdom", "Ireland")]   <- "Europe (North West)"
group[group %in% c("Europe (South East)", "Europe (North East)")] <- "Europe (East)"

grp_fct <- factor(group, unique(group))  # use factors to keep order
final_res <- tapply(res, grp_fct, sum)
round(100 * final_res, 1)
#>       Africa (West)      Africa (South)       Africa (East)      Africa (North) 
#>                 0.2                 0.0                 0.0                 0.0 
#>         Middle East           Ashkenazi               Italy       Europe (East) 
#>                 0.0                 8.2                20.2                15.3 
#>             Finland Europe (North West) Europe (South West)       South America 
#>                 0.5                55.6                 0.0                 0.0 
#>           Sri Lanka            Pakistan          Bangladesh         Asia (East) 
#>                 0.0                 0.0                 0.0                 0.0 
#>               Japan         Philippines 
#>                 0.0                 0.0

References

Chen, W., Wu, Y., Zheng, Z., Qi, T., Visscher, P.M., Zhu, Z., & Yang, J. (2021). Improved analyses of GWAS summary statistics by reducing data heterogeneity and errors. Nature Communications, 12, 1–10.
Gazal, S., Loh, P.-R., Finucane, H.K., Ganna, A., Schoech, A., Sunyaev, S., & Price, A.L. (2018). Functional architecture of low-frequency variants highlights strength of negative selection across coding and non-coding annotations. Nature Genetics, 50, 1600–1607.
Grotzinger, A.D., Fuente, J. de la, Privé, F., Nivard, M.G., & Tucker-Drob, E.M. (2023). Pervasive downward bias in estimates of liability-scale heritability in genome-wide association study meta-analysis: A simple solution. Biological Psychiatry, 93, 29–36.
Julienne, H., Laville, V., McCaw, Z.R., He, Z., Guillemot, V., Lasry, C., et al. (2021). Multitrait GWAS to connect disease variants and biological mechanisms. PLoS Genetics, 17, e1009713.
Kanai, M., Elzur, R., Zhou, W., Wu, K.-H.H., Rasheed, H., Tsuo, K., et al. (2022). Meta-analysis fine-mapping is often miscalibrated at single-variant resolution. Cell Genomics, 2, 100210.
Levey, D.F., Stein, M.B., Wendt, F.R., Pathak, G.A., Zhou, H., Aslan, M., et al. (2021). Bi-ancestral depression GWAS in the Million Veteran Program and meta-analysis in> 1.2 million individuals highlight new therapeutic directions. Nature Neuroscience, 24, 954–963.
Murphy, A.E., Schilder, B.M., & Skene, N.G. (2021). MungeSumstats: A Bioconductor package for the standardization and quality control of many GWAS summary statistics. Bioinformatics, 37, 4593–4596.
Okbay, A., Wu, Y., Wang, N., Jayashankar, H., Bennett, M., Nehzati, S.M., et al. (2022). Polygenic prediction of educational attainment within and between families from genome-wide association analyses in 3 million individuals. Nature Genetics, 54, 437–449.
Privé, F. (2022). Using the UK Biobank as a global reference of worldwide populations: application to measuring ancestry diversity from GWAS summary statistics. Bioinformatics, 38, 3477–3480.
Privé, F., Arbel, J., Aschard, H., & Vilhjálmsson, B.J. (2022). Identifying and correcting for misspecifications in GWAS summary statistics and polygenic scores. Human Genetics and Genomics Advances, 3, 100136.
Privé, F., Arbel, J., & Vilhjálmsson, B.J. (2020). LDpred2: better, faster, stronger. Bioinformatics, 36, 5424–5431.
Suzuki, K., Hatzikotoulas, K., Southam, L., Taylor, H.J., Yin, X., Lorenz, K.M., et al. (2023). Multi-ancestry genome-wide study in > 2.5 million individuals reveals heterogeneity in mechanistic pathways of type 2 diabetes and complications. medRxiv. Retrieved from https://doi.org/10.1101/2023.03.31.23287839
Yengo, L., Vedantam, S., Marouli, E., Sidorenko, J., Bartell, E., Sakaue, S., et al. (2022). A saturated map of common genetic variants associated with human height. Nature, 610, 704–712.
Zhou, W., Kanai, M., Wu, K.-H.H., Rasheed, H., Tsuo, K., Hirbo, J.B., et al. (2022). Global Biobank Meta-analysis Initiative: Powering genetic discovery across human disease. Cell Genomics, 2, 100192.
Zou, Y., Carbonetto, P., Wang, G., & Stephens, M. (2022). Fine-mapping from summary data with the "Sum of Single Effects" model. PLoS Genetics, 18, e1010299.