23_May_2025

class: title-slide center middle inverse

<br>

# Quality Control and Imputation<br>of GWAS Summary Statistics

<br>

## Florian Privé

### Aarhus University, Denmark

#### <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 576 512" width="1em" height="1em"><path d="M407.8 294.7c-3.3-.4-6.7-.8-10-1.3c3.4 .4 6.7 .9 10 1.3zM288 227.1C261.9 176.4 190.9 81.9 124.9 35.3C61.6-9.4 37.5-1.7 21.6 5.5C3.3 13.8 0 41.9 0 58.4S9.1 194 15 213.9c19.5 65.7 89.1 87.9 153.2 80.7c3.3-.5 6.6-.9 10-1.4c-3.3 .5-6.6 1-10 1.4C74.3 308.6-9.1 342.8 100.3 464.5C220.6 589.1 265.1 437.8 288 361.1c22.9 76.7 49.2 222.5 185.6 103.4c102.4-103.4 28.1-156-65.8-169.9c-3.3-.4-6.7-.8-10-1.3c3.4 .4 6.7 .9 10 1.3c64.1 7.1 133.6-15.1 153.2-80.7C566.9 194 576 75 576 58.4s-3.3-44.7-21.6-52.9c-15.8-7.1-40-14.9-103.2 29.8C385.1 81.9 314.1 176.4 288 227.1z" fill="#0886FE"/></svg> <svg viewBox="0 0 496 512" style="height:1em;position:relative;display:inline-block;top:.1em;fill:white;" xmlns="http://www.w3.org/2000/svg">  <path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"></path></svg> privefl

---

### GWAS summary statistics

<br>

- `\(\hat{\gamma}_j\)` &mdash; the GWAS effect size of variant `\(j\)` (marginal effect),

- `\(\text{se}(\hat{\gamma}_j)\)` &mdash; its standard error,

- `\(z_j = \frac{\hat{\gamma}_j}{\text{se}(\hat{\gamma}_j)}\)` &mdash; the Z-score of variant `\(j\)`,

- `\(n_j\)` &mdash; the GWAS sample size associated with variant `\(j\)`,

- `\(f_j\)` &mdash; the allele frequency of variant `\(j\)`,

- `\(\text{INFO}_j\)` &mdash; the imputation INFO score of variant `\(j\)`

---

### The first quality control I already recommend (e.g. for LDpred2)

**Compare standard deviations** of genotypes estimated in 2 ways:

<br>

1.  - When linear regression was used 
    \begin{equation}
    \text{sd}(G_j) \approx \dfrac{\text{sd}(y)}{\sqrt{n_j \cdot \text{se}(\hat{\gamma}_j)^2 + \hat{\gamma}_j^2}}
    \end{equation}

- When logistic regression was used (case-control phenotype)
    \begin{equation}\label{eq:approx-sd-log}
    \text{sd}(G_j) \approx \dfrac{2}{\sqrt{n_j^\text{eff} \cdot \text{se}(\hat{\gamma}_j)^2 + \hat{\gamma}_j^2}}
    \end{equation}
<br>
2. \begin{equation}\text{sd}(G_j) \approx \sqrt{2 \cdot f_j \cdot (1 - f_j) \cdot \text{INFO}_j}\end{equation}

---

### Detect differences in per-variant GWAS sample sizes

---

### Detect bias in total effective GWAS sample size

.pull_left[
<img src="figures/cad_quick_qc.png" width="100%" style="display: block; margin: auto;" />

<span class="footnote"> `\(N_\text{eff} = \frac{4}{1 / N_\text{ca} + 1 / N_\text{co}}\)` </span>
]

.pull_right[
<img src="figures/cad_neff_perstudy.png" width="70%" style="display: block; margin: auto;" />
]

---

### Detect low imputation INFO scores & other issues

<br>

---

### Multi-ancestry INFO scores are overestimated (e.g. in the UK Biobank)

<br>

---

### Overview of possible errors and misspecifications in GWAS sumstats<br>with possible harmful consequences, as well as possible remedies

<br>

.footnote[F. Privé et al (2022). Identifying and correcting for misspecifications in GWAS summary statistics and polygenic scores. *Human Genetics and Genomics Advances*.]

---

### Read more about this

- Privé, F., et al. (2022) "Identifying and correcting for misspecifications in GWAS summary statistics and polygenic scores." *Human Genetics and Genomics Advances* 3.4.

- Gazal, S., et al. (2018) "Functional architecture of low-frequency variants highlights strength of negative selection across coding and non-coding annotations." *Nature Genetics* 50.11.

- Grotzinger, A.D., et al. (2023) "Pervasive downward bias in estimates of liability-scale heritability in genome-wide association study meta-analysis: a simple solution." *Biological Psychiatry* 93.1.

- Zou, Y., et al. (2022) "Fine-mapping from summary data with the “Sum of Single Effects” model." *PLoS Genetics* 18.7.

- Chen, W., et al. (2021) "Improved analyses of GWAS summary statistics by reducing data heterogeneity and errors." *Nature Communications* 12.1.

- Privé, F. (2022) "Using the UK Biobank as a global reference of worldwide populations: application to measuring ancestry diversity from GWAS summary statistics." *Bioinformatics* 38.13.

---

class: center middle inverse

<br>

# Ongoing project

<br>

## Disclaimer:

### Started 1.5y ago, but paused for a while

### Will resume after the ancestry deconvolution is submitted

---

### Synergy between quality control (QC) and imputation

<br>

- imputation is used for QC
    
- QC is needed before imputation so that errors don't propagate

- imputation can recover QCed variants

<br>

#### What I propose to do

---

### Additional (complementary) QC &#8212; DENTIST methodology

<br>

GCTA method which compares reported Z-scores with imputed Z-scores.

<br>

`\(\chi^2(1)\)` test statistic:
<img src="figures/eq-dentist.jpg" width="90%" style="display: block; margin: auto;" />

where `\(i\)` is the variant of interest, and `\(t\)` the variants used for imputing.

<br>

It is particularly good at detecting allelic errors (opposite effect).

.footnote[DENTIST citation: Chen, W., et al. (2021) "Improved analyses of GWAS summary statistics by reducing data heterogeneity and errors." *Nature Communications* 12.1.]

---

### Quick simulation to check DENTIST

<br>

Design:

- Use 145K variants on chromosome 22 with MAF > 0.005 and INFO > 0.8

- Simulate some phenotype with heritability of 0.1 and polygenicity of 0.01

- Compute the GWAS summary statistics using N=50K    
(Z-scores in [-20; 20], mostly in [-10; 10])

- For 1000 variants at random, assign them an opposite effect (allelic error)

<br>

Results:

- 802 true positives (>80% power), but 3209 false positives

`\(\Rightarrow\)` improve the DENTIST methodology,    
to get less false positives (and ideally more power)    
    (as an <svg viewBox="0 0 581 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg">  <path d="M581 226.6C581 119.1 450.9 32 290.5 32S0 119.1 0 226.6C0 322.4 103.3 402 239.4 418.1V480h99.1v-61.5c24.3-2.7 47.6-7.4 69.4-13.9L448 480h112l-67.4-113.7c54.5-35.4 88.4-84.9 88.4-139.7zm-466.8 14.5c0-73.5 98.9-133 220.8-133s211.9 40.7 211.9 133c0 50.1-26.5 85-70.3 106.4-2.4-1.6-4.7-2.9-6.4-3.7-10.2-5.2-27.8-10.5-27.8-10.5s86.6-6.4 86.6-92.7-90.6-87.9-90.6-87.9h-199V361c-74.1-21.5-125.2-67.1-125.2-119.9zm225.1 38.3v-55.6c57.8 0 87.8-6.8 87.8 27.3 0 36.5-38.2 28.3-87.8 28.3zm-.9 72.5H365c10.8 0 18.9 11.7 24 19.2-16.1 1.9-33 2.8-50.6 2.9v-22.1z"></path></svg> implementation)

---

### Details about DENTIST

- Use a sliding window (2 Mb with 500 Kb overlap)

- Separate each window into two groups that are used to impute each other

- Invert `\(R_{tt}\)` using 40% of eigenvectors

- Do not use variants `\(t\)` with `\(R_{it}^2 > 0.95\)` to improve computational stability

- Do 10 iterations to remove variants (max top 0.5% for first 9 iterations)    
so that errors don't propagate

#### Possible issues (power and false positives)

- Loss of power by using only half of variants in each window

- Loss of power by not using highly correlated variants

- Eigenvectors can capture components that are not useful for imputation    
(e.g. LD blocks in weak LD with the variant we want to impute)

- Iterative removal and updating might not subtle enough

- Test statistic becomes larger with larger Z-scores

---

### My proposed approach (1/2)

<br>

For a particular variant `\(i\)` to test,

- consider variants `\(t\)` by decreasing order of `\(R_{it}^2\)`, up to `\(R_{it}^2 > 0.2\)`

`\(\Rightarrow\)` we need all `\(R_{jt}\)` such that `\(R_{it}^2 > 0.2\)` and `\(R_{ij}^2 > 0.2\)`    
    (not too large: 700 MB for 145K variants of chr22)
    
--

- pick them if their LD score with the variants already picked is <30

`\(\Rightarrow\)` the set of variants `\(t\)` to impute has maximum power,    
    without being too large nor too redundant

- do not consider variants with a potential error

`\(\Rightarrow\)` do not propagate errors, and faster (less updates when removed)

---

### My proposed approach (2/2)

For the iterations,

- Only remove one variant per iteration

- Update the statistics for those that used this variant for imputation

- Iterate until no variant is significant anymore

<br>

Other details:

- Constrain the denominator to be at least 0.01

- Solve the linear system `\(R_{tt}^{-1} R_{ti}\)` (with a bit of regularization)

- Scale the Z-statistic by the maximum Z-score divided by 4

- Compute `\(n_j^\text{imp} = N \cdot r^2_j\)`

- Compute `\(\text{se}(\hat{\gamma}_j)^\text{imp}\)` from `\(\text{sd}(G_j)\)`, `\(n_j^\text{imp}\)` and `\(\gamma_j^\text{imp}\)`

---

### More simulation results

- Use 145K variants on chromosome 22 with MAF > 0.005 and INFO > 0.8
- Simulate some phenotype with h2=0.01 or 0.1 and polygenicity of 0.01
- Compute the GWAS summary statistics using 356K individuals
- For 500 variants at random, assign them an opposite effect (allelic error)

<br>

---

### Real data analysis

<br>

Using type-1 diabetes (T1D) GWAS summary statistics with

- `\(N_\text{eff}\)` = 13.5K

- using 16K variants with very large effects

- from a long-range LD region (HLA on chromosome 6)

- genotypes imputed using the 1000G data (mediocre imputation)

<br>

`\(\Rightarrow\)` 12,310 "errors" detected with DENTIST

---

### Impossible imputation

---

### First issue: the assumption of the imputation model

<br>

- the model is assuming `\(Z \sim N(0, R)\)`,

- whereas it should be instead `\(Z \sim N(R \beta, R)\)`,    
where `\(\beta\)` are the (scaled) causal effects

- but we do not know the causal effects, and they are assumed to be small

<br>

I tried to estimate causal effects with both SuSiE and LDpred2-auto.

---

### Another impossible imputation

---

### Second issue: duplicates with different estimates

<br>

<br>

Idea: could somehow try to pick the best?

---

### Take-home messages

<br>

- There can be many issues in GWAS summary statistics

- You can detect many of them by comparing SDs estimated in two ways

- You can detect other (complementary) issues with DENTIST

- DENTIST is currently prone to false positives; it needs to be improved

- I hope to provide QCed GWAS summary statistics for everyone to use

<br>

.center[
### Thank you for your attention

Presentation available at [bit.ly/privefl230525](https://bit.ly/privefl230525)