Chapter 7 Linkage Disequilibrium (LD)

Most methods that rely on GWAS summary statistics also rely on an LD matrix (correlation matrix between genetic variants). It is used in algorithms to transform GWAS marginal (i.e. independent) effects into joint effects.

Indeed, if we take the simple example of a multiple linear regression \(y = X \beta + \epsilon\), the joint effects are estimated by \(\hat\beta = (X^T X)^{-1} X^T y\), where \(X^T y\) are the marginal effects and \(X^T X\) is the covariance matrix (up to some scaling). Both \(X^T y\) and \(X^T X\) are summary statistics.

The GWAS data from which the GWAS summary statistics were derived is often not accessible, so that the LD matrix is usually derived from another dataset. This reference dataset needs to be as close as possible as the GWAS data (in terms of genetic ancestry) to provide a close estimation of LD. This is why I got interested in estimating the ancestry composition of a GWAS based on allele frequencies reported in GWAS summary statistics (as we’ve seen in 6.3), at least to make sure that the reference data used to compute the LD is similar (in terms of genetic ancestry) to the original GWAS data.

7.1 Derivation of an LD matrix

Functions snp_cor() and bed_cor() from bigsnpr can be used to derive a sparse LD matrix. Both functions allow for some missing values in the input genotypes (although we will see that it might not be such a good idea). If we use e.g. one million variants, the LD matrix is a 1M x 1M matrix, which is too large to compute and also to store.

In practice, we know that LD between two genetic variants tend to decay with an increasing distance between them (Pritchard & Przeworski, 2001). In LDpred2 (Privé, Arbel, et al., 2020), I assume that variants distant from more than 3 cM are uncorrelated (same for variants on different chromosomes). Genetic positions in centimorgans (cM) are preferred over using physical positions in base pairs (bp) in this case. From this, it results in a windowed (sparse) LD matrix (i.e. only values close to the diagonal are computed and stored).

In 9.3.3, we will see how to alternatively use PLINK to derive an LD matrix.

Reuse the same data as before to compute the LD matrix for variants on chromosome 22.

Use snp_asGeneticPos() to convert positions to cM (do it for chromosome 22 only, to avoid downloading large files). And use size = 3 / 1000 (because positions are divided by 1000 internally in bigsnpr).

Do you get any different result using snp_cor() and bed_cor(); why?

Click to see solution

library(bigsnpr)
obj.bed <- bed("tmp-data/GWAS_data_sorted_QC.bed")
NCORES <- nb_cores()

library(dplyr)
ind_chr22 <- which(obj.bed$map$chromosome == 22)
POS_chr22 <- obj.bed$map$physical.pos[ind_chr22]  # in bp

# downloads 5.9 MB
POS2_chr22 <- snp_asGeneticPos(infos.chr = rep(22, length(POS_chr22)), 
                               infos.pos = POS_chr22, dir = "tmp-data")

cor1 <- bed_cor(obj.bed, ind.col = ind_chr22, ncores = NCORES,
                infos.pos = POS2_chr22, size = 3 / 1000)  # in cM

bigsnp <- snp_attach("tmp-data/GWAS_data_sorted_QC.rds")
G <- bigsnp$genotypes
stopifnot(all(dim(G) == dim(obj.bed)))  # quick check
cor2 <- snp_cor(G, ind.col = ind_chr22, ncores = NCORES,
                infos.pos = POS2_chr22, size = 3 / 1000)
cbind(cor2[1:5, 1:5], cor1[1:5, 1:5])  # not exactly the same; why?

#> 5 x 10 sparse Matrix of class "dgCMatrix"
#>                                                                             
#> [1,]  1.00000000 -0.05154381 -0.03629097 -0.02959808 -0.04231449  1.00000000
#> [2,] -0.05154381  1.00000000 -0.15134204 -0.15277138  0.15337533 -0.05251683
#> [3,] -0.03629097 -0.15134204  1.00000000  0.92384555 -0.50516808 -0.03645858
#> [4,] -0.02959808 -0.15277138  0.92384555  1.00000000 -0.45964191 -0.02979486
#> [5,] -0.04231449  0.15337533 -0.50516808 -0.45964191  1.00000000 -0.04304076
#>                                                     
#> [1,] -0.05251683 -0.03645858 -0.02979486 -0.04304076
#> [2,]  1.00000000 -0.15427297 -0.15639606  0.15833843
#> [3,] -0.15427297  1.00000000  0.92937159 -0.51078705
#> [4,] -0.15639606  0.92937159  1.00000000 -0.46606808
#> [5,]  0.15833843 -0.51078705 -0.46606808  1.00000000

G2 <- G$copy(CODE_012)  # go back to having missing values
cor3 <- snp_cor(G2, ind.col = ind_chr22, ncores = NCORES,
                infos.pos = POS2_chr22, size = 3 / 1000)
all.equal(cor3, cor1)

#> [1] TRUE

7.2 Subtleties about LD matrices

LD matrices are often singular (non invertible), often with negative eigenvalues even, which causes stability and divergence issues in algorithms using them (Zabad, Haryan, Gravel, Misra, & Li, 2025)
- if the LD is computed using less individuals than the number of variants, this leads to some eigenvalues being 0 (singular)
- computing LD from data with missing values often leads to the LD matrix having negative eigenvalues (because not the same individuals are used for all correlations, when using pairwise complete observations); imputing before computing LD can solve this issue (Zabad et al., 2025)
- windowing often results in having negative eigenvalues; defining independent LD blocks (Privé, 2021) in the LD matrix helps a lot (Privé, Arbel, et al., 2022)
- thresholding (discarding small absolute correlations) results in negative eigenvalues, so best to be avoided
one can regularize the LD matrix (e.g. by adding some positive value to the diagonal, or by scaling down off-diagonal elements, as done in many methods); but too much regularization will bias results (Privé, Albiñana, Arbel, Pasaniuc, & Vilhjálmsson, 2023)
computing LD from imputed data seems to provide an unbiased estimation (Privé, Arbel, et al., 2022)
should partial correlations be used instead? (e.g. when there is some population structure in the data); this is still an open question I want to investigate

References

Pritchard, J.K., & Przeworski, M. (2001). Linkage disequilibrium in humans: Models and data. The American Journal of Human Genetics, 69, 1–14.

Privé, F. (2021). Optimal linkage disequilibrium splitting. Bioinformatics, 38, 255–256.

Privé, F., Albiñana, C., Arbel, J., Pasaniuc, B., & Vilhjálmsson, B.J. (2023). Inferring disease architecture and predictive ability with LDpred2-auto. The American Journal of Human Genetics, 110, 2042–2055.

Privé, F., Arbel, J., Aschard, H., & Vilhjálmsson, B.J. (2022). Identifying and correcting for misspecifications in GWAS summary statistics and polygenic scores. Human Genetics and Genomics Advances, 3, 100136.

Privé, F., Arbel, J., & Vilhjálmsson, B.J. (2020). LDpred2: better, faster, stronger. Bioinformatics, 36, 5424–5431.

Zabad, S., Haryan, C.A., Gravel, S., Misra, S., & Li, Y. (2025). Towards whole-genome inference of polygenic scores with fast and memory-efficient algorithms. The American Journal of Human Genetics. Retrieved from https://doi.org/10.1016/j.ajhg.2025.05.002