Independent LD blocks — snp_ldsplit • bigsnpr

Split a correlation matrix in blocks as independent as possible. This finds the splitting in blocks that minimizes the sum of squared correlation between these blocks (i.e. everything outside these blocks). In case of equivalent splits, it then minimizes the sum of squared sizes of the blocks.

snp_ldsplit(
  corr,
  thr_r2,
  min_size,
  max_size,
  max_K = 500,
  max_r2 = 0.3,
  max_cost = ncol(corr)/200,
  pos_scaled = rep(0, ncol(corr))
)

Arguments

corr: Sparse correlation matrix. Usually, the output of snp_cor().
thr_r2: Threshold under which squared correlations are ignored. This is useful to avoid counting noise, which should give clearer patterns of costs vs. number of blocks. It is therefore possible to have a splitting cost of 0. If this parameter is used, then corr can be computed using the same parameter in snp_cor() (to increase the sparsity of the resulting matrix).
min_size: Minimum number of variants in each block. This is used not to have a disproportionate number of small blocks.
max_size: Maximum number of variants in each block. This is used not to have blocks that are too large, e.g. to limit computational and memory requirements of applications that would use these blocks. For some long-range LD regions, it may be needed to allow for large blocks. You can now provide a vector of values to try.
max_K: Maximum number of blocks to consider. All optimal solutions for K from 1 to max_K will be returned. Some of these K might not have any corresponding solution due to the limitations in size of the blocks. For example, splitting 10,000 variants in blocks with at least 500 and at most 2000 variants implies that there are at least 5 and at most 20 blocks. Then, the choice of K depends on the application, but a simple solution is to choose the largest K for which the cost is lower than some threshold. Default is 500.
max_r2: Maximum squared correlation allowed for one pair of variants in two different blocks. This is used to make sure that strong correlations are not discarded and also to speed up the algorithm. Default is 0.3.
max_cost: Maximum cost reported. Default is ncol(corr) / 200.
pos_scaled: Vector of positions. The positions should be scaled so that limits of a block must be separated by a distance of 1 at the maximum. E.g. if the positions are in base pairs (bp), and you want a maximum distance of 10 Mbp, you need to provide the vector of positions divided by 10e6.

Value

Either NULL when no block splitting satisfies the conditions, or a tibble with seven columns:

$max_size: Input parameter, useful when providing a vector of values to try.
$n_block: Number of blocks.
$cost: The sum of squared correlations outside the blocks.
$cost2: The sum of squared sizes of the blocks.
$perc_kept: Percentage of initial non-zero values kept within the blocks defined.
$all_last: Last index of each block.
$all_size: Sizes of the blocks.
$block_num: Resulting block numbers for each variant. This is not reported anymore, but can be computed with rep(seq_along(all_size), all_size).

Examples

if (FALSE) {

  corr <- readRDS(url("https://www.dropbox.com/s/65u96jf7y32j2mj/spMat.rds?raw=1"))

  # adjust `THR_R2` depending on sample size used to compute corr
  # use e.g. 0.05 for small sample sizes, and 0.01 for large sample sizes
  THR_R2 <- 0.02
  m <- ncol(corr)
  (SEQ <- round(seq_log(m / 30, m / 5, length.out = 10)))
  # replace `min_size` by e.g. 100 for larger data
  (res <- snp_ldsplit(corr, thr_r2 = THR_R2, min_size = 10, max_size = SEQ))

  # add the variant block IDs corresponding to each split
  res$block_num <- lapply(res$all_size, function(.) rep(seq_along(.), .))

  library(ggplot2)
  # trade-off cost / number of blocks
  qplot(n_block, cost, color = factor(max_size, SEQ), data = res) +
    theme_bw(14) +
    scale_y_log10() +
    theme(legend.position = "top") +
    labs(x = "Number of blocks", color = "Maximum block size",
         y = "Sum of squared correlations outside blocks")

  # trade-off cost / number of non-zero values
  qplot(perc_kept, cost, color = factor(max_size, SEQ), data = res) +
    theme_bw(14) +
    # scale_y_log10() +
    theme(legend.position = "top") +
    labs(x = "Percentage of non-zero values kept", color = "Maximum block size",
         y = "Sum of squared correlations outside blocks")

  # trade-off cost / sum of squared sizes
  qplot(cost2, cost, color = factor(max_size, SEQ), data = res) +
    theme_bw(14) +
    scale_y_log10() +
    geom_vline(xintercept = 0)+
    theme(legend.position = "top") +
    labs(x = "Sum of squared blocks", color = "Maximum block size",
         y = "Sum of squared correlations outside blocks")


  ## Pick one solution and visualize blocks
  library(dplyr)
  all_ind <- res %>%
    arrange(cost2 * sqrt(5 + cost)) %>%
    print() %>%
    slice(1) %>%
    pull(all_last)

  ## Transform sparse representation into (i,j,x) triplets
  corrT <- as(corr, "dgTMatrix")
  upper <- (corrT@i <= corrT@j & corrT@x^2 >= THR_R2)
  df <- data.frame(
    i = corrT@i[upper] + 1L,
    j = corrT@j[upper] + 1L,
    r2 = corrT@x[upper]^2
  )
  df$y <- (df$j - df$i) / 2

  ggplot(df) +
    geom_point(aes(i + y, y, alpha = r2)) +
    theme_minimal() +
    theme(axis.text.y = element_blank(), axis.ticks.y = element_blank(),
          strip.background = element_blank(), strip.text.x = element_blank()) +
    scale_alpha_continuous(range = 0:1) +
    scale_x_continuous(expand = c(0.02, 0.02), minor_breaks = NULL,
                       breaks = head(all_ind[[1]], -1) + 0.5) +
    facet_wrap(~ cut(i + y, 4), scales = "free", ncol = 1) +
    labs(x = "Position", y = NULL)
}