Function big_apply() enables to apply standard R functions to an FBM, by relying on split-apply-combine strategy where only a subset of the matrix is accessed at once. This allows for a compromise between the amount of maximum memory needed, and the speed of the operation. In this tutorial, I showcase two examples, the imputation of a ‘double’ FBM and the multiplication of two FBMs.

Example 1: Imputation of a ‘double’ FBM

Some data to work with:

library(bigstatsr)
X <- FBM(2000, 2000, init = rnorm(2000^2))
X[1, ] <- NA

To impute by the mean of each column, you can use the following code:

big_apply(X, function(X, ind) {
  # have an idea of progress
  print(ind[1])
  # access a subset of columns as a standard R matrix
  X.sub <- X[, ind, drop = FALSE]
  # get the location (i, j) of missing values
  ind_na <- which(is.na(X.sub), arr.ind = TRUE)
  # compute the corresponding mean for each column j
  means <- colMeans(X.sub, na.rm = TRUE)[ind_na[, 2]]
  # update j (relative to subset) to global 'ind'
  ind_na[, 2] <- ind[ind_na[, 2]]
  # fill positions with corresponding means
  X[ind_na] <- means
  # here we don't want to return anything, so `NULL`
  NULL
}, a.combine = 'c', block.size = 500)
## [1] 1
## [1] 501
## [1] 1001
## [1] 1501
## NULL

What is going on:

  • The function split indices (cols_along(X) by default, can be changed using parameter ind) to have maximum size of block.size.

  • Then, this is passed as ind into the function and can be used to access a subset of columns of the FBM with X[, ind, drop = FALSE] (which is then a standard R matrix).

  • Then you can compute the colmeans by applying a standard R function on this subset, using e.g. colMeans(X.sub, na.rm = TRUE).

  • Then, the missing values can be filled with the corresponding column means and assigned back to the FBM (be aware of the relative column indices ind versus the global ones cols_along(X)).

  • Then, we return NULL because we are not interested in returning anything here. We also use a.combine = 'c' to combine all NULL corresponding to each block to return only one NULL instead of a list of NULL.

  • If you would like to return e.g. the column means that you used for imputation, you could e.g. replace NULL by the result of colMeans(X.sub, na.rm = TRUE), and keep a.combine = 'c'.

# Verification
X[1, ]
##  [1] -0.034339519  0.034951955  0.050932074  0.003296804  0.008878715
##  [6] -0.024596088 -0.035180224 -0.015871200  0.006975371  0.001935426
## [11] -0.060124419  0.004442005  0.047367485 -0.046293716 -0.009476670
## [16] -0.011590082 -0.008450684  0.018243748  0.032336202 -0.010500991
##  [ reached getOption("max.print") -- omitted 1980 entries ]
big_scale()(X)$center
##  [1] -0.034339519  0.034951955  0.050932074  0.003296804  0.008878715
##  [6] -0.024596088 -0.035180224 -0.015871200  0.006975371  0.001935426
## [11] -0.060124419  0.004442005  0.047367485 -0.046293716 -0.009476670
## [16] -0.011590082 -0.008450684  0.018243748  0.032336202 -0.010500991
##  [ reached getOption("max.print") -- omitted 1980 entries ]

Example 2: multiplication of two FBMs

Imagine you have two large FBMs

# here they are not that large but it is just for the example
N <- 5000
M1 <- 1000
M2 <- 2000
library(bigstatsr)
X1 <- FBM(N, M1, init = 1)
X2 <- FBM(N, M2, init = 2)

How to compute the cross-product \(X_1^T X_2\)?

The first thing is to ask whether you really want to do this, and questioned the dimension of the resulting matrix (here M1 \(\times\) M2).

There are many solutions to this problem, that depends mainly of the size of your matrices.

The first simple solution when X2 is small is to access it as a standard R matrix and to use.

cprod1 <- big_cprodMat(X1, X2[])
dim(cprod1)
## [1] 1000 2000

If X1 is small and the resulting product is small, then you can also use

cprod2 <- t(big_cprodMat(X2, X1[]))
all.equal(cprod2, cprod1)
## [1] TRUE

If the matrices are larger, especially X2, then you can compute the cross-product only for a subset of columns of X2, which gives you only a subset of columns of the result, which you can then combine. This can be implemented using

cprod3 <- big_apply(X2, function(X, ind) {
  print(ind[1])
  big_cprodMat(X1, X[, ind, drop = FALSE])
}, a.combine  = "cbind", block.size = 500)
## [1] 1
## [1] 501
## [1] 1001
## [1] 1501
all.equal(cprod3, cprod1)
## [1] TRUE

To use parallelism, you can e.g. use bigparallelr::set_blas_ncores() to allow for parallel matrix operations if R is linked to some parallel matrix library such as MKL or OpenBLAS. You can also use the parallelism from big_apply(); this won’t work as is:

big_apply(X2, function(X, ind) {
  print(ind[1])
  big_cprodMat(X1, X[, ind, drop = FALSE])
}, a.combine  = "cbind", block.size = 500, ncores = nb_cores())
## Error in {: task 1 failed - "task 1 failed - "could not find function "big_cprodMat"""

You need two things here, which is basically telling the parallel clusters what are big_cprodMat and X1; you can do

cprod4 <- big_apply(X2, function(X, ind, X1) {
  print(ind[1])
  bigstatsr::big_cprodMat(X1, X[, ind, drop = FALSE])
}, a.combine  = "cbind", block.size = 500, ncores = nb_cores(), X1 = X1)
all.equal(cprod4, cprod1)
## [1] TRUE

Basically, you need to tell explicitly in which package to find the function, and to pass the other variables as parameters in the function and in big_apply(). For more information on these matters, please have a look at this tutorial on parallelism with R. Also note that each core will use at most a block size of 500 here, so at most 500 x ncores, so you might need to reduce the block size a bit when using parallelism (this is done by default when not specifying block.size).

Another strategy, to save a bit of memory, is to preallocate a resulting FBM and fill it:

cprod5 <- FBM(ncol(X1), ncol(X2))
big_apply(X2, function(X, ind, X1, res) {
  res[, ind] <- bigstatsr::big_cprodMat(X1, X[, ind, drop = FALSE])
  NULL
}, a.combine  = "c", block.size = 500, ncores = nb_cores(), X1 = X1, res = cprod5)
## NULL
all.equal(cprod5[], cprod1)
## [1] TRUE