Function big_apply() enables to apply standard R functions to an FBM, by relying on split-apply-combine strategy where only a subset of the matrix is accessed at once. This allows for a compromise between the amount of maximum memory needed, and the speed of the operation. In this tutorial, I showcase two examples, the imputation of a ‘double’ FBM and the multiplication of two FBMs.

## Example 1: Imputation of a ‘double’ FBM

Some data to work with:

library(bigstatsr)
X <- FBM(2000, 2000, init = rnorm(2000^2))
X[1, ] <- NA

To impute by the mean of each column, you can use the following code:

big_apply(X, function(X, ind) {
# have an idea of progress
print(ind[1])
# access a subset of columns as a standard R matrix
X.sub <- X[, ind, drop = FALSE]
# get the location (i, j) of missing values
ind_na <- which(is.na(X.sub), arr.ind = TRUE)
# compute the corresponding mean for each column j
means <- colMeans(X.sub, na.rm = TRUE)[ind_na[, 2]]
# update j (relative to subset) to global 'ind'
ind_na[, 2] <- ind[ind_na[, 2]]
# fill positions with corresponding means
X[ind_na] <- means
# here we don't want to return anything, so NULL
NULL
}, a.combine = 'c', block.size = 500)
## [1] 1
## [1] 501
## [1] 1001
## [1] 1501
## NULL

What is going on:

• The function split indices (cols_along(X) by default, can be changed using parameter ind) to have maximum size of block.size.

• Then, this is passed as ind into the function and can be used to access a subset of columns of the FBM with X[, ind, drop = FALSE] (which is then a standard R matrix).

• Then you can compute the colmeans by applying a standard R function on this subset, using e.g. colMeans(X.sub, na.rm = TRUE).

• Then, the missing values can be filled with the corresponding column means and assigned back to the FBM (be aware of the relative column indices ind versus the global ones cols_along(X)).

• Then, we return NULL because we are not interested in returning anything here. We also use a.combine = 'c' to combine all NULL corresponding to each block to return only one NULL instead of a list of NULL.

• If you would like to return e.g. the column means that you used for imputation, you could e.g. replace NULL by the result of colMeans(X.sub, na.rm = TRUE), and keep a.combine = 'c'.

# Verification
X[1, ]
##  [1] -1.267660e-02  3.282387e-02  3.315054e-04 -2.392106e-02 -5.341066e-02
##  [6]  1.382311e-02 -1.692490e-02 -3.555750e-02  2.448234e-02 -2.054286e-02
## [11]  5.787257e-03 -2.372889e-02  9.233218e-03 -3.948486e-02 -1.765272e-05
## [16] -1.623886e-02  1.675917e-02  4.250739e-02 -4.906126e-03 -1.519135e-02
##  [ reached getOption("max.print") -- omitted 1980 entries ]
big_scale()(X)\$center
##  [1] -1.267660e-02  3.282387e-02  3.315054e-04 -2.392106e-02 -5.341066e-02
##  [6]  1.382311e-02 -1.692490e-02 -3.555750e-02  2.448234e-02 -2.054286e-02
## [11]  5.787257e-03 -2.372889e-02  9.233218e-03 -3.948486e-02 -1.765272e-05
## [16] -1.623886e-02  1.675917e-02  4.250739e-02 -4.906126e-03 -1.519135e-02
##  [ reached getOption("max.print") -- omitted 1980 entries ]

## Example 2: multiplication of two FBMs

Imagine you have two large FBMs

# here they are not that large but it is just for the example
N <- 5000
M1 <- 1000
M2 <- 2000
library(bigstatsr)
X1 <- FBM(N, M1, init = 1)
X2 <- FBM(N, M2, init = 2)

How to compute the cross-product $$X_1^T X_2$$?

The first thing is to ask whether you really want to do this, and questioned the dimension of the resulting matrix (here M1 $$\times$$ M2).

There are many solutions to this problem, that depends mainly of the size of your matrices.

The first simple solution when X2 is small is to access it as a standard R matrix and to use.

cprod1 <- big_cprodMat(X1, X2[])
dim(cprod1)
## [1] 1000 2000

If X1 is small and the resulting product is small, then you can also use

cprod2 <- t(big_cprodMat(X2, X1[]))
all.equal(cprod2, cprod1)
## [1] TRUE

If the matrices are larger, especially X2, then you can compute the cross-product only for a subset of columns of X2, which gives you only a subset of columns of the result, which you can then combine. This can be implemented using

cprod3 <- big_apply(X2, function(X, ind) {
print(ind[1])
big_cprodMat(X1, X[, ind, drop = FALSE])
}, a.combine  = "cbind", block.size = 500)
## [1] 1
## [1] 501
## [1] 1001
## [1] 1501
all.equal(cprod3, cprod1)
## [1] TRUE

To use parallelism, you can e.g. use bigparallelr::set_blas_ncores() to allow for parallel matrix operations if R is linked to some parallel matrix library such as MKL or OpenBLAS. You can also use the parallelism from big_apply(); this won’t work as is:

big_apply(X2, function(X, ind) {
print(ind[1])
big_cprodMat(X1, X[, ind, drop = FALSE])
}, a.combine  = "cbind", block.size = 500, ncores = nb_cores())
## Error in {: task 1 failed - "task 1 failed - "could not find function "big_cprodMat"""

You need two things here, which is basically telling the parallel clusters what are big_cprodMat and X1; you can do

cprod4 <- big_apply(X2, function(X, ind, X1) {
print(ind[1])
bigstatsr::big_cprodMat(X1, X[, ind, drop = FALSE])
}, a.combine  = "cbind", block.size = 500, ncores = nb_cores(), X1 = X1)
all.equal(cprod4, cprod1)
## [1] TRUE

Basically, you need to tell explicitly in which package to find the function, and to pass the other variables as parameters in the function and in big_apply(). For more information on these matters, please have a look at this tutorial on parallelism with R. Also note that each core will use at most a block size of 500 here, so at most 500 x ncores, so you might need to reduce the block size a bit when using parallelism (this is done by default when not specifying block.size).

Another strategy, to save a bit of memory, is to preallocate a resulting FBM and fill it:

cprod5 <- FBM(ncol(X1), ncol(X2))
big_apply(X2, function(X, ind, X1, res) {
res[, ind] <- bigstatsr::big_cprodMat(X1, X[, ind, drop = FALSE])
NULL
}, a.combine  = "c", block.size = 500, ncores = nb_cores(), X1 = X1, res = cprod5)
## NULL
all.equal(cprod5[], cprod1)
## [1] TRUE