vignettes/big-apply.Rmd
big-apply.Rmd
Function big_apply()
enables to apply standard R
functions to an FBM, by relying on split-apply-combine strategy where
only a subset of the matrix is accessed at once. This allows for a
compromise between the amount of maximum memory needed, and the speed of
the operation. In this tutorial, I showcase two examples, the imputation
of a ‘double’ FBM and the multiplication of two FBMs.
Some data to work with:
To impute by the mean of each column, you can use the following code:
big_apply(X, function(X, ind) {
# have an idea of progress
print(ind[1])
# access a subset of columns as a standard R matrix
X.sub <- X[, ind, drop = FALSE]
# get the location (i, j) of missing values
ind_na <- which(is.na(X.sub), arr.ind = TRUE)
# compute the corresponding mean for each column j
means <- colMeans(X.sub, na.rm = TRUE)[ind_na[, 2]]
# update j (relative to subset) to global 'ind'
ind_na[, 2] <- ind[ind_na[, 2]]
# fill positions with corresponding means
X[ind_na] <- means
# here we don't want to return anything, so `NULL`
NULL
}, a.combine = 'c', block.size = 500)
## [1] 1
## [1] 501
## [1] 1001
## [1] 1501
## NULL
What is going on:
The function split indices
(cols_along(X)
by default, can be changed using parameter
ind
) to have maximum size of
block.size
.
Then, this is passed as ind
into the function and
can be used to access a subset of columns of the FBM with
X[, ind, drop = FALSE]
(which is then a standard R
matrix).
Then you can compute the colmeans by applying a
standard R function on this subset, using
e.g. colMeans(X.sub, na.rm = TRUE)
.
Then, the missing values can be filled with the corresponding
column means and assigned back to the FBM (be aware of the relative
column indices ind
versus the global ones
cols_along(X)
).
Then, we return NULL
because we are not interested
in returning anything here. We also use a.combine = 'c'
to
combine all NULL
corresponding to each
block to return only one NULL
instead of a list of
NULL
.
If you would like to return e.g. the column means that you used
for imputation, you could e.g. replace NULL
by the result
of colMeans(X.sub, na.rm = TRUE)
, and keep
a.combine = 'c'
.
# Verification
X[1, ]
## [1] -0.034339519 0.034951955 0.050932074 0.003296804 0.008878715
## [6] -0.024596088 -0.035180224 -0.015871200 0.006975371 0.001935426
## [11] -0.060124419 0.004442005 0.047367485 -0.046293716 -0.009476670
## [16] -0.011590082 -0.008450684 0.018243748 0.032336202 -0.010500991
## [ reached getOption("max.print") -- omitted 1980 entries ]
big_scale()(X)$center
## [1] -0.034339519 0.034951955 0.050932074 0.003296804 0.008878715
## [6] -0.024596088 -0.035180224 -0.015871200 0.006975371 0.001935426
## [11] -0.060124419 0.004442005 0.047367485 -0.046293716 -0.009476670
## [16] -0.011590082 -0.008450684 0.018243748 0.032336202 -0.010500991
## [ reached getOption("max.print") -- omitted 1980 entries ]
Imagine you have two large FBMs
# here they are not that large but it is just for the example
N <- 5000
M1 <- 1000
M2 <- 2000
library(bigstatsr)
X1 <- FBM(N, M1, init = 1)
X2 <- FBM(N, M2, init = 2)
How to compute the cross-product \(X_1^T X_2\)?
The first thing is to ask whether you really want to do this, and
questioned the dimension of the resulting matrix (here M1
\(\times\) M2
).
There are many solutions to this problem, that depends mainly of the size of your matrices.
The first simple solution when X2
is small is to access
it as a standard R matrix and to use.
cprod1 <- big_cprodMat(X1, X2[])
dim(cprod1)
## [1] 1000 2000
If X1
is small and the resulting product is small, then
you can also use
cprod2 <- t(big_cprodMat(X2, X1[]))
all.equal(cprod2, cprod1)
## [1] TRUE
If the matrices are larger, especially X2
, then you can
compute the cross-product only for a subset of columns of
X2
, which gives you only a subset of columns of the result,
which you can then combine. This can be implemented using
cprod3 <- big_apply(X2, function(X, ind) {
print(ind[1])
big_cprodMat(X1, X[, ind, drop = FALSE])
}, a.combine = "cbind", block.size = 500)
## [1] 1
## [1] 501
## [1] 1001
## [1] 1501
all.equal(cprod3, cprod1)
## [1] TRUE
To use parallelism, you can e.g. use
bigparallelr::set_blas_ncores()
to allow for parallel
matrix operations if R is linked to some parallel matrix library such as
MKL or OpenBLAS. You can also use the parallelism from
big_apply()
; this won’t work as is:
big_apply(X2, function(X, ind) {
print(ind[1])
big_cprodMat(X1, X[, ind, drop = FALSE])
}, a.combine = "cbind", block.size = 500, ncores = nb_cores())
## Error in {: task 1 failed - "task 1 failed - "could not find function "big_cprodMat"""
You need two things here, which is basically telling the parallel
clusters what are big_cprodMat
and X1
; you can
do
cprod4 <- big_apply(X2, function(X, ind, X1) {
print(ind[1])
bigstatsr::big_cprodMat(X1, X[, ind, drop = FALSE])
}, a.combine = "cbind", block.size = 500, ncores = nb_cores(), X1 = X1)
all.equal(cprod4, cprod1)
## [1] TRUE
Basically, you need to tell explicitly in which package to find the
function, and to pass the other variables as parameters in the function
and in big_apply()
. For more information on these matters,
please have a look at this
tutorial on parallelism with R. Also note that each core will use at
most a block size of 500 here, so at most 500 x ncores, so you might
need to reduce the block size a bit when using parallelism (this is done
by default when not specifying block.size
).
Another strategy, to save a bit of memory, is to preallocate a resulting FBM and fill it:
cprod5 <- FBM(ncol(X1), ncol(X2))
big_apply(X2, function(X, ind, X1, res) {
res[, ind] <- bigstatsr::big_cprodMat(X1, X[, ind, drop = FALSE])
NULL
}, a.combine = "c", block.size = 500, ncores = nb_cores(), X1 = X1, res = cprod5)
## NULL
all.equal(cprod5[], cprod1)
## [1] TRUE