A Split-Apply-Combine strategy to apply common R functions to a Filebacked Big Matrix.

big_apply(X, a.FUN, a.combine = NULL, ind = cols_along(X),
  ncores = 1, block.size = block_size(nrow(X), ncores), ...)

Arguments

X

A FBM.

a.FUN

The function to be applied to each subset matrix. It must take a Filebacked Big Matrix as first argument and ind, a vector of indices, which are used to split the data. For example, if you want to apply a function to X[ind.row, ind.col], you may use X[ind.row, ind.col[ind]] in a.FUN.

a.combine

Function to combine the results with do.call. This function should accept multiple arguments (...). For example, you can use c, cbind, rbind. This package also provides function plus to add multiple arguments together. The default is NULL, in which case the results are not combined and are returned as a list, each element being the result of a block.

ind

Initial vector of subsetting indices. Default is the vector of all column indices.

ncores

Number of cores used. Default doesn't use parallelism. You may use nb_cores.

block.size

Maximum number of columns (or rows, depending on how you use ind for subsetting) read at once. Default uses block_size.

...

Extra arguments to be passed to a.FUN.

Details

This function splits indices in parts, then apply a given function to each subset matrix and finally combine the results. If parallelization is used, this function splits indices in parts for parallelization, then split again them on each core, apply a given function to each part and finally combine the results (on each cluster and then from each cluster).

See also

Examples

X <- big_attachExtdata() # get the means of each column colMeans_sub <- function(X, ind) colMeans(X[, ind]) str(colmeans <- big_apply(X, a.FUN = colMeans_sub, a.combine = 'c'))
#> num [1:4542] 1.32 1.59 1.53 1.63 1.09 ...
# get the norms of each column colNorms_sub <- function(X, ind) sqrt(colSums(X[, ind]^2)) str(colnorms <- big_apply(X, colNorms_sub, a.combine = 'c'))
#> num [1:4542] 33.6 38.4 37.5 39.2 29.6 ...
# get the sums of each row # split along rows: need to change the "complete" `ind` parameter str(rowsums <- big_apply(X, a.FUN = function(X, ind) rowSums(X[ind, ]), ind = rows_along(X), a.combine = 'c', block.size = 100))
#> num [1:517] 6243 6168 6242 6249 6212 ...
# it is usually preferred to split along columns # because matrices are stored by column. str(rowsums2 <- big_apply(X, a.FUN = function(X, ind) rowSums(X[, ind]), a.combine = 'plus'))
#> num [1:517] 6243 6168 6242 6249 6212 ...
## Every extra parameter to `a.FUN` should be passed to `big_apply` # get the crossproduct between X and a matrix A # note that we don't explicitly pass `ind.col` to `a.FUN` body(big_cprodMat)
#> { #> check_args() #> assert_lengths(ind.row, rows_along(A.row)) #> if (length(ind.row) > 0 && length(ind.col) > 0) { #> big_apply(X, a.FUN = function(X, ind, M, ind.row) { #> crossprod(X[ind.row, ind, drop = FALSE], M) #> }, a.combine = "rbind", ind = ind.col, ncores = ncores, #> block.size = block.size, M = A.row, ind.row = ind.row) #> } #> else { #> matrix(0, length(ind.col), ncol(A.row)) #> } #> }
# get the product between X and a matrix B # here, we must explicitly pass `ind.col` to `a.FUN` # because the right matrix also needs to be subsetted. body(big_prodMat)
#> { #> check_args() #> assert_lengths(ind.col, rows_along(A.col)) #> if (length(ind.row) > 0 && length(ind.col) > 0) { #> big_apply(X, a.FUN = function(X, ind, M, ind.row, ind.col) { #> X[ind.row, ind.col[ind], drop = FALSE] %*% M[ind, #> , drop = FALSE] #> }, a.combine = "plus", ind = seq_along(ind.col), ncores = ncores, #> block.size = block.size, M = A.col, ind.row = ind.row, #> ind.col = ind.col) #> } #> else { #> matrix(0, length(ind.row), ncol(A.col)) #> } #> }