A Split-Apply-Combine strategy to parallelize the evaluation of a function.
big_parallelize(
X,
p.FUN,
p.combine = NULL,
ind = cols_along(X),
ncores = nb_cores(),
...
)
An object of class FBM.
The function to be applied to each subset matrix.
It must take a Filebacked Big Matrix as first argument and
ind
, a vector of indices, which are used to split the data.
For example, if you want to apply a function to X[ind.row, ind.col]
,
you may use X[ind.row, ind.col[ind]]
in a.FUN
.
Function to combine the results with do.call
.
This function should accept multiple arguments (...
). For example, you
can use c
, cbind
, rbind
. This package also provides function plus
to add multiple arguments together. The default is NULL
, in which case
the results are not combined and are returned as a list, each element being
the result of a block.
Initial vector of subsetting indices. Default is the vector of all column indices.
Number of cores used. Default doesn't use parallelism. You may use nb_cores.
Extra arguments to be passed to p.FUN
.
Return a list of ncores
elements, each element being the result of
one of the cores, computed on a block. The elements of this list are then
combined with do.call(p.combine, .)
if p.combined
is given.
This function splits indices in parts, then apply a given function to each part and finally combine the results.
if (FALSE) # CRAN is super slow when parallelism.
X <- big_attachExtdata()
### Computation on all the matrix
true <- big_colstats(X)
#> Error in as.list.environment(parent.frame()): object 'X' not found
big_colstats_sub <- function(X, ind) {
big_colstats(X, ind.col = ind)
}
# 1. the computation is split along all the columns
# 2. for each part the computation is done, using `big_colstats`
# 3. the results (data.frames) are combined via `rbind`.
test <- big_parallelize(X, p.FUN = big_colstats_sub,
p.combine = 'rbind', ncores = 2)
#> Error in ncol(x): object 'X' not found
all.equal(test, true)
#> Error in all.equal(test, true): object 'test' not found
### Computation on a part of the matrix
n <- nrow(X)
#> Error in nrow(X): object 'X' not found
m <- ncol(X)
#> Error in ncol(X): object 'X' not found
rows <- sort(sample(n, n/2)) # sort to provide some locality in accesses
#> Error in sample(n, n/2): object 'n' not found
cols <- sort(sample(m, m/2)) # idem
#> Error in sample(m, m/2): object 'm' not found
true2 <- big_colstats(X, ind.row = rows, ind.col = cols)
#> Error in as.list.environment(parent.frame()): object 'X' not found
big_colstats_sub2 <- function(X, ind, rows, cols) {
big_colstats(X, ind.row = rows, ind.col = cols[ind])
}
# This doesn't work because, by default, the computation is spread
# along all columns. We must explictly specify the `ind` parameter.
tryCatch(big_parallelize(X, p.FUN = big_colstats_sub2,
p.combine = 'rbind', ncores = 2,
rows = rows, cols = cols),
error = function(e) message(e))
#> Error in ncol(x): object 'X' not found
# This now works, using `ind = seq_along(cols)`.
test2 <- big_parallelize(X, p.FUN = big_colstats_sub2,
p.combine = 'rbind', ncores = 2,
ind = seq_along(cols),
rows = rows, cols = cols)
#> Error in assert_one_int(total_len): object 'cols' not found
all.equal(test2, true2)
#> Error in all.equal(test2, true2): object 'test2' not found