`vignettes/big-apply.Rmd`

`big-apply.Rmd`

Function `big_apply()`

enables to apply standard R functions to an FBM, by relying on split-apply-combine strategy where only a subset of the matrix is accessed at once. This allows for a compromise between the amount of maximum memory needed, and the speed of the operation. In this tutorial, I showcase two examples, the imputation of a ‘double’ FBM and the multiplication of two FBMs.

Some data to work with:

To impute by the mean of each column, you can use the following code:

big_apply(X, function(X, ind) { # have an idea of progress print(ind[1]) # access a subset of columns as a standard R matrix X.sub <- X[, ind, drop = FALSE] # get the location (i, j) of missing values ind_na <- which(is.na(X.sub), arr.ind = TRUE) # compute the corresponding mean for each column j means <- colMeans(X.sub, na.rm = TRUE)[ind_na[, 2]] # update j (relative to subset) to global 'ind' ind_na[, 2] <- ind[ind_na[, 2]] # fill positions with corresponding means X[ind_na] <- means # here we don't want to return anything, so `NULL` NULL }, a.combine = 'c', block.size = 500)

```
## [1] 1
## [1] 501
## [1] 1001
## [1] 1501
```

`## NULL`

What is going on:

The function

**split**indices (`cols_along(X)`

by default, can be changed using parameter`ind`

) to have maximum size of`block.size`

.Then, this is passed as

`ind`

into the function and can be used to access a subset of columns of the FBM with`X[, ind, drop = FALSE]`

(which is then a standard R matrix).Then you can compute the colmeans by

**apply**ing a standard R function on this subset, using e.g.`colMeans(X.sub, na.rm = TRUE)`

.Then, the missing values can be filled with the corresponding column means and assigned back to the FBM (be aware of the relative column indices

`ind`

versus the global ones`cols_along(X)`

).Then, we return

`NULL`

because we are not interested in returning anything here. We also use`a.combine = 'c'`

to**combine**all`NULL`

corresponding to each block to return only one`NULL`

instead of a list of`NULL`

.If you would like to return e.g. the column means that you used for imputation, you could e.g. replace

`NULL`

by the result of`colMeans(X.sub, na.rm = TRUE)`

, and keep`a.combine = 'c'`

.

# Verification X[1, ]

```
## [1] -1.267660e-02 3.282387e-02 3.315054e-04 -2.392106e-02 -5.341066e-02
## [6] 1.382311e-02 -1.692490e-02 -3.555750e-02 2.448234e-02 -2.054286e-02
## [11] 5.787257e-03 -2.372889e-02 9.233218e-03 -3.948486e-02 -1.765272e-05
## [16] -1.623886e-02 1.675917e-02 4.250739e-02 -4.906126e-03 -1.519135e-02
## [ reached getOption("max.print") -- omitted 1980 entries ]
```

big_scale()(X)$center

```
## [1] -1.267660e-02 3.282387e-02 3.315054e-04 -2.392106e-02 -5.341066e-02
## [6] 1.382311e-02 -1.692490e-02 -3.555750e-02 2.448234e-02 -2.054286e-02
## [11] 5.787257e-03 -2.372889e-02 9.233218e-03 -3.948486e-02 -1.765272e-05
## [16] -1.623886e-02 1.675917e-02 4.250739e-02 -4.906126e-03 -1.519135e-02
## [ reached getOption("max.print") -- omitted 1980 entries ]
```

Imagine you have two large FBMs

# here they are not that large but it is just for the example N <- 5000 M1 <- 1000 M2 <- 2000 library(bigstatsr) X1 <- FBM(N, M1, init = 1) X2 <- FBM(N, M2, init = 2)

How to compute the cross-product \(X_1^T X_2\)?

The first thing is to ask whether you really want to do this, and questioned the dimension of the resulting matrix (here `M1`

\(\times\) `M2`

).

There are many solutions to this problem, that depends mainly of the size of your matrices.

The first simple solution when `X2`

is small is to access it as a standard R matrix and to use.

cprod1 <- big_cprodMat(X1, X2[]) dim(cprod1)

`## [1] 1000 2000`

If `X1`

is small and the resulting product is small, then you can also use

cprod2 <- t(big_cprodMat(X2, X1[])) all.equal(cprod2, cprod1)

`## [1] TRUE`

If the matrices are larger, especially `X2`

, then you can compute the cross-product only for a subset of columns of `X2`

, which gives you only a subset of columns of the result, which you can then combine. This can be implemented using

cprod3 <- big_apply(X2, function(X, ind) { print(ind[1]) big_cprodMat(X1, X[, ind, drop = FALSE]) }, a.combine = "cbind", block.size = 500)

```
## [1] 1
## [1] 501
## [1] 1001
## [1] 1501
```

all.equal(cprod3, cprod1)

`## [1] TRUE`

To use parallelism, you can e.g. use `bigparallelr::set_blas_ncores()`

to allow for parallel matrix operations if R is linked to some parallel matrix library such as MKL or OpenBLAS. You can also use the parallelism from `big_apply()`

; this won’t work as is:

big_apply(X2, function(X, ind) { print(ind[1]) big_cprodMat(X1, X[, ind, drop = FALSE]) }, a.combine = "cbind", block.size = 500, ncores = nb_cores())

`## Error in {: task 1 failed - "task 1 failed - "could not find function "big_cprodMat"""`

You need two things here, which is basically telling the parallel clusters what are `big_cprodMat`

and `X1`

; you can do

cprod4 <- big_apply(X2, function(X, ind, X1) { print(ind[1]) bigstatsr::big_cprodMat(X1, X[, ind, drop = FALSE]) }, a.combine = "cbind", block.size = 500, ncores = nb_cores(), X1 = X1) all.equal(cprod4, cprod1)

`## [1] TRUE`

Basically, you need to tell explicitly in which package to find the function, and to pass the other variables as parameters in the function and in `big_apply()`

. For more information on these matters, please have a look at this tutorial on parallelism with R. Also note that each core will use at most a block size of 500 here, so at most 500 x ncores, so you might need to reduce the block size a bit when using parallelism (this is done by default when not specifying `block.size`

).

Another strategy, to save a bit of memory, is to preallocate a resulting FBM and fill it:

cprod5 <- FBM(ncol(X1), ncol(X2)) big_apply(X2, function(X, ind, X1, res) { res[, ind] <- bigstatsr::big_cprodMat(X1, X[, ind, drop = FALSE]) NULL }, a.combine = "c", block.size = 500, ncores = nb_cores(), X1 = X1, res = cprod5)

`## NULL`

all.equal(cprod5[], cprod1)

`## [1] TRUE`