I’m a fan of package {bigmemory}. It’s by far the most convenient solution I found for analyzing large genomic data in R on my computer. I’ve been using it since early 2016 and have also contributed some features.

At first, package {bigstatsr} was using the big.matrix objects of package {bigmemory}. Yet, at some point, I felt the need to become independent of package {bigmemory}. As package {bigstatsr} will be a central tool of all my thesis work, I need to add whatever feature I want whenever I want to. Thus, I reimplemented an object very similar to the filebacked big.matrix object, called “FBM” (Filebacked Big Matrix, very original) in this package. These two formats are so similar that you can easily convert (without copying the data) between the two objects.

In this vignette, I explain the main differences between my package {bigstatsr} and the packages of the bigmemory family.

## Formats and types

### Format

Package {bigmemory} provides 3 types of big.matrix objects:

• a “RAM” big.matrix, which is not shared between processes and use directly random access memory,
• a shared big.matrix, which uses some shared memory (still a mystery for me),
• a filebacked big.matrix (so, shared between processes), which stores the data on disk and access it via memory-mapping.

I placed a lot of interest for shared matrices (filebacked or not). Yet, I encountered memory limitations with the shared big.matrix (non-filebacked). So, at some point, I was using only filebacked big.matrix objects. So, in {bigstatsr}, you will found only the FBM format, which is very similar to the filebacked big.matrix format. To prove it, let us convert from one to the other (without copying the backingfile).

library(bigmemory)
library(bigstatsr)
# Create a temporary FBM
fbm <- FBM(10, 10)
fbm$backingfile ## [1] "/tmp/Rtmp16lQtQ/file51622c3b858d.bk" fbm[] ## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] ## [1,] 0 0 0 0 0 0 0 0 0 0 ## [2,] 0 0 0 0 0 0 0 0 0 0 ## [3,] 0 0 0 0 0 0 0 0 0 0 ## [4,] 0 0 0 0 0 0 0 0 0 0 ## [5,] 0 0 0 0 0 0 0 0 0 0 ## [6,] 0 0 0 0 0 0 0 0 0 0 ## [7,] 0 0 0 0 0 0 0 0 0 0 ## [8,] 0 0 0 0 0 0 0 0 0 0 ## [9,] 0 0 0 0 0 0 0 0 0 0 ## [10,] 0 0 0 0 0 0 0 0 0 0 # Convert it to a big.matrix bm <- fbm$bm()
# Same backingfile
paste0(dir.name(bm), file.name(bm))
## [1] "/tmp/Rtmp16lQtQ/file51622c3b858d.bk"
# Changing values of one changes the value of the other
bm[]
##       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
##  [1,]    0    0    0    0    0    0    0    0    0     0
##  [2,]    0    0    0    0    0    0    0    0    0     0
##  [3,]    0    0    0    0    0    0    0    0    0     0
##  [4,]    0    0    0    0    0    0    0    0    0     0
##  [5,]    0    0    0    0    0    0    0    0    0     0
##  [6,]    0    0    0    0    0    0    0    0    0     0
##  [7,]    0    0    0    0    0    0    0    0    0     0
##  [8,]    0    0    0    0    0    0    0    0    0     0
##  [9,]    0    0    0    0    0    0    0    0    0     0
## [10,]    0    0    0    0    0    0    0    0    0     0
bm[1, 1] <- 2
fbm[1, 1]
## [1] 2
BM2FBM <- function(bm) {
FBM(nrow = nrow(bm), ncol = ncol(bm), type = typeof(bm),
backingfile = file.path(dir.name(bm), sub_bk(file.name(bm))),
create_bk = FALSE)
}

# Convert the filebacked big.matrix to a FBM
fbm2 <- BM2FBM(bm)
bm[, 3] <- 1
fbm2[]
##       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
##  [1,]    2    0    1    0    0    0    0    0    0     0
##  [2,]    0    0    1    0    0    0    0    0    0     0
##  [3,]    0    0    1    0    0    0    0    0    0     0
##  [4,]    0    0    1    0    0    0    0    0    0     0
##  [5,]    0    0    1    0    0    0    0    0    0     0
##  [6,]    0    0    1    0    0    0    0    0    0     0
##  [7,]    0    0    1    0    0    0    0    0    0     0
##  [8,]    0    0    1    0    0    0    0    0    0     0
##  [9,]    0    0    1    0    0    0    0    0    0     0
## [10,]    0    0    1    0    0    0    0    0    0     0

### Types

Package {bigmemory} handles many types:

• unsigned char (1-byte unsigned integer)
• char (1-byte signed integer)
• short (2-byte signed integer)
• integer (4-byte signed integer)
• float (single precision floating-point numbers)
• double (double precision floating-point numbers)
• complex

For now, package {bigstatsr} handles the following types:

• unsigned char
• unsigned short
• integer
• float
• double

Additionally, the unsigned char type is used in the FBM.code256 format, which instead of accessing integer values ranging from 0 to 255, it uses some code to access 256 arbitrary different values. I make a lot of use of this format in my other R package {bigsnpr} in order to store genotype dosages.

## Class

A big.matrix is basically an S4 class object that stores a pointer to a C++ object (an external pointer). When you restart your R session, this pointer becomes Nil and it may make your R session crash. You’ll need a different object, a big.matrix.descriptor (using describe()) which stores enough information to make it possible to create this external pointer again (using attach.big.matrix()). Therefore, one has to often switch between descriptors and big.matrix objects.

For FBMs, I use the nice idea of package bigmemoryExtras. Basically, I use a Reference Class (RC) object with active binding. In this object, I store the external pointer and the information needed to create the pointer to the C++ object. The active binding makes this automatic so that the user never need to use attach.big.matrix() or describe() anymore (and no more session crash!).

What this also means is that you can now serialize a FBM (for example, saving it in an rds file with saveRDS() or using it in a parallel algorithm). For instance, with a standard big.matrix object, you’ll need to pass the descriptor object in paralell algorithms:

X <- FBM(10, 10); X[] <- rnorm(length(X))
bm <- X$bm() library(foreach) cl <- parallel::makeCluster(2) doParallel::registerDoParallel(cl) # Won't work because bm will be Nil when copied to the cluster tryCatch({ foreach(j = 1:10, .combine = 'c') %dopar% { sum(bm[, j]) } }, error = function(e) print(e)) ## <simpleError in unserialize(socklist[[n]]): error reading from connection> parallel::stopCluster(cl) cl <- parallel::makeCluster(2) doParallel::registerDoParallel(cl) # Need to pass the descriptor instead and to reattach bm.desc <- describe(bm) foreach(j = 1:10, .combine = 'c') %dopar% { x <- bigmemory::attach.big.matrix(bm.desc) sum(x[, j]) } ## [1] -3.392500 1.960299 3.498428 -2.621476 -1.756033 2.695954 1.643720 ## [8] 1.171187 2.363379 2.418518 # You can directly pass FBMs, the address will be reattached automatically foreach(j = 1:10, .combine = 'c') %dopar% { sum(X[, j]) } ## [1] -3.392500 1.960299 3.498428 -2.621476 -1.756033 2.695954 1.643720 ## [8] 1.171187 2.363379 2.418518 parallel::stopCluster(cl) ## C++ accessors Let us compute the column sums of a big.matrix object in Rcpp. // [[Rcpp::depends(BH, bigmemory)]] #include <bigmemory/MatrixAccessor.hpp> #include <Rcpp.h> using namespace Rcpp; // [[Rcpp::export]] NumericVector colsums_bm(SEXP pBigMat) { XPtr<BigMatrix> xpMat(pBigMat); MatrixAccessor<double> macc(*xpMat); int n = macc.nrow(); int m = macc.ncol(); NumericVector res(m); for (int j = 0; j < m; j++) { for (int i = 0; i < n; i++) { res[j] += macc[j][i]; } } return res; } colsums_bm(bm@address) ## [1] -3.392500 1.960299 3.498428 -2.621476 -1.756033 2.695954 1.643720 ## [8] 1.171187 2.363379 2.418518 Now, let us do it for an FBM. // [[Rcpp::depends(bigstatsr, rmio, RcppArmadillo)]] #include <bigstatsr/BMAcc.h> // [[Rcpp::export]] NumericVector colsums_fbm(Environment fbm) { XPtr<FBM> xpMat = fbm["address"]; BMAcc<double> macc(xpMat); int n = macc.nrow(); int m = macc.ncol(); NumericVector res(m); for (int j = 0; j < m; j++) { for (int i = 0; i < n; i++) { res[j] += macc(i, j); } } return res; } colsums_fbm(X) ## [1] -3.392500 1.960299 3.498428 -2.621476 -1.756033 2.695954 1.643720 ## [8] 1.171187 2.363379 2.418518 So, the main difference is that {bigmemory} uses macc[j][i] whereas FBM objects use the same accessor in C++ as standard Rcpp matrices, macc(i, j). So, it is easier to adapt existing Rcpp algorithms to be used for FBM objects, e.g. using templates. Note that there is also a sub-FBM accessor, so that you can also use the same algorithms on a subset of the FBM object. For example: // [[Rcpp::depends(bigstatsr, rmio, RcppArmadillo)]] #include <bigstatsr/BMAcc.h> template <class C> NumericVector colsums_fbm_templated(C macc) { int n = macc.nrow(); int m = macc.ncol(); NumericVector res(m); for (int j = 0; j < m; j++) { for (int i = 0; i < n; i++) { res[j] += macc(i, j); } } return res; } // [[Rcpp::export]] NumericVector colsums_matrix(const NumericMatrix& x) { return colsums_fbm_templated(x); } // [[Rcpp::export]] NumericVector colsums_fbm2(Environment fbm) { XPtr<FBM> xpMat = fbm["address"]; BMAcc<double> macc(xpMat); return colsums_fbm_templated(macc); } // [[Rcpp::export]] NumericVector colsums_fbm2_sub(Environment fbm, const IntegerVector& ind_row, const IntegerVector& ind_col) { XPtr<FBM> xpMat = fbm["address"]; SubBMAcc<double> macc(xpMat, ind_row - 1, ind_col - 1); return colsums_fbm_templated(macc); } class(mat <- X[])  ## [1] "matrix" colsums_matrix(mat) ## [1] -3.392500 1.960299 3.498428 -2.621476 -1.756033 2.695954 1.643720 ## [8] 1.171187 2.363379 2.418518 colsums_fbm2(X) ## [1] -3.392500 1.960299 3.498428 -2.621476 -1.756033 2.695954 1.643720 ## [8] 1.171187 2.363379 2.418518 colsums_fbm2_sub(X, rows_along(X), 1:6) ## [1] -3.392500 1.960299 3.498428 -2.621476 -1.756033 2.695954 ## Apply an R function m <- matrix(nrow = 1e5, ncol = 50) m[] <- rnorm(length(m)) m <- as.big.matrix(m) # Brute force solution (if you have enough RAM) system.time( true <- sqrt(rowSums(m[]^2)) ) ## user system elapsed ## 0.050 0.035 0.085 # Using package biganalytics (of the bigmemory family) system.time( test1 <- biganalytics::apply(m, 1, function(x) { sqrt(sum(x^2)) }) ) ## user system elapsed ## 3.313 0.034 3.348 all.equal(test1, true) ## [1] TRUE The {biganalytics} strategy is to make a loop, which is slow because there are a lot of elements to loop through. Package {bigstatsr} uses a trade-off between accessing all the matrix at once and accessing only one column/row at each iteration. You can access blocks of the big matrix and apply efficient vectorized R functions to each block, and then combine the results. m2 <- big_copy(m) # Here, I split the rows, which is NOT the default system.time( test2 <- big_apply(m2, a.FUN = function(X, ind) { sqrt(rowSums(X[ind, , drop = FALSE]^2)) }, a.combine = 'c', ind = rows_along(m2), block.size = 1000) ) ## user system elapsed ## 0.130 0.024 0.154 all.equal(test2, true) ## [1] TRUE # Here, I split the columns (the default) system.time( test3 <- big_apply(m2, a.FUN = function(X, ind) { rowSums(X[, ind, drop = FALSE]^2) }, a.combine = 'plus', block.size = 10) ) ## user system elapsed ## 0.112 0.044 0.155 all.equal(sqrt(test3), true) ## [1] TRUE ## Matrix operations m <- matrix(nrow = 10e3, ncol = 2000) m[] <- rnorm(length(m)) m <- as.big.matrix(m) a <- matrix(rnorm(20 * ncol(m)), ncol(m), 20) system.time( true <- m[] %*% a ) ## user system elapsed ## 0.231 0.076 0.370 library(bigalgebra) system.time( test <- m %*% a ) ## user system elapsed ## 0.070 0.001 0.072 stopifnot(identical(test[], true)) m2 <- big_copy(m) # Function built on top of big_apply system.time( test2 <- big_prodMat(m2, a) ) ## user system elapsed ## 0.191 0.089 0.285 stopifnot(identical(test2, true)) # Standard products (without subsetting or scaling) # are implemented for 'double' FBMs system.time( test3 <- m2 %*% a ) ## user system elapsed ## 0.063 0.000 0.063 stopifnot(identical(test3[], true)) # Additionally, you could use {bigalgebra} if you don't need subsetting system.time( test4 <- m2$bm() %*% a
)
##    user  system elapsed
##   0.144   0.071   0.219
stopifnot(identical(test4[], true))

Making functions (not operators) makes it possible to use subsetting.