vignettes/bigstatsr-and-bigmemory.Rmd
bigstatsr-and-bigmemory.Rmd
I’m a fan of package {bigmemory}. It’s by far the most convenient solution I found for analyzing large genomic data in R on my computer. I’ve started using it in early 2016 and have also contributed some features.
At first, package {bigstatsr} was using the big.matrix
objects of package {bigmemory}. Yet, at some point, I felt the need to
become independent of package {bigmemory}. As package {bigstatsr} was a
central tool of all my thesis work, I needed to add whatever feature I
wanted whenever I wanted to. Thus, I reimplemented an object very
similar to the filebacked big.matrix
object, called “FBM”
(Filebacked Big Matrix, very original) in package {bigstatsr}. These two
formats are so similar that you can easily convert (without copying the
data) between the two objects.
In this vignette, I explain the main differences between my package {bigstatsr} and the packages of the bigmemory family.
Package {bigmemory} provides 3 types of big.matrix
objects:
big.matrix
, which is not shared between
processes and use directly random access memory,big.matrix
, which uses some shared memory
(still a mystery for me),big.matrix
(so, shared between processes),
which stores the data on disk and access it via memory-mapping.I placed a lot of interest for shared matrices (filebacked or not).
Yet, I encountered memory limitations with the shared
big.matrix
(non-filebacked). So, at some point, I was using
only filebacked big.matrix
objects. So, in {bigstatsr}, you
will found only the FBM
format, which is very similar to
the filebacked big.matrix
format. To prove it, let us
convert from one to the other (without copying the backingfile).
## Warning: package 'bigmemory' was built under R version 4.2.3
# Create a temporary FBM
fbm <- FBM(10, 10)
fbm$backingfile
## [1] "C:\\Users\\au639593\\AppData\\Local\\Temp\\RtmpUdr0fH\\file54fc15d93a92.bk"
fbm[]
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 0 0 0 0 0 0 0 0 0 0
## [2,] 0 0 0 0 0 0 0 0 0 0
## [3,] 0 0 0 0 0 0 0 0 0 0
## [4,] 0 0 0 0 0 0 0 0 0 0
## [5,] 0 0 0 0 0 0 0 0 0 0
## [6,] 0 0 0 0 0 0 0 0 0 0
## [7,] 0 0 0 0 0 0 0 0 0 0
## [8,] 0 0 0 0 0 0 0 0 0 0
## [9,] 0 0 0 0 0 0 0 0 0 0
## [10,] 0 0 0 0 0 0 0 0 0 0
## [1] "C:/Users/au639593/AppData/Local/Temp/RtmpUdr0fH/file54fc15d93a92.bk"
# Changing values of one changes the value of the other
bm[]
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 0 0 0 0 0 0 0 0 0 0
## [2,] 0 0 0 0 0 0 0 0 0 0
## [3,] 0 0 0 0 0 0 0 0 0 0
## [4,] 0 0 0 0 0 0 0 0 0 0
## [5,] 0 0 0 0 0 0 0 0 0 0
## [6,] 0 0 0 0 0 0 0 0 0 0
## [7,] 0 0 0 0 0 0 0 0 0 0
## [8,] 0 0 0 0 0 0 0 0 0 0
## [9,] 0 0 0 0 0 0 0 0 0 0
## [10,] 0 0 0 0 0 0 0 0 0 0
bm[1, 1] <- 2
fbm[1, 1]
## [1] 2
BM2FBM <- function(bm) {
FBM(nrow = nrow(bm), ncol = ncol(bm), type = typeof(bm),
backingfile = file.path(dir.name(bm), sub_bk(file.name(bm))),
create_bk = FALSE)
}
# Convert the filebacked big.matrix to a FBM
fbm2 <- BM2FBM(bm)
bm[, 3] <- 1
fbm2[]
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 2 0 1 0 0 0 0 0 0 0
## [2,] 0 0 1 0 0 0 0 0 0 0
## [3,] 0 0 1 0 0 0 0 0 0 0
## [4,] 0 0 1 0 0 0 0 0 0 0
## [5,] 0 0 1 0 0 0 0 0 0 0
## [6,] 0 0 1 0 0 0 0 0 0 0
## [7,] 0 0 1 0 0 0 0 0 0 0
## [8,] 0 0 1 0 0 0 0 0 0 0
## [9,] 0 0 1 0 0 0 0 0 0 0
## [10,] 0 0 1 0 0 0 0 0 0 0
Package {bigmemory} handles many types:
For now, package {bigstatsr} handles the following types:
Additionally, the unsigned char type is used in the
FBM.code256
format, which instead of accessing integer
values ranging from 0 to 255, it uses some code to access 256 arbitrary
different values. I use this format in my other R package {bigsnpr} to
store genotype dosages.
A big.matrix
is basically an S4 class object that stores
a pointer to a C++ object (an external pointer). When you restart your R
session, this pointer becomes Nil
and it may make your R
session crash. You need to use a different object, a
big.matrix.descriptor
(using describe()
) which
stores enough information to make it possible to create this external
pointer again (using attach.big.matrix()
). Therefore, one
has to often switch between descriptors and big.matrix
objects.
For FBMs, I use the nice idea of package bigmemoryExtras.
Basically, I use a Reference Class (RC) object with active binding. In
this object, I store the external pointer and the information
needed to create the pointer to the C++ object. The active binding makes
this automatic so that the user never need to use
attach.big.matrix()
or describe()
anymore (and
no more session crash!).
What this also means is that you can now serialize a FBM (for
example, saving it in an rds file with saveRDS()
or using
it in a parallel algorithm). For instance, with a standard
big.matrix
object, you’ll need to pass the descriptor
object in parallel algorithms:
library(foreach)
cl <- parallel::makeCluster(2)
doParallel::registerDoParallel(cl)
# Won't work because bm will be Nil when copied to the cluster
foreach(j = 1:10, .combine = 'c') %dopar% {
sum(bm[, j])
}
## Error in unserialize(socklist[[n]]): error reading from connection
parallel::stopCluster(cl)
## Error in serialize(data, node$con): error writing to connection
cl <- parallel::makeCluster(2)
doParallel::registerDoParallel(cl)
# Need to pass the descriptor instead and to reattach
bm.desc <- describe(bm)
foreach(j = 1:10, .combine = 'c') %dopar% {
x <- bigmemory::attach.big.matrix(bm.desc)
sum(x[, j])
}
## [1] -4.0808142 -0.6354808 -3.0578644 -2.7322787 1.4110698 -1.4794974
## [7] 0.9597882 -1.7836563 1.3499387 8.0163466
# You can directly pass FBMs, the address will be reattached automatically
In contrast, you can directly use FBMs in parallel algorithms, as the address will be reattached automatically thanks to the active bindings.
## [1] -4.0808142 -0.6354808 -3.0578644 -2.7322787 1.4110698 -1.4794974
## [7] 0.9597882 -1.7836563 1.3499387 8.0163466
parallel::stopCluster(cl)
Let us compute the column sums of a big.matrix
object in
Rcpp.
// [[Rcpp::depends(BH, bigmemory)]]
#include <bigmemory/MatrixAccessor.hpp>
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector colsums_bm(SEXP pBigMat) {
XPtr<BigMatrix> xpMat(pBigMat);
MatrixAccessor<double> macc(*xpMat);
int n = macc.nrow();
int m = macc.ncol();
NumericVector res(m);
for (int j = 0; j < m; j++) {
for (int i = 0; i < n; i++) {
res[j] += macc[j][i];
}
}
return res;
}
colsums_bm(bm@address)
## [1] -4.0808142 -0.6354808 -3.0578644 -2.7322787 1.4110698 -1.4794974
## [7] 0.9597882 -1.7836563 1.3499387 8.0163466
Now, let us do it for an FBM
.
// [[Rcpp::plugins(cpp11)]]
// [[Rcpp::depends(bigstatsr, rmio)]]
#include <bigstatsr/BMAcc.h>
// [[Rcpp::export]]
NumericVector colsums_fbm(Environment fbm) {
XPtr<FBM> xpMat = fbm["address"];
BMAcc<double> macc(xpMat);
int n = macc.nrow();
int m = macc.ncol();
NumericVector res(m);
for (int j = 0; j < m; j++)
for (int i = 0; i < n; i++)
res[j] += macc(i, j);
return res;
}
colsums_fbm(X)
## [1] -4.0808142 -0.6354808 -3.0578644 -2.7322787 1.4110698 -1.4794974
## [7] 0.9597882 -1.7836563 1.3499387 8.0163466
So, the main difference is that {bigmemory} uses
macc[j][i]
whereas FBM
objects use the same
accessor in C++ as standard Rcpp matrices, macc(i, j)
. So,
it is easier to adapt existing Rcpp algorithms to be used for
FBM
objects, e.g. using templates. Note that there is also
a sub-FBM accessor, so that you can also use the same algorithms on a
subset of the FBM object. For example:
// [[Rcpp::plugins(cpp11)]]
// [[Rcpp::depends(bigstatsr, rmio)]]
#include <bigstatsr/BMAcc.h>
// [[Rcpp::export]]
NumericVector colsums_fbm_sub(Environment fbm,
const IntegerVector& ind_row,
const IntegerVector& ind_col) {
XPtr<FBM> xpMat = fbm["address"];
SubBMAcc<double> macc(xpMat, ind_row, ind_col, 1);
int n = macc.nrow();
int m = macc.ncol();
NumericVector res(m);
for (int j = 0; j < m; j++)
for (int i = 0; i < n; i++)
res[j] += macc(i, j);
return res;
}
colsums_fbm_sub(X, rows_along(X), 1:6)
## [1] -4.0808142 -0.6354808 -3.0578644 -2.7322787 1.4110698 -1.4794974
m <- matrix(nrow = 1e5, ncol = 50)
m[] <- rnorm(length(m))
m <- as.big.matrix(m)
# Brute force solution (if you have enough RAM)
system.time(
true <- sqrt(rowSums(m[]^2))
)
## user system elapsed
## 0.03 0.00 0.03
# Using package biganalytics (of the bigmemory family)
system.time(
test1 <- biganalytics::apply(m, 1, function(x) {
sqrt(sum(x^2))
})
)
## user system elapsed
## 2.09 0.12 2.34
all.equal(test1, true)
## [1] TRUE
The {biganalytics} strategy is to make a loop, which is slow because there are a lot of elements to loop through.
Package {bigstatsr} uses a trade-off between accessing all the matrix at once and accessing only one column/row at each iteration. You can access blocks of the big matrix and apply efficient vectorized R functions to each block, and then combine the results.
m2 <- big_copy(m)
# Here, I split the rows, which is NOT the default
system.time(
test2 <- big_apply(m2, a.FUN = function(X, ind) {
sqrt(rowSums(X[ind, , drop = FALSE]^2))
}, a.combine = 'c', ind = rows_along(m2), block.size = 1000)
)
## user system elapsed
## 0.06 0.03 0.09
all.equal(test2, true)
## [1] TRUE
# Here, I split the columns (the default)
system.time(
test3 <- big_apply(m2, a.FUN = function(X, ind) {
rowSums(X[, ind, drop = FALSE]^2)
}, a.combine = 'plus', block.size = 10)
)
## user system elapsed
## 0.04 0.00 0.05
## [1] TRUE
m <- matrix(nrow = 10e3, ncol = 2000)
m[] <- rnorm(length(m))
m <- as.big.matrix(m)
a <- matrix(rnorm(20 * ncol(m)), ncol(m), 20)
system.time(
true <- m[] %*% a
)
## user system elapsed
## 0.33 0.00 0.34
## Warning: package 'bigalgebra' was built under R version 4.2.3
system.time(
test <- m %*% a
)
## user system elapsed
## 0.33 0.00 0.31
identical(test[], true)
## [1] TRUE
m2 <- big_copy(m)
# Function that allows subsetting and scaling of the matrix
system.time(
test2 <- big_prodMat(m2, a)
)
## user system elapsed
## 0.41 0.15 0.64
identical(test2, true)
## [1] TRUE
# Standard products (without subsetting or scaling)
# are implemented for 'double' FBMs
system.time(
test3 <- m2 %*% a
)
## user system elapsed
## 0.31 0.00 0.31
identical(test3[], true)
## [1] TRUE
# Additionally, you could use {bigalgebra} if you don't need subsetting
system.time(
test4 <- m2$bm() %*% a
)
## user system elapsed
## 0.30 0.05 0.35
identical(test4[], true)
## [1] TRUE
Making functions (not operators) makes it possible to use subsetting and scaling.