Package bigstatsr: Statistics with matrices on disk (useR 2017)

Written on July 21, 2017

Edit: I’ve edited this post (August 29, 2017) to reflect the major change in package bigstatsr: it doesn’t depend on the bigmemory package anymore (for multiple reasons with the first one being able to put this package on CRAN). So, the Filebacked Big Matrices (FBMs) mentioned hereinafter aren’t big.matrix objects anymore, but something very similar to filebacked big.matrix objects.

In this post, I will talk about my package bigstatsr, which I’ve just presented in a lightning talk of 5 minutes at useR!2017. You can listen to me in action there. I should have chosen a longer talk to explain more about this package, maybe next time. I will use this post to give you a more detailed version of the talk I gave in Brussels.

Motivation behind bigstatsr

I’m a PhD student in predictive human genetics. I’m basically trying to predict someone’s risk of disease based on their DNA mutations. These DNA mutations are in the form of large matrices so that I’m currently working with a matrix of 15K rows and 300K columns. This matrix would take approximately 32GB of RAM if stored as a standard R matrix.

When I began studying this dataset, I had only 8GB of RAM on my computer. I now have 64GB of RAM but it would take only copying this matrix once to make my computer begin swapping and therefore slowing down. I found a convenient solution by using the object big.matrix provided by the R package bigmemory (Kane, Emerson, and Weston 2013). With this solution, you can access a matrix that is stored on disk almost as if it were a standard R matrix in memory.

Introduction to Filebacked Big Matrices (FBMs)

# loading package bigstatsr
library(bigstatsr)
# initializing some matrix on disk
mat <- FBM(5e3, 10e3)
class(mat)

## [1] "FBM"
## attr(,"package")
## [1] "bigstatsr"

dim(mat)

## [1]  5000 10000

mat$backingfile

## [1] "/tmp/RtmpQ6hv3K/file7dfdeb9bed2.bk"

mat[1:5, 1:5]

##      [,1] [,2] [,3] [,4] [,5]
## [1,]    0    0    0    0    0
## [2,]    0    0    0    0    0
## [3,]    0    0    0    0    0
## [4,]    0    0    0    0    0
## [5,]    0    0    0    0    0

mat[1, 1] <- 2
mat[1:5, 1:5]

##      [,1] [,2] [,3] [,4] [,5]
## [1,]    2    0    0    0    0
## [2,]    0    0    0    0    0
## [3,]    0    0    0    0    0
## [4,]    0    0    0    0    0
## [5,]    0    0    0    0    0

mat[2:4] <- 3
mat[1:5, 1:5]

##      [,1] [,2] [,3] [,4] [,5]
## [1,]    2    0    0    0    0
## [2,]    3    0    0    0    0
## [3,]    3    0    0    0    0
## [4,]    3    0    0    0    0
## [5,]    0    0    0    0    0

mat[, 2:3] <- rnorm(2 * nrow(mat))
mat[1:5, 1:5]

##      [,1]        [,2]       [,3] [,4] [,5]
## [1,]    2 -0.13852423 -3.2909757    0    0
## [2,]    3  2.43311765  0.6808002    0    0
## [3,]    3 -0.32046739 -1.2042780    0    0
## [4,]    3  0.01934478  0.1682615    0    0
## [5,]    0  0.46431363  0.9199029    0    0

What we can see is that FBMs can be accessed (read/write) almost as if they were standard R matrices, but you have to be cautious. For example, doing mat[1, ] isn’t recommended. Indeed, FBMs, as standard R matrices, are stored by column so that it is in fact a big vector with columns stored one after the other, contiguously. So, accessing the first row would access elements that are not stored contiguously in memory, which is slow. One should always access columns rather than rows.

Apply an R function to a FBM

An easy strategy to apply an R function to a FBM would be the split-apply-combine strategy (Wickham 2011). For example, you could access only a block of columns at a time, apply a (vectorized) function to this block, and then combine the results of all blocks. This is implemented in function big_apply().

# Compute the sums of the first 1000 columns
colsums_1 <- colSums(mat[, 1:1000])
# Compute the sums of the second block of 1000 columns
colsums_2 <- colSums(mat[, 1001:2000])
# Combine the results
colsums_1_2 <- c(colsums_1, colsums_2)
# Do this automatically with big_apply()
colsums_all <- big_apply(mat, a.FUN = function(X, ind) colSums(X[, ind]), 
                         a.combine = 'c')

When the split-apply-combine strategy can be used for a given function, you could use big_apply() to get the results, while accessing only small blocks of columns (or rows) at a time. You can find more examples of applications for big_apply() there.

Use Rcpp with a FBM

Using Rcpp with a FBM is super easy. Let’s use the previous example, i.e. the computation of the colsums of a FBM. We will do it in 3 different ways.

1. Using the simple accessor

Note: A FBM is a reference class, which is also an environment.

// [[Rcpp::depends(bigstatsr, BH)]]
#include <bigstatsr/BMAcc.h>
#include <Rcpp.h>

using namespace Rcpp;

// [[Rcpp::export]]
NumericVector bigcolsums(Environment BM) {
  
  XPtr<FBM> xpBM = BM["address"];
  BMAcc<double> macc(xpBM);
  
  int n = macc.nrow();
  int m = macc.ncol();

  NumericVector res(m); // vector of m zeros
  int i, j;

  for (j = 0; j < m; j++) 
    for (i = 0; i < n; i++) 
      res[j] += macc(i, j);

  return res;
}

colsums_all3 <- bigcolsums(mat)
all.equal(colsums_all3, colsums_all)

## [1] TRUE

2. Using the bigstatsr way

// [[Rcpp::depends(bigstatsr, BH)]]
#include <bigstatsr/BMAcc.h>
#include <Rcpp.h>

using namespace Rcpp;

// [[Rcpp::export]]
NumericVector bigcolsums2(Environment BM,
                          const IntegerVector& rowInd,
                          const IntegerVector& colInd) {
  
  XPtr<FBM> xpBM = BM["address"];
  // C++ indices begin at 0
  // An accessor of only part of the big.matrix
  SubBMAcc<double> macc(xpBM, rowInd - 1, colInd - 1);
  
  int n = macc.nrow();
  int m = macc.ncol();

  NumericVector res(m); // vector of m zeros
  int i, j;

  for (j = 0; j < m; j++) 
    for (i = 0; i < n; i++) 
      res[j] += macc(i, j);

  return res;
}

colsums_all4 <- bigcolsums2(mat, rows_along(mat), cols_along(mat))
all.equal(colsums_all4, colsums_all)

## [1] TRUE

In bigstatsr, most of the functions have parameters for subsetting rows and columns because it is often useful.

3. Use an already implemented function

str(colsums_all5 <- big_colstats(mat))

## 'data.frame':    10000 obs. of  2 variables:
##  $ sum: num  11 150.3 75.7 0 0 ...
##  $ var: num  0.0062 1.0047 1.0099 0 0 ...

all.equal(colsums_all5$sum, colsums_all)

## [1] TRUE

Principal Component Analysis

Let’s begin by filling the matrix with random numbers in a tricky way.

U <- sweep(matrix(rnorm(nrow(mat) * 10), ncol = 10), 2, 1:10, "/")
V <- matrix(rnorm(ncol(mat) * 10), ncol = 10)
big_apply(mat, a.FUN = function(X, ind) {
  X[, ind] <- tcrossprod(U, V[ind, ]) + rnorm(nrow(X) * length(ind))
  NULL
}, a.combine = 'c')

## NULL

Let’s say we want the first 10 PCs of the (scaled) matrix.

system.time(
  small_svd <- svd(scale(mat[, 1:2000]), nu = 10, nv = 10)
)

##    user  system elapsed 
##   8.175   0.195   8.368

system.time(
  small_svd2 <- big_SVD(mat, big_scale(), ind.col = 1:2000)
)

## (2)

##    user  system elapsed 
##   1.718   0.195   1.912

plot(small_svd2$u, small_svd$u)

system.time(
  small_svd3 <- big_randomSVD(mat, big_scale(), ind.col = 1:2000)
)

##    user  system elapsed 
##   0.368   0.000   0.368

plot(small_svd3$u, small_svd$u)

system.time(
  svd_all <- big_randomSVD(mat, big_scale())
)

##    user  system elapsed 
##   1.810   0.056   1.862

plot(svd_all)

Function big_randomSVD() uses Rcpp and package Rpsectra to implement a fast Singular Value Decomposition for a FBM that is linear in all dimensions (standard PCA algorithm is quadratic in the smallest dimension) which makes it very fast even for large datasets (that have both dimensions that are large).

Some linear models

M <- 100 # number of causal variables
set <- sample(ncol(mat), M)
y <- mat[, set] %*% rnorm(M)
y <- y + rnorm(length(y), sd = 2 * sd(y))

ind.train <- sort(sample(nrow(mat), size = 0.8 * nrow(mat)))
ind.test <- setdiff(rows_along(mat), ind.train)

mult_test <- big_univLinReg(mat, y[ind.train], ind.train = ind.train, 
                            covar.train = svd_all$u[ind.train, ])

library(ggplot2)
plot(mult_test) + 
  aes(color = cols_along(mat) %in% set) +
  labs(color = "Causal?")

train <- big_spLinReg(mat, y[ind.train], ind.train = ind.train, 
                      covar.train = svd_all$u[ind.train, ],
                      alpha = 0.5)
pred <- predict(train, X = mat, ind.row = ind.test, covar.row = svd_all$u[ind.test, ])
plot(apply(pred, 2, cor, y = y[ind.test]))

The functions big_spLinReg(), big_spLogReg() and big_spSVM() all use lasso (L1) or elastic-net (L1 & L2) regularizations in order to limit the number of predictors and to accelerate computations thanks to strong rules (R. Tibshirani et al. 2012). The implementation of these functions are based on modifications from packages sparseSVM and biglasso (Zeng and Breheny 2017). Yet, these models give predictions for a range of 100 different regularization parameters whereas we are only interested in one prediction.

So, that’s why I came up with the idea of Cross-Model Selection and Averaging (CMSA), which principle is:

This function separates the training set in K folds (e.g. 10).
In turn,
- each fold is considered as an inner validation set and the others (K - 1) folds form an inner training set,
- the model is trained on the inner training set and the corresponding predictions (scores) for the inner validation set are computed,
- the vector of scores which maximizes feval is determined,
- the vector of coefficients corresponding to the previous vector of scores is chosen.
The K resulting vectors of coefficients are then combined into one vector.

train2 <- big_CMSA(big_spLinReg, feval = function(pred, target) cor(pred, target), 
                   X = mat, y.train = y[ind.train], ind.train = ind.train, 
                      covar.train = svd_all$u[ind.train, ],
                      alpha = 0.5, ncores = nb_cores())
mean(train2 != 0) # percentage of predictors

## [1] 0.1285714

pred2 <- predict(train2, X = mat, ind.row = ind.test, covar.row = svd_all$u[ind.test, ])
cor(pred2, y[ind.test])

## [1] 0.3353398

Some matrix computations

For example, let’s compute the correlation of the first 2000 columns.

system.time(
  corr <- cor(mat[, 1:2000])
)

##    user  system elapsed 
##  11.015   0.006  11.018

system.time(
  corr2 <- big_cor(mat, ind.col = 1:2000)
)

##    user  system elapsed 
##   0.539   0.022   0.561

class(corr2)

## [1] "FBM"
## attr(,"package")
## [1] "bigstatsr"

all.equal(corr2[], corr)

## [1] TRUE

Advantages of using FBM objects

you can apply algorithms on 100GB of data,
you can easily parallelize your algorithms because the data on disk is shared,
you write more efficient algorithms,
you can use different types of data, for example, in my field, I’m storing my data with only 1 byte per element (rather than 8 bytes for a standard R matrix). See the documentation fo the FBM class for details.

In a next post, I’ll try to talk about good practices on how to use parallelism in R.

References

Kane, Michael J, John W Emerson, and Stephen Weston. 2013. “Scalable Strategies for Computing with Massive Data.” Journal of Statistical Software 55 (14): 1–19. doi:10.18637/jss.v055.i14.

Tibshirani, Robert, Jacob Bien, Jerome Friedman, Trevor Hastie, Noah Simon, Jonathan Taylor, and Ryan J Tibshirani. 2012. “Strong rules for discarding predictors in lasso-type problems.” Journal of the Royal Statistical Society. Series B, Statistical Methodology 74 (2): 245–66. doi:10.1111/j.1467-9868.2011.01004.x.

Wickham, Hadley. 2011. “The Split-Apply-Combine Strategy for Data Analysis. Journal of Statistical Software.” Journal of Statistical Software 40 (1): 1–29. doi:10.1039/np9971400083.

Zeng, Yaohui, and Patrick Breheny. 2017. “The biglasso Package: A Memory- and Computation-Efficient Solver for Lasso Model Fitting with Big Data in R,” January. http://arxiv.org/abs/1701.05936.