R package bigstatsr

# The package {bigstatsr}: memory- and computation-efficient tools for big matrices stored on disk

## Florian Privé (@privefl)

**Slides:** `https://privefl.github.io/R-presentation/bigstatsr.html`

**Installation:** `remotes::install_github("privefl/bigstatsr")`

---

# Motivation

---

## My thesis work

I'm a postdoc in **Predictive Human Genetics**.

`$$\boxed{\Large{\text{Disease} \sim \text{DNA mutations} + \cdots}}$$`

---

## Very large genotype matrices

- previously: 15K x 280K, [celiac disease](https://doi.org/10.1038/ng.543) (~30GB)

- currently: 500K x 500K, [UK Biobank](https://doi.org/10.1101/166298) (~2TB)
 
<img src="https://media.giphy.com/media/3o7bueyxGydy48Lwgo/giphy.gif" width="55%" style="display: block; margin: auto;" />

---

## The solution I found

.footnote[Format `FBM` is very similar to format `filebacked.big.matrix` from package {bigmemory} (details in [this vignette](https://privefl.github.io/bigstatsr/articles/bigstatsr-and-bigmemory.html)).]

---

# Simple accessors

---

## Similar accessor as R matrices

```r
X <- FBM(2, 5, init = 1:10, backingfile = "test")
```

```r
X$backingfile
```

```
## [1] "C:\\Users\\au639593\\Desktop\\R-presentation\\test.bk"
```

```r
X[, 1]  ## ok
```

```
## [1] 1 2
```

```r
X[1, ]  ## bad
```

```
## [1] 1 3 5 7 9
```

```r
X[]     ## super bad
```

```
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    3    5    7    9
## [2,]    2    4    6    8   10
```

---

## Similar accessor as R matrices

```r
colSums(X[])  ## super bad
```

```
## [1]  3  7 11 15 19
```

---

## Split-(par)Apply-Combine Strategy

### Apply standard R functions to big matrices (in parallel)

---

## Similar accessor as Rcpp matrices

```cpp
// [[Rcpp::plugins(cpp11)]]
// [[Rcpp::depends(bigstatsr, rmio)]]
#include <bigstatsr/BMAcc.h>

// [[Rcpp::export]]
NumericVector big_colsums(Environment BM) {
 
 XPtr<FBM> xpBM = BM["address"]; 
 BMAcc<double> macc(xpBM); 
 
* size_t n = macc.nrow();
* size_t m = macc.ncol();
 
 NumericVector res(m);
 
 for (size_t j = 0; j < m; j++) 
 for (size_t i = 0; i < n; i++)
* res[j] += macc(i, j);
 
 return res;
}
```

---

# Some examples

# from my work

---

## Partial Singular Value Decomposition

15K `$\times$` 100K -- 10 first PCs -- 6 cores -- **1 min** (vs 2h in base R)

---

## Sparse linear models

### Predicting complex diseases with a penalized logistic regression

15K `$\times$` 280K -- 6 cores -- **2 min** (10x faster than {glmnet})

Automatic (parallel) grid-search for the two hyper-parameters of elastic-net.

---

# Let us try

# some functions

---

## Create an FBM object

```r
X <- FBM(10e3, 1000, backingfile = "test2")
object.size(X)
```

```
## 680 bytes
```

```r
file.size(X$backingfile)  ## 8 x 1e4 x 1e3
```

```
## [1] 8e+07
```

```r
typeof(X)
```

```
## [1] "double"
```

---

## Fill it with random values

```r
big_apply(X, a.FUN = function(X, ind) {
 X[, ind] <- rnorm(nrow(X) * length(ind))
 NULL ## Here, you don't want to return anything
}, a.combine = 'c')
```

```
## NULL
```

```r
X[1:5, 1]
```

```
## [1]  1.6308495  0.8228180 -0.3483567  0.5665315  0.4108701
```

---

## Correlation matrix

```r
mat <- X[]
system.time(corr1 <- cor(mat))
```

```
##    user  system elapsed 
##    5.37    0.00    5.39
```

```r
system.time(corr2 <- big_cor(X))
```

```
##    user  system elapsed 
##    5.41    0.02    5.42
```

```r
all.equal(corr1, corr2[])
```

```
## [1] TRUE
```

---

## Partial Singular Value Decomposition

```r
system.time(svd1 <- svd(scale(mat), nu = 10, nv = 10))
```

```
##    user  system elapsed 
##   22.31    0.14   22.53
```

```r
# Quadratic in the smallest dimension, linear in the other one
system.time(svd2 <- big_SVD(X, fun.scaling = big_scale(), k = 10))
```

```
##    user  system elapsed 
##    6.45    0.07    6.55
```

```r
# Linear in both dimensions
# Extremely useful if both dimensions are very large
system.time(svd3 <- big_randomSVD(X, fun.scaling = big_scale(), k = 10))
```

```
##    user  system elapsed 
##    1.86    0.00    1.86
```

---

## Multiple association

```r
M <- 100 # number of causal variables
set <- sample(ncol(X), M)
y <- scale(X[, set]) %*% rnorm(M)
y <- y + rnorm(length(y), sd = 2 * sd(y))

mult_test <- big_univLinReg(X, y, covar.train = svd2$u)
plot(mult_test)
```

---

## Multiple association

```r
library(ggplot2)
plot(mult_test, type = "Manhattan") + 
  aes(color = cols_along(X) %in% set) +
  labs(color = "Causal?")
```

---

## Prediction

```r
# Split the indices in train/test sets
ind.train <- sort(sample(nrow(X), size = 0.8 * nrow(X)))
ind.test <- setdiff(rows_along(X), ind.train)

# Train a linear model with elastic-net regularization
# and automatic choice of hyper-parameter lambda
train <- big_spLinReg(X, y[ind.train], ind.train = ind.train, 
 covar.train = svd2$u[ind.train, ],
 alphas = c(1, 0.1, 0.01))
```

```r
# Get predictions for the test set
pred <- predict(train, X = X, ind.row = ind.test, 
 covar.row = svd2$u[ind.test, ])
```

---

## Prediction

```r
# Plot true value vs prediction
qplot(pred, y[ind.test]) + 
  geom_abline(intercept = 0, slope = 1, color = "red") + 
  theme_bigstatsr()
```

---

# Toy case:

## Compute the sum for each column

---

## Brute force solution

```r
sums1 <- colSums(X[]) ## /!\ access all the data in memory
```

---

## Do it by blocks

```r
sums2 <- big_apply(X, a.FUN = function(X, ind) colSums(X[, ind]), 
 a.combine = 'c')

all.equal(sums2, sums1)
```

```
## [1] TRUE
```

---

## Using Rcpp (1/3)

```cpp
// [[Rcpp::plugins(cpp11)]]
// [[Rcpp::depends(bigstatsr, rmio)]]
#include <bigstatsr/BMAcc.h>

// [[Rcpp::export]]
NumericVector bigcolsums(Environment BM) {
 
 XPtr<FBM> xpBM = BM["address"]; // get the external pointer
 BMAcc<double> macc(xpBM); // create an accessor to the data
 
 size_t i, j, n = macc.nrow(), m = macc.ncol();
 NumericVector res(m); // vector of m zeros

for (j = 0; j < m; j++) 
 for (i = 0; i < n; i++) 
 res[j] += macc(i, j);

return res;
}
```

---

## Using Rcpp (1/3)

```r
sums3 <- bigcolsums(X)

all.equal(sums3, sums1)
```

```
## [1] TRUE
```

---

## Using Rcpp (2/3): the bigstatsr way

```cpp
// [[Rcpp::plugins(cpp11)]]
// [[Rcpp::depends(bigstatsr, rmio)]]
#include <bigstatsr/BMAcc.h>

// [[Rcpp::export]]
NumericVector bigcolsums2(Environment BM,
* const IntegerVector& rowInd,
* const IntegerVector& colInd) {
 
 XPtr<FBM> xpBM = BM["address"];
* SubBMAcc<double> macc(xpBM, rowInd, colInd, 1);
 
 size_t i, j, n = macc.nrow(), m = macc.ncol();
 NumericVector res(m); // vector of m zeros

for (j = 0; j < m; j++) 
 for (i = 0; i < n; i++) 
* res[j] += macc(i, j);

return res;
}
```

---

## Using Rcpp (2/3): the bigstatsr way

```r
sums4 <- bigcolsums2(X, rows_along(mat), cols_along(mat))

all.equal(sums4, sums1)
```

```
## [1] TRUE
```

```r
sums5 <- bigcolsums2(X, rows_along(mat), 1:10)

all.equal(sums5, sums1[1:10])
```

```
## [1] TRUE
```

---

## Using Rcpp (3/3): already implemented

```r
sums6 <- big_colstats(X)
str(sums6)
```

```
## 'data.frame':	1000 obs. of  2 variables:
##  $ sum: num  58.2 22.4 101.3 176.7 120.2 ...
##  $ var: num  0.993 0.998 0.986 1.024 1.001 ...
```

```r
all.equal(sums6$sum, sums1)
```

```
## [1] TRUE
```

---

# Parallelism

---

## Most of the functions are parallelized

```r
ind.rep <- rep(cols_along(X), each = 100) ## size: 100,000
(NCORES <- nb_cores())
```

```
## [1] 4
```

```r
system.time(
 mult_test2 <- big_univLinReg(X, y, covar.train = svd2$u,
 ind.col = ind.rep)
)
```

```
##    user  system elapsed 
##   13.44    0.00   13.45
```

```r
system.time(
 mult_test3 <- big_univLinReg(X, y, covar.train = svd2$u,
 ind.col = ind.rep, ncores = NCORES) 
)
```

```
##    user  system elapsed 
##    0.02    0.08    7.66
```

---

## Parallelize your own functions

```r
system.time(
 mult_test4 <- big_parallelize(
 X, p.FUN = function(X, ind, y, covar) {
 bigstatsr::big_univLinReg(X, y, covar.train = covar,
 ind.col = ind)
 }, p.combine = "rbind", ind = ind.rep, 
 ncores = NCORES, y = y, covar = svd2$u)
)
```

```
##    user  system elapsed 
##    0.05    0.03    7.20
```

```r
all.equal(mult_test4, mult_test3)
```

```
## [1] TRUE
```

---

# Conclusion

---

# I'm able to run algorithms
# on 100GB of data
# in on my computer

---

## Advantages of using FBM objects

- you can apply algorithms on **data larger than your RAM**,

- you can easily **parallelize** your algorithms because the data on disk is shared,

- you write **more efficient algorithms** (you do less copies and think more about what you're doing),

- you can use **different types of data**, for example, in my field, I’m storing my data with only 1 byte per element (rather than 8 bytes for a standard R matrix). See [the documentation of the FBM class](https://privefl.github.io/bigstatsr/reference/FBM-class.html) for details.

---

## Check publications for details

---

## Contributions are welcome!

---

## Make sure to grab an hex sticker

---

# Thanks!

Presentation: https://privefl.github.io/R-presentation/bigstatsr.html

Package's website: https://privefl.github.io/bigstatsr/

DOIs: [10.1093/bioinformatics/bty185](https://doi.org/10.1093/bioinformatics/bty185)    
and [10.1534/genetics.119.302019](https://doi.org/10.1534/genetics.119.302019)

[privefl](https://twitter.com/privefl) &nbsp;&nbsp;&nbsp;&nbsp; [privefl](https://github.com/privefl) &nbsp;&nbsp;&nbsp;&nbsp; [F. Privé](https://stackoverflow.com/users/6103040/f-priv%c3%a9)