Chapter 3 Working with an FBM
3.1 Similar accessor as R matrices
#> [1] "C:\\Users\\au639593\\OneDrive - Aarhus universitet\\Desktop\\bigsnpr-extdoc\\test.bk"
You can access the whole FBM as an R matrix in memory using X[]
.
However, if the matrix is too large to fit in memory, you should always access only a subset of columns.
Note that the elements of the FBM are stored column-wise (as for a standard R matrix). Therefore, be careful not to access a subset of rows, since it would read non-contiguous elements from the whole matrix from disk.
#> [1] 1 2
#> [1] 1 3 5 7 9
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] 1 3 5 7 9
#> [2,] 2 4 6 8 10
3.2 Split-(par)Apply-Combine Strategy
#> [1] 3 7 11 15 19
How to apply standard R functions to big matrices (in parallel); implemented in big_apply()
.
Learn more with this tutorial on big_apply()
.
Compute the sum of each column of X <- big_attachExtdata()
using big_apply()
.
3.3 Similar accessor as Rcpp matrices
// [[Rcpp::plugins(cpp11)]]
// [[Rcpp::depends(bigstatsr, rmio)]]
#include <bigstatsr/BMAcc.h>
// [[Rcpp::export]]
NumericVector bigcolsums(Environment BM) {
XPtr<FBM> xpBM = BM["address"]; // get the external pointer
BMAcc<double> macc(xpBM); // create an accessor to the data
size_t n = macc.nrow(); // similar code as for an Rcpp::NumericMatrix
size_t m = macc.ncol(); // similar code as for an Rcpp::NumericMatrix
NumericVector res(m);
for (size_t j = 0; j < m; j++)
for (size_t i = 0; i < n; i++)
res[j] += macc(i, j); // similar code as for an Rcpp::NumericMatrix
return res;
}
For a subset of the data:
// [[Rcpp::plugins(cpp11)]]
// [[Rcpp::depends(bigstatsr, rmio)]]
#include <bigstatsr/BMAcc.h>
// [[Rcpp::export]]
NumericVector bigcolsums2(Environment BM,
const IntegerVector& rowInd,
const IntegerVector& colInd) {
XPtr<FBM> xpBM = BM["address"];
// accessor to a sub-view of the data -> the only line of code that should change
SubBMAcc<double> macc(xpBM, rowInd, colInd, 1);
size_t n = macc.nrow();
size_t m = macc.ncol();
NumericVector res(m);
for (size_t j = 0; j < m; j++)
for (size_t i = 0; i < n; i++)
res[j] += macc(i, j);
return res;
}
3.4 Some summary functions are already implemented
#> sum var
#> 1 3 0.5
#> 2 7 0.5
#> 3 11 0.5
#> 4 15 0.5
#> 5 19 0.5
#> center scale
#> 1 1.5 0.7071068
#> 2 3.5 0.7071068
#> 3 5.5 0.7071068
#> 4 7.5 0.7071068
#> 5 9.5 0.7071068
To only use a subset of the data stored as an FBM, you should almost never make a copy of the data; instead, use parameters ind.row
(or ind.train
) and ind.col
to apply functions to a subset of the data.