Chapter 3 Working with an FBM
3.1 Similar accessor as R matrices
#> [1] "C:\\Users\\au639593\\OneDrive - Aarhus universitet\\Desktop\\bigsnpr-extdoc\\test.bk"
#> [1] "C:\\Users\\au639593\\OneDrive - Aarhus universitet\\Desktop\\bigsnpr-extdoc\\test.rds"
You can access the whole FBM as an R matrix in memory using X[]
.
However, if the matrix is too large to fit in memory, you should always access only a subset of columns.
Note that the elements of the FBM are stored column-wise (as for a standard R matrix). Therefore, be careful not to access a subset of rows, since it would read non-contiguous elements from the whole matrix from disk.
#> [1] 1 2
#> [1] 1 3 5 7 9
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] 1 3 5 7 9
#> [2,] 2 4 6 8 10
3.1.1 Split-(par)Apply-Combine Strategy
#> [1] 3 7 11 15 19
Instead, the split-apply-combine strategy works well for applying standard R functions to FBMs (possibly in parallel), as implemented in big_apply()
.
Learn more with this tutorial on big_apply()
.
Compute the sum of each column of X <- big_attachExtdata()
using big_apply()
.
3.2 Similar accessor as Rcpp matrices
In case you want to develop new R functions for this format while coding in C++ (which is the case for many bigstatsr/bigsnpr functions).
// [[Rcpp::plugins(cpp11)]]
// [[Rcpp::depends(bigstatsr, rmio)]]
#include <bigstatsr/BMAcc.h>
// [[Rcpp::export]]
NumericVector bigcolsums(Environment BM) {
XPtr<FBM> xpBM = BM["address"]; // get the external pointer
BMAcc<double> macc(xpBM); // create an accessor to the data
size_t n = macc.nrow(); // similar code as for an Rcpp::NumericMatrix
size_t m = macc.ncol(); // similar code as for an Rcpp::NumericMatrix
NumericVector res(m);
for (size_t j = 0; j < m; j++)
for (size_t i = 0; i < n; i++)
res[j] += macc(i, j); // similar code as for an Rcpp::NumericMatrix
return res;
}
For a subset of the data:
// [[Rcpp::plugins(cpp11)]]
// [[Rcpp::depends(bigstatsr, rmio)]]
#include <bigstatsr/BMAcc.h>
// [[Rcpp::export]]
NumericVector bigcolsums2(Environment BM,
const IntegerVector& rowInd,
const IntegerVector& colInd) {
XPtr<FBM> xpBM = BM["address"];
// accessor to a sub-view of the data -> the only line of code that should change
SubBMAcc<double> macc(xpBM, rowInd, colInd, 1);
size_t n = macc.nrow();
size_t m = macc.ncol();
NumericVector res(m);
for (size_t j = 0; j < m; j++)
for (size_t i = 0; i < n; i++)
res[j] += macc(i, j);
return res;
}
3.3 Some summary functions are already implemented
#> A Filebacked Big Matrix of type 'code 256' with 517 rows and 4542 columns.
#> sum var
#> 1 354 0.4604831
#> 2 213 0.3241195
#> 3 245 0.3932122
#> 4 191 0.3109247
#> 5 472 0.5176030
#> 6 368 0.4613528
#> 7 132 0.2292594
#> 8 497 0.4791207
#> 9 498 0.5160886
#> 10 481 0.5416535
#> 11 460 0.5168908
#> 12 393 0.4772465
#> 13 80 0.1465521
#> 14 195 0.3245168
#> 15 468 0.5084417
#> 16 287 0.4489901
#> 17 257 0.3745071
#> 18 371 0.4356004
#> 19 330 0.4444994
#> 20 479 0.5178430
#> 21 158 0.2707630
#> 22 389 0.4540881
#> 23 335 0.4882371
#> 24 394 0.4801104
#> 25 430 0.4968213
#> [ reached 'max' / getOption("max.print") -- omitted 4517 rows ]
#> center scale
#> 1 0.6847195 0.6785891
#> 2 0.4119923 0.5693149
#> 3 0.4738878 0.6270663
#> 4 0.3694391 0.5576062
#> 5 0.9129594 0.7194463
#> 6 0.7117988 0.6792295
#> 7 0.2553191 0.4788104
#> 8 0.9613153 0.6921855
#> 9 0.9632495 0.7183931
#> 10 0.9303675 0.7359712
#> 11 0.8897485 0.7189512
#> 12 0.7601547 0.6908303
#> 13 0.1547389 0.3828213
#> 14 0.3771760 0.5696638
#> 15 0.9052224 0.7130510
#> 16 0.5551257 0.6700673
#> 17 0.4970986 0.6119698
#> 18 0.7176015 0.6600003
#> 19 0.6382979 0.6667079
#> 20 0.9264990 0.7196131
#> 21 0.3056093 0.5203490
#> 22 0.7524178 0.6738606
#> 23 0.6479691 0.6987397
#> 24 0.7620890 0.6929000
#> 25 0.8317215 0.7048555
#> [ reached 'max' / getOption("max.print") -- omitted 4517 rows ]
#> center scale
#> 1 0.6847195 0.6710433
#> 2 0.4119923 0.5719471
#> 3 0.4738878 0.6013343
#> 4 0.3694391 0.5488137
#> 5 0.9129594 0.7044231
#> 6 0.7117988 0.6771042
#> 7 0.2553191 0.4719377
#> 8 0.9613153 0.7065775
#> 9 0.9632495 0.7066291
#> 10 0.9303675 0.7053904
#> 11 0.8897485 0.7027961
#> 12 0.7601547 0.6864671
#> 13 0.1547389 0.3778450
#> 14 0.3771760 0.5532135
#> 15 0.9052224 0.7039237
#> 16 0.5551257 0.6332799
#> 17 0.4970986 0.6111834
#> 18 0.7176015 0.6783256
#> 19 0.6382979 0.6592312
#> 20 0.9264990 0.7051942
#> 21 0.3056093 0.5088327
#> 22 0.7524178 0.6850923
#> 23 0.6479691 0.6618437
#> 24 0.7620890 0.6868036
#> 25 0.8317215 0.6970231
#> [ reached 'max' / getOption("max.print") -- omitted 4517 rows ]
To only use a subset of the data stored as an FBM, you should almost never make a copy of the data; instead, use parameters ind.row
(or ind.train
) and ind.col
to apply functions to a subset of the data.