Chapter 3 Working with an FBM

3.1 Similar accessor as R matrices

library(bigstatsr)
X <- FBM(2, 5, init = 1:10, backingfile = "test")$save()
X$backingfile                # the file where numeric data is stored
#> [1] "C:\\Users\\au639593\\OneDrive - Aarhus universitet\\Desktop\\bigsnpr-extdoc\\test.bk"
X$rds                        # the file where information about the data is stored
#> [1] "C:\\Users\\au639593\\OneDrive - Aarhus universitet\\Desktop\\bigsnpr-extdoc\\test.rds"
X <- big_attach("test.rds")  # can get the FBM from any R session

You can access the whole FBM as an R matrix in memory using X[]. However, if the matrix is too large to fit in memory, you should always access only a subset of columns. Note that the elements of the FBM are stored column-wise (as for a standard R matrix). Therefore, be careful not to access a subset of rows, since it would read non-contiguous elements from the whole matrix from disk.

X[, 1]  # ok (must read first column only)
#> [1] 1 2
X[1, ]  # bad (must read all data from disk)
#> [1] 1 3 5 7 9
X[]     # super bad (standard R matrix in memory)
#>      [,1] [,2] [,3] [,4] [,5]
#> [1,]    1    3    5    7    9
#> [2,]    2    4    6    8   10

3.1.1 Split-(par)Apply-Combine Strategy

colSums(X[])  # super bad
#> [1]  3  7 11 15 19

Instead, the split-apply-combine strategy works well for applying standard R functions to FBMs (possibly in parallel), as implemented in big_apply().

Learn more with this tutorial on big_apply().

Compute the sum of each column of X <- big_attachExtdata() using big_apply().

3.2 Similar accessor as Rcpp matrices

In case you want to develop new R functions for this format while coding in C++ (which is the case for many bigstatsr/bigsnpr functions).

// [[Rcpp::plugins(cpp11)]]
// [[Rcpp::depends(bigstatsr, rmio)]]
#include <bigstatsr/BMAcc.h>

// [[Rcpp::export]]
NumericVector bigcolsums(Environment BM) {
  
  XPtr<FBM> xpBM = BM["address"];   // get the external pointer
  BMAcc<double> macc(xpBM);         // create an accessor to the data
  
  size_t n = macc.nrow();           // similar code as for an Rcpp::NumericMatrix
  size_t m = macc.ncol();           // similar code as for an Rcpp::NumericMatrix
  
  NumericVector res(m);
  
  for (size_t j = 0; j < m; j++) 
    for (size_t i = 0; i < n; i++)
      res[j] += macc(i, j);         // similar code as for an Rcpp::NumericMatrix
  
  return res;
}

For a subset of the data:

// [[Rcpp::plugins(cpp11)]]
// [[Rcpp::depends(bigstatsr, rmio)]]
#include <bigstatsr/BMAcc.h>

// [[Rcpp::export]]
NumericVector bigcolsums2(Environment BM,
                          const IntegerVector& rowInd,
                          const IntegerVector& colInd) {
  
  XPtr<FBM> xpBM = BM["address"];   
  // accessor to a sub-view of the data -> the only line of code that should change
  SubBMAcc<double> macc(xpBM, rowInd, colInd, 1);  
  
  size_t n = macc.nrow();
  size_t m = macc.ncol();
  
  NumericVector res(m);
  
  for (size_t j = 0; j < m; j++) 
    for (size_t i = 0; i < n; i++)
      res[j] += macc(i, j); 
  
  return res;
}

3.3 Some summary functions are already implemented

library(bigsnpr)
(X2 <- snp_attachExtdata()$genotypes)
#> A Filebacked Big Matrix of type 'code 256' with 517 rows and 4542 columns.
big_colstats(X2)      # sum and var (for each column)
#>    sum       var
#> 1  354 0.4604831
#> 2  213 0.3241195
#> 3  245 0.3932122
#> 4  191 0.3109247
#> 5  472 0.5176030
#> 6  368 0.4613528
#> 7  132 0.2292594
#> 8  497 0.4791207
#> 9  498 0.5160886
#> 10 481 0.5416535
#> 11 460 0.5168908
#> 12 393 0.4772465
#> 13  80 0.1465521
#> 14 195 0.3245168
#> 15 468 0.5084417
#> 16 287 0.4489901
#> 17 257 0.3745071
#> 18 371 0.4356004
#> 19 330 0.4444994
#> 20 479 0.5178430
#> 21 158 0.2707630
#> 22 389 0.4540881
#> 23 335 0.4882371
#> 24 394 0.4801104
#> 25 430 0.4968213
#>  [ reached 'max' / getOption("max.print") -- omitted 4517 rows ]
big_scale()(X2)       # mean and sd (for each column)
#>       center     scale
#> 1  0.6847195 0.6785891
#> 2  0.4119923 0.5693149
#> 3  0.4738878 0.6270663
#> 4  0.3694391 0.5576062
#> 5  0.9129594 0.7194463
#> 6  0.7117988 0.6792295
#> 7  0.2553191 0.4788104
#> 8  0.9613153 0.6921855
#> 9  0.9632495 0.7183931
#> 10 0.9303675 0.7359712
#> 11 0.8897485 0.7189512
#> 12 0.7601547 0.6908303
#> 13 0.1547389 0.3828213
#> 14 0.3771760 0.5696638
#> 15 0.9052224 0.7130510
#> 16 0.5551257 0.6700673
#> 17 0.4970986 0.6119698
#> 18 0.7176015 0.6600003
#> 19 0.6382979 0.6667079
#> 20 0.9264990 0.7196131
#> 21 0.3056093 0.5203490
#> 22 0.7524178 0.6738606
#> 23 0.6479691 0.6987397
#> 24 0.7620890 0.6929000
#> 25 0.8317215 0.7048555
#>  [ reached 'max' / getOption("max.print") -- omitted 4517 rows ]
snp_scaleBinom()(X2)  # 2 * af (mean) and sqrt(2 * af * (1 - af)) (~sd)
#>       center     scale
#> 1  0.6847195 0.6710433
#> 2  0.4119923 0.5719471
#> 3  0.4738878 0.6013343
#> 4  0.3694391 0.5488137
#> 5  0.9129594 0.7044231
#> 6  0.7117988 0.6771042
#> 7  0.2553191 0.4719377
#> 8  0.9613153 0.7065775
#> 9  0.9632495 0.7066291
#> 10 0.9303675 0.7053904
#> 11 0.8897485 0.7027961
#> 12 0.7601547 0.6864671
#> 13 0.1547389 0.3778450
#> 14 0.3771760 0.5532135
#> 15 0.9052224 0.7039237
#> 16 0.5551257 0.6332799
#> 17 0.4970986 0.6111834
#> 18 0.7176015 0.6783256
#> 19 0.6382979 0.6592312
#> 20 0.9264990 0.7051942
#> 21 0.3056093 0.5088327
#> 22 0.7524178 0.6850923
#> 23 0.6479691 0.6618437
#> 24 0.7620890 0.6868036
#> 25 0.8317215 0.6970231
#>  [ reached 'max' / getOption("max.print") -- omitted 4517 rows ]

To only use a subset of the data stored as an FBM, you should almost never make a copy of the data; instead, use parameters ind.row (or ind.train) and ind.col to apply functions to a subset of the data.