The R package bigstatsr provides functions for fast statistical analysis of large-scale data encoded as matrices. The package can handle matrices that are too large to fit in memory. The package relies on the format
big.matrix provided by the R package bigmemory .
The package bigstatsr enables users with laptop to perform statistical analysis of several dozens of gigabytes of data. The package is fast and efficient because of four different reasons. First, bigstatsr is memory-efficient because it uses only small chunks of data at a time. Second, special care has been taken to implement effective algorithms. Third,
big.matrix objects use memory-mapping, which provides efficient accesses to matrices. Finally, as matrices are stored on-disk, many processes can easily access them in parallel.
Note that most of the algorithms of this package don’t handle missing values.
For now, you can install this package using
As inputs, the package bigstatsr can use either
big.matrix.descriptor objects or simply
big.matrix objects (hereinafter referred to as ‘bigmatrices’). Using filebacked bigmatrices seems a convenient solution as it uses only disk storage. Descriptors may be preferred for several reasons:
X[,]— we recall that this package aims at handling matrices that are too large to fit in memory).
big.matrixobject is an external pointer to a C++ data structure, R can’t re-attach it (e.g. when restarting the R session) without any further information. The
big.matrix.descriptorobject provides this information.
big.matrix.descriptorobjects at a given point in time.
Moreover, a new class is introduced: a
BM.code. It is a bigmatrix of type
raw (one byte unsigned integer) with an embedded lookup table (the slot
code). This enables you to efficiently store a very large matrix with up to 256 different values. For example, this is used in package bigsnpr to store genotype matrices.
To facilitate the manipulation of descriptors and
BM.code objects, some methods have been added/extended:
lengthof a descriptor object access the underlying dimensions of the described bigmatrix (use
typeofto get the storage mode).
attach.BMare used to switch between descriptors and bigmatrices. Note that, in order to standardize algorithms, describing a descriptor or attaching a bigmatrix simply returns the same object.
as.BM.codeto convert a bigmatrix to a
BM.code(by specifying its lookup table).
Please open an issue if you find a bug. If you want help using bigmemory or bigstatsr, please post on Stack Overflow with the tag r-bigmemory. How to make a great R reproducible example?
Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.