The R package bigstatsr provides functions for fast statistical analysis of large-scale data encoded as matrices. The package can handle matrices that are too large to fit in memory. The package relies on the format big.matrix
provided by the R package bigmemory .
The package bigstatsr enables users with laptop to perform statistical analysis of several dozens of gigabytes of data. The package is fast and efficient because of four different reasons. First, bigstatsr is memory-efficient because it uses only small chunks of data at a time. Second, special care has been taken to implement effective algorithms. Third, big.matrix
objects use memory-mapping, which provides efficient accesses to matrices. Finally, as matrices are stored on-disk, many processes can easily access them in parallel.
Note that most of the algorithms of this package don’t handle missing values.
For now, you can install this package using
devtools::install_github("privefl/bigstatsr")
As inputs, the package bigstatsr can use either big.matrix.descriptor
objects or simply big.matrix
objects (hereinafter referred to as ‘bigmatrices’). Using filebacked bigmatrices seems a convenient solution as it uses only disk storage. Descriptors may be preferred for several reasons:
X[,]
— we recall that this package aims at handling matrices that are too large to fit in memory).big.matrix
object is an external pointer to a C++ data structure, R can’t re-attach it (e.g. when restarting the R session) without any further information. The big.matrix.descriptor
object provides this information.big.matrix.descriptor
objects at a given point in time.Moreover, a new class is introduced: a BM.code
. It is a bigmatrix of type raw
(one byte unsigned integer) with an embedded lookup table (the slot code
). This enables you to efficiently store a very large matrix with up to 256 different values. For example, this is used in package bigsnpr to store genotype matrices.
To facilitate the manipulation of descriptors and BM.code
objects, some methods have been added/extended:
nrow
, ncol
, dim
and length
of a descriptor object access the underlying dimensions of the described bigmatrix (use typeof
to get the storage mode).describe
and attach.BM
are used to switch between descriptors and bigmatrices. Note that, in order to standardize algorithms, describing a descriptor or attaching a bigmatrix simply returns the same object.as.BM.code
to convert a bigmatrix to a BM.code
(by specifying its lookup table).Please open an issue if you find a bug. If you want help using bigmemory or bigstatsr, please post on Stack Overflow with the tag r-bigmemory. How to make a great R reproducible example?
Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.