Chapter 2 Inputs and formats
2.1 In {bigstatsr}
The format provided in package {bigstatsr} is called a Filebacked Big Matrix (FBM). It is an on-disk matrix format which is accessed through memory-mapping.
How memory-mapping works:
- when you access the 1st element (1st row, 1st col), it accesses a block (say the first column) from disk into memory (RAM)
- when you access the 2nd element (2nd row, 1st col), it is already in memory so it is accessed very fast
- when you access the second column, you access from disk again (once)
- you can access many columns like that, until you do not have enough memory anymore
- some space is freed automatically so that new columns can be accessed into memory
- everything is seamlessly managed by the operating system (OS)
- it is also very convenient for parallelism as data is shared between processes
All the elements of an FBM have the same type; supported types are:
"double"
(the default, double precision – 8 bytes per element)"float"
(single precision – 4 bytes)"integer"
(signed, so between \(\text{-}2^{31}\) and (\(2^{31} \text{ - } 1\)) – 4 bytes)"unsigned short"
: can store integer values from \(0\) to \(65535\) (2 bytes)"raw"
or"unsigned char"
: can store integer values from \(0\) to \(255\) (1 byte). It is the basis for class FBM.code256 in order to access 256 arbitrary different numeric values. It is used in package {bigsnpr} (see below).
2.2 In {bigsnpr}
Package {bigsnpr} uses a class called bigSNP
for representing SNP data. A bigSNP
object is merely a list with the following elements:
$genotypes
: AFBM.code256
. Rows are samples and columns are genetic variants. This stores genotype calls or dosages (rounded to 2 decimal places).$fam
: Adata.frame
with some information on the samples.$map
: Adata.frame
with some information on the genetic variants.
The code used in class FBM.code256 for imputed data is e.g.
::CODE_DOSAGE bigsnpr
#> [1] 0.00 1.00 2.00 NA 0.00 1.00 2.00 0.00 0.01 0.02 0.03 0.04 0.05 0.06
#> [15] 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.20
#> [29] 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29 0.30 0.31 0.32 0.33 0.34
#> [43] 0.35 0.36 0.37 0.38 0.39 0.40 0.41 0.42 0.43 0.44 0.45 0.46 0.47 0.48
#> [57] 0.49 0.50 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59 0.60 0.61 0.62
#> [71] 0.63 0.64 0.65 0.66 0.67 0.68 0.69 0.70 0.71 0.72 0.73 0.74 0.75 0.76
#> [85] 0.77 0.78 0.79 0.80 0.81 0.82 0.83 0.84 0.85 0.86 0.87 0.88 0.89 0.90
#> [99] 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.99 1.00 1.01 1.02 1.03 1.04
#> [113] 1.05 1.06 1.07 1.08 1.09 1.10 1.11 1.12 1.13 1.14 1.15 1.16 1.17 1.18
#> [127] 1.19 1.20 1.21 1.22 1.23 1.24 1.25 1.26 1.27 1.28 1.29 1.30 1.31 1.32
#> [141] 1.33 1.34 1.35 1.36 1.37 1.38 1.39 1.40 1.41 1.42 1.43 1.44 1.45 1.46
#> [155] 1.47 1.48 1.49 1.50 1.51 1.52 1.53 1.54 1.55 1.56 1.57 1.58 1.59 1.60
#> [169] 1.61 1.62 1.63 1.64 1.65 1.66 1.67 1.68 1.69 1.70 1.71 1.72 1.73 1.74
#> [183] 1.75 1.76 1.77 1.78 1.79 1.80 1.81 1.82 1.83 1.84 1.85 1.86 1.87 1.88
#> [197] 1.89 1.90 1.91 1.92 1.93 1.94 1.95 1.96 1.97 1.98 1.99 2.00 NA NA
#> [211] NA NA NA NA NA NA NA NA NA NA NA NA NA NA
#> [225] NA NA NA NA NA NA NA NA NA NA NA NA NA NA
#> [239] NA NA NA NA NA NA NA NA NA NA NA NA NA NA
#> [253] NA NA NA NA
where the first four elements are used to store genotype calls, the next three to store imputed allele counts, and the next 201 values to store dosages rounded to 2 decimal places. This allows for handling many types of data while storing each elements using one byte only (x4 compared to bed files, but /8 compared to doubles).
Since v1.0, package {bigsnpr} also provides functions for directly working on bed files with a small percentage of missing values (Privé, Luu, Blum, et al., 2020).
If there is a demand for it, I might extend functions in {bigsnpr} to handle more types of FBMs than only FBM.code256. We have started talking about this in this issue.
2.3 Getting an FBM or bigSNP object
The easiest way to get an FBM is to use the constructor function
FBM()
or the converteras_FBM()
.To read an FBM from a large text file, you can use function
big_read()
(see this vignette).To read a bigSNP object from bed/bim/fam files, you can use functions
snp_readBed()
andsnp_readBed2()
(the second can read a subset of individuals/variants and use parallelism).To read dosages from BGEN files, you can use function
snp_readBGEN()
. This function takes around 40 minutes to read 1M variants for 400K individuals using 15 cores. Note that this function works only for BGEN v1.2 with probabilities stored as 8 bits (cf. this issue), which is the case for e.g. the UK Biobank files.To read any format used in genetics, you can always convert blocks of the data to text files using PLINK, read these using
bigreadr::fread2()
, and fill part of the resulting FBM. For example, see the code I used to convert the iPSYCH imputed data from the RICOPILI pipeline to my bigSNP format.