Chapter 2 Inputs and formats

2.1 In {bigstatsr}

The format provided in package {bigstatsr} is called a Filebacked Big Matrix (FBM). It is an on-disk matrix format which is accessed through memory-mapping.

How memory-mapping works:

  • when you access the 1st element (1st row, 1st col), it accesses a block (say the first column) from disk into memory (RAM)
  • when you access the 2nd element (2nd row, 1st col), it is already in memory so it is accessed very fast
  • when you access the second column, you access from disk again (once)
  • you can access many columns like that, until you do not have enough memory anymore
  • some space is freed automatically so that new columns can be accessed into memory
  • everything is seamlessly managed by the operating system (OS)
  • it is also very convenient for parallelism as data is shared between processes

All the elements of an FBM have the same type; supported types are:

  • "double" (the default, double precision – 8 bytes per element)
  • "float" (single precision – 4 bytes)
  • "integer" (signed, so between \(\text{-}2^{31}\) and (\(2^{31} \text{ - } 1\)) – 4 bytes)
  • "unsigned short": can store integer values from \(0\) to \(65535\) (2 bytes)
  • "raw" or "unsigned char": can store integer values from \(0\) to \(255\) (1 byte). It is the basis for class FBM.code256 in order to access 256 arbitrary different numeric values. It is used in package {bigsnpr} (see below).

2.2 In {bigsnpr}

Package {bigsnpr} uses a class called bigSNP for representing SNP data. A bigSNP object is merely a list with the following elements:

  • $genotypes: A FBM.code256. Rows are samples and columns are genetic variants. This stores genotype calls or dosages (rounded to 2 decimal places).
  • $fam: A data.frame with some information on the samples.
  • $map: A data.frame with some information on the genetic variants.

The code used in class FBM.code256 for imputed data is e.g. 

bigsnpr::CODE_DOSAGE
#>   [1] 0.00 1.00 2.00   NA 0.00 1.00 2.00 0.00 0.01 0.02 0.03 0.04 0.05 0.06
#>  [15] 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.20
#>  [29] 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29 0.30 0.31 0.32 0.33 0.34
#>  [43] 0.35 0.36 0.37 0.38 0.39 0.40 0.41 0.42 0.43 0.44 0.45 0.46 0.47 0.48
#>  [57] 0.49 0.50 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59 0.60 0.61 0.62
#>  [71] 0.63 0.64 0.65 0.66 0.67 0.68 0.69 0.70 0.71 0.72 0.73 0.74 0.75 0.76
#>  [85] 0.77 0.78 0.79 0.80 0.81 0.82 0.83 0.84 0.85 0.86 0.87 0.88 0.89 0.90
#>  [99] 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.99 1.00 1.01 1.02 1.03 1.04
#> [113] 1.05 1.06 1.07 1.08 1.09 1.10 1.11 1.12 1.13 1.14 1.15 1.16 1.17 1.18
#> [127] 1.19 1.20 1.21 1.22 1.23 1.24 1.25 1.26 1.27 1.28 1.29 1.30 1.31 1.32
#> [141] 1.33 1.34 1.35 1.36 1.37 1.38 1.39 1.40 1.41 1.42 1.43 1.44 1.45 1.46
#> [155] 1.47 1.48 1.49 1.50 1.51 1.52 1.53 1.54 1.55 1.56 1.57 1.58 1.59 1.60
#> [169] 1.61 1.62 1.63 1.64 1.65 1.66 1.67 1.68 1.69 1.70 1.71 1.72 1.73 1.74
#> [183] 1.75 1.76 1.77 1.78 1.79 1.80 1.81 1.82 1.83 1.84 1.85 1.86 1.87 1.88
#> [197] 1.89 1.90 1.91 1.92 1.93 1.94 1.95 1.96 1.97 1.98 1.99 2.00   NA   NA
#> [211]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
#> [225]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
#> [239]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
#> [253]   NA   NA   NA   NA

where the first four elements are used to store genotype calls, the next three to store imputed allele counts, and the next 201 values to store dosages rounded to 2 decimal places. This allows for handling many types of data while storing each elements using one byte only (x4 compared to bed files, but /8 compared to doubles).

Since v1.0, package {bigsnpr} also provides functions for directly working on bed files with a small percentage of missing values (Privé, Luu, Blum, et al., 2020).

If there is a demand for it, I might extend functions in {bigsnpr} to handle more types of FBMs than only FBM.code256. We have started talking about this in this issue.

2.3 Getting an FBM or bigSNP object

  • The easiest way to get an FBM is to use the constructor function FBM() or the converter as_FBM().

  • To read an FBM from a large text file, you can use function big_read() (see this vignette).

  • To read a bigSNP object from bed/bim/fam files, you can use functions snp_readBed() and snp_readBed2() (the second can read a subset of individuals/variants and use parallelism).

  • To read dosages from BGEN files, you can use function snp_readBGEN(). This function takes around 40 minutes to read 1M variants for 400K individuals using 15 cores. Note that this function works only for BGEN v1.2 with probabilities stored as 8 bits (cf. this issue), which is the case for e.g. the UK Biobank files.

  • To read any format used in genetics, you can always convert blocks of the data to text files using PLINK, read these using bigreadr::fread2(), and fill part of the resulting FBM. For example, see the code I used to convert the iPSYCH imputed data from the RICOPILI pipeline to my bigSNP format.

References

Privé, F., Luu, K., Blum, M.G., McGrath, J.J., & Vilhjálmsson, B.J. (2020). Efficient toolkit implementing best practices for principal component analysis of population genetic data. Bioinformatics, 36, 4449–4457.