Chapter 2 Inputs and formats

2.1 In {bigstatsr}

The format provided in package {bigstatsr} is called a Filebacked Big Matrix (FBM). It is an on-disk matrix format which is accessed through memory-mapping.

How memory-mapping works:

  • when you access the 1st element (1st row, 1st col), it accesses a block (say the first column) from disk into memory (RAM)
  • when you access the 2nd element (2nd row, 1st col), it is already in memory so it is accessed very fast
  • when you access the second column, you access from disk again (once)
  • you can access many columns like that, until you do not have enough memory anymore
  • some space is freed automatically so that new columns can be accessed into memory
  • everything is seamlessly managed by the operating system (OS)
  • it is also very convenient for parallelism as data is shared between processes

All the elements of an FBM have the same type; supported types are:

  • "double" (the default, double precision – 8 bytes per element)
  • "float" (single precision – 4 bytes)
  • "integer" (signed, so between \(\text{-}2^{31}\) and (\(2^{31} \text{ - } 1\)) – 4 bytes)
  • "unsigned short": can store integer values from \(0\) to \(65535\) (2 bytes)
  • "raw" or "unsigned char": can store integer values from \(0\) to \(255\) (1 byte). It is the basis for class FBM.code256 that can access 256 arbitrary different numeric values (decoded using a CODE_*), which is used in package {bigsnpr} (see below).

2.2 In {bigsnpr}

Package {bigsnpr} uses a class called bigSNP for representing SNP data. A bigSNP object is merely a list with the following elements:

  • $genotypes: A FBM.code256. Rows are samples and columns are genetic variants. This stores genotype calls or dosages (rounded to 2 decimal places).
  • $fam: A data.frame with some information on the samples.
  • $map: A data.frame with some information on the genetic variants.

The code used in class FBM.code256 for imputed data is e.g. 

bigsnpr::CODE_DOSAGE
#>   [1] 0.00 1.00 2.00   NA 0.00 1.00 2.00 0.00 0.01 0.02 0.03 0.04 0.05 0.06
#>  [15] 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.20
#>  [29] 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29 0.30 0.31 0.32 0.33 0.34
#>  [43] 0.35 0.36 0.37 0.38 0.39 0.40 0.41 0.42 0.43 0.44 0.45 0.46 0.47 0.48
#>  [57] 0.49 0.50 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59 0.60 0.61 0.62
#>  [71] 0.63 0.64 0.65 0.66 0.67 0.68 0.69 0.70 0.71 0.72 0.73 0.74 0.75 0.76
#>  [85] 0.77 0.78 0.79 0.80 0.81 0.82 0.83 0.84 0.85 0.86 0.87 0.88 0.89 0.90
#>  [99] 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.99 1.00 1.01 1.02 1.03 1.04
#> [113] 1.05 1.06 1.07 1.08 1.09 1.10 1.11 1.12 1.13 1.14 1.15 1.16 1.17 1.18
#> [127] 1.19 1.20 1.21 1.22 1.23 1.24 1.25 1.26 1.27 1.28 1.29 1.30 1.31 1.32
#> [141] 1.33 1.34 1.35 1.36 1.37 1.38 1.39 1.40 1.41 1.42 1.43 1.44 1.45 1.46
#> [155] 1.47 1.48 1.49 1.50 1.51 1.52 1.53 1.54 1.55 1.56 1.57 1.58 1.59 1.60
#> [169] 1.61 1.62 1.63 1.64 1.65 1.66 1.67 1.68 1.69 1.70 1.71 1.72 1.73 1.74
#> [183] 1.75 1.76 1.77 1.78 1.79 1.80 1.81 1.82 1.83 1.84 1.85 1.86 1.87 1.88
#> [197] 1.89 1.90 1.91 1.92 1.93 1.94 1.95 1.96 1.97 1.98 1.99 2.00   NA   NA
#> [211]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
#> [225]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
#> [239]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
#> [253]   NA   NA   NA   NA

where the first four elements are used to store genotype calls, the next three to store imputed allele counts, and the next 201 values to store dosages rounded to 2 decimal places. This allows for handling many data types (genotype calls and dosages) while storing each element using one byte only (x4 compared to bed files, but /8 compared to double-precision floating-point numbers).

Package {bigsnpr} also provides functions for directly working on bed files with a small percentage of missing values (Privé, Luu, Blum, et al., 2020).

2.3 Getting an FBM or bigSNP object

  • The easiest way to get an FBM is to use the constructor function FBM() or the converter as_FBM().

  • To read an FBM from a large text file, you can use function big_read() (see this vignette).

  • To read a bigSNP object from bed/bim/fam files, you can use functions snp_readBed() and snp_readBed2() (the second can read a subset of individuals/variants and use parallelism).

  • To read dosages from BGEN files, you can use function snp_readBGEN(). This function takes around 40 minutes to read 1M variants for 400K individuals using 15 cores. Note that this function currently works only for BGEN v1.2 with probabilities stored as 8 bits (cf. this issue), which is the case for e.g. the UK Biobank files.

  • To read any format used in genetics, you can always convert blocks of the data to text files using PLINK, read these using bigreadr::fread2(), and fill part of the resulting FBM. For example, see the code I used to convert the iPSYCH imputed data from the RICOPILI pipeline to my bigSNP format.

Example converting a bed file to bigSNP:

library(bigsnpr)
#> Loading required package: bigstatsr
bedfile <- system.file("extdata", "example.bed", package = "bigsnpr")
(rds <- snp_readBed2(bedfile, backingfile = tempfile()))  # get path to new file
#> [1] "C:\\Users\\au639593\\AppData\\Local\\Temp\\RtmpITf9T6\\file64047c5a1e58.rds"
bigsnp <- snp_attach(rds)  # can then read in the bigSNP object in any R session
(G <- bigsnp$genotypes)
#> A Filebacked Big Matrix of type 'code 256' with 517 rows and 4542 columns.
G[1:5, 1:5]
#>      [,1] [,2] [,3] [,4] [,5]
#> [1,]    0    0    2    0    1
#> [2,]    1    0    1    0    0
#> [3,]    0    1    1    0    2
#> [4,]    0    0    2    0    2
#> [5,]    1    0    0    0    2
str(bigsnp$fam)
#> 'data.frame':    517 obs. of  6 variables:
#>  $ family.ID  : chr  "POP1" "POP1" "POP1" "POP1" ...
#>  $ sample.ID  : chr  "IND0" "IND1" "IND2" "IND3" ...
#>  $ paternal.ID: int  0 0 0 0 0 0 0 0 0 0 ...
#>  $ maternal.ID: int  0 0 0 0 0 0 0 0 0 0 ...
#>  $ sex        : int  0 0 0 0 0 0 0 0 0 0 ...
#>  $ affection  : int  1 1 2 1 1 1 1 1 1 1 ...
str(bigsnp$map)
#> 'data.frame':    4542 obs. of  6 variables:
#>  $ chromosome  : int  1 1 1 1 1 1 1 1 1 1 ...
#>  $ marker.ID   : chr  "SNP0" "SNP1" "SNP2" "SNP3" ...
#>  $ genetic.dist: int  0 0 0 0 0 0 0 0 0 0 ...
#>  $ physical.pos: int  112 1098 2089 3107 4091 5091 6107 7103 8090 9074 ...
#>  $ allele1     : chr  "A" "T" "T" "T" ...
#>  $ allele2     : chr  "T" "A" "A" "A" ...

Example directly mapping the bed file:

obj.bed <- bed(bedfile)
obj.bed[1:5, 1:5]
#>      [,1] [,2] [,3] [,4] [,5]
#> [1,]    0    0    2    0    1
#> [2,]    1    0    1    0    0
#> [3,]    0    1    1    0    2
#> [4,]    0    0    2    0    2
#> [5,]    1    0    0    0    2
str(obj.bed$fam)
#> 'data.frame':    517 obs. of  6 variables:
#>  $ family.ID  : chr  "POP1" "POP1" "POP1" "POP1" ...
#>  $ sample.ID  : chr  "IND0" "IND1" "IND2" "IND3" ...
#>  $ paternal.ID: int  0 0 0 0 0 0 0 0 0 0 ...
#>  $ maternal.ID: int  0 0 0 0 0 0 0 0 0 0 ...
#>  $ sex        : int  0 0 0 0 0 0 0 0 0 0 ...
#>  $ affection  : int  1 1 2 1 1 1 1 1 1 1 ...
str(obj.bed$map)
#> 'data.frame':    4542 obs. of  6 variables:
#>  $ chromosome  : int  1 1 1 1 1 1 1 1 1 1 ...
#>  $ marker.ID   : chr  "SNP0" "SNP1" "SNP2" "SNP3" ...
#>  $ genetic.dist: int  0 0 0 0 0 0 0 0 0 0 ...
#>  $ physical.pos: int  112 1098 2089 3107 4091 5091 6107 7103 8090 9074 ...
#>  $ allele1     : chr  "A" "T" "T" "T" ...
#>  $ allele2     : chr  "T" "A" "A" "A" ...

Example converting a bgen file to bigSNP:

bgen <- runonce::download_file(
  "https://enkre.net/cgi-bin/code/bgen/raw/3ec770a829a753282b5cb45afc3f4eda036b246705b76f9037b6cc98c41a4194?at=example.8bits.bgen",
  fname = "example.bgen")
bgi <- runonce::download_file(
  "https://enkre.net/cgi-bin/code/bgen/raw/dc7276e0f0e2e096f58d2dac645aa5711de2cd64c3b29a07a80575e175344f78?at=example.8bits.bgen.bgi",
  fname = "example.bgen.bgi")
sample <- runonce::download_file(
  "https://enkre.net/cgi-bin/code/bgen/raw/a3c4d8e4c132048a502dc00a3e51362f98eda5a2889df695ba260dc48c327fd9?at=example.sample",
  fname = "example.sample")
# What should you be careful about when reading this file?
readLines(sample, n = 5)
#> [1] "ID_1"       "0"          "sample_001" "sample_002" "sample_003"
library(bigsnpr)
(var_info <- snp_readBGI(bgi))
#> # A tibble: 199 × 8
#>    chromosome position rsid     number_of_alleles allele1 allele2
#>    <chr>         <int> <chr>                <int> <chr>   <chr>  
#>  1 01             1001 RSID_101                 2 A       G      
#>  2 01             2000 RSID_2                   2 A       G      
#>  3 01             2001 RSID_102                 2 A       G      
#>  4 01             3000 RSID_3                   2 A       G      
#>  5 01             3001 RSID_103                 2 A       G      
#>  6 01             4000 RSID_4                   2 A       G      
#>  7 01             4001 RSID_104                 2 A       G      
#>  8 01             5000 RSID_5                   2 A       G      
#>  9 01             5001 RSID_105                 2 A       G      
#> 10 01             6000 RSID_6                   2 A       G      
#> # ℹ 189 more rows
#> # ℹ 2 more variables: file_start_position <dbl>, size_in_bytes <int>
(snp_id <- with(var_info[1:10, ], paste(chromosome, position, allele1, allele2, sep = "_")))
#>  [1] "01_1001_A_G" "01_2000_A_G" "01_2001_A_G" "01_3000_A_G" "01_3001_A_G"
#>  [6] "01_4000_A_G" "01_4001_A_G" "01_5000_A_G" "01_5001_A_G" "01_6000_A_G"
(rds <- snp_readBGEN(bgen, backingfile = tempfile(), list_snp_id = list(snp_id)))
#> [1] "C:\\Users\\au639593\\AppData\\Local\\Temp\\RtmpITf9T6\\file640451833652.rds"
bigsnp <- snp_attach(rds)
(G <- bigsnp$genotypes)
#> A Filebacked Big Matrix of type 'code 256' with 500 rows and 10 columns.
str(bigsnp$map)  # `$freq` and `$info` are computed when reading the data
#> tibble [10 × 8] (S3: tbl_df/tbl/data.frame)
#>  $ chromosome  : chr [1:10] "01" "01" "01" "01" ...
#>  $ marker.ID   : chr [1:10] "SNPID_101" "SNPID_2" "SNPID_102" "SNPID_3" ...
#>  $ rsid        : chr [1:10] "RSID_101" "RSID_2" "RSID_102" "RSID_3" ...
#>  $ physical.pos: int [1:10] 1001 2000 2001 3000 3001 4000 4001 5000 5001 6000
#>  $ allele1     : chr [1:10] "A" "A" "A" "A" ...
#>  $ allele2     : chr [1:10] "G" "G" "G" "G" ...
#>  $ freq        : num [1:10] 0.583 0.802 0.198 0.483 0.517 ...
#>  $ info        : num [1:10] 0.975 0.668 0.666 0.952 0.952 ...

There is no $fam information in this case, which should be read from other data and match using IDs from the sample file.

References

Privé, F., Luu, K., Blum, M.G., McGrath, J.J., & Vilhjálmsson, B.J. (2020). Efficient toolkit implementing best practices for principal component analysis of population genetic data. Bioinformatics, 36, 4449–4457.