Chapter 2 Inputs and formats

2.1 In {bigstatsr}

The format provided in package {bigstatsr} is called a Filebacked Big Matrix (FBM). It is an on-disk matrix format which is accessed through memory-mapping.

How memory-mapping works:

when you access the 1st element (1st row, 1st col), it accesses a block (say the first column) from disk into memory (RAM)
when you access the 2nd element (2nd row, 1st col), it is already in memory so it is accessed very fast
when you access the second column, you access from disk again (once)
you can access many columns like that, until you do not have enough memory anymore
some space is freed automatically so that new columns can be accessed into memory
everything is seamlessly managed by the operating system (OS)
it is also very convenient for parallelism as data is shared between processes

All the elements of an FBM have the same type; supported types are:

"double" (the default, double precision – 8 bytes per element)
"float" (single precision – 4 bytes)
"integer" (signed, so between $\text{-}2^{31}$ and ($2^{31} \text{ - } 1$) – 4 bytes)
"unsigned short": can store integer values from $0$ to $65535$ (2 bytes)
"raw" or "unsigned char": can store integer values from $0$ to $255$ (1 byte). It is the basis for class FBM.code256 that can access 256 arbitrary different numeric values (decoded using a CODE_*), which is used in package {bigsnpr} (see below).

2.2 In {bigsnpr}

Package {bigsnpr} uses a class called bigSNP for representing SNP data. A bigSNP object is merely a list with the following elements:

$genotypes: A FBM.code256. Rows are samples and columns are genetic variants. This stores genotype calls or dosages (rounded to 2 decimal places).
$fam: A data.frame with some information on the samples.
$map: A data.frame with some information on the genetic variants.

The code used in class FBM.code256 for imputed data is e.g.

bigsnpr::CODE_DOSAGE

#>   [1] 0.00 1.00 2.00   NA 0.00 1.00 2.00 0.00 0.01 0.02 0.03 0.04 0.05 0.06
#>  [15] 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.20
#>  [29] 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29 0.30 0.31 0.32 0.33 0.34
#>  [43] 0.35 0.36 0.37 0.38 0.39 0.40 0.41 0.42 0.43 0.44 0.45 0.46 0.47 0.48
#>  [57] 0.49 0.50 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59 0.60 0.61 0.62
#>  [71] 0.63 0.64 0.65 0.66 0.67 0.68 0.69 0.70 0.71 0.72 0.73 0.74 0.75 0.76
#>  [85] 0.77 0.78 0.79 0.80 0.81 0.82 0.83 0.84 0.85 0.86 0.87 0.88 0.89 0.90
#>  [99] 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.99 1.00 1.01 1.02 1.03 1.04
#> [113] 1.05 1.06 1.07 1.08 1.09 1.10 1.11 1.12 1.13 1.14 1.15 1.16 1.17 1.18
#> [127] 1.19 1.20 1.21 1.22 1.23 1.24 1.25 1.26 1.27 1.28 1.29 1.30 1.31 1.32
#> [141] 1.33 1.34 1.35 1.36 1.37 1.38 1.39 1.40 1.41 1.42 1.43 1.44 1.45 1.46
#> [155] 1.47 1.48 1.49 1.50 1.51 1.52 1.53 1.54 1.55 1.56 1.57 1.58 1.59 1.60
#> [169] 1.61 1.62 1.63 1.64 1.65 1.66 1.67 1.68 1.69 1.70 1.71 1.72 1.73 1.74
#> [183] 1.75 1.76 1.77 1.78 1.79 1.80 1.81 1.82 1.83 1.84 1.85 1.86 1.87 1.88
#> [197] 1.89 1.90 1.91 1.92 1.93 1.94 1.95 1.96 1.97 1.98 1.99 2.00   NA   NA
#> [211]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
#> [225]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
#> [239]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
#> [253]   NA   NA   NA   NA

where the first four elements are used to store genotype calls, the next three to store imputed allele counts, and the next 201 values to store dosages rounded to 2 decimal places. This allows for handling many data types (genotype calls and dosages) while storing each element using one byte only (x4 compared to bed files, but /8 compared to double-precision floating-point numbers).

Package {bigsnpr} also provides functions for directly working on bed files with a small percentage of missing values (Privé, Luu, Blum, et al., 2020).

2.3 Getting an FBM or bigSNP object

The easiest way to get an FBM is to use the constructor function FBM() or the converter as_FBM().
To read an FBM from a large text file, you can use function big_read() (see this vignette).
To read a bigSNP object from bed/bim/fam files, you can use functions snp_readBed() and snp_readBed2() (the second can read a subset of individuals/variants and use parallelism).
To read dosages from BGEN files, you can use function snp_readBGEN(). This function takes around 40 minutes to read 1M variants for 400K individuals using 15 cores. Note that this function currently works only for BGEN v1.2 with probabilities stored as 8 bits (cf. this issue), which is the case for e.g. the UK Biobank files.
To read any format used in genetics, you can always convert blocks of the data to text files using PLINK, read these using bigreadr::fread2(), and fill part of the resulting FBM. For example, see the code I used to convert the iPSYCH imputed data from the RICOPILI pipeline to my bigSNP format.

Example converting a bed file to bigSNP:

library(bigsnpr)

#> Loading required package: bigstatsr

bedfile <- system.file("extdata", "example.bed", package = "bigsnpr")
(rds <- snp_readBed2(bedfile, backingfile = tempfile()))  # get path to new file

#> [1] "C:\\Users\\au639593\\AppData\\Local\\Temp\\RtmpITf9T6\\file64047c5a1e58.rds"

bigsnp <- snp_attach(rds)  # can then read in the bigSNP object in any R session
(G <- bigsnp$genotypes)

#> A Filebacked Big Matrix of type 'code 256' with 517 rows and 4542 columns.

G[1:5, 1:5]

#>      [,1] [,2] [,3] [,4] [,5]
#> [1,]    0    0    2    0    1
#> [2,]    1    0    1    0    0
#> [3,]    0    1    1    0    2
#> [4,]    0    0    2    0    2
#> [5,]    1    0    0    0    2

str(bigsnp$fam)

#> 'data.frame':    517 obs. of  6 variables:
#>  $ family.ID  : chr  "POP1" "POP1" "POP1" "POP1" ...
#>  $ sample.ID  : chr  "IND0" "IND1" "IND2" "IND3" ...
#>  $ paternal.ID: int  0 0 0 0 0 0 0 0 0 0 ...
#>  $ maternal.ID: int  0 0 0 0 0 0 0 0 0 0 ...
#>  $ sex        : int  0 0 0 0 0 0 0 0 0 0 ...
#>  $ affection  : int  1 1 2 1 1 1 1 1 1 1 ...

str(bigsnp$map)

#> 'data.frame':    4542 obs. of  6 variables:
#>  $ chromosome  : int  1 1 1 1 1 1 1 1 1 1 ...
#>  $ marker.ID   : chr  "SNP0" "SNP1" "SNP2" "SNP3" ...
#>  $ genetic.dist: int  0 0 0 0 0 0 0 0 0 0 ...
#>  $ physical.pos: int  112 1098 2089 3107 4091 5091 6107 7103 8090 9074 ...
#>  $ allele1     : chr  "A" "T" "T" "T" ...
#>  $ allele2     : chr  "T" "A" "A" "A" ...

Example directly mapping the bed file:

obj.bed <- bed(bedfile)
obj.bed[1:5, 1:5]

#>      [,1] [,2] [,3] [,4] [,5]
#> [1,]    0    0    2    0    1
#> [2,]    1    0    1    0    0
#> [3,]    0    1    1    0    2
#> [4,]    0    0    2    0    2
#> [5,]    1    0    0    0    2

str(obj.bed$fam)

#> 'data.frame':    517 obs. of  6 variables:
#>  $ family.ID  : chr  "POP1" "POP1" "POP1" "POP1" ...
#>  $ sample.ID  : chr  "IND0" "IND1" "IND2" "IND3" ...
#>  $ paternal.ID: int  0 0 0 0 0 0 0 0 0 0 ...
#>  $ maternal.ID: int  0 0 0 0 0 0 0 0 0 0 ...
#>  $ sex        : int  0 0 0 0 0 0 0 0 0 0 ...
#>  $ affection  : int  1 1 2 1 1 1 1 1 1 1 ...

str(obj.bed$map)

#> 'data.frame':    4542 obs. of  6 variables:
#>  $ chromosome  : int  1 1 1 1 1 1 1 1 1 1 ...
#>  $ marker.ID   : chr  "SNP0" "SNP1" "SNP2" "SNP3" ...
#>  $ genetic.dist: int  0 0 0 0 0 0 0 0 0 0 ...
#>  $ physical.pos: int  112 1098 2089 3107 4091 5091 6107 7103 8090 9074 ...
#>  $ allele1     : chr  "A" "T" "T" "T" ...
#>  $ allele2     : chr  "T" "A" "A" "A" ...

Example converting a bgen file to bigSNP:

bgen <- runonce::download_file(
  "https://enkre.net/cgi-bin/code/bgen/raw/3ec770a829a753282b5cb45afc3f4eda036b246705b76f9037b6cc98c41a4194?at=example.8bits.bgen",
  fname = "example.bgen")
bgi <- runonce::download_file(
  "https://enkre.net/cgi-bin/code/bgen/raw/dc7276e0f0e2e096f58d2dac645aa5711de2cd64c3b29a07a80575e175344f78?at=example.8bits.bgen.bgi",
  fname = "example.bgen.bgi")
sample <- runonce::download_file(
  "https://enkre.net/cgi-bin/code/bgen/raw/a3c4d8e4c132048a502dc00a3e51362f98eda5a2889df695ba260dc48c327fd9?at=example.sample",
  fname = "example.sample")

# What should you be careful about when reading this file?
readLines(sample, n = 5)

#> [1] "ID_1"       "0"          "sample_001" "sample_002" "sample_003"

library(bigsnpr)
(var_info <- snp_readBGI(bgi))

#> # A tibble: 199 × 8
#>    chromosome position rsid     number_of_alleles allele1 allele2
#>    <chr>         <int> <chr>                <int> <chr>   <chr>  
#>  1 01             1001 RSID_101                 2 A       G      
#>  2 01             2000 RSID_2                   2 A       G      
#>  3 01             2001 RSID_102                 2 A       G      
#>  4 01             3000 RSID_3                   2 A       G      
#>  5 01             3001 RSID_103                 2 A       G      
#>  6 01             4000 RSID_4                   2 A       G      
#>  7 01             4001 RSID_104                 2 A       G      
#>  8 01             5000 RSID_5                   2 A       G      
#>  9 01             5001 RSID_105                 2 A       G      
#> 10 01             6000 RSID_6                   2 A       G      
#> # ℹ 189 more rows
#> # ℹ 2 more variables: file_start_position <dbl>, size_in_bytes <int>

(snp_id <- with(var_info[1:10, ], paste(chromosome, position, allele1, allele2, sep = "_")))

#>  [1] "01_1001_A_G" "01_2000_A_G" "01_2001_A_G" "01_3000_A_G" "01_3001_A_G"
#>  [6] "01_4000_A_G" "01_4001_A_G" "01_5000_A_G" "01_5001_A_G" "01_6000_A_G"

(rds <- snp_readBGEN(bgen, backingfile = tempfile(), list_snp_id = list(snp_id)))

#> [1] "C:\\Users\\au639593\\AppData\\Local\\Temp\\RtmpITf9T6\\file640451833652.rds"

bigsnp <- snp_attach(rds)
(G <- bigsnp$genotypes)

#> A Filebacked Big Matrix of type 'code 256' with 500 rows and 10 columns.

str(bigsnp$map)  # `$freq` and `$info` are computed when reading the data

#> tibble [10 × 8] (S3: tbl_df/tbl/data.frame)
#>  $ chromosome  : chr [1:10] "01" "01" "01" "01" ...
#>  $ marker.ID   : chr [1:10] "SNPID_101" "SNPID_2" "SNPID_102" "SNPID_3" ...
#>  $ rsid        : chr [1:10] "RSID_101" "RSID_2" "RSID_102" "RSID_3" ...
#>  $ physical.pos: int [1:10] 1001 2000 2001 3000 3001 4000 4001 5000 5001 6000
#>  $ allele1     : chr [1:10] "A" "A" "A" "A" ...
#>  $ allele2     : chr [1:10] "G" "G" "G" "G" ...
#>  $ freq        : num [1:10] 0.583 0.802 0.198 0.483 0.517 ...
#>  $ info        : num [1:10] 0.975 0.668 0.666 0.952 0.952 ...

There is no $fam information in this case, which should be read from other data and match using IDs from the sample file.

References

Privé, F., Luu, K., Blum, M.G., McGrath, J.J., & Vilhjálmsson, B.J. (2020). Efficient toolkit implementing best practices for principal component analysis of population genetic data. Bioinformatics, 36, 4449–4457.