Chapter 2 Inputs and formats
2.1 In {bigstatsr}
The format provided in package {bigstatsr} is called a Filebacked Big Matrix (FBM). It is an on-disk matrix format which is accessed through memory-mapping.
How memory-mapping works:
- when you access the 1st element (1st row, 1st col), it accesses a block (say the first column) from disk into memory (RAM)
- when you access the 2nd element (2nd row, 1st col), it is already in memory so it is accessed very fast
- when you access the second column, you access from disk again (once)
- you can access many columns like that, until you do not have enough memory anymore
- some space is freed automatically so that new columns can be accessed into memory
- everything is seamlessly managed by the operating system (OS)
- it is also very convenient for parallelism as data is shared between processes
All the elements of an FBM have the same type; supported types are:
"double"
(the default, double precision – 8 bytes per element)"float"
(single precision – 4 bytes)"integer"
(signed, so between \(\text{-}2^{31}\) and (\(2^{31} \text{ - } 1\)) – 4 bytes)"unsigned short"
: can store integer values from \(0\) to \(65535\) (2 bytes)"raw"
or"unsigned char"
: can store integer values from \(0\) to \(255\) (1 byte). It is the basis for class FBM.code256 that can access 256 arbitrary different numeric values (decoded using aCODE_*
), which is used in package {bigsnpr} (see below).
2.2 In {bigsnpr}
Package {bigsnpr} uses a class called bigSNP
for representing SNP data. A bigSNP
object is merely a list with the following elements:
$genotypes
: AFBM.code256
. Rows are samples and columns are genetic variants. This stores genotype calls or dosages (rounded to 2 decimal places).$fam
: Adata.frame
with some information on the samples.$map
: Adata.frame
with some information on the genetic variants.
The code used in class FBM.code256 for imputed data is e.g.
#> [1] 0.00 1.00 2.00 NA 0.00 1.00 2.00 0.00 0.01 0.02 0.03 0.04 0.05 0.06
#> [15] 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.20
#> [29] 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29 0.30 0.31 0.32 0.33 0.34
#> [43] 0.35 0.36 0.37 0.38 0.39 0.40 0.41 0.42 0.43 0.44 0.45 0.46 0.47 0.48
#> [57] 0.49 0.50 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59 0.60 0.61 0.62
#> [71] 0.63 0.64 0.65 0.66 0.67 0.68 0.69 0.70 0.71 0.72 0.73 0.74 0.75 0.76
#> [85] 0.77 0.78 0.79 0.80 0.81 0.82 0.83 0.84 0.85 0.86 0.87 0.88 0.89 0.90
#> [99] 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.99 1.00 1.01 1.02 1.03 1.04
#> [113] 1.05 1.06 1.07 1.08 1.09 1.10 1.11 1.12 1.13 1.14 1.15 1.16 1.17 1.18
#> [127] 1.19 1.20 1.21 1.22 1.23 1.24 1.25 1.26 1.27 1.28 1.29 1.30 1.31 1.32
#> [141] 1.33 1.34 1.35 1.36 1.37 1.38 1.39 1.40 1.41 1.42 1.43 1.44 1.45 1.46
#> [155] 1.47 1.48 1.49 1.50 1.51 1.52 1.53 1.54 1.55 1.56 1.57 1.58 1.59 1.60
#> [169] 1.61 1.62 1.63 1.64 1.65 1.66 1.67 1.68 1.69 1.70 1.71 1.72 1.73 1.74
#> [183] 1.75 1.76 1.77 1.78 1.79 1.80 1.81 1.82 1.83 1.84 1.85 1.86 1.87 1.88
#> [197] 1.89 1.90 1.91 1.92 1.93 1.94 1.95 1.96 1.97 1.98 1.99 2.00 NA NA
#> [211] NA NA NA NA NA NA NA NA NA NA NA NA NA NA
#> [225] NA NA NA NA NA NA NA NA NA NA NA NA NA NA
#> [239] NA NA NA NA NA NA NA NA NA NA NA NA NA NA
#> [253] NA NA NA NA
where the first four elements are used to store genotype calls, the next three to store imputed allele counts, and the next 201 values to store dosages rounded to 2 decimal places. This allows for handling many data types (genotype calls and dosages) while storing each element using one byte only (x4 compared to bed files, but /8 compared to double-precision floating-point numbers).
Package {bigsnpr} also provides functions for directly working on bed files with a small percentage of missing values (Privé, Luu, Blum, et al., 2020).
2.3 Getting an FBM or bigSNP object
The easiest way to get an FBM is to use the constructor function
FBM()
or the converteras_FBM()
.To read an FBM from a large text file, you can use function
big_read()
(see this vignette).To read a bigSNP object from bed/bim/fam files, you can use functions
snp_readBed()
andsnp_readBed2()
(the second can read a subset of individuals/variants and use parallelism).To read dosages from BGEN files, you can use function
snp_readBGEN()
. This function takes around 40 minutes to read 1M variants for 400K individuals using 15 cores. Note that this function currently works only for BGEN v1.2 with probabilities stored as 8 bits (cf. this issue), which is the case for e.g. the UK Biobank files.To read any format used in genetics, you can always convert blocks of the data to text files using PLINK, read these using
bigreadr::fread2()
, and fill part of the resulting FBM. For example, see the code I used to convert the iPSYCH imputed data from the RICOPILI pipeline to my bigSNP format.
Example converting a bed file to bigSNP:
#> Loading required package: bigstatsr
bedfile <- system.file("extdata", "example.bed", package = "bigsnpr")
(rds <- snp_readBed2(bedfile, backingfile = tempfile())) # get path to new file
#> [1] "C:\\Users\\au639593\\AppData\\Local\\Temp\\RtmpITf9T6\\file64047c5a1e58.rds"
bigsnp <- snp_attach(rds) # can then read in the bigSNP object in any R session
(G <- bigsnp$genotypes)
#> A Filebacked Big Matrix of type 'code 256' with 517 rows and 4542 columns.
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] 0 0 2 0 1
#> [2,] 1 0 1 0 0
#> [3,] 0 1 1 0 2
#> [4,] 0 0 2 0 2
#> [5,] 1 0 0 0 2
#> 'data.frame': 517 obs. of 6 variables:
#> $ family.ID : chr "POP1" "POP1" "POP1" "POP1" ...
#> $ sample.ID : chr "IND0" "IND1" "IND2" "IND3" ...
#> $ paternal.ID: int 0 0 0 0 0 0 0 0 0 0 ...
#> $ maternal.ID: int 0 0 0 0 0 0 0 0 0 0 ...
#> $ sex : int 0 0 0 0 0 0 0 0 0 0 ...
#> $ affection : int 1 1 2 1 1 1 1 1 1 1 ...
#> 'data.frame': 4542 obs. of 6 variables:
#> $ chromosome : int 1 1 1 1 1 1 1 1 1 1 ...
#> $ marker.ID : chr "SNP0" "SNP1" "SNP2" "SNP3" ...
#> $ genetic.dist: int 0 0 0 0 0 0 0 0 0 0 ...
#> $ physical.pos: int 112 1098 2089 3107 4091 5091 6107 7103 8090 9074 ...
#> $ allele1 : chr "A" "T" "T" "T" ...
#> $ allele2 : chr "T" "A" "A" "A" ...
Example directly mapping the bed file:
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] 0 0 2 0 1
#> [2,] 1 0 1 0 0
#> [3,] 0 1 1 0 2
#> [4,] 0 0 2 0 2
#> [5,] 1 0 0 0 2
#> 'data.frame': 517 obs. of 6 variables:
#> $ family.ID : chr "POP1" "POP1" "POP1" "POP1" ...
#> $ sample.ID : chr "IND0" "IND1" "IND2" "IND3" ...
#> $ paternal.ID: int 0 0 0 0 0 0 0 0 0 0 ...
#> $ maternal.ID: int 0 0 0 0 0 0 0 0 0 0 ...
#> $ sex : int 0 0 0 0 0 0 0 0 0 0 ...
#> $ affection : int 1 1 2 1 1 1 1 1 1 1 ...
#> 'data.frame': 4542 obs. of 6 variables:
#> $ chromosome : int 1 1 1 1 1 1 1 1 1 1 ...
#> $ marker.ID : chr "SNP0" "SNP1" "SNP2" "SNP3" ...
#> $ genetic.dist: int 0 0 0 0 0 0 0 0 0 0 ...
#> $ physical.pos: int 112 1098 2089 3107 4091 5091 6107 7103 8090 9074 ...
#> $ allele1 : chr "A" "T" "T" "T" ...
#> $ allele2 : chr "T" "A" "A" "A" ...
Example converting a bgen file to bigSNP:
bgen <- runonce::download_file(
"https://enkre.net/cgi-bin/code/bgen/raw/3ec770a829a753282b5cb45afc3f4eda036b246705b76f9037b6cc98c41a4194?at=example.8bits.bgen",
fname = "example.bgen")
bgi <- runonce::download_file(
"https://enkre.net/cgi-bin/code/bgen/raw/dc7276e0f0e2e096f58d2dac645aa5711de2cd64c3b29a07a80575e175344f78?at=example.8bits.bgen.bgi",
fname = "example.bgen.bgi")
sample <- runonce::download_file(
"https://enkre.net/cgi-bin/code/bgen/raw/a3c4d8e4c132048a502dc00a3e51362f98eda5a2889df695ba260dc48c327fd9?at=example.sample",
fname = "example.sample")
#> [1] "ID_1" "0" "sample_001" "sample_002" "sample_003"
#> # A tibble: 199 × 8
#> chromosome position rsid number_of_alleles allele1 allele2
#> <chr> <int> <chr> <int> <chr> <chr>
#> 1 01 1001 RSID_101 2 A G
#> 2 01 2000 RSID_2 2 A G
#> 3 01 2001 RSID_102 2 A G
#> 4 01 3000 RSID_3 2 A G
#> 5 01 3001 RSID_103 2 A G
#> 6 01 4000 RSID_4 2 A G
#> 7 01 4001 RSID_104 2 A G
#> 8 01 5000 RSID_5 2 A G
#> 9 01 5001 RSID_105 2 A G
#> 10 01 6000 RSID_6 2 A G
#> # ℹ 189 more rows
#> # ℹ 2 more variables: file_start_position <dbl>, size_in_bytes <int>
#> [1] "01_1001_A_G" "01_2000_A_G" "01_2001_A_G" "01_3000_A_G" "01_3001_A_G"
#> [6] "01_4000_A_G" "01_4001_A_G" "01_5000_A_G" "01_5001_A_G" "01_6000_A_G"
#> [1] "C:\\Users\\au639593\\AppData\\Local\\Temp\\RtmpITf9T6\\file640451833652.rds"
#> A Filebacked Big Matrix of type 'code 256' with 500 rows and 10 columns.
#> tibble [10 × 8] (S3: tbl_df/tbl/data.frame)
#> $ chromosome : chr [1:10] "01" "01" "01" "01" ...
#> $ marker.ID : chr [1:10] "SNPID_101" "SNPID_2" "SNPID_102" "SNPID_3" ...
#> $ rsid : chr [1:10] "RSID_101" "RSID_2" "RSID_102" "RSID_3" ...
#> $ physical.pos: int [1:10] 1001 2000 2001 3000 3001 4000 4001 5000 5001 6000
#> $ allele1 : chr [1:10] "A" "A" "A" "A" ...
#> $ allele2 : chr [1:10] "G" "G" "G" "G" ...
#> $ freq : num [1:10] 0.583 0.802 0.198 0.483 0.517 ...
#> $ info : num [1:10] 0.975 0.668 0.666 0.952 0.952 ...
There is no $fam
information in this case, which should be read from other data and match using IDs from the sample file.