Chapter 1 Introduction
1.1 Main motivation for developing {bigstatsr} and {bigsnpr}
The main motivation was for me to be able to run all my analyses within . I was frustrated by having to use all these different software, with different input formats, and requiring text files for parameters. This made it hard for me to build a chain of analyses, to perform some exploratory analyses, or to use familiar packages. Also, I wanted to develop new methods, which seemed very hard to do without using a simple matrix-like format. Thus I started developing package {bigsnpr} at the beginning of my thesis.
At some point, I realized that many functions (to perform e.g. GWAS, PCA, summary statistics) were not really specific to genotype data. Indeed, a TWAS or an EWAS are not conceptually very different from a GWAS; one can also perform PCA on e.g. DNA methylation data. Therefore I decided to move all these functions that could be used on any data stored as a matrix into a new package, {bigstatsr}. This is why there are two packages, where {bigstatsr} can basically be used by any field using data stored as large matrices, while {bigsnpr} provides some tools rather specific to genotype data, largely building on top of {bigstatsr}. The initial description of the two packages is available in Privé, Aschard, Ziyatdinov, & Blum (2018).
Functions starting with big_
are part of {bigstatsr}, while functions starting with snp_
or bed_
are part of {bigsnpr}.
1.2 Features
There are now many things implemented in the packages. You can find a comprehensive list of available functions on the package website of {bigstatsr} and of {bigsnpr}.
The next table presents an overview of common genetic analyses that are already implemented in {bigstatsr} and {bigsnpr}. This listing is inspired from table 1 of Visscher et al. (2017).
Analysis | Available in {bigstatsr} and {bigsnpr} | Still missing | Citations |
---|---|---|---|
Polygenic risk scores | - penalized regressions - (stacked) C+T - LDpred2 - lassosum2 |
multi-ancestry training | Privé, Aschard, & Blum (2019) Privé, Vilhjálmsson, Aschard, & Blum (2019) Privé, Arbel, & Vilhjálmsson (2020) Privé, Arbel, Aschard, & Vilhjálmsson (2022) |
Population structure | - principal component analysis - ancestry inference - fixation index (\(F_{ST}\)) - local adaptation |
Privé et al. (2018) Privé, Luu, Blum, McGrath, & Vilhjálmsson (2020) Privé, Aschard, et al. (2022) Privé (2022) Privé, Luu, Vilhjálmsson, & Blum (2020) |
|
GWAS | linear and logistic | - mixed models - rare variant association |
Privé et al. (2018) |
Genome-wide assessment of LD | - sparse correlation matrix - optimal LD splitting |
Privé et al. (2018) Privé (2021) |
|
Estimation of SNP heritability | - LD score regression - LDpred2-auto |
Privé, Arbel, et al. (2020) Privé, Albiñana, Pasaniuc, & Vilhjálmsson (2022) |
|
Estimation of polygenicity | LDpred2-auto | Privé, Albiñana, et al. (2022) | |
Estimation of genetic correlation | need to extend LDpred2-auto | ||
Fine-mapping | LDpred2-auto | Privé, Albiñana, et al. (2022) | |
GWAS summary imputation | need to extend LDpred2-auto | ||
Mendelian randomization | completely missing | ||
Miscellaneous | - integration with PLINK - format conversion - imputation of genotyped variants - matrix operations - summaries |
Privé et al. (2018) |
1.3 Example code
When you want to use a function for the first time, check the documentation and the examples in there (usually they are very short).
There are also many (longer) tutorials available (usually one with each paper), which be linked from here or are available at the packages’ websites.
Some other examples are provided in this extended documentation (in the next chapters).
All the code used in all my papers is available on GitHub. It mostly consists of R scripts based on {bigsnpr}, {bigstatsr}, the tidyverse, and the futureverse (Bengtsson, 2021; Privé et al., 2018; Wickham et al., 2019).
1.4 Installation
Both packages are available on CRAN, so you can use install.packages()
:
install.packages("bigstatsr")
install.packages("bigsnpr")
To install the latest versions (from GitHub), you can use {remotes}:
# install.packages("remotes")
::install_github("privefl/bigstatsr")
remotes::install_github("privefl/bigsnpr") remotes
1.5 Correct spellings
A friendly reminder:
— Florian Privé ((privefl?)) November 18, 2020
The correct spelling is
- bigstatsr – not bigstatr / BIGstatsR
- bigsnpr – not BIGsnpR / bigSNPr
- pcadapt – not PCAdapt
- LDpred – not LDPred
Thank you
The kittens thank you too pic.twitter.com/S8wyE4G6BG