Chapter 1 Introduction
1.1 Main motivation for developing {bigstatsr} and {bigsnpr}
At the time, there was a notable lack of user-friendly and efficient R packages for genetic analyses, which posed challenges for researchers. The existing workflows often required the use of disparate software tools with inconsistent input formats, reliance on text files for parameter settings, and limited compatibility with exploratory data analysis and familiar R packages. Additionally, the development of new methods was hindered by the absence of tools supporting a simple matrix-like data structure. To address these challenges, I initiated the development of the R package {bigsnpr} in 2016, reimplementing the statistical methods commonly used in genetic analyses within a cohesive and accessible R framework.
At some point, I realized that many functions (e.g. to perform genome-wide association studies (GWAS), principal component analysis (PCA), summary statistics, etc.) were not specific to genotype data. Indeed, both association studies and PCA are applicable to other omics data, such as transcriptomic or epigenetic datasets. Therefore I decided to move all these functions that could be used on any data stored as a matrix into a new R package, {bigstatsr}. This is why there are two packages, where {bigstatsr} can basically be used by any field using data stored as large numeric matrices, while {bigsnpr} provides some tools more specific to genotype data, largely building on top of {bigstatsr}. The initial description of the two packages is available in Privé, Aschard, Ziyatdinov, & Blum (2018).
Functions starting with big_
are part of {bigstatsr}, while functions starting with snp_
or bed_
are part of {bigsnpr}.
1.2 Features
There are now many functions implemented in the packages. You can find a comprehensive list of available functions on the package website of {bigstatsr} and of {bigsnpr}.
The next table presents an overview of common genetic analyses that are already implemented in {bigstatsr} and {bigsnpr}. This listing is inspired from table 1 of Visscher et al. (2017).
Analysis | Available in {bigstatsr} and {bigsnpr} | Still missing | Citations |
---|---|---|---|
Polygenic risk scores | - penalized regressions on individual-level data - (stacked) C+T - LDpred2 - lassosum2 |
multi-ancestry training | Privé, Aschard, & Blum (2019) Privé, Vilhjálmsson, Aschard, & Blum (2019) Privé, Arbel, & Vilhjálmsson (2020) Privé, Arbel, Aschard, & Vilhjálmsson (2022) |
Population structure | - principal component analysis (with automatic removal of LD) - ancestry inference - fixation index (\(F_{ST}\)) - local adaptation |
Privé et al. (2018) Privé, Luu, Blum, McGrath, & Vilhjálmsson (2020) Privé, Aschard, et al. (2022) Privé (2022) Privé, Luu, Vilhjálmsson, & Blum (2020) |
|
GWAS | linear and logistic | - mixed models - rare variant association |
Privé et al. (2018) |
Genome-wide assessment of LD | - sparse correlation matrix - optimal LD splitting |
Privé et al. (2018) Privé (2021) |
|
Estimation of SNP heritability | - LD score regression - LDpred2-auto |
Privé, Arbel, et al. (2020) Privé, Albiñana, Arbel, Pasaniuc, & Vilhjálmsson (2023) |
|
Estimation of polygenicity | LDpred2-auto | Privé et al. (2023) | |
Estimation of genetic correlation | need to extend LDpred2-auto | ||
Fine-mapping | LDpred2-auto | - using millions of variants - integrating functional annotations |
Privé et al. (2023) |
Imputation of GWAS summary statistics | in development | ||
Mendelian randomization | completely missing | ||
Miscellaneous | - integration with PLINK - format conversion - imputation of genotyped variants - matrix operations - summaries |
Privé et al. (2018) |
1.3 Example code
When you want to use a function for the first time, check the documentation and the examples in there (usually they are very short).
There are also many (longer) tutorials available (usually one with each paper), which will be linked from here or are available at the packages’ websites.
Some other examples are provided in this extended documentation (i.e. in the next chapters).
All the code used in all my papers is available on GitHub. It mostly consists of R scripts based on {bigsnpr}, {bigstatsr}, the tidyverse, and the futureverse (Bengtsson, 2021; Privé et al., 2018; Wickham et al., 2019).
1.4 Installation
Both packages are available on CRAN, so you can use install.packages()
:
To install the latest versions (from GitHub), you can use {remotes}:
1.5 Correct spellings
A friendly reminder:
— Florian Privé ((privefl?)) November 18, 2020
The correct spelling is
- bigstatsr – not bigstatr / BIGstatsR
- bigsnpr – not BIGsnpR / bigSNPr
- pcadapt – not PCAdapt
- LDpred – not LDPred
Thank you
The kittens thank you too pic.twitter.com/S8wyE4G6BG