Chapter 1 Introduction

1.1 Main motivation for developing {bigstatsr} and {bigsnpr}

At the time, there was a notable lack of user-friendly and efficient R packages for genetic analyses, which posed challenges for researchers. The existing workflows often required the use of disparate software tools with inconsistent input formats, reliance on text files for parameter settings, and limited compatibility with exploratory data analysis and familiar R packages. Additionally, the development of new methods was hindered by the absence of tools supporting a simple matrix-like data structure. To address these challenges, I initiated the development of the R package {bigsnpr} in 2016, reimplementing the statistical methods commonly used in genetic analyses within a cohesive and accessible R framework.

At some point, I realized that many functions (e.g. to perform genome-wide association studies (GWAS), principal component analysis (PCA), summary statistics, etc.) were not specific to genotype data. Indeed, both association studies and PCA are applicable to other omics data, such as transcriptomic or epigenetic datasets. Therefore I decided to move all these functions that could be used on any data stored as a matrix into a new R package, {bigstatsr}. This is why there are two packages, where {bigstatsr} can basically be used by any field using data stored as large numeric matrices, while {bigsnpr} provides some tools more specific to genotype data, largely building on top of {bigstatsr}. The initial description of the two packages is available in Privé, Aschard, Ziyatdinov, & Blum (2018).

Functions starting with big_ are part of {bigstatsr}, while functions starting with snp_ or bed_ are part of {bigsnpr}.

1.2 Features

There are now many functions implemented in the packages. You can find a comprehensive list of available functions on the package website of {bigstatsr} and of {bigsnpr}.

The next table presents an overview of common genetic analyses that are already implemented in {bigstatsr} and {bigsnpr}. This listing is inspired from table 1 of Visscher et al. (2017).

Analysis Available in {bigstatsr} and {bigsnpr} Still missing Citations
Polygenic risk scores - penalized regressions on individual-level data
- (stacked) C+T
- LDpred2
- lassosum2
multi-ancestry training Privé, Aschard, & Blum (2019)
Privé, Vilhjálmsson, Aschard, & Blum (2019)
Privé, Arbel, & Vilhjálmsson (2020)
Privé, Arbel, Aschard, & Vilhjálmsson (2022)
Population structure - principal component analysis (with automatic removal of LD)
- ancestry inference
- fixation index (\(F_{ST}\))
- local adaptation
Privé et al. (2018)
Privé, Luu, Blum, McGrath, & Vilhjálmsson (2020)
Privé, Aschard, et al. (2022)
Privé (2022)
Privé, Luu, Vilhjálmsson, & Blum (2020)
GWAS linear and logistic - mixed models
- rare variant association
Privé et al. (2018)
Genome-wide assessment of LD - sparse correlation matrix
- optimal LD splitting
Privé et al. (2018)
Privé (2021)
Estimation of SNP heritability - LD score regression
- LDpred2-auto
Privé, Arbel, et al. (2020)
Privé, Albiñana, Arbel, Pasaniuc, & Vilhjálmsson (2023)
Estimation of polygenicity LDpred2-auto Privé et al. (2023)
Estimation of genetic correlation need to extend LDpred2-auto
Fine-mapping LDpred2-auto - using millions of variants
- integrating functional annotations
Privé et al. (2023)
Imputation of GWAS summary statistics in development
Mendelian randomization completely missing
Miscellaneous - integration with PLINK
- format conversion
- imputation of genotyped variants
- matrix operations
- summaries
Privé et al. (2018)

1.3 Example code

  • When you want to use a function for the first time, check the documentation and the examples in there (usually they are very short).

  • There are also many (longer) tutorials available (usually one with each paper), which will be linked from here or are available at the packages’ websites.

  • Some other examples are provided in this extended documentation (i.e. in the next chapters).

  • All the code used in all my papers is available on GitHub. It mostly consists of R scripts based on {bigsnpr}, {bigstatsr}, the tidyverse, and the futureverse (Bengtsson, 2021; Privé et al., 2018; Wickham et al., 2019).

1.4 Installation

Both packages are available on CRAN, so you can use install.packages():

install.packages("bigstatsr")
install.packages("bigsnpr")

To install the latest versions (from GitHub), you can use {remotes}:

# install.packages("remotes")
remotes::install_github("privefl/bigstatsr")
remotes::install_github("privefl/bigsnpr")

1.5 Correct spellings

References

Bengtsson, H. (2021). A unifying framework for parallel and distributed processing in R using futures. The R Journal, 13, 273–291.
Privé, F. (2021). Optimal linkage disequilibrium splitting. Bioinformatics, 38, 255–256.
Privé, F. (2022). Using the UK Biobank as a global reference of worldwide populations: application to measuring ancestry diversity from GWAS summary statistics. Bioinformatics, 38, 3477–3480.
Privé, F., Albiñana, C., Arbel, J., Pasaniuc, B., & Vilhjálmsson, B.J. (2023). Inferring disease architecture and predictive ability with LDpred2-auto. The American Journal of Human Genetics, 110, 2042–2055.
Privé, F., Arbel, J., Aschard, H., & Vilhjálmsson, B.J. (2022). Identifying and correcting for misspecifications in GWAS summary statistics and polygenic scores. Human Genetics and Genomics Advances, 3, 100136.
Privé, F., Arbel, J., & Vilhjálmsson, B.J. (2020). LDpred2: better, faster, stronger. Bioinformatics, 36, 5424–5431.
Privé, F., Aschard, H., & Blum, M.G. (2019). Efficient implementation of penalized regression for genetic risk prediction. Genetics, 212, 65–74.
Privé, F., Aschard, H., Carmi, S., Folkersen, L., Hoggart, C., O’Reilly, P.F., & Vilhjálmsson, B.J. (2022). Portability of 245 polygenic scores when derived from the UK Biobank and applied to 9 ancestry groups from the same cohort. The American Journal of Human Genetics, 109, 12–23.
Privé, F., Aschard, H., Ziyatdinov, A., & Blum, M.G.B. (2018). Efficient analysis of large-scale genome-wide data with two R packages: bigstatsr and bigsnpr. Bioinformatics, 34, 2781–2787.
Privé, F., Luu, K., Blum, M.G., McGrath, J.J., & Vilhjálmsson, B.J. (2020). Efficient toolkit implementing best practices for principal component analysis of population genetic data. Bioinformatics, 36, 4449–4457.
Privé, F., Luu, K., Vilhjálmsson, B.J., & Blum, M.G. (2020). Performing highly efficient genome scans for local adaptation with R package pcadapt version 4. Molecular Biology and Evolution, 37, 2153–2154.
Privé, F., Vilhjálmsson, B.J., Aschard, H., & Blum, M.G.B. (2019). Making the most of clumping and thresholding for polygenic scores. The American Journal of Human Genetics, 105, 1213–1221.
Visscher, P.M., Wray, N.R., Zhang, Q., Sklar, P., McCarthy, M.I., Brown, M.A., & Yang, J. (2017). 10 years of GWAS discovery: Biology, function, and translation. The American Journal of Human Genetics, 101, 5–22.
Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L.D., François, R., et al. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4, 1686.