Chapter 1 Introduction

1.1 Main motivation for developing {bigstatsr} and {bigsnpr}

The main motivation was for me to be able to run all my analyses within . I was frustrated by having to use all these different software, with different input formats, and requiring text files for parameters. This made it hard for me to build a chain of analyses, to perform some exploratory analyses, or to use familiar packages. Also, I wanted to develop new methods, which seemed very hard to do without using a simple matrix-like format. Thus I started developing package {bigsnpr} at the beginning of my thesis.

At some point, I realized that many functions (to perform e.g. GWAS, PCA, summary statistics) were not really specific to genotype data. Indeed, a TWAS or an EWAS are not conceptually very different from a GWAS; one can also perform PCA on e.g. DNA methylation data. Therefore I decided to move all these functions that could be used on any data stored as a matrix into a new package, {bigstatsr}. This is why there are two packages, where {bigstatsr} can basically be used by any field using data stored as large matrices, while {bigsnpr} provides some tools rather specific to genotype data, largely building on top of {bigstatsr}. The initial description of the two packages is available in Privé, Aschard, Ziyatdinov, & Blum (2018).

Functions starting with big_ are part of {bigstatsr}, while functions starting with snp_ or bed_ are part of {bigsnpr}.

1.2 Features

There are now many things implemented in the packages. You can find a comprehensive list of available functions on the package website of {bigstatsr} and of {bigsnpr}.

The next table presents an overview of common genetic analyses that are already implemented in {bigstatsr} and {bigsnpr}. This listing is inspired from table 1 of Visscher et al. (2017).

Analysis Available in {bigstatsr} and {bigsnpr} Still missing Citations
Polygenic risk scores - penalized regressions
- (stacked) C+T
- LDpred2
- lassosum2
multi-ancestry training Privé, Aschard, & Blum (2019)
Privé, Vilhjálmsson, Aschard, & Blum (2019)
Privé, Arbel, & Vilhjálmsson (2020)
Privé, Arbel, Aschard, & Vilhjálmsson (2022)
Population structure - principal component analysis
- ancestry inference
- fixation index (\(F_{ST}\))
- local adaptation
Privé et al. (2018)
Privé, Luu, Blum, McGrath, & Vilhjálmsson (2020)
Privé, Aschard, et al. (2022)
Privé (2022)
Privé, Luu, Vilhjálmsson, & Blum (2020)
GWAS linear and logistic - mixed models
- rare variant association
Privé et al. (2018)
Genome-wide assessment of LD - sparse correlation matrix
- optimal LD splitting
Privé et al. (2018)
Privé (2021)
Estimation of SNP heritability - LD score regression
- LDpred2-auto
Privé, Arbel, et al. (2020)
Privé, Albiñana, Pasaniuc, & Vilhjálmsson (2022)
Estimation of polygenicity LDpred2-auto Privé, Albiñana, et al. (2022)
Estimation of genetic correlation need to extend LDpred2-auto
Fine-mapping LDpred2-auto Privé, Albiñana, et al. (2022)
GWAS summary imputation need to extend LDpred2-auto
Mendelian randomization completely missing
Miscellaneous - integration with PLINK
- format conversion
- imputation of genotyped variants
- matrix operations
- summaries
Privé et al. (2018)

1.3 Example code

  • When you want to use a function for the first time, check the documentation and the examples in there (usually they are very short).

  • There are also many (longer) tutorials available (usually one with each paper), which be linked from here or are available at the packages’ websites.

  • Some other examples are provided in this extended documentation (in the next chapters).

  • All the code used in all my papers is available on GitHub. It mostly consists of R scripts based on {bigsnpr}, {bigstatsr}, the tidyverse, and the futureverse (Bengtsson, 2021; Privé et al., 2018; Wickham et al., 2019).

1.4 Installation

Both packages are available on CRAN, so you can use install.packages():

install.packages("bigstatsr")
install.packages("bigsnpr")

To install the latest versions (from GitHub), you can use {remotes}:

# install.packages("remotes")
remotes::install_github("privefl/bigstatsr")
remotes::install_github("privefl/bigsnpr")

1.5 Correct spellings

References

Bengtsson, H. (2021). A unifying framework for parallel and distributed processing in R using futures. The R Journal, 13, 273–291.
Privé, F. (2021). Optimal linkage disequilibrium splitting. Bioinformatics, 38, 255–256.
Privé, F. (2022). Using the UK Biobank as a global reference of worldwide populations: application to measuring ancestry diversity from GWAS summary statistics. Bioinformatics, 38, 3477–3480.
Privé, F., Albiñana, C., Pasaniuc, B., & Vilhjálmsson, B.J. (2022). Inferring disease architecture and predictive ability with LDpred2-auto. bioRxiv. Retrieved from https://doi.org/10.1101/2022.10.10.511629
Privé, F., Arbel, J., Aschard, H., & Vilhjálmsson, B.J. (2022). Identifying and correcting for misspecifications in GWAS summary statistics and polygenic scores. Human Genetics and Genomics Advances, 3, 100136.
Privé, F., Arbel, J., & Vilhjálmsson, B.J. (2020). LDpred2: better, faster, stronger. Bioinformatics, 36, 5424–5431.
Privé, F., Aschard, H., & Blum, M.G. (2019). Efficient implementation of penalized regression for genetic risk prediction. Genetics, 212, 65–74.
Privé, F., Aschard, H., Carmi, S., Folkersen, L., Hoggart, C., O’Reilly, P.F., & Vilhjálmsson, B.J. (2022). Portability of 245 polygenic scores when derived from the UK Biobank and applied to 9 ancestry groups from the same cohort. The American Journal of Human Genetics, 109, 12–23.
Privé, F., Aschard, H., Ziyatdinov, A., & Blum, M.G.B. (2018). Efficient analysis of large-scale genome-wide data with two R packages: bigstatsr and bigsnpr. Bioinformatics, 34, 2781–2787.
Privé, F., Luu, K., Blum, M.G., McGrath, J.J., & Vilhjálmsson, B.J. (2020). Efficient toolkit implementing best practices for principal component analysis of population genetic data. Bioinformatics, 36, 4449–4457.
Privé, F., Luu, K., Vilhjálmsson, B.J., & Blum, M.G. (2020). Performing highly efficient genome scans for local adaptation with R package pcadapt version 4. Molecular Biology and Evolution, 37, 2153–2154.
Privé, F., Vilhjálmsson, B.J., Aschard, H., & Blum, M.G.B. (2019). Making the most of clumping and thresholding for polygenic scores. The American Journal of Human Genetics, 105, 1213–1221.
Visscher, P.M., Wray, N.R., Zhang, Q., Sklar, P., McCarthy, M.I., Brown, M.A., & Yang, J. (2017). 10 years of GWAS discovery: Biology, function, and translation. The American Journal of Human Genetics, 101, 5–22.
Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L.D., François, R., et al.others. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4, 1686.