class: title-slide center middle inverse # The
package {bigstatsr}:<br/>memory- and computation-efficient tools<br/>for big matrices stored on disk ## Florian Privé (@privefl) ### Rencontres R 2018 **Slides:** https://privefl.github.io/RR18/bigstatsr.html --- class: center middle inverse # Introduction & Motivation --- ## About I'm a PhD Student (2016-2019) in **Predictive Human Genetics** in Grenoble. `$$\boxed{\Large{\text{Disease} \sim \text{DNA mutations} + \cdots}}$$` <img src="https://r-in-grenoble.github.io/cover.jpg" style="display: block; margin: auto;" /> --- ## Analyze very large genotype matrices - previously: 15K x 280K, [celiac disease](https://doi.org/10.1038/ng.543) (~30GB) - currently: 500K x 500K, [UK Biobank](https://doi.org/10.1101/166298) (~2TB) <img src="https://media.giphy.com/media/3o7bueyxGydy48Lwgo/giphy.gif" width="55%" style="display: block; margin: auto;" /> .footnote[But I still want to use
..] --- ## The solution I found <img src="memory-solution.svg" width="90%" style="display: block; margin: auto;" /> .footnote[Format `FBM` is very similar to format `filebacked.big.matrix` from package {bigmemory} (details in [this vignette](https://privefl.github.io/bigstatsr/articles/bigstatsr-and-bigmemory.html)).] --- class: center middle inverse # Simple accessors --- ## Similar accessor as R matrices ```r X <- FBM(2, 5, init = 1:10, backingfile = "test") ``` ```r X$backingfile ``` ``` ## [1] "/home/privef/Bureau/RR18/test.bk" ``` ```r X[, 1] ## ok ``` ``` ## [1] 1 2 ``` ```r X[1, ] ## bad ``` ``` ## [1] 1 3 5 7 9 ``` ```r X[] ## super bad ``` ``` ## [,1] [,2] [,3] [,4] [,5] ## [1,] 1 3 5 7 9 ## [2,] 2 4 6 8 10 ``` --- ## Similar accessor as R matrices ```r colSums(X[]) ## super bad ``` ``` ## [1] 3 7 11 15 19 ``` </br> <img src="caution.jpg" width="70%" style="display: block; margin: auto;" /> --- ## Split-(par)Apply-Combine Strategy ### Apply standard R functions to big matrices (in parallel) <img src="split-apply-combine.svg" width="95%" style="display: block; margin: auto;" /> .footnote[Implemented in `big_apply()`.] --- ## Similar accessor as Rcpp matrices ```cpp // [[Rcpp::depends(BH, bigstatsr)]] #include <bigstatsr/BMAcc.h> // [[Rcpp::export]] NumericVector big_colsums(Environment BM) { XPtr<FBM> xpBM = BM["address"]; BMAcc<double> macc(xpBM); * size_t n = macc.nrow(); * size_t m = macc.ncol(); NumericVector res(m); for (size_t j = 0; j < m; j++) for (size_t i = 0; i < n; i++) * res[j] += macc(i, j); return res; } ``` --- class: center middle inverse # Some examples # from my work --- ## Partial Singular Value Decomposition 15K `\(\times\)` 100K -- 10 first PCs -- 6 cores -- **1 min** (vs 2h in base R) </br> <img src="PC1-4.png" width="90%" style="display: block; margin: auto;" /> .footnote[Implemented in `big_randomSVD()`, powered by R packages {RSpectra} and {Rcpp}.] --- ## Multiple association testing ### Which DNA mutations are associated with one disease? <br> <img src="celiac-gwas-cut.png" width="90%" style="display: block; margin: auto;" /> --- ## Sparse linear models ### Predicting complex diseases via penalized logistic regression 15K `\(\times\)` 280K -- 6 cores -- **2 min** <img src="density-scores.svg" width="75%" style="display: block; margin: auto;" /> --- class: center middle inverse # Conclusion --- class: inverse, center, middle # I'm able to run algorithms # on 100GB of data # in
on my computer --- ## Advantages of using FBM objects <br> - you can apply algorithms on **data larger than your RAM**, - you can easily **parallelize** your algorithms because the data on disk is shared, - you write **more efficient algorithms** (you do less copies and think more about what you're doing), - you can use **different types of data**, for example, in my field, I’m storing my data with only 1 byte per element (rather than 8 bytes for a standard R matrix). See [the documentation of the FBM class](https://privefl.github.io/bigstatsr/reference/FBM-class.html) for details. --- ## R Packages <br> <a href="https://doi.org/10.1093/bioinformatics/bty185" target="_blank"> <img src="bty185.png" width="70%" style="display: block; margin: auto;" /> </a> <br> - {bigstatsr}: to be used by any field of research - {bigsnpr}: algorithms specific to my field of research --- ## Contributors are welcomed! <img src="cat-help.jpg" width="80%" style="display: block; margin: auto;" /> --- ## Make sure to grab an hex sticker <br> <img src="https://raw.githubusercontent.com/privefl/bigstatsr/master/bigstatsr.png" width="45%" style="display: block; margin: auto;" /> --- class: inverse, center, middle # Thanks! <br/><br/> Presentation: https://privefl.github.io/RR18/bigstatsr.html Package's website: https://privefl.github.io/bigstatsr/ DOI: [10.1093/bioinformatics/bty185](https://doi.org/10.1093/bioinformatics/bty185) <br/>
[privefl](https://twitter.com/privefl)
[privefl](https://github.com/privefl)
[F. Privé](https://stackoverflow.com/users/6103040/f-priv%c3%a9) .footnote[Slides created via the R package [**xaringan**](https://github.com/yihui/xaringan).]