class: center, middle, inverse, title-slide # The R package
bigstatsr
:
Memory- and Computation-Efficient Tools
for Big Matrices ## useR!2017 lightning talk ### Florian Privé (
@privefl
) ### July 6, 2017 --- ## About I'm a PhD Student (2016-2019) in **Predictive Human Genetics** in Grenoble. `$$\boxed{\Large{\text{Disease} \sim \text{DNA mutations}}}$$` <img src="https://r-in-grenoble.github.io/cover.jpg" style="display: block; margin: auto;" /> --- ## Very large genotype matrices - currently: 15K x 300K, [celiac disease](http://www.nature.com/ng/journal/v42/n4/abs/ng.543.html) - soon: 500K x 800K, [UK Biobank](https://doi.org/10.1371/journal.pmed.1001779) <img src="https://media.giphy.com/media/3o7bueyxGydy48Lwgo/giphy.gif" width="65%" style="display: block; margin: auto;" /> --- ## Problem I had <img src="memory-problem.svg" width="100%" style="display: block; margin: auto;" /> --- ## Solution I found <img src="memory-solution.svg" width="100%" style="display: block; margin: auto;" /> .footnote[Michael J. Kane, John Emerson, Stephen Weston (2013).] --- ## Similar accessor as R matrices </br> <img src="http://i.ebayimg.com/images/i/200955927319-0-1/s-l1000.jpg" width="80%" style="display: block; margin: auto;" /> --- ## Split-(par)Apply-Combine Strategy ### Apply standard R functions to big matrices (in parallel) <img src="split-apply-combine.svg" width="100%" style="display: block; margin: auto;" /> .footnote[strategy coined by Hadley Wickham (2011)] --- <!-- ## Matrix operations --> <!-- - (cross-)products with matrices/vectors --> <!-- - special tricks for handling scaling ([vignette](https://privefl.github.io/bigstatsr/articles/operations-with-scaling.html) and [blog post](https://goo.gl/L8cNbo)) --> <!-- <br/> --> <!-- ### Example: computation of correlation of a 100,000 x 5000 matrix --> <!-- - `cor`: 22 minutes --> <!-- - `big_cor`: 1 minute --> <!-- --- --> ## Similar accessor as Rcpp matrices <img src="rcpp-trust.svg" width="100%" style="display: block; margin: auto;" /> --- ## Partial Singular Value Decomposition 15K x 100K `big.matrix`, 6 cores, K = 10, **1 min** (vs 2h in base R) </br> <img src="you-are-lightning-fast.jpg" width="60%" style="display: block; margin: auto;" /> .footnote[based on R package **RSpectra**] --- ## Sparse linear models: **biglasso** <img src="https://raw.githubusercontent.com/YaohuiZeng/biglasso/master/vignettes/2016-11-20_vary_p_pkgs.png" width="70%" style="display: block; margin: auto;" /> .footnote[Zeng, Y., and Breheny, P. (2017).] --- ## Other functions - matrix operations (Split-Apply-Combine strategy) - association of each variable with an output (RcppArmadillo) - plotting functions (ggplot2) - read from text files - others.. --- class: inverse, center, middle # I'm now able # to run algorithms # on 100GB of data --- ## R Packages <img src="recap-packages.svg" width="100%" style="display: block; margin: auto;" /> .footnote[Paper in preparation: "Efficient management and analysis of large-scale genome-wide data with two R packages: bigstatsr and bigsnpr".] --- ## Contributors are welcomed! <img src="cat-help.jpg" width="80%" style="display: block; margin: auto;" /> --- class: inverse, center, middle # Thanks! <br/><br/> Package's website: https://privefl.github.io/bigstatsr/ Twitter and GitHub: [@privefl](https://twitter.com/privefl) Presentation available online: https://goo.gl/cv7L5s .footnote[Slides created via the R package [**xaringan**](https://github.com/yihui/xaringan).]