The R package bigstatsr: Memory- and Computation-Efficient Tools for Big Matrices

# The R package bigstatsr: Memory- and Computation-Efficient Tools for Big Matrices
## useR!2017 lightning talk
### Florian Privé (@privefl)
### July 6, 2017

---

## About

I'm a PhD Student (2016-2019) in **Predictive Human Genetics** in Grenoble.

`$$\boxed{\Large{\text{Disease} \sim \text{DNA mutations}}}$$`

---

## Very large genotype matrices

- currently: 15K x 300K, [celiac disease](http://www.nature.com/ng/journal/v42/n4/abs/ng.543.html)

- soon: 500K x 800K, [UK Biobank](https://doi.org/10.1371/journal.pmed.1001779)
 
<img src="https://media.giphy.com/media/3o7bueyxGydy48Lwgo/giphy.gif" width="65%" style="display: block; margin: auto;" />

---

## Problem I had

---

## Solution I found

---

## Similar accessor as R matrices

---

## Split-(par)Apply-Combine Strategy

### Apply standard R functions to big matrices (in parallel)

---

## Similar accessor as Rcpp matrices

---

## Partial Singular Value Decomposition

15K x 100K `big.matrix`, 6 cores, K = 10, **1 min** (vs 2h in base R)

---

## Sparse linear models: **biglasso**

---

## Other functions

- matrix operations (Split-Apply-Combine strategy)

- association of each variable with an output (RcppArmadillo)

- plotting functions (ggplot2)

- read from text files

- others..

---

# I'm now able 
# to run algorithms
# on 100GB of data

---

## R Packages

.footnote[Paper in preparation: "Efficient management and analysis of large-scale genome-wide data with two R packages: bigstatsr and bigsnpr".]

---

## Contributors are welcomed!

---

# Thanks!

Package's website: https://privefl.github.io/bigstatsr/

Twitter and GitHub: [@privefl](https://twitter.com/privefl)

Presentation available online: https://goo.gl/cv7L5s