---
title: "Getting My Colleagues Hooked on R"
author: "Florian Privé"
date: "`r Sys.Date()`"
output:
ioslides_presentation:
css: styles.css
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(
cache = TRUE,
warning = FALSE, message = FALSE,
fig.align = 'center', comment = "")
```
```{r, include=FALSE}
library(pacman)
p_load(magrittr, longurl, gsheet)
responses <- "goo.gl/4zYmrw" %>% expand_urls %>% {gsheet2tbl(.$expanded_url)[, 2]}
```
## What `r nrow(responses)` of you wanted to learn
```{r, include=FALSE}
p_load(gsubfn, stringr)
questions <-
"https://goo.gl/forms/LREeX5NORBJlCrcC3" %>%
readLines(encoding = "UTF-8") %>%
strapply(pattern = "\\[\"([^\"]*)\",,,,0\\]") %>%
unlist
counts <- str_count(responses, coll(questions))
counts.lvl <- counts %>% unique %>% sort(decreasing = TRUE) %>% setdiff(0)
```
```{r, echo=FALSE, results='asis'}
printf <- function(...) cat(sprintf(...))
for (n in counts.lvl) {
if (n == 2) printf("\n***\n")
printf("- for **%d** of you:\n", n)
q.tmp <- questions[counts == n]
for (q in q.tmp) {
printf(" - %s\n", q)
}
}
```
## Overview
We will try to see a bit of everything.
- This is only a (small) part of what R can do
- We will only see introductions to each topic, with some links to learn more
- I'm not an expert in everything in R (yet :D)
Contents:
- some stats about R
- data manipulation and visualization
- Rcpp
- bigmemory
- RStudio
- learn more
## Some facts about the growth of R:
- R is #5 of all programming languages ([IEEE Spectrum, July 2016](https://www.r-bloggers.com/r-moves-up-to-5th-place-in-ieee-language-rankings/))
```{r, echo=FALSE}
knitr::include_graphics("http://revolution-computing.typepad.com/.a/6a010534b1db25970b01bb092485d1970d-800wi")
```
---
```{r, echo = FALSE}
n <- readLines('https://cran.r-project.org/web/packages/') %>%
gsubfn::strapply(
paste("Currently, the CRAN package repository",
"features ([0-9]+) available packages.")) %>%
unlist
```
- There are now `r n` available packages on CRAN ([CRAN: Contributed Packages, `r Sys.Date()`](https://cran.r-project.org/web/packages/))
```{r, echo=FALSE, out.height="450px"}
knitr::include_graphics("http://a3.typepad.com/6a017d41eeee1a970c01bb08ef2103970d-pi")
```
---
- There are many R conferences:
- useR!: 900+ people in 2016,
- eRum: european R users meeting,
- EARL: many people from the Industry,
- Rencontres R: Grenoble in 2015,
- satRdays,
- R/Finance & R in Insurance.
- The R blogosphere is huge: [R-bloggers](https://www.r-bloggers.com/) has
- nearly 600 bloggers,
- 36K followers on Twitter,
- 39K on Facebook,
- very interesting posts every day!
## Manipulating data? Ask Hadley Wickham!
R packages that he has developped (from [his website](http://hadley.nz/)):
- Data science
- ggplot2 for visualising data.
- dplyr for manipulating data.
- tidyr for tidying data.
- stringr for working with strings.
- lubridate for working with date/times.
---
- Data import
- readr for reading .csv and fwf files.
- readxl for reading .xls and .xlsx files.
- haven for SAS, SPSS, and Stata files.
- httr for talking to web APIs.
- rvest for scraping websites.
- xml2 for importing XML files.
- Software engineering
- devtools for general package development.
- roxygen2 for in-line documentation.
- testthat for unit testing.
## Introduction to dplyr (from its vignette)
```{r, collapse=TRUE}
p_load(nycflights13)
dim(flights)
head(flights)
```
***
```{r}
p_load(dplyr)
```
Dplyr aims to provide a function for each basic verb of data manipulation:
- ``filter()`` (and ``slice()``)
- ``arrange()``
- ``select()`` (and ``rename()``)
- ``distinct()``
- ``mutate()`` (and ``transmute()``)
- ``summarise()``
- ``sample_n()`` (and ``sample_frac()``)
***
```{r}
filter(flights, month == 1, day == 1)
```
***
```{r}
arrange(flights, desc(dep_delay))
```
***
```{r}
mutate(flights, gain = arr_delay - dep_delay,
speed = distance / air_time * 60)
```
***
```{r}
flights2 <- flights %>%
filter(month == 1, day == 1) %>%
arrange(desc(dep_delay)) %>%
mutate(gain = arr_delay - dep_delay,
speed = distance / air_time * 60)
print(flights2, n = 6)
```
## Elegant visualization tools: [ggplot2](http://ggplot2.org/)
```{r, out.height=380, out.width=600}
p_load(ggplot2)
p <- qplot(dep_delay, arr_delay, data = flights2,
main = "Flights which take off late arrive late. Surprising!")
print(p)
```
## Adding layers
```{r}
p + geom_smooth()
```
## More: go check this book
```{r}
citation("ggplot2")
```
## Some extensions are available [here](https://www.ggplot2-exts.org/)
```{r}
p_load(ggExtra)
ggMarginal(p, type = "histogram")
```
## [ggmap](https://github.com/dkahle/ggmap): maps with ggplot2
```{r, echo=FALSE}
knitr::include_graphics("http://revolution-computing.typepad.com/.a/6a010534b1db25970b0167689d5031970b-800wi")
```
## Interactive visualizations tools: [plotly](https://plot.ly/ggplot2/)
```{r}
p_load(plotly)
ggplotly(p)
```
## More
Looking for inspiration or help concerning data visualisation with R? Go check the [R graph gallery](http://www.r-graph-gallery.com/)!
## Interactive apps: [Shiny](http://shiny.rstudio.com/)
Live demo!
- From the Shiny website
- My own shiny app: `shiny::runGitHub("privefl/repartitions_equipes")`
- A game: [Lights Out](https://daattali.com/shiny/lightsout/)
More advanced usage: [Advanced tips and tricks](https://github.com/daattali/advanced-shiny)
## Use of C++ code when needed
More infos [there](http://adv-r.had.co.nz/Rcpp.html)
Typical bottlenecks that C++ can address include:
- Recursive functions, or problems which involve calling functions **millions of times**.
- Loops that **can’t be easily vectorised** because subsequent iterations depend on previous ones.
- Problems that require **advanced data structures** and algorithms that R doesn’t provide.
## Sum
```{r}
sumR <- function(x) {
total <- 0
for (i in seq_along(x)) {
total <- total + x[i]
}
total
}
```
![](http://medienwoche.ch/wp_live/wp-content/uploads/2016/01/vomit.jpg)
***
In Rcpp:
```{r engine='Rcpp'}
#include
using namespace Rcpp;
// [[Rcpp::export]]
double sumC(NumericVector x) {
int n = x.size();
double total = 0;
for(int i = 0; i < n; ++i) {
total += x[i];
}
return total;
}
```
***
In [Rcpp Sugar](http://dirk.eddelbuettel.com/code/rcpp/Rcpp-sugar.pdf):
```{r engine='Rcpp'}
#include
using namespace Rcpp;
// [[Rcpp::export]]
double sumCS(NumericVector x) {
return sum(x);
}
```
***
Microbenchmark:
```{r}
p_load(microbenchmark)
x <- runif(1e3)
microbenchmark(
sum(x),
sumC(x),
sumCS(x),
sumR(x)
)
```
## Gibbs sampler
```{r}
gibbs_r <- function(N, thin) {
mat <- matrix(nrow = 2, ncol = N)
x <- y <- 0
for (i in 1:N) {
for (j in 1:thin) {
x <- rgamma(1, 3, y * y + 4)
y <- rnorm(1, 1 / (x + 1), 1 / sqrt(2 * (x + 1)))
}
mat[, i] <- c(x, y)
}
mat
}
```
***
```{r engine='Rcpp'}
#include
using namespace Rcpp;
// [[Rcpp::export]]
NumericMatrix gibbs_cpp(int N, int thin) {
NumericMatrix mat(2, N);
double x = 0, y = 0;
for(int i = 0; i < N; i++) {
for(int j = 0; j < thin; j++) {
x = rgamma(1, 3, 1 / (y * y + 4))[0];
y = rnorm(1, 1 / (x + 1), 1 / sqrt(2 * (x + 1)))[0];
}
mat(0, i) = x;
mat(1, i) = y;
}
return(mat);
}
```
***
```{r}
microbenchmark(
gibbs_r(100, 10),
gibbs_cpp(100, 10)
)
```
## Bigmemory
- On-disk matrices
- types: ``char``, ``short``, ``int``, ``float``, ``double``
- Access with `[i, j]` as a matrix
- Access via C++ code with `[j][i]`
- Easy use of parallelisation with shared matrices
## Example with foreach and bigmemory
> - Say you have:
- A SNP big.matrix X stored on-disk in directory _backingfiles_,
- Infos on the positions of the SNPs (the first 40,000 SNPs are in chromosome 1, then 38,000 are in chromosome 2, etc.),
> - And you have to do some computations which are independent with respect to chromosomes. You want to use __Parallel Computing__!
> - How to do use Parallel Computing on massive genotype matrices?
***
```{r, eval = FALSE, out.height=300}
DO_all <- function(X, infos, ncores) {
DO_chr <- function(X.desc, lims) {
X.chr <- sub.big.matrix(X.desc,
firstCol = lims[1],
lastCol = lims[2],
backingpath = "backingfiles")
## Do something with X.chr (such as imputing)
}
range.chr <- LimsChr(infos)
X.desc <- describe(X)
obj <- foreach(chr = 1:nrow(range.chr),
.packages = "bigmemory")
expr_fun <- function(chr) {
DO_chr(X.desc, range.chr[chr, ])
}
res <- foreach2(obj, expr_fun, ncores)
}
```
***
```{r, eval = FALSE}
LimsChr <- function(infos) {
map.rle <- rle(infos$map$chromosome)
upper <- cumsum(map.rle$length)
lower <- c(1, upper[-length(upper)] + 1)
cbind(lower, upper, "chr" = map.rle$values)
}
foreach2 <- function(obj, expr_fun, ncores) {
if (is.seq <- (ncores == 1)) {
foreach::registerDoSEQ()
} else {
cl <- parallel::makeCluster(ncores)
doParallel::registerDoParallel(cl)
}
res <- eval(parse(
text = sprintf("foreach::`%%dopar%%`(obj, expr_fun(%s))",
obj$argnames)))
if (!is.seq) parallel::stopCluster(cl)
return(res)
}
```
## We have [RStudio](https://www.rstudio.com/)
Live demo!
- Code highlighting/autocompletion
- Help > Cheatsheets
- Panels (Git, ...)
- debugger
- [Notebooks](https://www.r-bloggers.com/r-notebooks/)
More tips: [RStudio Tips](https://twitter.com/rstudiotips) on Twitter
## Free books to learn about R:
- Advanced R programming:
- [Efficient R Programming](https://csgillespie.github.io/efficientR/preface.html)
- [Advanced R](http://adv-r.had.co.nz/)
- Reporting:
- [Getting used to R, RStudio, and R Markdown](https://ismayc.github.io/rbasics-book/index.html)
- Data analysis:
- [R for Data Science](http://r4ds.had.co.nz/)
- [An Introduction to Statistical Learning, with Applications in R](http://www-bcf.usc.edu/~gareth/ISL/) (Trevor Hastie is one of the 4 authors)
- Package development:
- [R packages](http://r-pkgs.had.co.nz/)
Learn: [R Course Finder](http://r-exercises.com/r-courses/)
## References and further reading
- [7 Tips For Getting Your Colleagues Hooked on R](http://scl.io/QZxZZl6u#gs.zMhz76Q)
- [Video: What is R?](https://www.youtube.com/watch?v=TR2bHSJ_eck)
- [How Companies Use R to Compete in a Data-Driven World](http://data-informed.com/companies-use-r-compete-data-driven-world/)
- [How the growth of R helps data-driven organizations succeed](http://www.slideshare.net/RevolutionAnalytics/how-the-growth-of-r-helps-datadriven-organizations-succeed)
- [A segmented model of CRAN package growth](https://www.r-bloggers.com/a-segmented-model-of-cran-package-growth/)
- [Coke vs Soda vs Pop : Linguistic trends analyzed with Twitter and R](https://www.r-bloggers.com/coke-vs-soda-vs-pop-linguistic-trends-analyzed-with-twitter-and-r/)
- [rPython R package](http://rpython.r-forge.r-project.org/)
- [Feather: A Fast On-Disk Format for Data Frames for R and Python](https://www.r-bloggers.com/feather-a-fast-on-disk-format-for-data-frames-for-r-and-python-powered-by-apache-arrow/)