--- title: "Getting My Colleagues Hooked on R" author: "Florian Privé" date: "`r Sys.Date()`" output: ioslides_presentation: css: styles.css --- ```{r setup, include=FALSE} knitr::opts_chunk$set( cache = TRUE, warning = FALSE, message = FALSE, fig.align = 'center', comment = "") ``` ```{r, include=FALSE} library(pacman) p_load(magrittr, longurl, gsheet) responses <- "goo.gl/4zYmrw" %>% expand_urls %>% {gsheet2tbl(.$expanded_url)[, 2]} ``` ## What `r nrow(responses)` of you wanted to learn ```{r, include=FALSE} p_load(gsubfn, stringr) questions <- "https://goo.gl/forms/LREeX5NORBJlCrcC3" %>% readLines(encoding = "UTF-8") %>% strapply(pattern = "\\[\"([^\"]*)\",,,,0\\]") %>% unlist counts <- str_count(responses, coll(questions)) counts.lvl <- counts %>% unique %>% sort(decreasing = TRUE) %>% setdiff(0) ``` ```{r, echo=FALSE, results='asis'} printf <- function(...) cat(sprintf(...)) for (n in counts.lvl) { if (n == 2) printf("\n***\n") printf("- for **%d** of you:\n", n) q.tmp <- questions[counts == n] for (q in q.tmp) { printf(" - %s\n", q) } } ``` ## Overview We will try to see a bit of everything. - This is only a (small) part of what R can do - We will only see introductions to each topic, with some links to learn more - I'm not an expert in everything in R (yet :D) Contents: - some stats about R - data manipulation and visualization - Rcpp - bigmemory - RStudio - learn more ## Some facts about the growth of R: - R is #5 of all programming languages ([IEEE Spectrum, July 2016](https://www.r-bloggers.com/r-moves-up-to-5th-place-in-ieee-language-rankings/)) ```{r, echo=FALSE} knitr::include_graphics("http://revolution-computing.typepad.com/.a/6a010534b1db25970b01bb092485d1970d-800wi") ``` --- ```{r, echo = FALSE} n <- readLines('https://cran.r-project.org/web/packages/') %>% gsubfn::strapply( paste("Currently, the CRAN package repository", "features ([0-9]+) available packages.")) %>% unlist ``` - There are now `r n` available packages on CRAN ([CRAN: Contributed Packages, `r Sys.Date()`](https://cran.r-project.org/web/packages/)) ```{r, echo=FALSE, out.height="450px"} knitr::include_graphics("http://a3.typepad.com/6a017d41eeee1a970c01bb08ef2103970d-pi") ``` --- - There are many R conferences: - useR!: 900+ people in 2016, - eRum: european R users meeting, - EARL: many people from the Industry, - Rencontres R: Grenoble in 2015, - satRdays, - R/Finance & R in Insurance. - The R blogosphere is huge: [R-bloggers](https://www.r-bloggers.com/) has - nearly 600 bloggers, - 36K followers on Twitter, - 39K on Facebook, - very interesting posts every day! ## Manipulating data? Ask Hadley Wickham! R packages that he has developped (from [his website](http://hadley.nz/)): - Data science - ggplot2 for visualising data. - dplyr for manipulating data. - tidyr for tidying data. - stringr for working with strings. - lubridate for working with date/times. --- - Data import - readr for reading .csv and fwf files. - readxl for reading .xls and .xlsx files. - haven for SAS, SPSS, and Stata files. - httr for talking to web APIs. - rvest for scraping websites. - xml2 for importing XML files. - Software engineering - devtools for general package development. - roxygen2 for in-line documentation. - testthat for unit testing. ## Introduction to dplyr (from its vignette) ```{r, collapse=TRUE} p_load(nycflights13) dim(flights) head(flights) ``` *** ```{r} p_load(dplyr) ``` Dplyr aims to provide a function for each basic verb of data manipulation: - ``filter()`` (and ``slice()``) - ``arrange()`` - ``select()`` (and ``rename()``) - ``distinct()`` - ``mutate()`` (and ``transmute()``) - ``summarise()`` - ``sample_n()`` (and ``sample_frac()``) *** ```{r} filter(flights, month == 1, day == 1) ``` *** ```{r} arrange(flights, desc(dep_delay)) ``` *** ```{r} mutate(flights, gain = arr_delay - dep_delay, speed = distance / air_time * 60) ``` *** ```{r} flights2 <- flights %>% filter(month == 1, day == 1) %>% arrange(desc(dep_delay)) %>% mutate(gain = arr_delay - dep_delay, speed = distance / air_time * 60) print(flights2, n = 6) ``` ## Elegant visualization tools: [ggplot2](http://ggplot2.org/) ```{r, out.height=380, out.width=600} p_load(ggplot2) p <- qplot(dep_delay, arr_delay, data = flights2, main = "Flights which take off late arrive late. Surprising!") print(p) ``` ## Adding layers ```{r} p + geom_smooth() ``` ## More: go check this book ```{r} citation("ggplot2") ``` ## Some extensions are available [here](https://www.ggplot2-exts.org/) ```{r} p_load(ggExtra) ggMarginal(p, type = "histogram") ``` ## [ggmap](https://github.com/dkahle/ggmap): maps with ggplot2 ```{r, echo=FALSE} knitr::include_graphics("http://revolution-computing.typepad.com/.a/6a010534b1db25970b0167689d5031970b-800wi") ``` ## Interactive visualizations tools: [plotly](https://plot.ly/ggplot2/) ```{r} p_load(plotly) ggplotly(p) ``` ## More Looking for inspiration or help concerning data visualisation with R? Go check the [R graph gallery](http://www.r-graph-gallery.com/)! ## Interactive apps: [Shiny](http://shiny.rstudio.com/) Live demo! - From the Shiny website - My own shiny app: `shiny::runGitHub("privefl/repartitions_equipes")` - A game: [Lights Out](https://daattali.com/shiny/lightsout/) More advanced usage: [Advanced tips and tricks](https://github.com/daattali/advanced-shiny) ## Use of C++ code when needed More infos [there](http://adv-r.had.co.nz/Rcpp.html) Typical bottlenecks that C++ can address include: - Recursive functions, or problems which involve calling functions **millions of times**. - Loops that **can’t be easily vectorised** because subsequent iterations depend on previous ones. - Problems that require **advanced data structures** and algorithms that R doesn’t provide. ## Sum ```{r} sumR <- function(x) { total <- 0 for (i in seq_along(x)) { total <- total + x[i] } total } ``` ![](http://medienwoche.ch/wp_live/wp-content/uploads/2016/01/vomit.jpg) *** In Rcpp: ```{r engine='Rcpp'} #include using namespace Rcpp; // [[Rcpp::export]] double sumC(NumericVector x) { int n = x.size(); double total = 0; for(int i = 0; i < n; ++i) { total += x[i]; } return total; } ``` *** In [Rcpp Sugar](http://dirk.eddelbuettel.com/code/rcpp/Rcpp-sugar.pdf): ```{r engine='Rcpp'} #include using namespace Rcpp; // [[Rcpp::export]] double sumCS(NumericVector x) { return sum(x); } ``` *** Microbenchmark: ```{r} p_load(microbenchmark) x <- runif(1e3) microbenchmark( sum(x), sumC(x), sumCS(x), sumR(x) ) ``` ## Gibbs sampler ```{r} gibbs_r <- function(N, thin) { mat <- matrix(nrow = 2, ncol = N) x <- y <- 0 for (i in 1:N) { for (j in 1:thin) { x <- rgamma(1, 3, y * y + 4) y <- rnorm(1, 1 / (x + 1), 1 / sqrt(2 * (x + 1))) } mat[, i] <- c(x, y) } mat } ``` *** ```{r engine='Rcpp'} #include using namespace Rcpp; // [[Rcpp::export]] NumericMatrix gibbs_cpp(int N, int thin) { NumericMatrix mat(2, N); double x = 0, y = 0; for(int i = 0; i < N; i++) { for(int j = 0; j < thin; j++) { x = rgamma(1, 3, 1 / (y * y + 4))[0]; y = rnorm(1, 1 / (x + 1), 1 / sqrt(2 * (x + 1)))[0]; } mat(0, i) = x; mat(1, i) = y; } return(mat); } ``` *** ```{r} microbenchmark( gibbs_r(100, 10), gibbs_cpp(100, 10) ) ``` ## Bigmemory - On-disk matrices - types: ``char``, ``short``, ``int``, ``float``, ``double`` - Access with `[i, j]` as a matrix - Access via C++ code with `[j][i]` - Easy use of parallelisation with shared matrices ## Example with foreach and bigmemory > - Say you have: - A SNP big.matrix X stored on-disk in directory _backingfiles_, - Infos on the positions of the SNPs (the first 40,000 SNPs are in chromosome 1, then 38,000 are in chromosome 2, etc.), > - And you have to do some computations which are independent with respect to chromosomes. You want to use __Parallel Computing__! > - How to do use Parallel Computing on massive genotype matrices? *** ```{r, eval = FALSE, out.height=300} DO_all <- function(X, infos, ncores) { DO_chr <- function(X.desc, lims) { X.chr <- sub.big.matrix(X.desc, firstCol = lims[1], lastCol = lims[2], backingpath = "backingfiles") ## Do something with X.chr (such as imputing) } range.chr <- LimsChr(infos) X.desc <- describe(X) obj <- foreach(chr = 1:nrow(range.chr), .packages = "bigmemory") expr_fun <- function(chr) { DO_chr(X.desc, range.chr[chr, ]) } res <- foreach2(obj, expr_fun, ncores) } ``` *** ```{r, eval = FALSE} LimsChr <- function(infos) { map.rle <- rle(infos$map$chromosome) upper <- cumsum(map.rle$length) lower <- c(1, upper[-length(upper)] + 1) cbind(lower, upper, "chr" = map.rle$values) } foreach2 <- function(obj, expr_fun, ncores) { if (is.seq <- (ncores == 1)) { foreach::registerDoSEQ() } else { cl <- parallel::makeCluster(ncores) doParallel::registerDoParallel(cl) } res <- eval(parse( text = sprintf("foreach::`%%dopar%%`(obj, expr_fun(%s))", obj$argnames))) if (!is.seq) parallel::stopCluster(cl) return(res) } ``` ## We have [RStudio](https://www.rstudio.com/) Live demo! - Code highlighting/autocompletion - Help > Cheatsheets - Panels (Git, ...) - debugger - [Notebooks](https://www.r-bloggers.com/r-notebooks/) More tips: [RStudio Tips](https://twitter.com/rstudiotips) on Twitter ## Free books to learn about R: - Advanced R programming: - [Efficient R Programming](https://csgillespie.github.io/efficientR/preface.html) - [Advanced R](http://adv-r.had.co.nz/) - Reporting: - [Getting used to R, RStudio, and R Markdown](https://ismayc.github.io/rbasics-book/index.html) - Data analysis: - [R for Data Science](http://r4ds.had.co.nz/) - [An Introduction to Statistical Learning, with Applications in R](http://www-bcf.usc.edu/~gareth/ISL/) (Trevor Hastie is one of the 4 authors) - Package development: - [R packages](http://r-pkgs.had.co.nz/) Learn: [R Course Finder](http://r-exercises.com/r-courses/) ## References and further reading - [7 Tips For Getting Your Colleagues Hooked on R](http://scl.io/QZxZZl6u#gs.zMhz76Q) - [Video: What is R?](https://www.youtube.com/watch?v=TR2bHSJ_eck) - [How Companies Use R to Compete in a Data-Driven World](http://data-informed.com/companies-use-r-compete-data-driven-world/) - [How the growth of R helps data-driven organizations succeed](http://www.slideshare.net/RevolutionAnalytics/how-the-growth-of-r-helps-datadriven-organizations-succeed) - [A segmented model of CRAN package growth](https://www.r-bloggers.com/a-segmented-model-of-cran-package-growth/) - [Coke vs Soda vs Pop : Linguistic trends analyzed with Twitter and R](https://www.r-bloggers.com/coke-vs-soda-vs-pop-linguistic-trends-analyzed-with-twitter-and-r/) - [rPython R package](http://rpython.r-forge.r-project.org/) - [Feather: A Fast On-Disk Format for Data Frames for R and Python](https://www.r-bloggers.com/feather-a-fast-on-disk-format-for-data-frames-for-r-and-python-powered-by-apache-arrow/)