R beyond statistics

# beyond statistics

## Florian Privé

### April 8, 2019

**Slides:** `privefl.github.io/R-presentation/beyond-stats.html`

---

## Contents

- Statistics & Data Science

- Visualization

- Reporting

- Web

- High Performance Computing

- Improvements to the language

- RStudio IDE

- Community

- Learn R

---

# Statistics & Data Science

---

## Just one example in Statistics: 
### glmnet: Lasso and Elastic-Net Regularized Generalized Linear Models

Extremely efficient procedures for fitting the entire lasso or elastic-net regularization path for

- linear regression,

- logistic and multinomial regression models,

- Poisson regression and

- the Cox model.

Two recent additions are the

- multiple-response Gaussian, and

- the grouped multinomial regression.

**With so many options!**

---

## Work with many kinds of data

- tabular tidy data (see [this book](http://r4ds.had.co.nz/))

- spatial (see [this book](https://bookdown.org/robinlovelace/geocompr/) and [this blog](https://statnmap.com/))

- temporal (see [this book](https://otexts.org/fpp2/))

- textual (see [this book](https://www.tidytextmining.com/))

- networks (see [this book](https://sites.fas.harvard.edu/~airoldi/pub/books/BookDraft-CsardiNepuszAiroldi2016.pdf))

- etc

---

## Data manipulation

- [{data.table}](https://github.com/Rdatatable/data.table/wiki) is very fast

- [{dplyr}](https://github.com/tidyverse/dplyr) is often fast enough and has a very intuitive syntax

---

## Data manipulation with {dplyr}

```r
library(dplyr)
(flights <- nycflights13::flights)
```

```
# A tibble: 336,776 x 19
 year month day dep_time sched_dep_time dep_delay arr_time
 <int> <int> <int> <int> <int> <dbl> <int>
 1 2013 1 1 517 515 2 830
 2 2013 1 1 533 529 4 850
 3 2013 1 1 542 540 2 923
 4 2013 1 1 544 545 -1 1004
 5 2013 1 1 554 600 -6 812
 6 2013 1 1 554 558 -4 740
 7 2013 1 1 555 600 -5 913
 8 2013 1 1 557 600 -3 709
 9 2013 1 1 557 600 -3 838
10 2013 1 1 558 600 -2 753
# … with 336,766 more rows, and 12 more variables:
# sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
# flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
# air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
# time_hour <dttm>
```

---

## Data manipulation with {dplyr}

R package {dplyr} aims to provide a function for each basic verb of data manipulation:

- `filter()`

- `arrange()`

- `select()`

- `mutate()`

- `group_by()`

- `summarise()`

- and many others..

---

## Filtering observations

```r
filter(flights, month == 1, day == 1)
```

```
# A tibble: 842 x 19
 year month day dep_time sched_dep_time dep_delay arr_time
 <int> <int> <int> <int> <int> <dbl> <int>
 1 2013 1 1 517 515 2 830
 2 2013 1 1 533 529 4 850
 3 2013 1 1 542 540 2 923
 4 2013 1 1 544 545 -1 1004
 5 2013 1 1 554 600 -6 812
 6 2013 1 1 554 558 -4 740
 7 2013 1 1 555 600 -5 913
 8 2013 1 1 557 600 -3 709
 9 2013 1 1 557 600 -3 838
10 2013 1 1 558 600 -2 753
# … with 832 more rows, and 12 more variables: sched_arr_time <int>,
# arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
# origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
```

---

## Sorting

```r
arrange(flights, desc(dep_delay))
```

```
# A tibble: 336,776 x 19
 year month day dep_time sched_dep_time dep_delay arr_time
 <int> <int> <int> <int> <int> <dbl> <int>
 1 2013 1 9 641 900 1301 1242
 2 2013 6 15 1432 1935 1137 1607
 3 2013 1 10 1121 1635 1126 1239
 4 2013 9 20 1139 1845 1014 1457
 5 2013 7 22 845 1600 1005 1044
 6 2013 4 10 1100 1900 960 1342
 7 2013 3 17 2321 810 911 135
 8 2013 6 27 959 1900 899 1236
 9 2013 7 22 2257 759 898 121
10 2013 12 5 756 1700 896 1058
# … with 336,766 more rows, and 12 more variables:
# sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
# flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
# air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
# time_hour <dttm>
```

---

## Adding/replacing variables

```r
mutate(flights, speed = distance / air_time * 60)
```

```
# A tibble: 336,776 x 20
 year month day dep_time sched_dep_time dep_delay arr_time
 <int> <int> <int> <int> <int> <dbl> <int>
 1 2013 1 1 517 515 2 830
 2 2013 1 1 533 529 4 850
 3 2013 1 1 542 540 2 923
 4 2013 1 1 544 545 -1 1004
 5 2013 1 1 554 600 -6 812
 6 2013 1 1 554 558 -4 740
 7 2013 1 1 555 600 -5 913
 8 2013 1 1 557 600 -3 709
 9 2013 1 1 557 600 -3 838
10 2013 1 1 558 600 -2 753
# … with 336,766 more rows, and 13 more variables:
# sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
# flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
# air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
# time_hour <dttm>, speed <dbl>
```

---

## Piping operations

```r
flights2 <- flights %>%
 filter(month == 1, day == 1) %>%
 arrange(desc(dep_delay)) %>%
 mutate(speed = distance / air_time * 60)
print(flights2, n = 6)
```

```
# A tibble: 842 x 20
 year month day dep_time sched_dep_time dep_delay arr_time
 <int> <int> <int> <int> <int> <dbl> <int>
1 2013 1 1 848 1835 853 1001
2 2013 1 1 2343 1724 379 314
3 2013 1 1 1815 1325 290 2120
4 2013 1 1 2205 1720 285 46
5 2013 1 1 1842 1422 260 1958
6 2013 1 1 2115 1700 255 2330
# … with 836 more rows, and 13 more variables: sched_arr_time <int>,
# arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
# origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>, speed <dbl>
```

---

## Summarizing by group

```r
flights %>%
  group_by(carrier) %>%
  summarize(avg_arr_delay = mean(arr_delay, na.rm = TRUE)) %>%
  arrange(desc(avg_arr_delay)) %>%
  left_join(nycflights13::airlines)
```

```
Joining, by = "carrier"
```

```
# A tibble: 16 x 3
 carrier avg_arr_delay name 
 <chr> <dbl> <chr> 
 1 F9 21.9 Frontier Airlines Inc. 
 2 FL 20.1 AirTran Airways Corporation
 3 EV 15.8 ExpressJet Airlines Inc. 
 4 YV 15.6 Mesa Airlines Inc. 
 5 OO 11.9 SkyWest Airlines Inc. 
 6 MQ 10.8 Envoy Air 
 7 WN 9.65 Southwest Airlines Co. 
 8 B6 9.46 JetBlue Airways 
 9 9E 7.38 Endeavor Air Inc. 
10 UA 3.56 United Air Lines Inc. 
11 US 2.13 US Airways Inc. 
12 VX 1.76 Virgin America 
13 DL 1.64 Delta Air Lines Inc. 
14 AA 0.364 American Airlines Inc. 
15 HA -6.92 Hawaiian Airlines Inc. 
16 AS -9.93 Alaska Airlines Inc. 
```

---

## {dplyr} also works with databases

Connect to DBs such as MySQL, MariaDB, Postgres, Redshift, SQLite, Google’s BigQuery... and use {dplyr} instead of SQL.

.footnote[Learn more at https://db.rstudio.com/dplyr/ and with [this webinar](https://www.rstudio.com/resources/videos/best-practices-for-working-with-databases-webinar/).]

---

## Machine Learning & Deep Learning

### Package {caret}

The caret package (short for **C**lassification **A**nd **RE**gression **T**raining) is a set of functions that attempt to streamline the process for creating predictive models (see [the full documentation](http://topepo.github.io/caret/index.html)). The package contains tools for:

- data splitting
- pre-processing
- feature selection
- model tuning using resampling
- variable importance estimation

### Keras & TensorFlow in R

Keras & TensorFlow are integrated in R

- [TensorFlow for R](https://TensorFlow.rstudio.com/)

- [TensorFlow for R blog](https://blogs.rstudio.com/TensorFlow/)

---

# Visualization

---

## Package {ggplot2} and extensions

---

## Animate graphics with [{gganimate}](https://github.com/thomasp85/gganimate)

---

## Fancy graphics: [alluvial diagrams](https://github.com/mbojan/alluvial)

---

## Image processing

- [{magick}](https://github.com/ropensci/magick)

- [{imager}](https://github.com/dahtah/imager) (by Simon Barthelmé, GIPSA-lab)

---

# Reporting

---

## R Markdown

- Reports (analysis, etc) with text, code and results in the same place! With many possible output formats including HTML, PDF, MS Word, beamer, etc.

- Many supported languages: 
    
    ```
    awk, bash, coffee, gawk, groovy, haskell, lein, mysql, node,
    octave, perl, psql, Rscript, ruby, sas, scala, sed, sh, stata,
    zsh, highlight, Rcpp, tikz, dot, c, fortran, fortran95, asy, cat,
    asis, stan, block, block2, js, css, sql, go, python, julia
    ```
    
- Notebooks

- HTML presentations (like this one! -- see [source code](https://github.com/privefl/R-presentation/blob/master/beyond-stats.Rmd))

- websites (such as [the website of our R user group](https://r-in-grenoble.github.io/))

- books (e.g. my online [advanced R course](https://privefl.github.io/advr38book/) or even [a thesis](https://keurcien.github.io/book/introduction.html))

- [WIP] [{pagedown}](https://github.com/rstudio/pagedown): Paginate the HTML Output of R Markdown with CSS for Print

---

# Web

---

## Web scrapping

(Remember to be nice on the web with [{polite}](https://github.com/dmi3kno/polite))

```r
library(rvest)

read_html("https://r-in-grenoble.github.io/sessions.html") %>%
  html_nodes(".schedule") %>%
  html_nodes(".center-title") %>%
  html_text() %>%
  gsub("\n", "", .) %>%
  writeLines()
```

```
What R can do for you
Image processing with package {imager}
Linear models in R
Manage your workflow with package {drake}
Deep Learning with package {keras}
Machine Learning with package {caret}
Best coding practices
R Markdown
Data manipulation with package {dplyr}
Data vizualisation with package {ggplot2}
```

---

## R clients to interact with APIs

- libcurl: https://cran.r-project.org/package=RCurl

- Twitter: https://github.com/mkearney/rtweet

- Gmail: https://github.com/jimhester/gmailr

- Google Spreadsheets: https://github.com/jennybc/googlesheets

- Google Analytics: https://github.com/MarkEdmondson1234/googleAnalyticsR

- GitHub: https://github.com/r-lib/gh

- Cloud computing / Web services: https://github.com/cloudyr

- dbSNP / openSNP: https://github.com/ropensci/rsnps

- OpenStreetMap: https://github.com/ropensci/osmdata

- [Géo](https://api.gouv.fr/api/api-geo.html): https://github.com/ColinFay/rgeoapi

- etc

---

## Shiny apps: web apps in R

- Example 1: [Airbnb visualization in New York](https://yuyuhan0306.shinyapps.io/airbnb_yuhan/)

- Example 2: [Make pixel art models](https://florianprive.shinyapps.io/pixelart/)

[Learn more](https://privefl.github.io/advr38book/shiny.html)

---

# High Performance Computing

---

## [Integrate C++ code with {Rcpp}](https://privefl.github.io/R-presentation/Rcpp.html)

Rcpp lives between R and C++, so that you can get

- the *performance of C++* and

- the *convenience of R*.

- I love *performance* and

- I also enjoy *simplicity*,

Rcpp might be my favorite R package.

---

## Easy parallelism with [{future}](https://github.com/HenrikBengtsson/future)

<blockquote class="twitter-tweet" data-lang="en" align="center" width="55%">future 1.0.0 on CRAN - cross-platform parallel evaluation via a single unified API <a href="https://twitter.com/hashtag/rstats?src=hash&amp;ref_src=twsrc%5Etfw">#rstats</a> <a href="https://t.co/uxIozDAWHA">https://t.co/uxIozDAWHA</a> <a href="https://t.co/wV5vhcgpMJ">pic.twitter.com/wV5vhcgpMJ</a>&mdash; Henrik Bengtsson (@henrikbengtsson) <a href="https://twitter.com/henrikbengtsson/status/746906359484973057?ref_src=twsrc%5Etfw">26 juin 2016</a></blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

.footnote[Also see [my intro to parallelism with {foreach}](https://privefl.github.io/blog/a-guide-to-parallelism-in-r/).]

---

## Scalable reproducible workflow with [{drake}](https://ropensci.github.io/drake/)

---

## Large matrices with [{bigstatsr}](https://github.com/privefl/bigstatsr)

### Advantages of using Filebacked Big Matrix (FBM) objects

- you can apply algorithms on **data larger than your RAM**,

- you can easily **parallelize** your algorithms because the data on disk is shared,

- you write **more efficient algorithms** (you do less copies and think more about what you're doing),

- you can use **different types of data**, for example, in my field, I’m storing my data with only 1 byte per element (rather than 8 bytes for a standard R matrix). See [the documentation of the FBM class](https://privefl.github.io/bigstatsr/reference/FBM-class.html) for details.

---

# Improvements to the language

---

## Just-in-time compiler

```r
compiler::enableJIT(0)
```

```
[1] 3
```

```r
system.time({
 N <- 1e6
 res <- numeric(N)
 for (i in 2:N) {
 res[i] <- res[i - 1] + 1;
 }
})
```

```
   user  system elapsed 
  1.406   0.000   1.406 
```
]

```r
compiler::enableJIT(3)
```

```
[1] 0
```

```r
system.time({
 N <- 1e6
 res <- numeric(N)
 for (i in 2:N) {
 res[i] <- res[i - 1] + 1;
 }
})
```

```
   user  system elapsed 
  0.111   0.000   0.111 
```

Since R 3.4
]

---

## Growing vectors

```r
system.time({
 N <- 3e4
 res <- c(0)
 for (i in 2:N) {
 new_val <- res[i - 1] + 1
 res <- c(res, new_val)
 }
})
```

```
   user  system elapsed 
  1.440   0.041   1.483 
```
]

```r
system.time({
 N <- 3e4
 res <- c(0)
 for (i in 2:N) {
 new_val <- res[i - 1] + 1
 res[i] <- new_val;
 }
})
```

```
   user  system elapsed 
  0.013   0.000   0.013 
```

Since R 3.4
]

---

## ALTREP (since R 3.5)

### An alternative way of representing R objects internally

Before :

```r
x <- 1:1e10
```

```
Error: cannot allocate vector of size 74.5 Gb
```

After:

```r
x <- 1:1e10
.Internal(inspect(x))
```

```
@67ddb58 14 REALSXP g0c0 [NAM(3)]  1 : 10000000000 (compact)
```

```r
sum(x)
```

```
[1] 5e+19
```

---

# RStudio

---

## RStudio IDE really helps

- console / scripts / environment / plots

- code diagnostics

- projects (+ git panel)

- viewer / debugger / profiler

- interactive import / connection

- integrated terminal / HTML viewer

- support many programming languages

- RStudio server to open RStudio in a web brower, with code running on a distant server

---

## RStudio for Python?

<blockquote class="twitter-tweet" data-lang="en" align="center">What is one of the better interactive data analysis editor / environments for <a href="https://twitter.com/hashtag/Python?src=hash&amp;ref_src=twsrc%5Etfw">#Python</a>? <a href="https://twitter.com/rstudio?ref_src=twsrc%5Etfw">@rstudio</a> :-) The new RSTUDIO 1.2 release will have support for python REPL <a href="https://twitter.com/hashtag/rstats?src=hash&amp;ref_src=twsrc%5Etfw">#rstats</a> <a href="https://twitter.com/hashtag/MachineLearning?src=hash&amp;ref_src=twsrc%5Etfw">#MachineLearning</a> <a href="https://twitter.com/hashtag/visualization?src=hash&amp;ref_src=twsrc%5Etfw">#visualization</a> <a href="https://t.co/lUvP3qYd4z">pic.twitter.com/lUvP3qYd4z</a>&mdash; Longhow Lam 林 隆 豪 (@longhowlam) <a href="https://twitter.com/longhowlam/status/1022821624318447616?ref_src=twsrc%5Etfw">27 juillet 2018</a></blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

---

# Community

---

## A welcoming and inclusive community

<blockquote class="twitter-tweet" data-lang="en" align="center">I&#39;m curious as to how the R community came to be so supportive and welcoming (as opposed to so much of the tech world). Anyone have ideas? <a href="https://twitter.com/hashtag/rstats?src=hash&amp;ref_src=twsrc%5Etfw">#rstats</a>&mdash; David Keyes (@dgkeyes) <a href="https://twitter.com/dgkeyes/status/1113547867984027648?ref_src=twsrc%5Etfw">3 avril 2019</a></blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

- [R Ladies](https://rladies.org/)

- [R Forwards](https://forwards.github.io/)

- etc

---

## Open Science: rOpenSci community

Packages in scope: data retrieval & extraction, scientific software wrappers, database access, data munging, data deposition, reproducibility, geospatial data, and text analysis.

Packages are openly reviewed to unsure quality.

---

## Open to many other languages

- C++ via [{Rcpp}](https://github.com/RcppCore/Rcpp)

- Python via [{reticulate}](https://rstudio.github.io/reticulate/)

- Java vua [{rJava}](https://github.com/s-u/rJava)

- Julia via [{JuliaCall}](https://github.com/Non-Contradiction/JuliaCall)

- JavaScript via [{V8}](https://github.com/jeroen/v8)

- JavaScript visualization via [{htmlwidgets}](https://www.htmlwidgets.org/)

- etc

- Beyond: [Apache Arrow](https://github.com/apache/arrow): cross-language development platform for in-memory data. Languages currently supported include C, C++, C#, Go, Java, JavaScript, MATLAB, Python, R, Ruby, and Rust.

---

# Where to learn R?

---

## Where to learn R?

- [An Introduction to R](https://colinfay.me/intro-to-r/) by the R core team

- [Introduction to R](https://www.datacamp.com/courses/free-introduction-to-r) by DataCamp

- [R for Data Science](http://r4ds.had.co.nz/index.html) by Garrett Grolemund & Hadley Wickham, and [some solutions](https://jrnold.github.io/r4ds-exercise-solutions/)

- [Advanced R](http://adv-r.had.co.nz/) by Hadley Wickham, and [some solutions](https://bookdown.org/Tazinho/Advanced-R-Solutions/)

- [Useful packages for Data Science](https://github.com/rstudio/RStartHere)

- [CRAN Task Views](https://cran.r-project.org/web/views/)

- [Advanced R course](https://privefl.github.io/advr38book/index.html)

- Read code, documentation, blog posts, etc. And PRACTICE.

- Learn from others

- [join the French-speaking R community](https://join.slack.com/t/r-grrr/shared_invite/enQtMzI4MzgwNTc4OTAxLWZlOGZiZTBiMWU0NDQ3OTYzOGE1YThiODgwZWNhNWEyYjI4ZDJiNmNhY2YyYWI5YzFiOTFkNDYxYzkwODUwNWM)
    - [join the R-Ladies community](https://rladies-community-slack.herokuapp.com/)

---

# Thanks!

**Slides:** `privefl.github.io/R-presentation/beyond-stats.html`

[privefl](https://twitter.com/privefl) &nbsp;&nbsp;&nbsp;&nbsp; [privefl](https://github.com/privefl) &nbsp;&nbsp;&nbsp;&nbsp; [F. Privé](https://stackoverflow.com/users/6103040/f-priv%c3%a9)