What R can do - Grenoble RUG

# What can do for you

## Florian Privé

### Grenoble RUG - September 13, 2018

**Slides:** `bit.ly/RUGgre11`

---

## Contents

- Statistics & Data Science

- Visualization

- High Performance Computing

- Web

- Reporting

- RStudio IDE

- Community

- Learn R

- Program for this year

---

# Statistics & Data Science

---

## Statistics

> R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, …) and graphical techniques, and is highly extensible. The S language is often the vehicle of choice for research in statistical methodology, and R provides an Open Source route to participation in that activity.
>
> -- https://www.r-project.org/about.html

---

## Work with many kinds of data

- tabular tidy data (see [this book](http://r4ds.had.co.nz/))

- spatial (see [this book](https://bookdown.org/robinlovelace/geocompr/) and [this blog](https://statnmap.com/))

- temporal (see [this book](https://otexts.org/fpp2/))

- textual (see [this book](https://www.tidytextmining.com/))

- networks (see [this book](https://sites.fas.harvard.edu/~airoldi/pub/books/BookDraft-CsardiNepuszAiroldi2016.pdf))

- etc

---

## CRAN task views

Browse https://cran.r-project.org/web/views/.

> CRAN task views aim to provide some guidance which packages on CRAN are relevant for tasks related to a certain topic.

They are so useful to discover packages that are used in a field of research.

## Bioconductor

Search engine: https://www.bioconductor.org/packages/devel/BiocViews.html

---

### Simple example

```r
plot(iris, pch = 20, col = iris$Species)
```

---

### Simple example

```r
pca <- prcomp(iris[, -5], center = TRUE, scale. = TRUE)
plot(pca$x, pch = 20, col = iris$Species)
```

---

### Simple example (November session)

```r
summary(fit <- lm(Petal.Length ~ ., data = iris))
```

```

Call:
lm(formula = Petal.Length ~ ., data = iris)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.78396 -0.15708  0.00193  0.14730  0.65418

Coefficients:
 Estimate Std. Error t value Pr(>|t|) 
(Intercept) -1.11099 0.26987 -4.117 6.45e-05 ***
Sepal.Length 0.60801 0.05024 12.101 < 2e-16 ***
Sepal.Width -0.18052 0.08036 -2.246 0.0262 * 
Petal.Width 0.60222 0.12144 4.959 1.97e-06 ***
Speciesversicolor 1.46337 0.17345 8.437 3.14e-14 ***
Speciesvirginica 1.97422 0.24480 8.065 2.60e-13 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2627 on 144 degrees of freedom
Multiple R-squared: 0.9786,	Adjusted R-squared: 0.9778 
F-statistic: 1317 on 5 and 144 DF, p-value: < 2.2e-16
```

---

## Data manipulation with {dplyr} (May session)

```r
library(dplyr)
(flights <- nycflights13::flights)
```

```
# A tibble: 336,776 x 19
 year month day dep_time sched_dep_time dep_delay arr_time
 <int> <int> <int> <int> <int> <dbl> <int>
 1 2013 1 1 517 515 2 830
 2 2013 1 1 533 529 4 850
 3 2013 1 1 542 540 2 923
 4 2013 1 1 544 545 -1 1004
 5 2013 1 1 554 600 -6 812
 6 2013 1 1 554 558 -4 740
 7 2013 1 1 555 600 -5 913
 8 2013 1 1 557 600 -3 709
 9 2013 1 1 557 600 -3 838
10 2013 1 1 558 600 -2 753
# ... with 336,766 more rows, and 12 more variables:
# sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
# flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
# air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
# time_hour <dttm>
```

---

## Data manipulation with {dplyr}

R package {dplyr} aims to provide a function for each basic verb of data manipulation:

- `filter()`

- `arrange()`

- `select()`

- `mutate()`

- `group_by()`

- `summarise()`

- and many others..

---

## Filtering observations

```r
filter(flights, month == 1, day == 1)
```

```
# A tibble: 842 x 19
 year month day dep_time sched_dep_time dep_delay arr_time
 <int> <int> <int> <int> <int> <dbl> <int>
 1 2013 1 1 517 515 2 830
 2 2013 1 1 533 529 4 850
 3 2013 1 1 542 540 2 923
 4 2013 1 1 544 545 -1 1004
 5 2013 1 1 554 600 -6 812
 6 2013 1 1 554 558 -4 740
 7 2013 1 1 555 600 -5 913
 8 2013 1 1 557 600 -3 709
 9 2013 1 1 557 600 -3 838
10 2013 1 1 558 600 -2 753
# ... with 832 more rows, and 12 more variables:
# sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
# flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
# air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
# time_hour <dttm>
```

---

## Sorting

```r
arrange(flights, desc(dep_delay))
```

```
# A tibble: 336,776 x 19
 year month day dep_time sched_dep_time dep_delay arr_time
 <int> <int> <int> <int> <int> <dbl> <int>
 1 2013 1 9 641 900 1301 1242
 2 2013 6 15 1432 1935 1137 1607
 3 2013 1 10 1121 1635 1126 1239
 4 2013 9 20 1139 1845 1014 1457
 5 2013 7 22 845 1600 1005 1044
 6 2013 4 10 1100 1900 960 1342
 7 2013 3 17 2321 810 911 135
 8 2013 6 27 959 1900 899 1236
 9 2013 7 22 2257 759 898 121
10 2013 12 5 756 1700 896 1058
# ... with 336,766 more rows, and 12 more variables:
# sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
# flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
# air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
# time_hour <dttm>
```

---

## Adding/replacing variables

```r
mutate(flights, speed = distance / air_time * 60)
```

```
# A tibble: 336,776 x 20
 year month day dep_time sched_dep_time dep_delay arr_time
 <int> <int> <int> <int> <int> <dbl> <int>
 1 2013 1 1 517 515 2 830
 2 2013 1 1 533 529 4 850
 3 2013 1 1 542 540 2 923
 4 2013 1 1 544 545 -1 1004
 5 2013 1 1 554 600 -6 812
 6 2013 1 1 554 558 -4 740
 7 2013 1 1 555 600 -5 913
 8 2013 1 1 557 600 -3 709
 9 2013 1 1 557 600 -3 838
10 2013 1 1 558 600 -2 753
# ... with 336,766 more rows, and 13 more variables:
# sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
# flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
# air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
# time_hour <dttm>, speed <dbl>
```

---

## Piping operations

```r
flights2 <- flights %>%
 filter(month == 1, day == 1) %>%
 arrange(desc(dep_delay)) %>%
 mutate(speed = distance / air_time * 60)
print(flights2, n = 6)
```

```
# A tibble: 842 x 20
 year month day dep_time sched_dep_time dep_delay arr_time
 <int> <int> <int> <int> <int> <dbl> <int>
1 2013 1 1 848 1835 853 1001
2 2013 1 1 2343 1724 379 314
3 2013 1 1 1815 1325 290 2120
4 2013 1 1 2205 1720 285 46
5 2013 1 1 1842 1422 260 1958
6 2013 1 1 2115 1700 255 2330
# ... with 836 more rows, and 13 more variables:
# sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
# flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
# air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
# time_hour <dttm>, speed <dbl>
```

---

## Summarizing by group

```r
flights %>%
  group_by(carrier) %>%
  summarize(avg_arr_delay = mean(arr_delay, na.rm = TRUE)) %>%
  arrange(desc(avg_arr_delay)) %>%
  left_join(nycflights13::airlines)
```

```
Joining, by = "carrier"
```

```
# A tibble: 16 x 3
 carrier avg_arr_delay name 
 <chr> <dbl> <chr> 
 1 F9 21.9 Frontier Airlines Inc. 
 2 FL 20.1 AirTran Airways Corporation
 3 EV 15.8 ExpressJet Airlines Inc. 
 4 YV 15.6 Mesa Airlines Inc. 
 5 OO 11.9 SkyWest Airlines Inc. 
 6 MQ 10.8 Envoy Air 
 7 WN 9.65 Southwest Airlines Co. 
 8 B6 9.46 JetBlue Airways 
 9 9E 7.38 Endeavor Air Inc. 
10 UA 3.56 United Air Lines Inc. 
11 US 2.13 US Airways Inc. 
12 VX 1.76 Virgin America 
13 DL 1.64 Delta Air Lines Inc. 
14 AA 0.364 American Airlines Inc. 
15 HA -6.92 Hawaiian Airlines Inc. 
16 AS -9.93 Alaska Airlines Inc. 
```

---

## {dplyr} also works with databases

.footnote[Learn more with [this webinar](https://www.rstudio.com/resources/videos/best-practices-for-working-with-databases-webinar/).]

---

## Machine Learning & Deep Learning

### Package {caret} (February session)

The caret package (short for **C**lassification **A**nd **RE**gression **T**raining) is a set of functions that attempt to streamline the process for creating predictive models (see [the full documentation](http://topepo.github.io/caret/index.html)). The package contains tools for:

- data splitting
- pre-processing
- feature selection
- model tuning using resampling
- variable importance estimation

### Keras & TensorFlow in R (January session)

Keras & TensorFlow are integrated in R

- [TensorFlow for R](https://TensorFlow.rstudio.com/)

- [TensorFlow for R blog](https://blogs.rstudio.com/TensorFlow/)

---

# Visualization

---

## Package {ggplot2} and extensions (June session)

---

## Animate graphics with {[gganimate](https://github.com/thomasp85/gganimate)}

---

## Fancy graphics: [alluvial diagrams](https://github.com/mbojan/alluvial)

---

## Image processing

- {[magick](https://github.com/ropensci/magick)}

- {[imager](https://github.com/dahtah/imager)} (October session)

---

# Reporting

---

## R Markdown (April session)

- Reports (analysis, etc) with text, code and results in the same place! With many possible output formats including HTML, PDF, MS Word, beamer, etc.

- HTML presentations (like this one! -- see [source code](https://github.com/privefl/R-presentation/blob/master/whatrcando.Rmd))

- websites (such as [the website of our R user group](https://r-in-grenoble.github.io/))

- books (or even [a thesis](https://keurcien.github.io/book/introduction.html))

---

# Web

---

## Web scrapping

```r
library(rvest)

read_html("https://r-in-grenoble.github.io/sessions.html") %>%
  html_nodes(".schedule") %>%
  html_nodes(".center-title") %>%
  html_text() %>%
  gsub("\n", "", .) %>%
  writeLines()
```

```
What R can do for you
Image processing with package {imager}
Linear models in R
Manage your workflow with package {drake}
Deep Learning with package {tensorflow}
Machine Learning with package {caret}
Best coding practices
R Markdown
Data manipulation with package {dplyr}
Data vizualisation with package {ggplot2}
```

---

## Shiny apps: web apps in R

- Example 1: [Airbnb visualization in New York](https://yuyuhan0306.shinyapps.io/airbnb_yuhan/)

- Example 2: [Make pixel art models](https://florianprive.shinyapps.io/pixelart/)

[Learn more](https://privefl.github.io/advr38book/shiny.html)

---

# High Performance Computing

---

## [Integrate C++ code with {Rcpp}](https://privefl.github.io/R-presentation/Rcpp.html)

Rcpp lives between R and C++, so that you can get

- the *performance of C++* and

- the *convenience of R*.

- I love *performance* and

- I also enjoy *simplicity*,

Rcpp might be my favorite R package.

---

## Easy parallelism with {[future](https://github.com/HenrikBengtsson/future)}

<blockquote class="twitter-tweet" data-lang="en" align="center">future 1.0.0 on CRAN - cross-platform parallel evaluation via a single unified API <a href="https://twitter.com/hashtag/rstats?src=hash&amp;ref_src=twsrc%5Etfw">#rstats</a> <a href="https://t.co/uxIozDAWHA">https://t.co/uxIozDAWHA</a> <a href="https://t.co/wV5vhcgpMJ">pic.twitter.com/wV5vhcgpMJ</a>&mdash; Henrik Bengtsson (@henrikbengtsson) <a href="https://twitter.com/henrikbengtsson/status/746906359484973057?ref_src=twsrc%5Etfw">26 juin 2016</a></blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

.footnote[Also see [my intro to parallelism with {foreach}](https://privefl.github.io/blog/a-guide-to-parallelism-in-r/).]

---

## Scalable reproducible workflow with {[drake](https://ropensci.github.io/drake/)} (December session)

---

## Large matrices with {[bigstatsr](https://github.com/privefl/bigstatsr)}

### Advantages of using FBM objects

- you can apply algorithms on **data larger than your RAM**,

- you can easily **parallelize** your algorithms because the data on disk is shared,

- you write **more efficient algorithms** (you do less copies and think more about what you're doing),

- you can use **different types of data**, for example, in my field, I’m storing my data with only 1 byte per element (rather than 8 bytes for a standard R matrix). See [the documentation of the FBM class](https://privefl.github.io/bigstatsr/reference/FBM-class.html) for details.

---

# RStudio

---

## RStudio IDE really helps

- console / scripts / environment / plots

- code diagnostics

- projects (+ git panel)

- viewer / debugger / profiler

- interactive import / connection

- integrated terminal / HTML viewer

- support many programming languages

---

# Where to learn R?

---

## Where to learn R?

- [An Introduction to R](https://colinfay.me/intro-to-r/) by the R core team

- [Introduction to R](https://www.datacamp.com/courses/free-introduction-to-r) by DataCamp

- [R for Data Science](http://r4ds.had.co.nz/index.html) by Garrett Grolemund & Hadley Wickham, and [some solutions](https://jrnold.github.io/r4ds-exercise-solutions/)

- [Advanced R](http://adv-r.had.co.nz/) by Hadley Wickham, and [some solutions](https://bookdown.org/Tazinho/Advanced-R-Solutions/)

- [Useful packages for Data Science](https://github.com/rstudio/RStartHere)

- [CRAN Task Views](https://cran.r-project.org/web/views/)

- Course: [Advanced R course](https://privefl.github.io/advr38book/index.html) for PhD students in Grenoble (and 5 other open spots). **In French, but may be in English if enough demands.**

- Read code, documentation, blog posts, etc. And PRACTICE.

- Learn from others

- [join the French-speaking R community](https://join.slack.com/t/r-grrr/shared_invite/enQtMzI4MzgwNTc4OTAxLWZlOGZiZTBiMWU0NDQ3OTYzOGE1YThiODgwZWNhNWEyYjI4ZDJiNmNhY2YyYWI5YzFiOTFkNDYxYzkwODUwNWM)
    - [join the R-Ladies community](https://rladies-community-slack.herokuapp.com/)

---

<blockquote class="twitter-tweet" data-lang="en" align="center" size="50%">New <a href="https://twitter.com/hashtag/rstats?src=hash&amp;ref_src=twsrc%5Etfw">#rstats</a> post: &quot;Where to get help with your R question?&quot; <a href="https://t.co/ilIarU1518">https://t.co/ilIarU1518</a> ❓❔⁉️ <a href="https://t.co/u0FB7FtAla">pic.twitter.com/u0FB7FtAla</a>&mdash; Maëlle Salmon 🐟 (@ma_salmon) <a href="https://twitter.com/ma_salmon/status/1021052562580045824?ref_src=twsrc%5Etfw">22 juillet 2018</a></blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

---

## Schedule

---

# Thanks Grenoble Alpes Data Institute

## for food, ecocups and stickers

---

# Thanks!

**Slides:** `bit.ly/RUGgre11`

[privefl](https://twitter.com/privefl) &nbsp;&nbsp;&nbsp;&nbsp; [privefl](https://github.com/privefl) &nbsp;&nbsp;&nbsp;&nbsp; [F. Privé](https://stackoverflow.com/users/6103040/f-priv%c3%a9)