class: title-slide center middle inverse # What
can do for you <br> ## Florian Privé ### Grenoble RUG - September 13, 2018 <br> **Slides:** `bit.ly/RUGgre11` --- ## Contents <br> - Statistics & Data Science - Visualization - High Performance Computing - Web - Reporting - RStudio IDE - Community - Learn R - Program for this year --- class: center middle inverse # Statistics & Data Science --- ## Statistics <br> > R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, …) and graphical techniques, and is highly extensible. The S language is often the vehicle of choice for research in statistical methodology, and R provides an Open Source route to participation in that activity. > > -- https://www.r-project.org/about.html --- ## Work with many kinds of data <br> - tabular tidy data (see [this book](http://r4ds.had.co.nz/)) - spatial (see [this book](https://bookdown.org/robinlovelace/geocompr/) and [this blog](https://statnmap.com/)) - temporal (see [this book](https://otexts.org/fpp2/)) - textual (see [this book](https://www.tidytextmining.com/)) - networks (see [this book](https://sites.fas.harvard.edu/~airoldi/pub/books/BookDraft-CsardiNepuszAiroldi2016.pdf)) - etc - etc - etc --- ## CRAN task views <br> Browse https://cran.r-project.org/web/views/. > CRAN task views aim to provide some guidance which packages on CRAN are relevant for tasks related to a certain topic. They are so useful to discover packages that are used in a field of research. <br> ## Bioconductor <br> Search engine: https://www.bioconductor.org/packages/devel/BiocViews.html --- ### Simple example ```r plot(iris, pch = 20, col = iris$Species) ``` <img src="whatrcando_files/figure-html/unnamed-chunk-1-1.svg" width="80%" style="display: block; margin: auto;" /> --- ### Simple example ```r pca <- prcomp(iris[, -5], center = TRUE, scale. = TRUE) plot(pca$x, pch = 20, col = iris$Species) ``` <img src="whatrcando_files/figure-html/unnamed-chunk-2-1.svg" width="80%" style="display: block; margin: auto;" /> --- ### Simple example (November session) ```r summary(fit <- lm(Petal.Length ~ ., data = iris)) ``` ``` Call: lm(formula = Petal.Length ~ ., data = iris) Residuals: Min 1Q Median 3Q Max -0.78396 -0.15708 0.00193 0.14730 0.65418 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -1.11099 0.26987 -4.117 6.45e-05 *** Sepal.Length 0.60801 0.05024 12.101 < 2e-16 *** Sepal.Width -0.18052 0.08036 -2.246 0.0262 * Petal.Width 0.60222 0.12144 4.959 1.97e-06 *** Speciesversicolor 1.46337 0.17345 8.437 3.14e-14 *** Speciesvirginica 1.97422 0.24480 8.065 2.60e-13 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.2627 on 144 degrees of freedom Multiple R-squared: 0.9786, Adjusted R-squared: 0.9778 F-statistic: 1317 on 5 and 144 DF, p-value: < 2.2e-16 ``` --- ## Data manipulation with {dplyr} (May session) ```r library(dplyr) (flights <- nycflights13::flights) ``` ``` # A tibble: 336,776 x 19 year month day dep_time sched_dep_time dep_delay arr_time <int> <int> <int> <int> <int> <dbl> <int> 1 2013 1 1 517 515 2 830 2 2013 1 1 533 529 4 850 3 2013 1 1 542 540 2 923 4 2013 1 1 544 545 -1 1004 5 2013 1 1 554 600 -6 812 6 2013 1 1 554 558 -4 740 7 2013 1 1 555 600 -5 913 8 2013 1 1 557 600 -3 709 9 2013 1 1 557 600 -3 838 10 2013 1 1 558 600 -2 753 # ... with 336,766 more rows, and 12 more variables: # sched_arr_time <int>, arr_delay <dbl>, carrier <chr>, # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, # time_hour <dttm> ``` --- ## Data manipulation with {dplyr} R package {dplyr} aims to provide a function for each basic verb of data manipulation: - `filter()` - `arrange()` - `select()` - `mutate()` - `group_by()` - `summarise()` - and many others.. --- ## Filtering observations ```r filter(flights, month == 1, day == 1) ``` ``` # A tibble: 842 x 19 year month day dep_time sched_dep_time dep_delay arr_time <int> <int> <int> <int> <int> <dbl> <int> 1 2013 1 1 517 515 2 830 2 2013 1 1 533 529 4 850 3 2013 1 1 542 540 2 923 4 2013 1 1 544 545 -1 1004 5 2013 1 1 554 600 -6 812 6 2013 1 1 554 558 -4 740 7 2013 1 1 555 600 -5 913 8 2013 1 1 557 600 -3 709 9 2013 1 1 557 600 -3 838 10 2013 1 1 558 600 -2 753 # ... with 832 more rows, and 12 more variables: # sched_arr_time <int>, arr_delay <dbl>, carrier <chr>, # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, # time_hour <dttm> ``` --- ## Sorting ```r arrange(flights, desc(dep_delay)) ``` ``` # A tibble: 336,776 x 19 year month day dep_time sched_dep_time dep_delay arr_time <int> <int> <int> <int> <int> <dbl> <int> 1 2013 1 9 641 900 1301 1242 2 2013 6 15 1432 1935 1137 1607 3 2013 1 10 1121 1635 1126 1239 4 2013 9 20 1139 1845 1014 1457 5 2013 7 22 845 1600 1005 1044 6 2013 4 10 1100 1900 960 1342 7 2013 3 17 2321 810 911 135 8 2013 6 27 959 1900 899 1236 9 2013 7 22 2257 759 898 121 10 2013 12 5 756 1700 896 1058 # ... with 336,766 more rows, and 12 more variables: # sched_arr_time <int>, arr_delay <dbl>, carrier <chr>, # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, # time_hour <dttm> ``` --- ## Adding/replacing variables ```r mutate(flights, speed = distance / air_time * 60) ``` ``` # A tibble: 336,776 x 20 year month day dep_time sched_dep_time dep_delay arr_time <int> <int> <int> <int> <int> <dbl> <int> 1 2013 1 1 517 515 2 830 2 2013 1 1 533 529 4 850 3 2013 1 1 542 540 2 923 4 2013 1 1 544 545 -1 1004 5 2013 1 1 554 600 -6 812 6 2013 1 1 554 558 -4 740 7 2013 1 1 555 600 -5 913 8 2013 1 1 557 600 -3 709 9 2013 1 1 557 600 -3 838 10 2013 1 1 558 600 -2 753 # ... with 336,766 more rows, and 13 more variables: # sched_arr_time <int>, arr_delay <dbl>, carrier <chr>, # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, # time_hour <dttm>, speed <dbl> ``` --- ## Piping operations ```r flights2 <- flights %>% filter(month == 1, day == 1) %>% arrange(desc(dep_delay)) %>% mutate(speed = distance / air_time * 60) print(flights2, n = 6) ``` ``` # A tibble: 842 x 20 year month day dep_time sched_dep_time dep_delay arr_time <int> <int> <int> <int> <int> <dbl> <int> 1 2013 1 1 848 1835 853 1001 2 2013 1 1 2343 1724 379 314 3 2013 1 1 1815 1325 290 2120 4 2013 1 1 2205 1720 285 46 5 2013 1 1 1842 1422 260 1958 6 2013 1 1 2115 1700 255 2330 # ... with 836 more rows, and 13 more variables: # sched_arr_time <int>, arr_delay <dbl>, carrier <chr>, # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, # time_hour <dttm>, speed <dbl> ``` --- ## Summarizing by group ```r flights %>% group_by(carrier) %>% summarize(avg_arr_delay = mean(arr_delay, na.rm = TRUE)) %>% arrange(desc(avg_arr_delay)) %>% left_join(nycflights13::airlines) ``` ``` Joining, by = "carrier" ``` ``` # A tibble: 16 x 3 carrier avg_arr_delay name <chr> <dbl> <chr> 1 F9 21.9 Frontier Airlines Inc. 2 FL 20.1 AirTran Airways Corporation 3 EV 15.8 ExpressJet Airlines Inc. 4 YV 15.6 Mesa Airlines Inc. 5 OO 11.9 SkyWest Airlines Inc. 6 MQ 10.8 Envoy Air 7 WN 9.65 Southwest Airlines Co. 8 B6 9.46 JetBlue Airways 9 9E 7.38 Endeavor Air Inc. 10 UA 3.56 United Air Lines Inc. 11 US 2.13 US Airways Inc. 12 VX 1.76 Virgin America 13 DL 1.64 Delta Air Lines Inc. 14 AA 0.364 American Airlines Inc. 15 HA -6.92 Hawaiian Airlines Inc. 16 AS -9.93 Alaska Airlines Inc. ``` --- ## {dplyr} also works with databases <img src="https://d33wubrfki0l68.cloudfront.net/738885c8f54f3ab6118545469c28cd6635fcd656/96e0d/homepage/interact.png" width="90%" style="display: block; margin: auto;" /> .footnote[Learn more with [this webinar](https://www.rstudio.com/resources/videos/best-practices-for-working-with-databases-webinar/).] --- ## Machine Learning & Deep Learning ### Package {caret} (February session) The caret package (short for **C**lassification **A**nd **RE**gression **T**raining) is a set of functions that attempt to streamline the process for creating predictive models (see [the full documentation](http://topepo.github.io/caret/index.html)). The package contains tools for: - data splitting - pre-processing - feature selection - model tuning using resampling - variable importance estimation ### Keras & TensorFlow in R (January session) Keras & TensorFlow are integrated in R - [TensorFlow for R](https://TensorFlow.rstudio.com/) - [TensorFlow for R blog](https://blogs.rstudio.com/TensorFlow/) --- class: center middle inverse # Visualization --- ## Package {ggplot2} and extensions (June session) <img src="http://www.sthda.com/english/rpkgs/ggpubr/tools/README-ggpubr-box-plot-dot-plots-strip-charts-3.png" width="75%" style="display: block; margin: auto;" /> --- ## Animate graphics with {[gganimate](https://github.com/thomasp85/gganimate)} <img src="whatrcando_files/figure-html/unnamed-chunk-12-1.gif" width="70%" style="display: block; margin: auto;" /> --- ## Fancy graphics: [alluvial diagrams](https://github.com/mbojan/alluvial) <img src="whatrcando_files/figure-html/unnamed-chunk-13-1.svg" width="80%" style="display: block; margin: auto;" /> .footnote[More nice plots in [the R Graph Gallery](https://www.r-graph-gallery.com/).] --- ## Image processing - {[magick](https://github.com/ropensci/magick)} - {[imager](https://github.com/dahtah/imager)} (October session) <img src="parrots.gif" width="80%" style="display: block; margin: auto;" /> --- class: center middle inverse # Reporting --- ## R Markdown (April session) <br> - Reports (analysis, etc) with text, code and results in the same place! With many possible output formats including HTML, PDF, MS Word, beamer, etc. - HTML presentations (like this one! -- see [source code](https://github.com/privefl/R-presentation/blob/master/whatrcando.Rmd)) - websites (such as [the website of our R user group](https://r-in-grenoble.github.io/)) - books (or even [a thesis](https://keurcien.github.io/book/introduction.html)) --- class: center middle inverse # Web --- ## Web scrapping ```r library(rvest) read_html("https://r-in-grenoble.github.io/sessions.html") %>% html_nodes(".schedule") %>% html_nodes(".center-title") %>% html_text() %>% gsub("\n", "", .) %>% writeLines() ``` ``` What R can do for you Image processing with package {imager} Linear models in R Manage your workflow with package {drake} Deep Learning with package {tensorflow} Machine Learning with package {caret} Best coding practices R Markdown Data manipulation with package {dplyr} Data vizualisation with package {ggplot2} ``` --- ## Shiny apps: web apps in R <br> - Example 1: [Airbnb visualization in New York](https://yuyuhan0306.shinyapps.io/airbnb_yuhan/) - Example 2: [Make pixel art models](https://florianprive.shinyapps.io/pixelart/) <br> [Learn more](https://privefl.github.io/advr38book/shiny.html) --- class: center middle inverse # High Performance Computing --- ## [Integrate C++ code with {Rcpp}](https://privefl.github.io/R-presentation/Rcpp.html) <br> Rcpp lives between R and C++, so that you can get - the *performance of C++* and - the *convenience of R*. <br> As - I love *performance* and - I also enjoy *simplicity*, Rcpp might be my favorite R package. --- ## Easy parallelism with {[future](https://github.com/HenrikBengtsson/future)} <blockquote class="twitter-tweet" data-lang="en" align="center"><p lang="en" dir="ltr">future 1.0.0 on CRAN - cross-platform parallel evaluation via a single unified API <a href="https://twitter.com/hashtag/rstats?src=hash&ref_src=twsrc%5Etfw">#rstats</a> <a href="https://t.co/uxIozDAWHA">https://t.co/uxIozDAWHA</a> <a href="https://t.co/wV5vhcgpMJ">pic.twitter.com/wV5vhcgpMJ</a></p>— Henrik Bengtsson (@henrikbengtsson) <a href="https://twitter.com/henrikbengtsson/status/746906359484973057?ref_src=twsrc%5Etfw">26 juin 2016</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script> .footnote[Also see [my intro to parallelism with {foreach}](https://privefl.github.io/blog/a-guide-to-parallelism-in-r/).] --- ## Scalable reproducible workflow with {[drake](https://ropensci.github.io/drake/)} (December session) <br> <img src="https://ropensci.github.io/drake/images/graph.png" width="85%" style="display: block; margin: auto;" /> --- ## Large matrices with {[bigstatsr](https://github.com/privefl/bigstatsr)} ### Advantages of using FBM objects <br> - you can apply algorithms on **data larger than your RAM**, - you can easily **parallelize** your algorithms because the data on disk is shared, - you write **more efficient algorithms** (you do less copies and think more about what you're doing), - you can use **different types of data**, for example, in my field, I’m storing my data with only 1 byte per element (rather than 8 bytes for a standard R matrix). See [the documentation of the FBM class](https://privefl.github.io/bigstatsr/reference/FBM-class.html) for details. --- class: center middle inverse # RStudio --- ## RStudio IDE really helps <br> - console / scripts / environment / plots - code diagnostics - projects (+ git panel) - viewer / debugger / profiler - interactive import / connection - integrated terminal / HTML viewer - support many programming languages --- class: center middle inverse # Where to learn R? --- ## Where to learn R? - [An Introduction to R](https://colinfay.me/intro-to-r/) by the R core team - [Introduction to R](https://www.datacamp.com/courses/free-introduction-to-r) by DataCamp - [R for Data Science](http://r4ds.had.co.nz/index.html) by Garrett Grolemund & Hadley Wickham, and [some solutions](https://jrnold.github.io/r4ds-exercise-solutions/) - [Advanced R](http://adv-r.had.co.nz/) by Hadley Wickham, and [some solutions](https://bookdown.org/Tazinho/Advanced-R-Solutions/) - [Useful packages for Data Science](https://github.com/rstudio/RStartHere) - [CRAN Task Views](https://cran.r-project.org/web/views/) - Course: [Advanced R course](https://privefl.github.io/advr38book/index.html) for PhD students in Grenoble (and 5 other open spots). **In French, but may be in English if enough demands.** - Read code, documentation, blog posts, etc. And PRACTICE. - Learn from others - [join the French-speaking R community](https://join.slack.com/t/r-grrr/shared_invite/enQtMzI4MzgwNTc4OTAxLWZlOGZiZTBiMWU0NDQ3OTYzOGE1YThiODgwZWNhNWEyYjI4ZDJiNmNhY2YyYWI5YzFiOTFkNDYxYzkwODUwNWM) - [join the R-Ladies community](https://rladies-community-slack.herokuapp.com/) --- <blockquote class="twitter-tweet" data-lang="en" align="center" size="50%"><p lang="en" dir="ltr">New <a href="https://twitter.com/hashtag/rstats?src=hash&ref_src=twsrc%5Etfw">#rstats</a> post: "Where to get help with your R question?" <a href="https://t.co/ilIarU1518">https://t.co/ilIarU1518</a><br><br>❓❔⁉️ <a href="https://t.co/u0FB7FtAla">pic.twitter.com/u0FB7FtAla</a></p>— Maëlle Salmon 🐟 (@ma_salmon) <a href="https://twitter.com/ma_salmon/status/1021052562580045824?ref_src=twsrc%5Etfw">22 juillet 2018</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script> --- ## Schedule <br> <img src="whatrcando_files/figure-html/unnamed-chunk-18-1.png" width="90%" style="display: block; margin: auto;" /> --- class: center middle # Thanks Grenoble Alpes Data Institute <img src="http://www.sciencespo-grenoble.fr/wp-content/uploads/2018/04/logo-data-institute.jpg" width="80%" style="display: block; margin: auto;" /> ## for food, ecocups and stickers --- class: inverse, center, middle # Thanks! <br> **Slides:** `bit.ly/RUGgre11` <br>
[privefl](https://twitter.com/privefl)
[privefl](https://github.com/privefl)
[F. Privé](https://stackoverflow.com/users/6103040/f-priv%c3%a9) .footnote[Slides created via the R package [**xaringan**](https://github.com/yihui/xaringan).]