# Chapter 5 Performance

Some resources used here or for further reading:

The people who say that “R is just always slow” are usually not great R programmers. It is true that writing inefficient R code is easy, yet writing efficient R code is also possible when you know what you’re doing. In this chapter, you will learn how to write R(cpp) code that is fast.

## 5.1 R’s memory management

## 5.2 Early advice

### 5.2.1 NEVER GROW A VECTOR

Example computing the cumulative sums of a vector:

```
x <- rnorm(1e4) # Try also with n = 1e5
system.time({
current_sum <- 0
res <- c()
for (x_i in x) {
current_sum <- current_sum + x_i
res <- c(res, current_sum)
}
})
```

```
#> user system elapsed
#> 0.156 0.000 0.156
```

Here, at each iteration, you reallocating a vector (of increasing size). Allocation of memory takes time as well as computations. This makes your code quadratic with the size of `x`

(if you multiply the size of `x`

by 2, you can expect the execution time to be multiplied by 4, for large sample sizes), whereas it should be only linear.

A good solution is to always pre-allocate your results (if you know the size):

```
system.time({
current_sum <- 0
res2 <- double(length(x))
for (i in seq_along(x)) {
current_sum <- current_sum + x[i]
res2[i] <- current_sum
}
})
```

```
#> user system elapsed
#> 0.004 0.000 0.003
```

`all.equal(res2, res)`

`#> [1] TRUE`

If you don’t know the size of the results, you can store them in a list and merge them afterwards:

```
system.time({
current_sum <- 0
res3 <- list()
for (i in seq_along(x)) {
current_sum <- current_sum + x[i]
res3[[i]] <- current_sum
}
})
```

```
#> user system elapsed
#> 0.008 0.000 0.005
```

`all.equal(unlist(res3), res)`

`#> [1] TRUE`

With recent versions of R (>= 3.4), you can efficiently grow a vector using

```
system.time({
current_sum <- 0
res4 <- c()
for (i in seq_along(x)) {
current_sum <- current_sum + x[i]
res4[i] <- current_sum
}
})
```

```
#> user system elapsed
#> 0.004 0.000 0.005
```

`all.equal(res4, res)`

`#> [1] TRUE`

Assigning to an element of a vector beyond the current length now over-allocates by a small fraction. The new vector is marked internally as growable, and the true length of the new vector is stored in the truelength field. This makes building up a vector result by assigning to the next element beyond the current length more efficient, though pre-allocating is still preferred. The implementation is subject to change and not intended to be used in packages at this time. (NEWS)

An even better solution would be to avoid the loop by using a vectorized function:

`system.time(res5 <- cumsum(x))`

```
#> user system elapsed
#> 0 0 0
```

`all.equal(res5, res)`

`#> [1] TRUE`

```
x <- rnorm(1e7)
system.time(cumsum(x))
```

```
#> user system elapsed
#> 0.036 0.012 0.050
```

As a second example, let us generate a matrix of uniform values (max changing for every column):

```
n <- 1e3
max <- 1:1000
system.time({
mat <- NULL
for (m in max) {
mat <- cbind(mat, runif(n, max = m))
}
})
```

```
#> user system elapsed
#> 0.560 0.028 0.588
```

`apply(mat, 2, max)[1:10]`

```
#> [1] 0.9999165 1.9979255 2.9954357 3.9937308 4.9959766 5.9940244 6.9865828 7.9811018
#> [9] 8.9987889 9.9838704
```

So, we can either pre-allocate a list or a matrix:

```
system.time({
l <- vector("list", length(max))
for (i in seq_along(max)) {
l[[i]] <- runif(n, max = max[i])
}
mat2 <- do.call("cbind", l)
})
```

```
#> user system elapsed
#> 0.028 0.000 0.028
```

`apply(mat2, 2, max)[1:10]`

```
#> [1] 0.997812 1.995219 2.996246 3.998122 4.998680 5.994925 6.984907 7.999305 8.992457
#> [10] 9.971088
```

```
system.time({
mat3 <- matrix(0, n, length(max))
for (i in seq_along(max)) {
mat3[, i] <- runif(n, max = max[i])
}
})
```

```
#> user system elapsed
#> 0.028 0.000 0.030
```

`apply(mat3, 2, max)[1:10]`

```
#> [1] 0.9997669 1.9946318 2.9963005 3.9843856 4.9987015 5.9957843 6.9919149 7.9928761
#> [9] 8.9948675 9.9859396
```

What is nice with using a list is that you don’t need to pre-allocate. Indeed, as opposed to atomic vectors, each element of a list is in different places in memory so that you don’t have to reallocate all the data when you add an element to a list.

```
system.time({
l <- list()
for (i in seq_along(max)) {
l[[i]] <- runif(n, max = max[i])
}
mat4 <- do.call("cbind", l)
})
```

```
#> user system elapsed
#> 0.028 0.000 0.029
```

`apply(mat4, 2, max)[1:10]`

```
#> [1] 0.9997064 1.9988569 2.9924517 3.9925108 4.9973405 5.9958494 6.9949641 7.9949342
#> [9] 8.9962179 9.9982783
```

Instead of pre-allocating yourself, you can use `sapply`

(or `lapply`

and calling `do.call()`

after, as previously done):

```
system.time(
mat4 <- sapply(max, function(m) runif(n, max = m))
)
```

```
#> user system elapsed
#> 0.032 0.000 0.030
```

`apply(mat4, 2, max)[1:10]`

```
#> [1] 0.9994166 1.9990283 2.9995770 3.9993237 4.9938192 5.9988797 6.9906763 7.9759771
#> [9] 8.9949382 9.9847357
```

Don’t listen to people telling you that `sapply()`

is a vectorized operation that is so much faster than loops.

### 5.2.2 Access columns of a matrix

When you do computations on a matrix, recall that a matrix is just a vector with some dimensions.

```
vec <- 1:20
dim(vec) <- c(4, 5)
vec
```

```
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] 1 5 9 13 17
#> [2,] 2 6 10 14 18
#> [3,] 3 7 11 15 19
#> [4,] 4 8 12 16 20
```

So, as you can see in this example, R matrices are column-oriented, which means that elements of the same column are stored contiguously in memory. Therefore, accessing elements of the same column is fast.

### 5.2.3 Use the right function

Often, in order to optimize your code, you can simply find the right function to do what you need to do.

For example, using `rowMeans(x)`

instead of `apply(x, 1, mean)`

can save you a lot of time. For example, if you want more efficient functions that apply to rows and columns of matrices, you can check package {matrixStats}.

Another example is when reading large text files; in such cases, prefer using `data.table::fread()`

rather than `read.table()`

.

Generally, packages that uses C/Rcpp are efficient.

### 5.2.4 Do not try to optimize everything

“Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered.”

— Donald Knuth.

If you try to optimize each and every part of your code, you will end up losing a lot of time writing it and it will probably less readable.

R is great at prototyping quickly because you can write code in a concise and easy way. Begin by doing just that. If performance matters, then profile your code to see which part of your code is taking too much time and optimize only this part!

Learn more on how to profile your code in RStudio in this article.

## 5.3 Vectorization

See this great blog post by Noam Ross to understand vectorization.

### 5.3.1 Exercise

Monte-Carlo integration (example from book Efficient R programming)

Suppose we wish to estimate the integral \(\int_0^1 x^2 dx\) using a Monte-Carlo method. Essentially, we throw darts at the curve and count the number of darts that fall below the curve (as in the following figure).

*Monte Carlo Integration pseudo-code*

- Initialize:
`hits = 0`

**for i in 1:N**- \(~~\) Generate two random numbers, \(U_1\) and \(U_2\), between 0 and 1
- \(~~\) If \(U_2 < U_1^2\), then
`hits = hits + 1`

**end for**- Area estimate =
`hits / N`

Naively implementing this Monte-Carlo algorithm in R would typically lead to something like:

```
monte_carlo <- function(N) {
hits <- 0
for (i in seq_len(N)) {
u1 <- runif(1)
u2 <- runif(1)
if (u1 ^ 2 > u2) {
hits <- hits + 1
}
}
hits / N
}
```

This takes a few seconds for `N = 1e6`

:

```
N <- 1e6
system.time(monte_carlo(N))
```

```
#> user system elapsed
#> 2.500 0.008 2.512
```

**Your task: Find a vectorized solution for this problem:**

`system.time(monte_carlo_vec(N))`

```
#> user system elapsed
#> 0.052 0.000 0.060
```

## 5.4 Algorithms & data structures

Sometimes, getting the right data structure (e.g. using a matrix instead of a data frame or integers instead of characters) can save you some computation time.

Is your algorithm doing some redundant computations making it e.g. quadratic instead of linear with respect to the dimension of your data?

See exercises (section 5.7) for some insights.

You can also find a detailed example in this blog post.

## 5.5 Rcpp

See this presentation.

You have this data and this working code (a loop) that is slow

```
mydf <- readRDS(system.file("extdata/one-million.rds", package = "advr38pkg"))
QRA_3Dmatrix <- array(0, dim = c(max(mydf$ID), max(mydf$Volume), 2))
for (i in seq_len(nrow(mydf))) {
# Row corresponds to IDcell
row <- mydf[[i, 1]]
# Column corresponds to the volume class
column <- mydf[[i, 3]]
# Number of events, initially zero, then +1
QRA_3Dmatrix[row, column, 1] <- QRA_3Dmatrix[row, column, 1] + 1
# Sum energy
QRA_3Dmatrix[row, column, 2] <- QRA_3Dmatrix[row, column, 2] +
1 - 1.358 / (1 + exp( (1000 * mydf[[i, 2]] - 129000) / 120300 ))
}
```

Rewrite this for-loop with Rcpp.

You can also try to use {dplyr} for this problem.

## 5.6 Linear algebra

In R, prefer using `crossprod(X)`

and `tcrossprod(X)`

instead of `t(X) %*% X`

and `X %*% t(X)`

. Also using `A %*% (B %*% y)`

and `solve(A, y)`

will be faster than `A %*% B %*% y`

and `solve(A) %*% y`

.

Don’t re-implement linear algebra operations (such as matrix products) yourself. There exist some highly optimized libraries for this. If you want to use linear algebra in Rcpp, try RcppArmadillo or RcppEigen.

If you want to use some optimized multi-threaded linear library, you can try Microsoft R Open.

### 5.6.1 Exercise

Rewrite this problem to use linear algebra instead of a loop:

```
N <- 1e5
x <- rnorm(N*3*3); dim(x) <- c(N,3,3)
y <- rnorm(N*3*3); dim(y) <- c(N,3,3)
system.time({
gg <- 0
for (n in 1:dim(x)[1]){
gg <- gg + t(x[n,,]) %*% y[n,,]
}
})
```

```
#> user system elapsed
#> 0.556 0.000 0.796
```

A possible faster solution takes

```
#> user system elapsed
#> 0.004 0.000 0.004
```

## 5.7 Exercises

Generate \(10^8\) (begin with \(10^4\)) steps of the process described by the formula:\[X(0)=0\]\[X(t+1)=X(t)+Y(t)\] where \(Y(t)\) are independent random variables with the distribution \(N(0,1)\). Then, calculate in what percentage of indices \(t\) the value of \(X(t)\) was negative. You don’t need to store values of \(X\) if you don’t want to. What would be the benefit of writing an Rcpp function over a simple vectorized R function?

```
set.seed(1)
system.time(p <- advr38pkg::random_walk_neg_prop(1e7))
```

```
#> user system elapsed
#> 0.412 0.000 0.413
```

`p`

`#> [1] 0.3400444`

```
mat <- as.matrix(mtcars)
ind <- seq_len(nrow(mat))
mat_big <- mat[rep(ind, 1000), ] ## 1000 times bigger dataset
last_row <- mat_big[nrow(mat_big), ]
```

Speed up these loops:

```
system.time({
for (j in 1:ncol(mat_big)) {
for (i in 1:nrow(mat_big)) {
mat_big[i, j] <- 10 * mat_big[i, j] * last_row[j]
}
}
})
```

```
#> user system elapsed
#> 0.408 0.000 0.408
```

Why `colSums()`

on a whole matrix is faster than on only half of it?

```
m0 <- matrix(rnorm(1e6), 1e3, 1e3)
microbenchmark::microbenchmark(
colSums(m0[, 1:500]),
colSums(m0)
)
```

```
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> colSums(m0[, 1:500]) 2296.281 2455.3760 2679.9311 2549.4040 2645.037 5700.785 100
#> colSums(m0) 787.159 805.2085 836.8259 827.3865 866.385 941.303 100
```

Try to speed up this code by vectorizing it first. Then, recode it in Rcpp and benchmark all the solutions you came up with.

```
M <- 50
step1 <- runif(M)
A <- rnorm(M)
N <- 1e4
tau <- matrix(0, N + 1, M)
tau[1, ] <- A
for (j in 1:M) {
for (i in 2:nrow(tau)) {
tau[i, j] <- tau[i - 1, j] + step1[j] * 1.0025^(i - 2)
}
}
```

Make a fast function that counts the number of elements between a sequence of breaks. Can you do it in base R? Try also implementing it in Rcpp. How can you implement a solution whose computation time doesn’t depend on the number of breaks? [Which are the special cases that you should consider?]

```
x <- sample(10, size = 1e4, replace = TRUE)
breaks <- c(1, 3, 9, 9.5, 10)
table(cut(x, breaks))
```

```
#>
#> (1,3] (3,9] (9,9.5] (9.5,10]
#> 2082 5925 0 994
```

`hist(x, breaks, plot = FALSE)$counts # includes first break`

`#> [1] 3081 5925 0 994`

`advr38pkg::count_by_breaks(x, breaks)`

`#> [1] 2082 5925 0 994`

`advr38pkg::count_by_breaks_fast(x, breaks)`

`#> [1] 2082 5925 0 994`

```
microbenchmark::microbenchmark(
table(cut(x, breaks)),
hist(x, breaks, plot = FALSE)$counts,
advr38pkg::count_by_breaks(x, breaks),
advr38pkg::count_by_breaks_fast(x, breaks)
)
```

```
#> Unit: microseconds
#> expr min lq mean median
#> table(cut(x, breaks)) 1894.059 1937.9325 2007.9992 1999.4865
#> hist(x, breaks, plot = FALSE)$counts 291.928 304.8845 326.1458 316.4815
#> advr38pkg::count_by_breaks(x, breaks) 251.511 263.4230 315.9483 277.8140
#> advr38pkg::count_by_breaks_fast(x, breaks) 151.654 162.1280 177.2954 168.2860
#> uq max neval
#> 2041.5885 3228.013 100
#> 351.6005 395.499 100
#> 399.6690 437.452 100
#> 197.0705 224.374 100
```

```
x2 <- sample(10, size = 1e5, replace = TRUE)
breaks2 <- seq(0, 10, length.out = 100)
microbenchmark::microbenchmark(
advr38pkg::count_by_breaks(x2, breaks),
advr38pkg::count_by_breaks_fast(x2, breaks),
advr38pkg::count_by_breaks(x2, breaks2),
advr38pkg::count_by_breaks_fast(x2, breaks2)
)
```

```
#> Unit: milliseconds
#> expr min lq mean median
#> advr38pkg::count_by_breaks(x2, breaks) 2.222262 2.274528 2.336109 2.311901
#> advr38pkg::count_by_breaks_fast(x2, breaks) 1.201984 1.239275 1.267131 1.264028
#> advr38pkg::count_by_breaks(x2, breaks2) 53.195051 89.826053 90.227292 90.150534
#> advr38pkg::count_by_breaks_fast(x2, breaks2) 1.228221 1.264827 1.309104 1.300410
#> uq max neval
#> 2.366786 3.284371 100
#> 1.295619 1.380691 100
#> 90.524636 126.812186 100
#> 1.326923 2.283741 100
```

An R user wants to implement some sampling on a sparse matrix and provides this working code:

```
library(Matrix)
N <- 100
m <- Matrix(0, nrow = N, ncol = N)
for (j in 1:N) {
cols <- sample((1:N)[-j], 2) # 2 columns != j
m[j, cols] <- 1
}
```

This code is slow; can you find two major reasons why?

How can you more efficiently assign 1s?

Can you use sampling with replacement (which can be easily vectorized) in this example?

Implement faster solutions in R and Rcpp.

Make a fast function that returns all prime numbers up to a number `N`

.

```
N <- 1e6
system.time(
primes <- advr38pkg::AllPrimesUpTo(N)
)
```

```
#> user system elapsed
#> 0.032 0.000 0.032
```

`plot(primes, pch = 20, cex = 0.5)`

Find a fast method to compute pairwise distances between 2 matrices. A naive R function would be:

```
naive_pdist <- function(A, B) {
# A: matrix with observation vectors (nrow = number of observations)
# B: matrix with another set of vectors (e.g. cluster centers)
result = matrix(ncol = nrow(B), nrow = nrow(A))
for (i in 1:nrow(A))
for (j in 1:nrow(B))
result[i,j] = sqrt(sum( (A[i,] - B[j,])^2 ))
result
}
```

To see a comparison of different computation strategies, see this nice blog post.

## 5.8 Parallel

I basically always use `foreach`

and recommend to do so. See my guide to parallelism in R with `foreach`

.

**Just remember to optimize your code before trying to parallelize it.**

Try to parallelize some of your best solutions for the previous exercises.