Why I rarely use apply
In this short post, I talk about why I’m moving away from using function apply
.
With matrices
It’s okay to use apply
with a dense matrix, although you can often use an equivalent that is faster.
N <- M <- 8000
X <- matrix(rnorm(N * M), N)
system.time(res1 <- apply(X, 2, mean))
## user system elapsed
## 0.73 0.05 0.78
system.time(res2 <- colMeans(X))
## user system elapsed
## 0.05 0.00 0.05
stopifnot(isTRUE(all.equal(res2, res1)))
“Yeah, there are colSums
and colMeans
, but what about computing standard deviations?”
There are lots of apply
-like functions in package {matrixStats}.
system.time(res3 <- apply(X, 2, sd))
## user system elapsed
## 0.96 0.01 0.97
system.time(res4 <- matrixStats::colSds(X))
## user system elapsed
## 0.2 0.0 0.2
stopifnot(isTRUE(all.equal(res4, res3)))
With data frames
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
apply(head(iris), 2, identity)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 "5.1" "3.5" "1.4" "0.2" "setosa"
## 2 "4.9" "3.0" "1.4" "0.2" "setosa"
## 3 "4.7" "3.2" "1.3" "0.2" "setosa"
## 4 "4.6" "3.1" "1.5" "0.2" "setosa"
## 5 "5.0" "3.6" "1.4" "0.2" "setosa"
## 6 "5.4" "3.9" "1.7" "0.4" "setosa"
A DATA FRAME IS NOT A MATRIX (it’s a list).
The first thing that apply
does is converting the object to a matrix, which consumes memory and in the previous example transforms all data as strings (because a matrix can have only one type).
What can you use as a replacement of apply
with a data frame?
If you want to operate on all columns, since a data frame is just a list, you can use
sapply
instead (ormap*
if you are a purrrist).sapply(iris, typeof)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## "double" "double" "double" "double" "integer"
If you want to operate on all rows, I recommend you to watch this webinar.
With sparse matrices
The memory problem is even more important when using apply
with sparse matrices, which makes using apply
very slow for such data.
library(Matrix)
X.sp <- rsparsematrix(N, M, density = 0.01)
## X.sp is converted to a dense matrix when using `apply`
system.time(res5 <- apply(X.sp, 2, mean))
## user system elapsed
## 0.78 0.46 1.25
system.time(res6 <- Matrix::colMeans(X.sp))
## user system elapsed
## 0.01 0.00 0.02
stopifnot(isTRUE(all.equal(res6, res5)))
You could implement your own apply
-like function for sparse matrices by seeing a sparse matrix as a data frame with 3 columns (i
and j
storing positions of non-null elements, and x
storing values of these elements). Then, you could use a group_by
-summarize
approach.
For instance, for the previous example, you can do this in base R:
apply2_sp <- function(X, FUN) {
res <- numeric(ncol(X))
X2 <- as(X, "dgTMatrix")
tmp <- tapply(X2@x, X2@j, FUN)
res[as.integer(names(tmp)) + 1] <- tmp
res
}
system.time(res7 <- apply2_sp(X.sp, sum) / nrow(X.sp))
## user system elapsed
## 0.03 0.00 0.03
stopifnot(isTRUE(all.equal(res7, res5)))
Conclusion
Using apply
with a dense matrix is fine, but try to avoid it if you have a data frame or a sparse matrix.