Whether to use a data frame in R?
In this post, I try to show you in which situations using a data frame is appropriate, and in which it’s not.
Learn more with the Advanced R book.
What is a data frame?
A data frame is just a list of vectors of the same length, each vector being a column.
This may convince you:
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
is.list(iris)
## [1] TRUE
length(iris)
## [1] 5
sapply(iris, typeof)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## "double" "double" "double" "double" "integer"
sapply(iris, length)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 150 150 150 150 150
What is a list?
A list is just a vector of references to objects in memory.
x <- 1:1e6
pryr::object_size(x)
## 4 MB
y <- list(x, x, x)
pryr::object_size(y)
## 4 MB
address <- data.table::address
address(x)
## [1] "000000001E49C530"
sapply(y, address)
## [1] "000000001E49C530" "000000001E49C530" "000000001E49C530"
So, basically, here y
is a vector of 3 references, each pointing to the same object x
in memory. This is very efficient because there is no need to copy x
3 times when creating y
.
Using package {dplyr}
Using {dplyr} operations such as mutate
or select
is very efficient.
select
:library(dplyr) mydf <- iris mydf2 <- select(mydf, -Species) sapply(mydf, address)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## "000000001CE852F8" "000000001BC64EB8" "000000000B965428" "000000000B39A758" "000000000B356168"
sapply(mydf2, address)
## Sepal.Length Sepal.Width Petal.Length Petal.Width ## "000000001CE852F8" "000000001BC64EB8" "000000000B965428" "000000000B39A758"
So, when you use
select
, you get a new object. This object is a new data frame (a new list). Yet, remember that a list is nothing but a vector of references. So, this is extremely efficient because it creates only a new vector of 4 references pointing to objects already in memory.mutate
:mydf3 <- mutate(iris, Species = as.character(Species)) sapply(mydf, address)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## "000000001CE852F8" "000000001BC64EB8" "000000000B965428" "000000000B39A758" "000000000B356168"
sapply(mydf3, address)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## "000000001CE852F8" "000000001BC64EB8" "000000000B965428" "000000000B39A758" "0000000020451AB0"
This is the same when using
mutate
. You get a new object, yet you modified the 5-th variable only. So, the first 4 variables don’t have to be copied, your new data frame (list) can just point to the same 4 vectors in memory. R only creates a new vector of character and points to it in the new object.
So, adding/removing/modifying one variable of a data frame is efficient because R doesn’t have to copy the other variables.
What about modifying one row of a data frame?
If you modify the first row of a data frame, then you modify the first element of each variable. If there are multiple references to these vectors, R would decide to copy them all, getting you a full copy of the data frame.
mydf4 <- mydf3
sapply(mydf3, address)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## "000000001CE852F8" "000000001BC64EB8" "000000000B965428" "000000000B39A758" "0000000020451AB0"
sapply(mydf4, address)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## "000000001CE852F8" "000000001BC64EB8" "000000000B965428" "000000000B39A758" "0000000020451AB0"
mydf4[1, ] <- mydf3[1, ]
sapply(mydf4, address)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## "0000000029BAB238" "0000000029BAB718" "000000002841AB70" "000000002841B050" "000000002841B530"
Conclusion
It is appropriate to use data frames when you want to operate on variables, but not when you want to operate on rows. If you still want or need to do so, I recommend you to watch this webinar.