Whether to use a data frame in R?

Written on July 20, 2018

In this post, I try to show you in which situations using a data frame is appropriate, and in which it’s not.

What is a data frame?

A data frame is just a list of vectors of the same length, each vector being a column.

This may convince you:

str(iris)

## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

is.list(iris)

## [1] TRUE

length(iris)

## [1] 5

sapply(iris, typeof)

## Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
##     "double"     "double"     "double"     "double"    "integer"

sapply(iris, length)

## Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
##          150          150          150          150          150

What is a list?

A list is just a vector of references to objects in memory.

x <- 1:1e6
pryr::object_size(x)

## 4 MB

y <- list(x, x, x)
pryr::object_size(y)

## 4 MB

address <- data.table::address
address(x)

## [1] "000000001E49C530"

sapply(y, address)

## [1] "000000001E49C530" "000000001E49C530" "000000001E49C530"

So, basically, here y is a vector of 3 references, each pointing to the same object x in memory. This is very efficient because there is no need to copy x 3 times when creating y.

Using package {dplyr}

Using {dplyr} operations such as mutate or select is very efficient.

select:

library(dplyr)
mydf <- iris
mydf2 <- select(mydf, -Species)
sapply(mydf, address)

##       Sepal.Length        Sepal.Width       Petal.Length        Petal.Width            Species 
## "000000001CE852F8" "000000001BC64EB8" "000000000B965428" "000000000B39A758" "000000000B356168"

sapply(mydf2, address)

##       Sepal.Length        Sepal.Width       Petal.Length        Petal.Width 
## "000000001CE852F8" "000000001BC64EB8" "000000000B965428" "000000000B39A758"

So, when you use select, you get a new object. This object is a new data frame (a new list). Yet, remember that a list is nothing but a vector of references. So, this is extremely efficient because it creates only a new vector of 4 references pointing to objects already in memory.

mutate:

mydf3 <- mutate(iris, Species = as.character(Species))
sapply(mydf, address)

##       Sepal.Length        Sepal.Width       Petal.Length        Petal.Width            Species 
## "000000001CE852F8" "000000001BC64EB8" "000000000B965428" "000000000B39A758" "000000000B356168"

sapply(mydf3, address)

##       Sepal.Length        Sepal.Width       Petal.Length        Petal.Width            Species 
## "000000001CE852F8" "000000001BC64EB8" "000000000B965428" "000000000B39A758" "0000000020451AB0"

This is the same when using mutate. You get a new object, yet you modified the 5-th variable only. So, the first 4 variables don’t have to be copied, your new data frame (list) can just point to the same 4 vectors in memory. R only creates a new vector of character and points to it in the new object.

So, adding/removing/modifying one variable of a data frame is efficient because R doesn’t have to copy the other variables.

What about modifying one row of a data frame?

If you modify the first row of a data frame, then you modify the first element of each variable. If there are multiple references to these vectors, R would decide to copy them all, getting you a full copy of the data frame.

mydf4 <- mydf3
sapply(mydf3, address)

##       Sepal.Length        Sepal.Width       Petal.Length        Petal.Width            Species 
## "000000001CE852F8" "000000001BC64EB8" "000000000B965428" "000000000B39A758" "0000000020451AB0"

sapply(mydf4, address)

##       Sepal.Length        Sepal.Width       Petal.Length        Petal.Width            Species 
## "000000001CE852F8" "000000001BC64EB8" "000000000B965428" "000000000B39A758" "0000000020451AB0"

mydf4[1, ] <- mydf3[1, ]
sapply(mydf4, address)

##       Sepal.Length        Sepal.Width       Petal.Length        Petal.Width            Species 
## "0000000029BAB238" "0000000029BAB718" "000000002841AB70" "000000002841B050" "000000002841B530"

Conclusion

It is appropriate to use data frames when you want to operate on variables, but not when you want to operate on rows. If you still want or need to do so, I recommend you to watch this webinar.