R package to operate with data frames stored on disk

LIST OF FUNCTIONS

Example

# devtools::install_github("privefl/bigdfr")
library(bigdfr)

# Create a temporary file of ~349 MB (just as an example)
csv <- bigreadr::fwrite2(iris[rep(seq_len(nrow(iris)), 1e5), ], 
                         tempfile(fileext = ".csv"))
format(file.size(csv), big.mark = ",")

# Read the csv file in FDF format
(X <- FDF_read(csv))
head(X)
file.size(X$backingfile)
X$types

# Standard {dplyr} operations
X2 <- X %>% 
  filter(Species == "virginica", Sepal.Length < 5) %>%
  mutate(Sepal.Length = Sepal.Length + 1) %>%
  arrange(desc(Sepal.Length))
  
# Export as tibble (fully in memory, e.g. after sufficient filtering)
as_tibble(X2)

# An other way to get a tibble is to use summarize()
X %>%
  group_by(Species) %>%
  summarize(min_length = min(Sepal.Length))

How does it work?

I use a binary file on disk to store variables. Operations like mutate grow the file to add new columns. Operation like subset, filter and arrange just use indices to access a subset of the file. When (and only when) some columns are needed for some computations, data are actually accessed in memory.

Differences with {dplyr}

  • In group_by, variables are passed the same way as in select. If you want to use temporary variables, use mutate.
  • This is allowed to summarize data with a function that returns a value of length > 1 (you’ll get a list-column).
  • When adding columns to an FDF (e.g. with mutate), these columns always go last even if they existed before. This means that you can do FDF(iris) %>% mutate(Sepal.Width = Sepal.Width + 10) %>% pull() to get the newly created “Sepal.Width” variable.
  • You can’t have list-columns stored in a FDF.

TODO

  1. optimize when possible
  2. support factors
  3. implement n()
  4. implement fresh backingfile? (when subview is too small -> just use as_tibble()?)
  5. support dates