R package to operate with data frames stored on disk
# devtools::install_github("privefl/bigdfr")
library(bigdfr)
# Create a temporary file of ~349 MB (just as an example)
csv <- bigreadr::fwrite2(iris[rep(seq_len(nrow(iris)), 1e5), ],
tempfile(fileext = ".csv"))
format(file.size(csv), big.mark = ",")
# Read the csv file in FDF format
(X <- FDF_read(csv))
head(X)
file.size(X$backingfile)
X$types
# Standard {dplyr} operations
X2 <- X %>%
filter(Species == "virginica", Sepal.Length < 5) %>%
mutate(Sepal.Length = Sepal.Length + 1) %>%
arrange(desc(Sepal.Length))
# Export as tibble (fully in memory, e.g. after sufficient filtering)
as_tibble(X2)
# An other way to get a tibble is to use summarize()
X %>%
group_by(Species) %>%
summarize(min_length = min(Sepal.Length))
I use a binary file on disk to store variables. Operations like mutate
grow the file to add new columns. Operation like subset
, filter
and arrange
just use indices to access a subset of the file. When (and only when) some columns are needed for some computations, data are actually accessed in memory.
group_by
, variables are passed the same way as in select
. If you want to use temporary variables, use mutate
.summarize
data with a function that returns a value of length > 1 (you’ll get a list-column).mutate
), these columns always go last even if they existed before. This means that you can do FDF(iris) %>% mutate(Sepal.Width = Sepal.Width + 10) %>% pull()
to get the newly created “Sepal.Width” variable.n()
as_tibble()
?)