Functions big_(c)prodMat()
and big_(t)crossprodSelf()
now use much less memory, and may be faster.
Add covar_from_df()
to convert a data frame with factors/characters to a numeric matrix using one-hot encoding.
Add a new column $all_conv
to output of summary()
for big_spLinReg()
and big_spLogReg()
to check whether all models have stopped because of “no more improvement”. Also add a new parameter sort
to summary()
.
Now warn
(enabled by default) if some models may not have reached a minimum when using big_spLinReg()
and big_spLogReg()
.
Make two different memory-mappings: one that is read-only (using $address
) and one where it is possible to write (using $address_rw
). This enables to use file permissions to prevent modifying data.
Also add a new field $is_read_only
to be used to prevent modifying data (at least with <-
) even when you have write permissions to it. Functions creating an FBM now gain a parameter is_read_only
.
Make vector accessors (e.g. X[1:10]
) faster.
Move some code to new packages {bigassertr} and {bigparallelr}.
big_randomSVD()
gains arguments related to matrix-vector multiplication.
assert_noNA()
is faster.
big_increment()
.In plot.big_SVD()
,
Can now plot many PCA scores (more than two) at once.
Use coord_fixed()
when plotting PCA scores because it is good practice.
Use log-scale in scree plot to better see small differences in singular values.
Reexport cowplot::plot_grid()
to merge multiple ggplots.
AUCBoot()
is now 6-7 times faster.big_univLogReg()
for variables with no variation. IRLS was not converging, so glm()
was used instead. The problem is that glm()
drops dimensions causing singularities so that Z-score of the first covariate (or intercept) was used instead of a missing value.Use mio instead of boost for memory-mapping.
Add a parameter base.row
to predict.big_sp_list()
and automatically detect if needed (as well as for covar.row
).
Possibility to subset a big_sp_list
without losing attributes, so that one can access one model (corresponding to one alpha) even if it is not the ‘best’.
Add parameters pf.X
and pf.covar
in big_sp***Reg()
to provide different penalization for each variable (possibly no penalization at all).
Add %*%
, crossprod
and tcrossprod
operations for ‘double’ FBMs.
Now also returns the number of non-zero variables ($nb_active
) and the number of candidate variables ($nb_candidate
) for each step of the regularization paths of big_spLinReg()
and big_spLogReg()
.
warn
and return.all
of big_spLinReg()
and big_spLogReg()
are deprecated; now always return the maximum information. Now provide two methods (summary
and plot
) to get a quick assessment of the fitted models.Check of missing values for input vectors (indices and targets) and matrices (covariables).
AUC()
is now stricter: it accepts only 0s and 1s for target
.
$bm()
and $bm.desc()
have been added in order to get an FBM
as a filebacked.big.matrix
. This enables using {bigmemory} functions.big_read
now has a filter
argument to filter rows, and argument nrow
has been removed because it is now determined when reading the first block of data.
Removed the save
argument from FBM
(and others); now, you must use FBM(...)$save()
instead of FBM(..., save = TRUE)
.
You can now fill an FBM using a data frame. Note that factors will be used as integers.
Package {bigreadr} has been developed and is now used by big_read
.
options(bigstatsr.downcast.warning = FALSE)
, or you can use without_downcast_warning()
to disable this warning for one call.possibility to add a “base predictor” for big_spLinReg
and big_spLogReg
.
don’t store the whole regularization path (as a sparse matrix) in big_spLinReg
and big_spLogReg
anymore because it caused major slowdowns.
directly average the K predictions in predict.big_sp_best_list
.
only use the “PSOCK” type of cluster because “FORK” can leave zombies behind. You can change this with options(bigstatsr.cluster.type = "PSOCK")
.
Fix a bug in big_spLinReg
related to the computation of summaries.
Now provides function plus
to be used as the combine
argument in big_apply
and big_parallelize
instead of '+'
.
options(bigstatsr.cluster.type = "PSOCK")
. Uses “PSOCK” in 0.4.0.big_spLinReg
and big_spLogReg
. One will be chosen by grid-search.big_crossprod
, big_tcrossprod
, big_SVD
and big_randomSVD
(before, there was no default at all)Integrate Cross-Model Selection and Averaging (CMSA) directly in big_spLinReg
and big_spLogReg
, a procedure that automatically chooses the value of the $\lambda$ hyper-parameter.
Speed up big_spLinReg
and big_spLogReg
(issue #12)