big_spLogReg.Rd
Fit lasso penalized linear regression path for a Filebacked Big Matrix. Covariates can be added to correct for confounders.
big_spLogReg( X, y01.train, ind.train = rows_along(X), ind.col = cols_along(X), covar.train = NULL, base.train = NULL, pf.X = NULL, pf.covar = NULL, alphas = 1, K = 10, ind.sets = NULL, nlambda = 200, nlam.min = 50, n.abort = 10, dfmax = 50000, warn = TRUE, ncores = 1, ... )
X  An object of class FBM. 

y01.train  Vector of responses, corresponding to 
ind.train  An optional vector of the row indices that are used, for the training part. If not specified, all rows are used. Don't use negative indices. 
ind.col  An optional vector of the column indices that are used. If not specified, all columns are used. Don't use negative indices. 
covar.train  Matrix of covariables to be added in each model to correct
for confounders (e.g. the scores of PCA), corresponding to 
base.train  Vector of base predictions. Model will be learned starting from these predictions. This can be useful if you want to previously fit a model with largeeffect variables that you don't want to penalize. 
pf.X  A multiplicative factor for the penalty applied to each coefficient.
If supplied, 
pf.covar  Same as 
alphas  The elasticnet mixing parameter that controls the relative
contribution from the lasso (l1) and the ridge (l2) penalty. The penalty is
defined as $$ \alpha\beta_1 + (1\alpha)/2\beta_2^2.$$

K  Number of sets used in the CrossModel Selection and Averaging
(CMSA) procedure. Default is 
ind.sets  Integer vectors of values between 
nlambda  The number of lambda values. Default is 
nlam.min  Minimum number of lambda values to investigate. Default is 
n.abort  Number of lambda values for which prediction on the validation
set must decrease before stopping. Default is 
dfmax  Upper bound for the number of nonzero coefficients. Default is

warn  Whether to warn if some models may not have reached a minimum.
Default is 
ncores  Number of cores used. Default doesn't use parallelism. You may use nb_cores. 
...  Arguments passed on to

Return an object of class big_sp_list
(a list of length(alphas)
x K
) that has 3 methods predict
, summary
and plot
.
This is a modified version of one function of
package biglasso.
It adds the possibility to train models with covariables and use many
types of FBM
(not only double
ones).
Yet, it only corresponds to screen = "SSR"
(Sequential Strong Rules).
Also, to remove the choice of the lambda parameter, we introduce the CrossModel Selection and Averaging (CMSA) procedure:
This function separates the training set in K
folds (e.g. 10).
In turn,
each fold is considered as an inner validation set and the others (K  1) folds form an inner training set,
the model is trained on the inner training set and the corresponding predictions (scores) for the inner validation set are computed,
the vector of scores which maximizes loglikelihood is determined,
the vector of coefficients corresponding to the previous vector of scores is chosen.
The K
resulting vectors of coefficients are then averaged into one final
vector of coefficients.
Tibshirani, R., Bien, J., Friedman, J., Hastie, T., Simon, N., Taylor, J. and Tibshirani, R. J. (2012), Strong rules for discarding predictors in lassotype problems. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 74: 245266. https://doi.org/10.1111/j.14679868.2011.01004.x.
Zeng, Y., and Breheny, P. (2017). The biglasso Package: A Memory and ComputationEfficient Solver for Lasso Model Fitting with Big Data in R. arXiv preprint arXiv:1701.05936. https://arxiv.org/abs/1701.05936.
Privé, F., Aschard, H., and Blum, M. G.B. (2019). Efficient implementation of penalized regression for genetic risk prediction. Genetics, 212: 6574. https://doi.org/10.1534/genetics.119.302019.
set.seed(2) # simulating some data N < 230 M < 730 X < FBM(N, M, init = rnorm(N * M, sd = 5)) y01 < as.numeric((rowSums(X[, 1:10]) + 2 * rnorm(N)) > 0) covar < matrix(rnorm(N * 3), N) ind.train < sort(sample(nrow(X), 150)) ind.test < setdiff(rows_along(X), ind.train) # fitting model for multiple lambdas and alphas test < big_spLogReg(X, y01[ind.train], ind.train = ind.train, covar.train = covar[ind.train, ], alphas = c(1, 0.5, 0.1, 0.01), warn = FALSE) # peek at the models plot(test)#> # A tibble: 4 x 7 #> alpha validation_loss intercept beta nb_var message all_conv #> <dbl> <dbl> <dbl> <list> <int> <list> <lgl> #> 1 1 0.425 0.0615 <dbl [733]> 154 <chr [10]> TRUE #> 2 0.5 0.435 0.0758 <dbl [733]> 164 <chr [10]> TRUE #> 3 0.1 0.482 0.139 <dbl [733]> 215 <chr [10]> TRUE #> 4 0.01 0.554 0.258 <dbl [733]> 454 <chr [10]> TRUE#> [[1]] #> [1] "No more improvement" "No more improvement" "No more improvement" #> [4] "No more improvement" "No more improvement" "No more improvement" #> [7] "No more improvement" "No more improvement" "No more improvement" #> [10] "No more improvement" #> #> [[2]] #> [1] "No more improvement" "No more improvement" "No more improvement" #> [4] "No more improvement" "No more improvement" "No more improvement" #> [7] "No more improvement" "No more improvement" "No more improvement" #> [10] "No more improvement" #> #> [[3]] #> [1] "No more improvement" "No more improvement" "No more improvement" #> [4] "No more improvement" "No more improvement" "No more improvement" #> [7] "No more improvement" "No more improvement" "No more improvement" #> [10] "No more improvement" #> #> [[4]] #> [1] "No more improvement" "No more improvement" "No more improvement" #> [4] "No more improvement" "No more improvement" "No more improvement" #> [7] "No more improvement" "No more improvement" "No more improvement" #> [10] "No more improvement" #>#> # A tibble: 1 x 7 #> alpha validation_loss intercept beta nb_var message all_conv #> <dbl> <dbl> <dbl> <list> <int> <list> <lgl> #> 1 1 0.425 0.0615 <dbl [733]> 154 <chr [10]> TRUEpred < predict(test, X, ind.row = ind.test, covar.row = covar[ind.test, ]) AUC(pred, y01[ind.test])#> [1] 0.8693182library(ggplot2) qplot(pred, fill = as.logical(y01[ind.test]), geom = "density", alpha = I(0.4)) + labs(fill = "Case?") + theme_bigstatsr() + theme(legend.position = c(0.52, 0.8))