`penalized-regressions.Rmd`

In R package {bigstatsr}, you can fit efficient penalized (linear and logistic) regressions using functions `big_spLinReg()`

and `big_spLogReg()`

. Similar implementation of Cox regression is an area of future development.

You might want to look at the corresponding paper and cite it:

Privé, Florian, Hugues Aschard, and Michael GB Blum. “Efficient implementation of penalized regression for genetic risk prediction.” Genetics (2019). [Open access]

To illustrate how to use `big_spLinReg()`

and `big_spLogReg()`

, let us use some simulated data:

```
library(bigstatsr)
set.seed(1)
# simulating some data
N <- 530
M <- 730
X <- FBM(N, M, init = rnorm(N * M, sd = 5))
y <- rowSums(X[, 1:100]) + 20 * rnorm(N)
ind.train <- sort(sample(nrow(X), 400))
ind.test <- setdiff(rows_along(X), ind.train)
```

Functions `big_spLinReg()`

and `big_spLogReg()`

automatically perform a procedure similar to cross-validation to choose hyper-parameters \(\lambda\) and \(\alpha\) of the elastic net regularization:

The first (maximum) value of the lambda sequence (\(\lambda_{max}\)) is computed automatically and corresponds to enough regularization to have no variable entering the model. Then, a sequence of

`nlambda`

(200 by default) values is used, equally spaces on a log-scale between \(\lambda_{max}\) and \(\lambda_{max}\) *`lambda.min`

.A sequence of \(\alpha\) can be defined by the user using

`alphas`

(default is`1`

). We recommend to use a grid on a log-scale (e.g.`10^(-(0:4))`

).

There are three main reasons for which regularization paths can end:

the current model include too many non-zero variables (

`dfmax`

equals`50e3`

by default, you can use`Inf`

)the early stopping criterion is reached, which means that models are getting worse for the inner validation set; you can control this using

`nlam.min`

(the minimal number of \(\lambda\) values to try before stopping) and`n.abort`

(the number of \(\lambda\) values for which the model is getting worse)the

`nlambda`

\(\lambda\) values have been used; you might want to reduce`lambda.min`

to go further down the path

It is possible to fit the `K`

folds for each value of `alphas`

in parallel using parameter `ncores`

.

You *should* check why regularization paths stop. For this, you can use both methods `plot()`

and `summary()`

. Let us make an example below.

```
mod <- big_spLinReg(X, y[ind.train], ind.train = ind.train, K = 4)
summary(mod)
```

```
## # A tibble: 1 x 7
## alpha validation_loss intercept beta nb_var message all_conv
## <dbl> <dbl> <dbl> <list> <int> <list> <lgl>
## 1 1 1794. -2.23 <dbl [730]> 369 <chr [4]> TRUE
```

`summary(mod)$message`

```
## [[1]]
## [1] "No more improvement" "No more improvement" "No more improvement"
## [4] "No more improvement"
```

Here, `"No more improvement"`

means that the early stopping criterion has been reached. We can see the validation loss getting worse at some point:

`plot(mod)`

For small datasets, you might want to reduce the number of folds (`K`

= 10 bu default). For large datasets, you might want to decrease `n.abort`

(i.e. stop the regularization path quickly) as the path is usually much more smooth and fitting is more and more demanding as we advance along the regularization path (because more and more variables are used in the model).

Use `predict()`

to use the model for data in the test set:

You can add covariates as a standard R matrix using parameter

`covar.train`

; it will fit the model as if`X[ind.train, ]`

and`covar.train`

were`cbind`

ed.You can add an offset to your model using

`base.train`

(on the linear scale, do not provide probabilities!).You can apply some multiplicative penalization factors (

`pf.X`

and`pf.covar`

) to penalize variables differently (possibly no penalization for*some*variables using`0`

).Functions

`big_spLinReg()`

and`big_spLogReg()`

are not deterministic because folds are chosen randomly. One way that should ensure reproducibility is to define your own folds using parameter`ind.sets`

(using e.g.`sample(rep_len(1:K, length(ind.train)))`

).

The computation time of our PLR implementation mainly depends on the sample size and the number of candidate variables (variables that are included in the gradient descent). Indeed, the algorithm is composed of **two steps**: first, for each variable, the algorithm computes an univariate statistic that is used to decide if the variable is included in the model (for each value of \(\lambda\)). This first step is very fast (if data can be quickly read from disk). Then, the algorithm iterates over a regularization path of decreasing values of \(\lambda\), which progressively enables variables to enter the model (see first figure). In the second step, the number of variables increases and computations stop when an early stopping criterion is reached (when prediction is getting worse on the corresponding validation set, see first figure).

For outcomes that require lots of variables to predict, such as predicting height and when using huge datasets such as the UK Biobank, the algorithm might iterate over >100,000 variables, which is computationally demanding. On the contrary, for outcomes that require only a few variables to predict, such as autoimmune diseases, the number of variables included in the model is much smaller so that fitting is very fast (only 13 min for 150K women of the UK Biobank for breast cancer).

**Memory requirements are tightly linked to computation time.** Indeed, variables are accessed in memory thanks to memory-mapping when they are used. When there is not enough memory left, the operating system (OS) frees some memory for new incoming variables. Yet, if too many variables are used in the gradient descent, the OS would regularly swap memory between disk and RAM, severely slowing down computations.