# Groupyr: Sparse Group Lasso in Python¶

Groupyr is a scikit-learn compatible implementation of the sparse group lasso linear model. It is intended for high-dimensional supervised learning problems where related covariates can be assigned to predefined groups.

## The Sparse Group Lasso¶

The sparse group lasso 1 is a penalized regression approach that combines the group lasso with the normal lasso penalty to promote both global sparsity and group-wise sparsity. It estimates a target variable $$\hat{y}$$ from a feature matrix $$\mathbf{X}$$, using

$\hat{y} = \mathbf{X} \hat{\beta},$

where the coefficients in $$\hat{\beta}$$ characterize the relationship between the features and the target and must satisfy 1

$\hat{\beta} = \min_{\beta} \frac{1}{2} || y - \sum_{\ell = 1}^{G} \mathbf{X}^{(\ell)} \beta^{(\ell)} ||_2^2 + (1 - \alpha) \lambda \sum_{\ell = 1}^{G} \sqrt{p_{\ell}} ||\beta^{(\ell)}||_2 + \alpha \lambda ||\beta||_1,$

where $$G$$ is the total number of groups, $$\mathbf{X}^{(\ell)}$$ is the submatrix of $$\mathbf{X}$$ with columns belonging to group $$\ell$$, $$\beta^{(\ell)}$$ is the coefficient vector of group $$\ell$$, and $$p_{\ell}$$ is the length of $$\beta^{(\ell)}$$. The model hyperparameter $$\alpha$$ controls the combination of the group-lasso and the lasso, with $$\alpha=0$$ giving the group lasso fit and $$\alpha=1$$ yielding the lasso fit. The hyperparameter $$\lambda$$ controls the strength of the regularization.

## Installation¶

See the installation guide for installation instructions.

## Usage¶

Groupyr is compatible with the scikit-learn API and its estimators offer the same instantiate, fit, predict workflow that will be familiar to scikit-learn users. See the API and examples for full details. Here, we describe only the key differences necessary for scikit-learn users to get started with groupyr.

For syntactic parallelism with the scikit-learn ElasticNet estimator, we use the keyword l1_ratio to refer to SGL’s $$\alpha$$ hyperparameter above that controls the mixture of group lasso and lasso penalties. In addition to keyword parameters shared with scikit-learn’s ElasticNet, ElasticNetCV, LogisticRegression, and LogisticRegressionCV estimators, users must specify the group assignments for the columns of the feature matrix X. This is done during estimator instantiation using the groups parameter, which accepts a list of numpy arrays, where the $$i$$-th array specifies the feature indices of the $$i$$-th group. If no grouping information is provided, the default behavior assigns all features to one group.

Groupyr also offers cross-validation estimators that automatically select the best values of the hyperparameters $$\alpha$$ and $$\lambda$$ using either an exhaustive grid search (with tuning_strategy="grid") or sequential model based optimization (SMBO) using the scikit-optimize library (with tuning_strategy="bayes"). For the grid search strategy, our implementation is more efficient than using the base estimator with scikit-learn’s GridSearchCV because it makes use of warm-starting, where the model is fit along a pre-defined regularization path and the solution from the previous fit is used as the initial guess for the current hyperparameter value. The randomness associated with SMBO complicates the use of a warm start strategy; it can be difficult to determine which of the previously attempted hyperparameter combinations should provide the initial guess for the current evaluation. However, even without warm-starting, we find that the SMBO strategy usually outperforms grid search because far fewer evaluations are needed to arrive at the optimal hyperparameters. We provide examples of both strategies.

## API Documentation¶

See the API Documentation for detailed documentation of the API.

## Examples¶

And look at the example gallery for a set of introductory examples.

## Citing groupyr¶

If you use groupyr in a scientific publication, we would appreciate citations. Please see our citation instructions for the latest reference and a bibtex entry.

## Acknowledgements¶

Groupyr development is supported through a grant from the Gordon and Betty Moore Foundation and from the Alfred P. Sloan Foundation to the University of Washington eScience Institute, as well as NIMH BRAIN Initiative grant 1RF1MH121868-01 to Ariel Rokem (University of Washington).

The API design of groupyr was facilitated by the scikit-learn project template and it therefore borrows heavily from scikit-learn 2. Groupyr relies on the copt optimization library 3 for its solver. The groupyr logo is a flipped silhouette of an image from J. E. Randall and is licensed CC BY-SA.