Groupyr: Sparse Group Lasso in Python

Groupyr is a scikit-learn compatible implementation of the sparse group lasso linear model. It is intended for high-dimensional supervised learning problems where related covariates can be assigned to predefined groups.

The Sparse Group Lasso

The sparse group lasso 1 is a penalized regression approach that combines the group lasso with the normal lasso penalty to promote both global sparsity and group-wise sparsity. It estimates a target variable \(\hat{y}\) from a feature matrix \(\mathbf{X}\), using

\[\hat{y} = \mathbf{X} \hat{\beta},\]

where the coefficients in \(\hat{\beta}\) characterize the relationship between the features and the target and must satisfy 1

\[\hat{\beta} = \min_{\beta} \frac{1}{2} || y - \sum_{\ell = 1}^{G} \mathbf{X}^{(\ell)} \beta^{(\ell)} ||_2^2 + (1 - \alpha) \lambda \sum_{\ell = 1}^{G} \sqrt{p_{\ell}} ||\beta^{(\ell)}||_2 + \alpha \lambda ||\beta||_1,\]

where \(G\) is the total number of groups, \(\mathbf{X}^{(\ell)}\) is the submatrix of \(\mathbf{X}\) with columns belonging to group \(\ell\), \(\beta^{(\ell)}\) is the coefficient vector of group \(\ell\), and \(p_{\ell}\) is the length of \(\beta^{(\ell)}\). The model hyperparameter \(\alpha\) controls the combination of the group-lasso and the lasso, with \(\alpha=0\) giving the group lasso fit and \(\alpha=1\) yielding the lasso fit. The hyperparameter \(\lambda\) controls the strength of the regularization.

Installation

See the installation guide for installation instructions.

Usage

Groupyr is compatible with the scikit-learn API and its estimators offer the same instantiate, fit, predict workflow that will be familiar to scikit-learn users. See the API and examples for full details. Here, we describe only the key differences necessary for scikit-learn users to get started with groupyr.

For syntactic parallelism with the scikit-learn ElasticNet estimator, we use the keyword l1_ratio to refer to SGL’s \(\alpha\) hyperparameter above that controls the mixture of group lasso and lasso penalties. In addition to keyword parameters shared with scikit-learn’s ElasticNet, ElasticNetCV, LogisticRegression, and LogisticRegressionCV estimators, users must specify the group assignments for the columns of the feature matrix X. This is done during estimator instantiation using the groups parameter, which accepts a list of numpy arrays, where the \(i\)-th array specifies the feature indices of the \(i\)-th group. If no grouping information is provided, the default behavior assigns all features to one group.

Groupyr also offers cross-validation estimators that automatically select the best values of the hyperparameters \(\alpha\) and \(\lambda\) using either an exhaustive grid search (with tuning_strategy="grid") or sequential model based optimization (SMBO) using the scikit-optimize library (with tuning_strategy="bayes"). For the grid search strategy, our implementation is more efficient than using the base estimator with scikit-learn’s GridSearchCV because it makes use of warm-starting, where the model is fit along a pre-defined regularization path and the solution from the previous fit is used as the initial guess for the current hyperparameter value. The randomness associated with SMBO complicates the use of a warm start strategy; it can be difficult to determine which of the previously attempted hyperparameter combinations should provide the initial guess for the current evaluation. However, even without warm-starting, we find that the SMBO strategy usually outperforms grid search because far fewer evaluations are needed to arrive at the optimal hyperparameters. We provide examples of both strategies.

API Documentation

See the API Documentation for detailed documentation of the API.

Examples

And look at the example gallery for a set of introductory examples.

Citing groupyr

If you use groupyr in a scientific publication, we would appreciate citations. Please see our citation instructions for the latest reference and a bibtex entry.

Acknowledgements

Groupyr development is supported through a grant from the Gordon and Betty Moore Foundation and from the Alfred P. Sloan Foundation to the University of Washington eScience Institute, as well as NIMH BRAIN Initiative grant 1RF1MH121868-01 to Ariel Rokem (University of Washington).

The API design of groupyr was facilitated by the scikit-learn project template and it therefore borrows heavily from scikit-learn 2. Groupyr relies on the copt optimization library 3 for its solver. The groupyr logo is a flipped silhouette of an image from J. E. Randall and is licensed CC BY-SA.

References

1(1,2)

Simon, N., Friedman, J., Hastie, T., & Tibshirani, R. (2013). A sparse-group lasso. Journal of Computational and Graphical Statistics, 22(2), 231-245.

2

Pedregosa et al. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12, 2825-2830; Buitnick et al. (2013). API design for machine learning software: experiences from the scikit-learn project. ECML PKDD Workshop: Languages for Data Mining and Machine Learning, 108-122.

3

Pedregosa et al. (2020). copt: composite optimization in Python. DOI:10.5281/zenodo.1283339.