*Groupyr*: Sparse Group Lasso in Python¶

*Groupyr* is a scikit-learn compatible implementation of the sparse group lasso
linear model. It is intended for high-dimensional supervised learning
problems where related covariates can be assigned to predefined groups.

## The Sparse Group Lasso¶

The sparse group lasso 1 is a penalized regression approach that combines the group lasso with the normal lasso penalty to promote both global sparsity and group-wise sparsity. It estimates a target variable \(\hat{y}\) from a feature matrix \(\mathbf{X}\), using

where the coefficients in \(\hat{\beta}\) characterize the relationship between the features and the target and must satisfy 1

where \(G\) is the total number of groups, \(\mathbf{X}^{(\ell)}\) is the submatrix of \(\mathbf{X}\) with columns belonging to group \(\ell\), \(\beta^{(\ell)}\) is the coefficient vector of group \(\ell\), and \(p_{\ell}\) is the length of \(\beta^{(\ell)}\). The model hyperparameter \(\alpha\) controls the combination of the group-lasso and the lasso, with \(\alpha=0\) giving the group lasso fit and \(\alpha=1\) yielding the lasso fit. The hyperparameter \(\lambda\) controls the strength of the regularization.

## Installation¶

See the installation guide for installation instructions.

## Usage¶

*Groupyr* is compatible with the scikit-learn API and its estimators offer the
same instantiate, `fit`

, `predict`

workflow that will be familiar to
scikit-learn users. See the API and examples for full details. Here, we describe only the key
differences necessary for scikit-learn users to get started with *groupyr*.

For syntactic parallelism with the scikit-learn `ElasticNet`

estimator, we
use the keyword `l1_ratio`

to refer to SGL’s \(\alpha\) hyperparameter
above that controls the mixture of group lasso and lasso penalties. In
addition to keyword parameters shared with scikit-learn’s `ElasticNet`

,
`ElasticNetCV`

, `LogisticRegression`

, and `LogisticRegressionCV`

estimators, users must specify the group assignments for the columns of the
feature matrix `X`

. This is done during estimator instantiation using the
`groups`

parameter, which accepts a list of numpy arrays, where the
\(i\)-th array specifies the feature indices of the \(i\)-th group.
If no grouping information is provided, the default behavior assigns all
features to one group.

*Groupyr* also offers cross-validation estimators that automatically select
the best values of the hyperparameters \(\alpha\) and \(\lambda\)
using either an exhaustive grid search (with `tuning_strategy="grid"`

) or
sequential model based optimization (SMBO) using the scikit-optimize library
(with `tuning_strategy="bayes"`

). For the grid search strategy, our
implementation is more efficient than using the base estimator with
scikit-learn’s `GridSearchCV`

because it makes use of warm-starting, where
the model is fit along a pre-defined regularization path and the solution
from the previous fit is used as the initial guess for the current
hyperparameter value. The randomness associated with SMBO complicates the use
of a warm start strategy; it can be difficult to determine which of the
previously attempted hyperparameter combinations should provide the initial
guess for the current evaluation. However, even without warm-starting, we
find that the SMBO strategy usually outperforms grid search because far fewer
evaluations are needed to arrive at the optimal hyperparameters. We provide
examples of both strategies.

## API Documentation¶

See the API Documentation for detailed documentation of the API.

## Examples¶

And look at the example gallery for a set of introductory examples.

## Citing groupyr¶

If you use *groupyr* in a scientific publication, we would appreciate
citations. Please see our citation instructions for the latest
reference and a bibtex entry.

## Acknowledgements¶

*Groupyr* development is supported through a grant from the Gordon and Betty
Moore Foundation and from the Alfred P. Sloan
Foundation to the University of Washington eScience
Institute, as well as NIMH BRAIN
Initiative grant 1RF1MH121868-01
to Ariel Rokem (University of Washington).

The API design of *groupyr* was facilitated by the scikit-learn project
template and it therefore borrows heavily from scikit-learn 2.
*Groupyr* relies on the copt optimization library 3 for its solver. The
*groupyr* logo is a flipped silhouette of an image from J. E. Randall and is
licensed CC BY-SA.

## References¶

- 1(1,2)
Simon, N., Friedman, J., Hastie, T., & Tibshirani, R. (2013). A sparse-group lasso. Journal of Computational and Graphical Statistics, 22(2), 231-245.

- 2
Pedregosa et al. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12, 2825-2830; Buitnick et al. (2013). API design for machine learning software: experiences from the scikit-learn project. ECML PKDD Workshop: Languages for Data Mining and Machine Learning, 108-122.

- 3
Pedregosa et al. (2020). copt: composite optimization in Python. DOI:10.5281/zenodo.1283339.