Regression Shrinkage and Selection via the Lasso

These are my notes from the paper Regression Shrinkage and Selection via the Lasso by Robert Tibshirani.

Overview

The paper introduces a method for estimating coefficients of a Linear Regression model called Lasso or Least Absolute Shrinkage and Selection Operator.

It builds on Ordinary Least Squares by ensuring the sum of the absolute values of the coefficients is below some value tt.

(α^,β^)=argmin{i=1N(yiαjβjxij)2}subject tojβjt. (\hat{\alpha}, \hat{\beta}) = \arg \min \left\{ \sum_{i=1}^{N} \left( y_i - \alpha - \sum_{j} \beta_j x_{ij} \right)^2 \right\} \quad \text{subject to} \quad \sum_{j} |\beta_j| \leq t.

Unlike alternative methods Variable Subset Selection it tends to generalise better, and Ridge Regression tends to make more interpretable models, since it typically makes some coefficients exactly 0 and prioritises key features, making models more interpretable, i.e. we can understand which coefficients impact the outcome the most.

Lasso Algorithm

Given data (xi,yi)(\mathbf{x}^{i}, y_i) for i=1,2,,Ni =1, 2, \ldots, N, where xi=(xi1,,xip)T\mathbf{x}^i = (x_{i_1}, \ldots, x_{i_p})^T are the predictor variables, and yiy_i are the labels.

The observations are assumed to be independent, or the yiy_i are conditionally independent given the xijx_{ij}. Assumes predictors are standardised so that ixij/N=0\sum_{i}x_{ij}/N = 0 and ixij2/N=1\sum_{i}x_{ij}^2 / N = 1.

The LASSO estimate (α^,β^)(\hat{\alpha}, \hat{\beta}) is defined as:

(α^,β^)=argmin{i=1N(yiαjβjxij)2}subject tojβjt. (\hat{\alpha}, \hat{\beta}) = \arg \min \left\{ \sum_{i=1}^{N} \left( y_i - \alpha - \sum_{j} \beta_j x_{ij} \right)^2 \right\} \quad \text{subject to} \quad \sum_{j} |\beta_j| \leq t.

t>0t > 0 is a tuning parameter.

For all tt, the solution for α\alpha is α^=yˉ\hat{\alpha} = \bar{y}, We can assume without loss of generality that y^=0\hat{y} = 0 and hence omit the bias term α\alpha, focusing on the coefficients β^\hat{\beta}

Computation of the solution is a quadratic programming problem with linear inequality constraints.

The parameter t0t \ge 0 controls the amount of shrinkage applied to the estimates.

Let β^j0\hat{\beta}^0_j be the full least squares estimates and let t0=jβ^j0t_0 = \sum_j |\hat{\beta}^0_j|. Values of t<t0t < t_0 will cause shrinkage of the solutions towards 0, and some coefficients may be exactly equal to 0. For example, if t=t0/2t = t_0/2, the effect will be roughly similar to finding the best subset of size p/2p/2. Note also that the design matrix does not need to be of full rank.

The motivation for the Lasso came from Breiman's Non-negative Garrotte, which minimises:

i=1N(yiαjcjβj^xij)2subject tocj0,jcjt. \sum_{i=1}^{N} \left( y_i - \alpha - \sum_{j} c_j \hat{\beta_j} x_{ij} \right)^2 \quad \text{subject to} \quad c_j \geq 0, \quad \sum_j c_j \leq t.

A drawback of the garotte is that depends on both the sign and the magnitude of the Ordinary Least Squares estimates.

In overfit or highly correlated settings where the OLS estimates behave poorly, the garotte may suffer as a result. In contrast, the Lasso avoids the explicit use of the OLS estimates.