These are my notes from the paper Regression Shrinkage and Selection via the Lasso by Robert Tibshirani.
Overview
The paper introduces a method for estimating coefficients of a Linear Regression model called Lasso or Least Absolute Shrinkage and Selection Operator.
It builds on Ordinary Least Squares by ensuring the sum of the absolute values of the coefficients is below some value t.
(α^,β^​)=argmin⎩⎪⎨⎪⎧​i=1∑N​(yi​−α−j∑​βj​xij​)2⎭⎪⎬⎪⎫​subject toj∑​∣βj​∣≤t.
Unlike alternative methods Variable Subset Selection it tends to generalise better, and Ridge Regression tends to make more interpretable models, since it typically makes some coefficients exactly 0 and prioritises key features, making models more interpretable, i.e. we can understand which coefficients impact the outcome the most.
Lasso Algorithm
Given data (xi,yi​) for i=1,2,…,N, where xi=(xi1​​,…,xip​​)T are the predictor variables, and yi​ are the labels.
The observations are assumed to be independent, or the yi​ are conditionally independent given the xij​. Assumes predictors are standardised so that ∑i​xij​/N=0 and ∑i​xij2​/N=1.
The LASSO estimate (α^,β^​) is defined as:
(α^,β^​)=argmin⎩⎪⎨⎪⎧​i=1∑N​(yi​−α−j∑​βj​xij​)2⎭⎪⎬⎪⎫​subject toj∑​∣βj​∣≤t.
t>0 is a tuning parameter.
For all t, the solution for α is α^=yˉ​, We can assume without loss of generality that y^​=0 and hence omit the bias term α, focusing on the coefficients β^​
Computation of the solution is a quadratic programming problem with linear inequality constraints.
The parameter t≥0 controls the amount of shrinkage applied to the estimates.
Let β^​j0​ be the full least squares estimates and let t0​=∑j​∣β^​j0​∣. Values of t<t0​ will cause shrinkage of the solutions towards 0, and some coefficients may be exactly equal to 0. For example, if t=t0​/2, the effect will be roughly similar to finding the best subset of size p/2. Note also that the design matrix does not need to be of full rank.
The motivation for the Lasso came from Breiman's Non-negative Garrotte, which minimises:
i=1∑N​(yi​−α−j∑​cj​βj​^​xij​)2subject tocj​≥0,j∑​cj​≤t.
A drawback of the garotte is that depends on both the sign and the magnitude of the Ordinary Least Squares estimates.
In overfit or highly correlated settings where the OLS estimates behave poorly, the garotte may suffer as a result. In contrast, the Lasso avoids the explicit use of the OLS estimates.