Lasso and Elastic Net

Lasso and Elastic NetDr. D’Agostino McGowan1 / 12

Ridge Review

What are we minimizing with Ridge Regression?

2 / 12

Ridge Review

What are we minimizing with Ridge Regression?

$R S S + λ \sum_{j = 1}^{p} β_{j}^{2}$

2 / 12

Ridge Review

What are we minimizing with Ridge Regression?

$R S S + λ \sum_{j = 1}^{p} β_{j}^{2}$

What is the resulting estimate for ${\hat{β}}_{r i d g e}$ ?

2 / 12

Ridge Review

What are we minimizing with Ridge Regression?

$R S S + λ \sum_{j = 1}^{p} β_{j}^{2}$

What is the resulting estimate for ${\hat{β}}_{r i d g e}$ ?

${\hat{β}}_{r i d g e} = (X^{T} X + λ I)^{- 1} X^{T} y$

2 / 12

Ridge Review

What are we minimizing with Ridge Regression?

$R S S + λ \sum_{j = 1}^{p} β_{j}^{2}$

What is the resulting estimate for ${\hat{β}}_{r i d g e}$ ?

${\hat{β}}_{r i d g e} = (X^{T} X + λ I)^{- 1} X^{T} y$

Why is this useful?

2 / 12

Ridge Review

How is $λ$ determined?

$R S S + λ \sum_{j = 1}^{p} β_{j}^{2}$

3 / 12

Ridge Review

How is $λ$ determined?

$R S S + λ \sum_{j = 1}^{p} β_{j}^{2}$

What is the bias-variance trade-off?

3 / 12

Dr. Lucy D'Agostino McGowan adapted from slides by Hastie & Tibshirani

Ridge RegressionPros
Can be used when p>np>n
Can be used to help with multicollinearity
Will decrease variance
(as λ→∞λ→∞ )

4 / 12

Dr. Lucy D'Agostino McGowan adapted from slides by Hastie & Tibshirani

Ridge RegressionPros
Can be used when p>np>n
Can be used to help with multicollinearity
Will decrease variance
(as λ→∞λ→∞ )

Cons
Will have increased bias (compared to least squares)
Does not really help with variable selection (all variables are included in some regard, even if their ββ coefficients are really small)

4 / 12

Dr. Lucy D'Agostino McGowan adapted from slides by Hastie & Tibshirani

Lasso!The lasso is similar to ridge, but it actually drives some ββ coefficients to 0! (So it helps with variable selection)
5 / 12

Lasso!

The lasso is similar to ridge, but it actually drives some $β$ coefficients to 0! (So it helps with variable selection)

$R S S + λ \sum_{j = 1}^{p} | β_{j} |$

5 / 12

Lasso!

The lasso is similar to ridge, but it actually drives some $β$ coefficients to 0! (So it helps with variable selection)

$R S S + λ \sum_{j = 1}^{p} | β_{j} |$

We say lasso uses an $ℓ_{1}$ penalty, ridge uses an $ℓ_{2}$ penalty

5 / 12

Lasso!

The lasso is similar to ridge, but it actually drives some $β$ coefficients to 0! (So it helps with variable selection)

$R S S + λ \sum_{j = 1}^{p} | β_{j} |$

We say lasso uses an $ℓ_{1}$ penalty, ridge uses an $ℓ_{2}$ penalty
$| | β | |_{1} = \sum | β_{j} |$
$| | β | |_{2} = \sum β_{j}^{2}$

5 / 12

Dr. Lucy D'Agostino McGowan adapted from slides by Hastie & Tibshirani

LassoLike Ridge regression, lasso shrinks the coefficients towards 0
6 / 12

Dr. Lucy D'Agostino McGowan adapted from slides by Hastie & Tibshirani

LassoLike Ridge regression, lasso shrinks the coefficients towards 0
In lasso, the ℓ1ℓ1 penalty forces some of the coefficient estimates to be exactly zero when the tuning parameter λλ is sufficiently large
6 / 12

Dr. Lucy D'Agostino McGowan adapted from slides by Hastie & Tibshirani

LassoLike Ridge regression, lasso shrinks the coefficients towards 0
In lasso, the ℓ1ℓ1 penalty forces some of the coefficient estimates to be exactly zero when the tuning parameter λλ is sufficiently large
Therefore, lasso can be used for variable selection
6 / 12

Dr. Lucy D'Agostino McGowan adapted from slides by Hastie & Tibshirani

LassoLike Ridge regression, lasso shrinks the coefficients towards 0
In lasso, the ℓ1ℓ1 penalty forces some of the coefficient estimates to be exactly zero when the tuning parameter λλ is sufficiently large
Therefore, lasso can be used for variable selection
The lasso can help create smaller, simplier models
6 / 12

Dr. Lucy D'Agostino McGowan adapted from slides by Hastie & Tibshirani

LassoLike Ridge regression, lasso shrinks the coefficients towards 0
In lasso, the ℓ1ℓ1 penalty forces some of the coefficient estimates to be exactly zero when the tuning parameter λλ is sufficiently large
Therefore, lasso can be used for variable selection
The lasso can help create smaller, simplier models
Choosing λλ again is done via cross-validation
6 / 12

Dr. Lucy D'Agostino McGowan adapted from slides by Hastie & Tibshirani

LassoPros
Can be used when p>np>n
Can be used to help with multicollinearity
Will decrease variance
(as λ→∞λ→∞ )
Can be used for variable selection, since it will make some ββ coefficients exactly 0

7 / 12

Dr. Lucy D'Agostino McGowan adapted from slides by Hastie & Tibshirani

LassoPros
Can be used when p>np>n
Can be used to help with multicollinearity
Will decrease variance
(as λ→∞λ→∞ )
Can be used for variable selection, since it will make some ββ coefficients exactly 0

Cons
Will have increased bias (compared to least squares)
If p>np>n the lasso can select at most nn variables

7 / 12

Dr. Lucy D'Agostino McGowan adapted from slides by Hastie & Tibshirani

Ridge versus lassoNeither Ridge nor lasso will universally dominate
8 / 12

Dr. Lucy D'Agostino McGowan adapted from slides by Hastie & Tibshirani

Ridge versus lassoNeither Ridge nor lasso will universally dominate
Cross-validation can also be used to determine which method (Ridge or lasso) should be used
8 / 12

Dr. Lucy D'Agostino McGowan adapted from slides by Hastie & Tibshirani

Ridge versus lassoNeither Ridge nor lasso will universally dominate
Cross-validation can also be used to determine which method (Ridge or lasso) should be used
Cross-validation is also used to select λλ in either method. You choose the λλ value for which the cross-validation model is the smallest
8 / 12

Dr. Lucy D'Agostino McGowan adapted from slides by Hastie & Tibshirani

What if we want to do both?Elastic net!
9 / 12