+ - 0:00:00
Notes for current slide
Notes for next slide

Ridge Regression

Dr. D’Agostino McGowan

1 / 47

📖 Canvas

2 / 47

Linear Regression Review

In linear regression, what are we minimizing? How can I write this in matrix form?

3 / 47

Linear Regression Review

In linear regression, what are we minimizing? How can I write this in matrix form?

  • RSS!

(yXβ^)T(yXβ^)

3 / 47

Linear Regression Review

In linear regression, what are we minimizing? How can I write this in matrix form?

  • RSS!

(yXβ^)T(yXβ^)

What is the solution ( β^ ) to this?

3 / 47

Linear Regression Review

In linear regression, what are we minimizing? How can I write this in matrix form?

  • RSS!

(yXβ^)T(yXβ^)

What is the solution ( β^ ) to this?

(XTX)1XTy

3 / 47

Linear Regression Review

What is X?

4 / 47

Linear Regression Review

What is X?

  • the design matrix!
4 / 47

Matrix fact

C=ABCT=BTAT

5 / 47

Matrix fact

C=ABCT=BTAT

Try it!

  • Distribute (FOIL / get rid of the parentheses) the RSS equation

RSS=(yXβ^)T(yXβ^)

02:00
5 / 47

Matrix fact

C=ABCT=BTAT

Try it!

  • Distribute (FOIL / get rid of the parentheses) the RSS equation

RSS=(yXβ^)T(yXβ^)=yTyβ^TXTyyTXβ^+β^TXTXβ^

6 / 47

Matrix fact

  • the transpose of a scalar is a scalar
7 / 47

Matrix fact

  • the transpose of a scalar is a scalar
  • β^TXTy is a scalar

Why? What are the dimensions of β^T? What are the dimensions of X? What are the dimensions of y?

7 / 47

Matrix fact

  • the transpose of a scalar is a scalar
  • β^TXTy is a scalar

Why? What are the dimensions of β^T? What are the dimensions of X? What are the dimensions of y?

  • (yTXβ^)T=β^TXTy
7 / 47

Matrix fact

  • the transpose of a scalar is a scalar
  • β^TXTy is a scalar

Why? What are the dimensions of β^T? What are the dimensions of X? What are the dimensions of y?

  • (yTXβ^)T=β^TXTy

RSS=(yXβ^)T(yXβ^)=yTyβ^TXTyyTXβ^+β^TXTXβ^=yTy2β^TXTy+β^TXTXβ^

7 / 47

Linear Regression Review

To find the β^ that is going to minimize this RSS, what do we do? Why?

RSS=(yXβ^)T(yXβ^)=yTyβ^TXTyyTXβ^+β^TXTXβ^=yTy2β^TXTy+β^TXTXβ^

8 / 47

Matrix fact

  • When a and b are p×1 vectors

aTbb=bTab=a

9 / 47

Matrix fact

  • When a and b are p×1 vectors

aTbb=bTab=a

  • When A is a symmetric matrix

bTAbb=2Ab=2bTA

9 / 47

Matrix fact

  • When a and b are p×1 vectors

aTbb=bTab=a

  • When A is a symmetric matrix

bTAbb=2Ab=2bTA

Try it!

RSSβ^=

  • RSS=yTy2β^TXTy+β^TXTXβ^
02:00
9 / 47

Linear Regression Review

How did we get (XTX)1XTy?

RSS=yTy2β^TXTy+β^TXTXβ^

RSSβ^=2XTy+2XTXβ^=0

10 / 47

Matrix fact

AA1=I

11 / 47

Matrix fact

AA1=I

What is I?

11 / 47

Matrix fact

AA1=I

What is I?

  • identity matrix

I=[100010001]

AI=A

11 / 47

Try it!

  • Solve for β^

2XTy+2XTXβ^=0

02:00
12 / 47

Linear Regression Review

How did we get (XTX)1XTy?

2XTy+2XTXβ^=02XTXβ^=2XTyXTXβ^=XTy

13 / 47

Linear Regression Review

How did we get (XTX)1XTy?

2XTy+2XTXβ^=02XTXβ^=2XTyXTXβ^=XTy(XTX)1XTXβ^=(XTX)1XTy

14 / 47

Linear Regression Review

How did we get (XTX)1XTy?

2XTy+2XTXβ^=02XTXβ^=2XTyXTXβ^=XTy(XTX)1XTXβ^=(XTX)1XTy(XTX)1XTXIβ^=(XTX)1XTy

15 / 47

Linear Regression Review

How did we get (XTX)1XTy?

2XTy+2XTXβ^=02XTXβ^=2XTyXTXβ^=XTy(XTX)1XTXβ^=(XTX)1XTy(XTX)1XTXIβ^=(XTX)1XTyIβ^=(XTX)1XTy

16 / 47

Linear Regression Review

How did we get (XTX)1XTy?

2XTy+2XTXβ^=02XTXβ^=2XTyXTXβ^=XTy(XTX)1XTXβ^=(XTX)1XTy(XTX)1XTXIβ^=(XTX)1XTyIβ^=(XTX)1XTyβ^=(XTX)1XTy

17 / 47

Linear Regression Review

Let's try to find an X for which it would be impossible to calculate β^

18 / 47

Ridge

- Go to RStudio Pro - rstudio.hpc.ar53.wfu.edu:8787 - pw: R2D2Star!
05:00
19 / 47

Estimating β^

β^=(XTX)1XTy

Under what circumstances is this equation not estimable?

20 / 47

Estimating β^

β^=(XTX)1XTy

Under what circumstances is this equation not estimable?

  • when we can't invert (XTX)1
20 / 47

Estimating β^

β^=(XTX)1XTy

Under what circumstances is this equation not estimable?

  • when we can't invert (XTX)1
    • p>n
    • multicollinearity
20 / 47

Estimating β^

β^=(XTX)1XTy

Under what circumstances is this equation not estimable?

  • when we can't invert (XTX)1
    • p>n
    • multicollinearity

A guaranteed way to check whether a square matrix is not invertible is to check whether the determinant is equal to zero

20 / 47

Estimating β^

X=[12311340]

What is n here? What is p?

21 / 47

Estimating β^

X=[12311340]

What is n here? What is p?

Is (XTX)1 going to be invertible?

21 / 47

Estimating β^

X=[12311340]

What is n here? What is p?

Is (XTX)1 going to be invertible?

X <- matrix(c(1, 1, 2, 3, 3, 4, 1, 0), nrow = 2)
det(t(X) %*% X)
## [1] 0
21 / 47

Estimating β^

X=[1361481510124]

22 / 47

Estimating β^

X=[1361481510124]

Is (XTX)1 going to be invertible?

22 / 47

Estimating β^

X=[1361481510124]

Is (XTX)1 going to be invertible?

X <- matrix(c(1, 1, 1, 1, 3, 4, 5, 2, 6, 8, 10, 4), nrow = 4)
det(t(X) %*% X)
## [1] 0
cor(X[, 2], X[, 3])
## [1] 1
22 / 47

Estimating β^

X=[1361481510124]

What was the problem this time?

X <- matrix(c(1, 1, 1, 1, 3, 4, 5, 2, 6, 8, 10, 4), nrow = 4)
det(t(X) %*% X)
## [1] 0
cor(X[, 2], X[, 3])
## [1] 1
23 / 47

Estimating β^

What is a sure-fire way to tell whether (XTX)1 will be invertible?

24 / 47

Estimating β^

What is a sure-fire way to tell whether (XTX)1 will be invertible?

  • Take the determinant!
24 / 47

Estimating β^

What is a sure-fire way to tell whether (XTX)1 will be invertible?

  • Take the determinant!

|A| means the determinant of matrix A

24 / 47

Estimating β^

What is a sure-fire way to tell whether (XTX)1 will be invertible?

  • Take the determinant!

|A| means the determinant of matrix A

  • For a 2x2 matrix:

A=[abcd] |A|=adbc

24 / 47

Estimating β^

What is a sure-fire way to tell whether (XTX)1 will be invertible?

  • Take the determinant!

|A| means the determinant of matrix A

  • For a 3x3 matrix:

A=[abcdefghi] |A|=a(eifh)b(difg)+c(dheg)

25 / 47

Determinants

It looks funky, but it follows a nice pattern!

A=[abcdefghi] |A|=a(eifh)b(difg)+c(dheg)

26 / 47

Determinants

It looks funky, but it follows a nice pattern!

A=[abcdefghi] |A|=a(eifh)b(difg)+c(dheg)

  • (1) multiply a by the determinant of the portion of the matrix that are not in a's row or column
  • do the same for b (2) and c (3)
  • put it together as plus (1) minus (2) plus (3)
26 / 47

Determinants

It looks funky, but it follows a nice pattern!

A=[abcdefghi] |A|=a(eifh)b(difg)+c(dheg)

  • (1) multiply a by the determinant of the portion of the matrix that are not in a's row or column
  • do the same for b (2) and c (3)
  • put it together as plus (1) minus (2) plus (3)

|A|=a|efhi|b|dfgi|+c|degh|

26 / 47
<!-- -->

Determinants

  • Calculate the determinant of the following matrices in R using the det() function:

A=[1245]

B=[123369257]

  • Are these both invertible?
01:00
27 / 47

Estimating β^

X=[13.0161481510124]

28 / 47

Estimating β^

X=[13.0161481510124]

Is (XTX)1 going to be invertible?

28 / 47

Estimating β^

X=[13.0161481510124]

Is (XTX)1 going to be invertible?

X <- matrix(c(1, 1, 1, 1, 3.01, 4, 5, 2, 6, 8, 10, 4), nrow = 4)
det(t(X) %*% X)
## [1] 0.0056
cor(X[, 2], X[, 3])
## [1] 0.999993
28 / 47

Estimating β^

X=[13.0161481510124]

Is (XTX)1 going to be invertible?

y <- c(1, 2, 3, 2)
solve(t(X) %*% X) %*% t(X) %*% y
## [,1]
## [1,] 1.285714
## [2,] -114.285714
## [3,] 57.285714
29 / 47

Estimating β^

X=[13.0161481510124]

Is (XTX)1 going to be invertible?

[β^0β^1β^2]=[1.28114.2957.29]

30 / 47

Estimating β^

X=[13.0161481510124]

What is the equation for the variance of β^?

var(β^)=σ2(XTX)1

31 / 47

Estimating β^

X=[13.0161481510124]

What is the equation for the variance of β^?

var(β^)=σ2(XTX)1

  • σ^2=RSSnp1
31 / 47

Estimating β^

X=[13.0161481510124]

What is the equation for the variance of β^?

var(β^)=σ2(XTX)1

  • σ^2=RSSnp1

var(β^)=[0.9183524.48912.13224.489434081.5712038.74512.132472038.7451018.367]

31 / 47

Estimating β^

X=[13.0161481510124]

var(β^)=[0.9183524.48912.13224.489434081.5712038.74512.132472038.7451018.367]

What is the variance for β^0?

32 / 47

Estimating β^

X=[13.0161481510124]

var(β^)=[0.9183524.48912.13224.489434081.5712038.74512.132472038.7451018.367]

What is the variance for β^0?

33 / 47

Estimating β^

X=[13.0161481510124]

var(β^)=[0.9183524.48912.13224.489434081.5712038.74512.132472038.7451018.367]

What is the variance for β^1?

34 / 47

Estimating β^

X=[13.0161481510124]

var(β^)=[0.9183524.48912.13224.489434081.5712038.74512.132472038.7451018.367]

What is the variance for β^1? 😱

35 / 47

What's the problem?

  • Sometimes we can't solve for β^

Why?

36 / 47

What's the problem?

  • Sometimes we can't solve for β^
    • XTX is not invertible
37 / 47

What's the problem?

  • Sometimes we can't solve for β^
    • XTX is not invertible
    • We have more variables than observations ( p>n )
    • The variables are linear combinations of one another
37 / 47

What's the problem?

  • Sometimes we can't solve for β^
    • XTX is not invertible
    • We have more variables than observations ( p>n )
    • The variables are linear combinations of one another
  • Even when we can invert XTX, things can go wrong
37 / 47

What's the problem?

  • Sometimes we can't solve for β^
    • XTX is not invertible
    • We have more variables than observations ( p>n )
    • The variables are linear combinations of one another
  • Even when we can invert XTX, things can go wrong
    • The variance can blow up, like we just saw!
37 / 47

What can we do about this?

38 / 47

Ridge Regression

  • What if we add an additional penalty to keep the β^ coefficients small (this will keep the variance from blowing up!)
39 / 47

Ridge Regression

  • What if we add an additional penalty to keep the β^ coefficients small (this will keep the variance from blowing up!)
  • Instead of minimizing RSS, like we do with linear regresion, let's minimize RSS PLUS some penalty function
39 / 47

Ridge Regression

  • What if we add an additional penalty to keep the β^ coefficients small (this will keep the variance from blowing up!)
  • Instead of minimizing RSS, like we do with linear regresion, let's minimize RSS PLUS some penalty function

RSS+λj=1pβj2shrinkage penalty

39 / 47

Ridge Regression

  • What if we add an additional penalty to keep the β^ coefficients small (this will keep the variance from blowing up!)
  • Instead of minimizing RSS, like we do with linear regresion, let's minimize RSS PLUS some penalty function

RSS+λj=1pβj2shrinkage penalty

What happens when λ=0? What happens as λ?

39 / 47

Ridge Regression

Let's solve for the β^ coefficients using Ridge Regression. What are we minimizing?

40 / 47

Ridge Regression

Let's solve for the β^ coefficients using Ridge Regression. What are we minimizing?

(yXβ)T(yXβ)+λβTβ

40 / 47

Try it!

  • Find β^ that minimizes this:

(yXβ)T(yXβ)+λβTβ

02:00
41 / 47

Ridge Regression

β^ridge=(XTX+λI)1XTy

42 / 47

Ridge Regression

β^ridge=(XTX+λI)1XTy

  • Not only does this help with the variance, it solves our problem when XTX isn't invertible!
42 / 47

Choosing λ

  • λ is known as a tuning parameter and is selected using cross validation
  • For example, choose the λ that results in the smallest estimated test error
43 / 47

Bias-variance tradeoff

How do you think ridge regression fits into the bias-variance tradeoff?

44 / 47

Bias-variance tradeoff

How do you think ridge regression fits into the bias-variance tradeoff?

  • As λ ☝️, bias ☝️, variance 👇
44 / 47

Bias-variance tradeoff

How do you think ridge regression fits into the bias-variance tradeoff?

  • As λ ☝️, bias ☝️, variance 👇
  • Bias( β^ridge ) = λ(XTX+λI)1β
44 / 47

Bias-variance tradeoff

How do you think ridge regression fits into the bias-variance tradeoff?

  • As λ ☝️, bias ☝️, variance 👇
  • Bias( β^ridge ) = λ(XTX+λI)1β

    What would this be if λ was 0?

44 / 47

Bias-variance tradeoff

How do you think ridge regression fits into the bias-variance tradeoff?

  • As λ ☝️, bias ☝️, variance 👇
  • Bias( β^ridge ) = λ(XTX+λI)1β

    What would this be if λ was 0?

  • Var( β^ridge ) = σ2(XTX+λI)1XTX(XTX+λI)1
44 / 47

Bias-variance tradeoff

How do you think ridge regression fits into the bias-variance tradeoff?

  • As λ ☝️, bias ☝️, variance 👇
  • Bias( β^ridge ) = λ(XTX+λI)1β

    What would this be if λ was 0?

  • Var( β^ridge ) = σ2(XTX+λI)1XTX(XTX+λI)1

    Is this bigger or smaller than σ(XTX)1? What is this when λ=0? As λ does this go up or down?

44 / 47

Ridge Regression

  • IMPORTANT: When doing ridge regression, it is important to standardize your variables (divide by the standard deviation)
45 / 47

Ridge Regression

  • IMPORTANT: When doing ridge regression, it is important to standardize your variables (divide by the standard deviation)

Why?

45 / 47
<!-- --> <!-- -->

<!--

--> <!-- -->
46 / 47
47 / 47

📖 Canvas

2 / 47
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow