Loading [MathJax]/jax/output/CommonHTML/jax.js
+ - 0:00:00
Notes for current slide
Notes for next slide

Logistic regression, LDA, QDA

Dr. D’Agostino McGowan

1 / 41

☝️ Reminders

2 / 41

📖 Canvas

  • use Google Chrome
3 / 41

Recap

  • Last class we had a linear regression refresher
4 / 41

Recap

  • Last class we had a linear regression refresher
  • We covered how to write a linear model in matrix form
4 / 41

Recap

  • Last class we had a linear regression refresher
  • We covered how to write a linear model in matrix form
  • We learned how to minimize RSS to calculate ˆβ with (XTX)1XTy
4 / 41

Recap

  • Last class we had a linear regression refresher
  • We covered how to write a linear model in matrix form
  • We learned how to minimize RSS to calculate ˆβ with (XTX)1XTy
  • Linear regression is a great tool when we have a continuous outcome
  • We are going to learn some fancy ways to do even better in the future
4 / 41

Classification

5 / 41

Classification

What are some examples of classification problems?

  • Qualitative response variable in an unordered set, C
6 / 41

Classification

What are some examples of classification problems?

  • Qualitative response variable in an unordered set, C
    • eye color {blue, brown, green}
    • email {spam, not spam}
6 / 41

Classification

What are some examples of classification problems?

  • Qualitative response variable in an unordered set, C
    • eye color {blue, brown, green}
    • email {spam, not spam}
  • Response, Y takes on values in C
  • Predictors are a vector, X
6 / 41

Classification

What are some examples of classification problems?

  • Qualitative response variable in an unordered set, C
    • eye color {blue, brown, green}
    • email {spam, not spam}
  • Response, Y takes on values in C
  • Predictors are a vector, X
  • The task: build a function C(X) that takes X and predicts Y, C(X)C
6 / 41

Classification

What are some examples of classification problems?

  • Qualitative response variable in an unordered set, C
    • eye color {blue, brown, green}
    • email {spam, not spam}
  • Response, Y takes on values in C
  • Predictors are a vector, X
  • The task: build a function C(X) that takes X and predicts Y, C(X)C
  • Many times we are actually more interested in the probabilities that X belongs to each category in C
6 / 41

Example: Credit card default

7 / 41

Can we use linear regression?

We can code Default as

Y={0if No1if Yes Can we fit a linear regression of Y on X and classify as Yes if ˆY>0.5?

8 / 41

Can we use linear regression?

We can code Default as

Y={0if No1if Yes Can we fit a linear regression of Y on X and classify as Yes if ˆY>0.5?

  • In this case of a binary outcome, linear regression is okay (it is equivalent to linear discriminant analysis, we'll get to that soon!)
  • E[Y|X=x]=P(Y=1|X=x), so it seems like this is a pretty good idea!
  • The problem: Linear regression can produce probabilities less than 0 or greater than 1 😱
8 / 41

Can we use linear regression?

We can code Default as

Y={0if No1if Yes Can we fit a linear regression of Y on X and classify as Yes if ˆY>0.5?

  • In this case of a binary outcome, linear regression is okay (it is equivalent to linear discriminant analysis, we'll get to that soon!)
  • E[Y|X=x]=P(Y=1|X=x), so it seems like this is a pretty good idea!
  • The problem: Linear regression can produce probabilities less than 0 or greater than 1 😱

    What may do a better job?

8 / 41

Can we use linear regression?

We can code Default as

Y={0if No1if Yes Can we fit a linear regression of Y on X and classify as Yes if ˆY>0.5?

  • In this case of a binary outcome, linear regression is okay (it is equivalent to linear discriminant analysis, we'll get to that soon!)
  • E[Y|X=x]=P(Y=1|X=x), so it seems like this is a pretty good idea!
  • The problem: Linear regression can produce probabilities less than 0 or greater than 1 😱

    What may do a better job?

  • Logistic regression!
8 / 41

Linear versus logistic regression

Which does a better job at predicting the probability of default?

  • The orange marks represent the response Y{0,1}
9 / 41

Linear Regression

What if we have >2 possible outcomes? For example, someone comes to the emergency room and we need to classify them according to their symptoms

Y={1if stroke2if drug overdose3if epileptic seizure

What could go wrong here?

10 / 41

Linear Regression

What if we have >2 possible outcomes? For example, someone comes to the emergency room and we need to classify them according to their symptoms

Y={1if stroke2if drug overdose3if epileptic seizure

What could go wrong here?

  • The coding implies an ordering
  • The coding implies equal spacing (that is the difference between stroke and drug overdose is the same as drug overdose and epileptic seizure)
10 / 41

Linear Regression

What if we have >2 possible outcomes? For example, someone comes to the emergency room and we need to classify them according to their symptoms

Y={1if stroke2if drug overdose3if epileptic seizure

  • Linear regression is not appropriate here
  • Mutliclass logistic regression or discriminant analysis are more appropriate
11 / 41

Logistic Regression

p(X)=eβ0+β1X1+eβ0+β1X

  • Note: p(X) is shorthand for P(Y=1|X)
  • No matter what values β0, β1, or X take p(X) will always be between 0 and 1
12 / 41

Logistic Regression

p(X)=eβ0+β1X1+eβ0+β1X

  • Note: p(X) is shorthand for P(Y=1|X)
  • No matter what values β0, β1, or X take p(X) will always be between 0 and 1
  • We can rearrange this into the following form: log(p(X)1p(X))=β0+β1X

What is this transformation called?

12 / 41

Logistic Regression

p(X)=eβ0+β1X1+eβ0+β1X

  • Note: p(X) is shorthand for P(Y=1|X)
  • No matter what values β0, β1, or X take p(X) will always be between 0 and 1
  • We can rearrange this into the following form: log(p(X)1p(X))=β0+β1X

What is this transformation called?

  • This is a log odds or logit transformation of p(X)
12 / 41

Linear versus logistic regression

Logistic regression ensures that our estimates for p(X) are between 0 and 1 🎉

13 / 41

Maximum Likelihood

Refresher: How did we estimate ˆβ in linear regression?

14 / 41

Maximum Likelihood

Refresher: How did we estimate (\hat\beta) in linear regression?

In logistic regression, we use maximum likelihood to estimate the parameters

l(β0,β1)=i:yi=1p(xi)i:yi=0(1p(xi))

14 / 41

Maximum Likelihood

Refresher: How did we estimate (\hat\beta) in linear regression?

In logistic regression, we use maximum likelihood to estimate the parameters

l(β0,β1)=i:yi=1p(xi)i:yi=0(1p(xi))

  • This likelihood give the probability of the observed ones and zeros in the data
  • We pick β0 and β1 to maximize the likelihood
  • We'll let R do the heavy lifting here
14 / 41

Let's see it in R

glm(default ~ balance, data = Default, family = "binomial") %>%
tidy()
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -10.7 0.361 -29.5 3.62e-191
## 2 balance 0.00550 0.000220 25.0 1.98e-137
  • Use the glm() function in R with the family = "binomial" argument
15 / 41

Making predictions

What is our estimated probability of default for someone with a balance of $1000?

term estimate std.error statistic p.value
(Intercept) -10.6513306 0.3611574 -29.49221 0
balance 0.0054989 0.0002204 24.95309 0
16 / 41

Making predictions

What is our estimated probability of default for someone with a balance of $1000?

term estimate std.error statistic p.value
(Intercept) -10.6513306 0.3611574 -29.49221 0
balance 0.0054989 0.0002204 24.95309 0

ˆp(X)=eˆβ0+ˆβ1X1+eˆβ0+ˆβ1X=e10.65+0.0055×10001+e10.65+0.0055×1000=0.006

16 / 41

Making predictions

What is our estimated probability of default for someone with a balance of $2000?

term estimate std.error statistic p.value
(Intercept) -10.6513306 0.3611574 -29.49221 0
balance 0.0054989 0.0002204 24.95309 0
17 / 41

Making predictions

What is our estimated probability of default for someone with a balance of $2000?

term estimate std.error statistic p.value
(Intercept) -10.6513306 0.3611574 -29.49221 0
balance 0.0054989 0.0002204 24.95309 0

ˆp(X)=eˆβ0+ˆβ1X1+eˆβ0+ˆβ1X=e10.65+0.0055×20001+e10.65+0.0055×2000=0.586

17 / 41

Logistic regression example

Let's refit the model to predict the probability of default given the customer is a student

term estimate std.error statistic p.value
(Intercept) -3.5041278 0.0707130 -49.554219 0.0000000
studentYes 0.4048871 0.1150188 3.520181 0.0004313

P(default = Yes|student = Yes)=e3.5041+0.4049×11+e3.5041+0.4049×1=0.0431

18 / 41

Logistic regression example

Let's refit the model to predict the probability of default given the customer is a student

term estimate std.error statistic p.value
(Intercept) -3.5041278 0.0707130 -49.554219 0.0000000
studentYes 0.4048871 0.1150188 3.520181 0.0004313

P(default = Yes|student = Yes)=e3.5041+0.4049×11+e3.5041+0.4049×1=0.0431

How will this change if student = No?

18 / 41

Logistic regression example

Let's refit the model to predict the probability of default given the customer is a student

term estimate std.error statistic p.value
(Intercept) -3.5041278 0.0707130 -49.554219 0.0000000
studentYes 0.4048871 0.1150188 3.520181 0.0004313

P(default = Yes|student = Yes)=e3.5041+0.4049×11+e3.5041+0.4049×1=0.0431

How will this change if student = No?

P(default = Yes|student = No)=e3.5041+0.4049×01+e3.5041+0.4049×0=0.0292

18 / 41

Multiple logistic regression

log(p(X)1p(X))=β0+β1X1++βpXp p(X)=eβ0+β1X1++βpXp1+eβ0+β1X1++βpXp

term estimate std.error statistic p.value
(Intercept) -10.8690452 0.4922555 -22.080088 0.0000000
balance 0.0057365 0.0002319 24.737563 0.0000000
income 0.0000030 0.0000082 0.369815 0.7115203
studentYes -0.6467758 0.2362525 -2.737646 0.0061881
19 / 41

Multiple logistic regression

log(p(X)1p(X))=β0+β1X1++βpXp p(X)=eβ0+β1X1++βpXp1+eβ0+β1X1++βpXp

term estimate std.error statistic p.value
(Intercept) -10.8690452 0.4922555 -22.080088 0.0000000
balance 0.0057365 0.0002319 24.737563 0.0000000
income 0.0000030 0.0000082 0.369815 0.7115203
studentYes -0.6467758 0.2362525 -2.737646 0.0061881
  • Why is the coefficient for student negative now when it was positive before?
19 / 41

Confounding

What is going on here?

20 / 41

Confounding

  • Students tend to have higher balances than non-students
    • Their marginal default rate is higher
21 / 41

Confounding

  • Students tend to have higher balances than non-students
    • Their marginal default rate is higher
  • For each level of balance, students default less
    • Their conditional default rate is lower
21 / 41

Logistic regression for more than two classes

  • So far we've discussed binary outcome data
  • We can generalize this to situations with multiple classes

P(Y=k|X)=eβ0k+β1kX1++βpkXpKl=1eβ0l+β1lX1++βplXp

  • Here we have a linear function for each of the K classes
  • This is known as multinomial logistic regression
22 / 41

Discriminant Analysis

  • Another way to model multiple classes 💡 Big idea:
  • Model the distribution of X in each class separately ( P(X|Y) )
  • Use Bayes theorem to flip things around to get P(Y|X)
23 / 41

Bayes Theorem

What is Bayes theorem?

24 / 41

Bayes Theorem

What is Bayes theorem?

P(Y=k|X=x)=P(X=x|Y=k)×P(Y=k)P(X=x)

25 / 41

Bayes Theorem

P(Y=k|X=x)=P(X=x|Y=k)×P(Y=k)P(X=x)

26 / 41

Bayes Theorem

P(Y=k|X=x)posterior=P(X=x|Y=k)×P(Y=k)P(X=x)

27 / 41

Bayes Theorem

P(Y=k|X=x)=likelihoodP(X=x|Y=k)×P(Y=k)P(X=x)

28 / 41

Bayes Theorem

P(Y=k|X=x)=likelihoodP(X=x|Y=k)×priorP(Y=k)P(X=x)

29 / 41

Bayes Theorem Example

P(Sick|+)=P(+|Sick)P(Sick)P(+)=P(+|Sick)P(Sick)P(+|Sick)P(Sick)+P(+|Healthy)P(Healthy)

30 / 41

Bayes Theorem Example

P(Sick|+)=P(+|Sick)P(Sick)P(+)=P(+|Sick)P(Sick)P(+|Sick)P(Sick)+P(+|Healthy)P(Healthy)

  • Often when a test is created the sensitivity is calculated, that is the true positive rate, the P(+|Sick). Let's say in this case that is 99%
30 / 41

Bayes Theorem Example

P(Sick|+)=P(+|Sick)P(Sick)P(+)=P(+|Sick)P(Sick)P(+|Sick)P(Sick)+P(+|Healthy)P(Healthy)

  • Often when a test is created the sensitivity is calculated, that is the true positive rate, the P(+|Sick). Let's say in this case that is 99%
  • Let's suppose the probability of a positive test if you are healthy is rare, 1%
30 / 41

Bayes Theorem Example

P(Sick|+)=P(+|Sick)P(Sick)P(+)=P(+|Sick)P(Sick)P(+|Sick)P(Sick)+P(+|Healthy)P(Healthy)

  • Often when a test is created the sensitivity is calculated, that is the true positive rate, the P(+|Sick). Let's say in this case that is 99%
  • Let's suppose the probability of a positive test if you are healthy is rare, 1%
  • Finally, let's suppose the disease is fairly common, 20% of people in the population have it.
30 / 41

Bayes Theorem Example

P(Sick|+)=P(+|Sick)P(Sick)P(+)=P(+|Sick)P(Sick)P(+|Sick)P(Sick)+P(+|Healthy)P(Healthy)

  • Often when a test is created the sensitivity is calculated, that is the true positive rate, the P(+|Sick). Let's say in this case that is 99%
  • Let's suppose the probability of a positive test if you are healthy is rare, 1%
  • Finally, let's suppose the disease is fairly common, 20% of people in the population have it.

    What is my probability of having the disease given I tested positive?

30 / 41

Bayes Theorem Example

P(Sick|+)=P(+|Sick)P(Sick)P(+).96=0.99×0.20.99×0.2+0.01×0.8

  • Often when a test is created the sensitivity is calculated, that is the true positive rate, the P(+|Sick). Let's say in this case that is 99%
  • Let's suppose the probability of a positive test if you are healthy is rare, 1%
  • Finally, let's suppose the disease is fairly common, 20% of people in the population have it.

What is my probability of having the disease given I tested positive?

31 / 41

Bayes Theorem Example

P(Sick|+)=P(+|Sick)P(Sick)P(+).96=0.99×0.20.99×0.2+0.01×0.8

  • Often when a test is created the sensitivity is calculated, that is the true positive rate, the P(+|Sick). Let's say in this case that is 99%
  • Let's suppose the probability of a positive test if you are healthy is rare, 1%
  • If the disease is rare (let's say 0.1% have it) how does that change my probability of having it given a positive test?

What is my probability of having the disease given I tested positive?

32 / 41

Bayes Theorem Example

P(Sick|+)=P(+|Sick)P(Sick)P(+)0.09=0.99×0.0010.99×0.001+0.01×0.999

  • Often when a test is created the sensitivity is calculated, that is the true positive rate, the P(+|Sick). Let's say in this case that is 99%
  • Let's suppose the probability of a positive test if you are healthy is rare, 1%
  • If the disease is rare (let's say 0.1% have it) how does that change my probability of having it given a positive test?

What is my probability of having the disease given I tested positive?

33 / 41

Bayes Theorem and Discriminant Analysis

P(Y|X)=P(X|Y)×P(Y)P(X)

This same equation is used for discriminant analysis with slightly different notation:

34 / 41

Bayes Theorem and Discriminant Analysis

P(Y|X)=P(X|Y)×P(Y)P(X)

This same equation is used for discriminant analysis with slightly different notation: P(Y|X)=πkfk(x)Kl=1fl(x)

34 / 41

Bayes Theorem and Discriminant Analysis

P(Y|X)=P(X|Y)×P(Y)P(X)

This same equation is used for discriminant analysis with slightly different notation: P(Y|X)=πkfk(x)Kl=1fl(x)

  • fk(x)=P(X|Y) is the density for X in class k
    • For linear discriminant analysis we will use the normal distribution to represent this density
34 / 41

Bayes Theorem and Discriminant Analysis

P(Y|X)=P(X|Y)×P(Y)P(X)

This same equation is used for discriminant analysis with slightly different notation: P(Y|X)=πkfk(x)Kl=1fl(x)

  • fk(x)=P(X|Y) is the density for X in class k
    • For linear discriminant analysis we will use the normal distribution to represent this density
  • πk=P(Y) is the marginal or prior probability for class k
34 / 41

Discriminant analysis

35 / 41

Discriminant analysis

  • Here there are two classes
35 / 41

Discriminant analysis

  • Here there are two classes
  • We classify new points based on which density is highest
35 / 41

Discriminant analysis

  • Here there are two classes
  • We classify new points based on which density is highest
  • On the left, the priors for the two classes are the same
35 / 41

Discriminant analysis

  • Here there are two classes
  • We classify new points based on which density is highest
  • On the left, the priors for the two classes are the same
  • On the right, we favor the orange class, making the decision boundary shift to the left
35 / 41

Why discriminant analysis?

  • When the classes are well separated, logistic regression is unstable, linear discriminant analysis (LDA) is not
36 / 41

Why discriminant analysis?

  • When the classes are well separated, logistic regression is unstable, linear discriminant analysis (LDA) is not
  • When n is small and the distribution of predictors ( X ) is approximately normal in each class, the linear discriminant model is more stable than the logistic model
36 / 41

Why discriminant analysis?

  • When the classes are well separated, logistic regression is unstable, linear discriminant analysis (LDA) is not
  • When n is small and the distribution of predictors ( X ) is approximately normal in each class, the linear discriminant model is more stable than the logistic model
  • When we have more than 2 classes, LDA also provides a nice low dimensional way to visualize data
36 / 41

Linear Discriminant Analysis p = 1

The density for the normal distribution is

fk(x)=12πσke12(xμkσk)2

37 / 41

Linear Discriminant Analysis p = 1

The density for the normal distribution is

fk(x)=12πσke12(xμkσk)2

  • μk is the mean in class k
37 / 41

Linear Discriminant Analysis p = 1

The density for the normal distribution is

fk(x)=12πσke12(xμkσk)2

  • μk is the mean in class k
  • σ2k is the variance k (We will assume σk=σ are the same for all classes)
37 / 41

Linear Discriminant Analysis p = 1

The density for the normal distribution is

fk(x)=12πσke12(xμkσk)2

  • We can plug this into Bayes formula

pk(X)=πk12πσke12(xμkσk)2kl=1πl12πσle12(xμlσl)2

38 / 41

Linear Discriminant Analysis p = 1

The density for the normal distribution is

fk(x)=12πσke12(xμkσk)2

  • We can plug this into Bayes formula

pk(X)=πk12πσke12(xμkσk)2kl=1πl12πσle12(xμlσl)2

😅 Luckily things cancel!

38 / 41

Discriminant functions

  • To classify an observation where X=x we need to determine which of the pk(x) is the largest
  • It turns out this is equivalent to assigning x to the class with the largest discriminant score

δk(x)=xμkσ2μ2k2σ2+log(πk)

39 / 41

Discriminant functions

  • To classify an observation where X=x we need to determine which of the pk(x) is the largest
  • It turns out this is equivalent to assigning x to the class with the largest discriminant score

δk(x)=xμkσ2μ2k2σ2+log(πk)

  • This discriminant score , δk(x), is a function of pk(x) (we took some logs and discarded terms that don't include k)
  • δk(x) is a linear function of x
39 / 41

Discriminant functions

  • To classify an observation where X=x we need to determine which of the pk(x) is the largest
  • It turns out this is equivalent to assigning x to the class with the largest discriminant score

δk(x)=xμkσ2μ2k2σ2+log(πk)

  • This discriminant score , δk(x), is a function of pk(x) (we took some logs and discarded terms that don't include k)
  • δk(x) is a linear function of x

If K=2, how do you think we would calculate the decision boundary?

39 / 41

Discriminant functions

δ1(x)=δ2(x)

40 / 41

Discriminant functions

δ1(x)=δ2(x)

  • Let's set π1=π2=0.5
40 / 41

Discriminant functions

δ1(x)=δ2(x)

  • Let's set π1=π2=0.5

xμ1σ2μ212σ2+log(0.5)=xμ2σ2μ222σ2+log(0.5)xμ1σ2xμ2σ2=μ222σ2+log(0.5)+μ212σ2log(0.5)x(μ1μ2)=μ21μ222x=μ21μ22(μ1μ2)2x=(μ1μ2)(μ1+μ2)(μ1μ2)2x=μ1+μ22

40 / 41
41 / 41

☝️ Reminders

2 / 41
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow