Logistic regression, LDA, QDA

Logistic regression, LDA, QDADr. D’Agostino McGowan1 / 41

☝️ Reminders

Homework 1 is due tomorrow (be sure to knit, commit, push)
Study sessions are 7-9 in Manchester 122
Questions? Use github.com/sta-363-s20/community

2 / 41

Dr. Lucy D'Agostino McGowan adapted from slides by Hastie & Tibshirani

📖 Canvasuse Google Chrome
3 / 41

Dr. Lucy D'Agostino McGowan adapted from slides by Hastie & Tibshirani

RecapLast class we had a linear regression refresher
4 / 41

Dr. Lucy D'Agostino McGowan adapted from slides by Hastie & Tibshirani

RecapLast class we had a linear regression refresher
We covered how to write a linear model in matrix form
4 / 41

Dr. Lucy D'Agostino McGowan adapted from slides by Hastie & Tibshirani

RecapLast class we had a linear regression refresher
We covered how to write a linear model in matrix form
We learned how to minimize RSS to calculate ^ββ^ with (XTX)−1XTy(XTX)−1XTy
4 / 41

Dr. Lucy D'Agostino McGowan adapted from slides by Hastie & Tibshirani

RecapLast class we had a linear regression refresher
We covered how to write a linear model in matrix form
We learned how to minimize RSS to calculate ^ββ^ with (XTX)−1XTy(XTX)−1XTy
Linear regression is a great tool when we have a continuous outcome
We are going to learn some fancy ways to do even better in the future
4 / 41

Dr. Lucy D'Agostino McGowan adapted from slides by Hastie & Tibshirani

Classification5 / 41

Classification

What are some examples of classification problems?

Qualitative response variable in an unordered set, $C$

6 / 41

Classification

What are some examples of classification problems?

Qualitative response variable in an unordered set, $C$
- eye color $\in$ {blue, brown, green}
- email $\in$ {spam, not spam}

6 / 41

Classification

What are some examples of classification problems?

Qualitative response variable in an unordered set, $C$
- eye color $\in$ {blue, brown, green}
- email $\in$ {spam, not spam}
Response, $Y$ takes on values in $C$
Predictors are a vector, $X$

6 / 41

Classification

What are some examples of classification problems?

Qualitative response variable in an unordered set, $C$
- eye color $\in$ {blue, brown, green}
- email $\in$ {spam, not spam}
Response, $Y$ takes on values in $C$
Predictors are a vector, $X$
The task: build a function $C (X)$ that takes $X$ and predicts $Y$ , $C (X) \in C$

6 / 41

Classification

What are some examples of classification problems?

Qualitative response variable in an unordered set, $C$
- eye color $\in$ {blue, brown, green}
- email $\in$ {spam, not spam}
Response, $Y$ takes on values in $C$
Predictors are a vector, $X$
The task: build a function $C (X)$ that takes $X$ and predicts $Y$ , $C (X) \in C$
Many times we are actually more interested in the probabilities that $X$ belongs to each category in $C$

6 / 41

Example: Credit card default

7 / 41

Can we use linear regression?

We can code Default as

$Y = {\begin{cases} 0 & if No \\ 1 & if Yes \end{cases}$ Can we fit a linear regression of $Y$ on $X$ and classify as Yes if $\hat{Y} > 0.5$ ?

8 / 41

Can we use linear regression?

We can code Default as

$Y = {\begin{cases} 0 & if No \\ 1 & if Yes \end{cases}$ Can we fit a linear regression of $Y$ on $X$ and classify as Yes if $\hat{Y} > 0.5$ ?

In this case of a binary outcome, linear regression is okay (it is equivalent to linear discriminant analysis, we'll get to that soon!)
$E [Y | X = x] = P (Y = 1 | X = x)$ , so it seems like this is a pretty good idea!
The problem: Linear regression can produce probabilities less than 0 or greater than 1 😱

8 / 41

Can we use linear regression?

We can code Default as

$Y = {\begin{cases} 0 & if No \\ 1 & if Yes \end{cases}$ Can we fit a linear regression of $Y$ on $X$ and classify as Yes if $\hat{Y} > 0.5$ ?

In this case of a binary outcome, linear regression is okay (it is equivalent to linear discriminant analysis, we'll get to that soon!)
$E [Y | X = x] = P (Y = 1 | X = x)$ , so it seems like this is a pretty good idea!
The problem: Linear regression can produce probabilities less than 0 or greater than 1 😱
What may do a better job?

8 / 41

Can we use linear regression?

We can code Default as

$Y = {\begin{cases} 0 & if No \\ 1 & if Yes \end{cases}$ Can we fit a linear regression of $Y$ on $X$ and classify as Yes if $\hat{Y} > 0.5$ ?

In this case of a binary outcome, linear regression is okay (it is equivalent to linear discriminant analysis, we'll get to that soon!)
$E [Y | X = x] = P (Y = 1 | X = x)$ , so it seems like this is a pretty good idea!
The problem: Linear regression can produce probabilities less than 0 or greater than 1 😱
What may do a better job?
Logistic regression!

8 / 41

Linear versus logistic regression

Which does a better job at predicting the probability of default?

The orange marks represent the response $Y \in {0, 1}$

9 / 41

Linear Regression

What if we have $> 2$ possible outcomes? For example, someone comes to the emergency room and we need to classify them according to their symptoms

$\begin{aligned} Y = {\begin{cases} 1 & if stroke \\ 2 & if drug overdose \\ 3 & if epileptic seizure \end{cases} \end{aligned}$

What could go wrong here?

10 / 41

Linear Regression

What if we have $> 2$ possible outcomes? For example, someone comes to the emergency room and we need to classify them according to their symptoms

$\begin{aligned} Y = {\begin{cases} 1 & if stroke \\ 2 & if drug overdose \\ 3 & if epileptic seizure \end{cases} \end{aligned}$

What could go wrong here?

The coding implies an ordering
The coding implies equal spacing (that is the difference between stroke and drug overdose is the same as drug overdose and epileptic seizure)

10 / 41

Linear Regression

What if we have $> 2$ possible outcomes? For example, someone comes to the emergency room and we need to classify them according to their symptoms

$\begin{aligned} Y = {\begin{cases} 1 & if stroke \\ 2 & if drug overdose \\ 3 & if epileptic seizure \end{cases} \end{aligned}$

Linear regression is not appropriate here
Mutliclass logistic regression or discriminant analysis are more appropriate

11 / 41

Logistic Regression

$p (X) = \frac{e^{β_{0} + β_{1} X}}{1 + e^{β_{0} + β_{1} X}}$

Note: $p (X)$ is shorthand for $P (Y = 1 | X)$
No matter what values $β_{0}$ , $β_{1}$ , or $X$ take $p (X)$ will always be between 0 and 1

12 / 41

Logistic Regression

$p (X) = \frac{e^{β_{0} + β_{1} X}}{1 + e^{β_{0} + β_{1} X}}$

Note: $p (X)$ is shorthand for $P (Y = 1 | X)$
No matter what values $β_{0}$ , $β_{1}$ , or $X$ take $p (X)$ will always be between 0 and 1
We can rearrange this into the following form: $\log (\frac{p (X)}{1 - p (X)}) = β_{0} + β_{1} X$

What is this transformation called?

12 / 41

Logistic Regression

$p (X) = \frac{e^{β_{0} + β_{1} X}}{1 + e^{β_{0} + β_{1} X}}$

Note: $p (X)$ is shorthand for $P (Y = 1 | X)$
No matter what values $β_{0}$ , $β_{1}$ , or $X$ take $p (X)$ will always be between 0 and 1
We can rearrange this into the following form: $\log (\frac{p (X)}{1 - p (X)}) = β_{0} + β_{1} X$

What is this transformation called?

This is a log odds or logit transformation of $p (X)$

12 / 41

Linear versus logistic regression

Logistic regression ensures that our estimates for $p (X)$ are between 0 and 1 🎉

13 / 41

Maximum Likelihood

Refresher: How did we estimate $\hat{β}$ in linear regression?

14 / 41

Maximum Likelihood

Refresher: How did we estimate (\hat\beta) in linear regression?

In logistic regression, we use maximum likelihood to estimate the parameters

$l (β_{0}, β_{1}) = \prod_{i : y_{i} = 1} p (x_{i}) \prod_{i : y_{i} = 0} (1 - p (x_{i}))$

14 / 41

Maximum Likelihood

Refresher: How did we estimate (\hat\beta) in linear regression?

In logistic regression, we use maximum likelihood to estimate the parameters

$l (β_{0}, β_{1}) = \prod_{i : y_{i} = 1} p (x_{i}) \prod_{i : y_{i} = 0} (1 - p (x_{i}))$

This likelihood give the probability of the observed ones and zeros in the data
We pick $β_{0}$ and $β_{1}$ to maximize the likelihood
We'll let R do the heavy lifting here

14 / 41

Dr. Lucy D'Agostino McGowan adapted from slides by Hastie & Tibshirani

Let's see it in Rglm(default ~ balance, data = Default, family = "binomial") %>%
  tidy()

## # A tibble: 2 x 5
##   term         estimate std.error statistic   p.value
##   <chr>           <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept) -10.7      0.361        -29.5 3.62e-191
## 2 balance       0.00550  0.000220      25.0 1.98e-137
Use the glm() function in R with the family = "binomial" argument
15 / 41

Making predictions

What is our estimated probability of default for someone with a balance of $1000?

term	estimate	std.error	statistic	p.value
(Intercept)	-10.6513306	0.3611574	-29.49221	0
balance	0.0054989	0.0002204	24.95309	0

16 / 41

Making predictions

What is our estimated probability of default for someone with a balance of $1000?

term	estimate	std.error	statistic	p.value
(Intercept)	-10.6513306	0.3611574	-29.49221	0
balance	0.0054989	0.0002204	24.95309	0

$\hat{p} (X) = \frac{e^{{\hat{β}}_{0} + {\hat{β}}_{1} X}}{1 + e^{{\hat{β}}_{0} + {\hat{β}}_{1} X}} = \frac{e^{- 10.65 + 0.0055 \times 1000}}{1 + e^{- 10.65 + 0.0055 \times 1000}} = 0.006$

16 / 41

Making predictions

What is our estimated probability of default for someone with a balance of $2000?

term	estimate	std.error	statistic	p.value
(Intercept)	-10.6513306	0.3611574	-29.49221	0
balance	0.0054989	0.0002204	24.95309	0

17 / 41

Making predictions

What is our estimated probability of default for someone with a balance of $2000?

term	estimate	std.error	statistic	p.value
(Intercept)	-10.6513306	0.3611574	-29.49221	0
balance	0.0054989	0.0002204	24.95309	0

$\hat{p} (X) = \frac{e^{{\hat{β}}_{0} + {\hat{β}}_{1} X}}{1 + e^{{\hat{β}}_{0} + {\hat{β}}_{1} X}} = \frac{e^{- 10.65 + 0.0055 \times 2000}}{1 + e^{- 10.65 + 0.0055 \times 2000}} = 0.586$

17 / 41

Logistic regression example

Let's refit the model to predict the probability of default given the customer is a student

term	estimate	std.error	statistic	p.value
(Intercept)	-3.5041278	0.0707130	-49.554219	0.0000000
studentYes	0.4048871	0.1150188	3.520181	0.0004313

$P (default = Yes | student = Yes) = \frac{e^{- 3.5041 + 0.4049 \times 1}}{1 + e^{- 3.5041 + 0.4049 \times 1}} = 0.0431$

18 / 41

Logistic regression example

Let's refit the model to predict the probability of default given the customer is a student

term	estimate	std.error	statistic	p.value
(Intercept)	-3.5041278	0.0707130	-49.554219	0.0000000
studentYes	0.4048871	0.1150188	3.520181	0.0004313

$P (default = Yes | student = Yes) = \frac{e^{- 3.5041 + 0.4049 \times 1}}{1 + e^{- 3.5041 + 0.4049 \times 1}} = 0.0431$

How will this change if student = No?

18 / 41

Logistic regression example

Let's refit the model to predict the probability of default given the customer is a student

term	estimate	std.error	statistic	p.value
(Intercept)	-3.5041278	0.0707130	-49.554219	0.0000000
studentYes	0.4048871	0.1150188	3.520181	0.0004313

$P (default = Yes | student = Yes) = \frac{e^{- 3.5041 + 0.4049 \times 1}}{1 + e^{- 3.5041 + 0.4049 \times 1}} = 0.0431$

How will this change if student = No?

$P (default = Yes | student = No) = \frac{e^{- 3.5041 + 0.4049 \times 0}}{1 + e^{- 3.5041 + 0.4049 \times 0}} = 0.0292$

18 / 41

Multiple logistic regression

$\log (\frac{p (X)}{1 - p (X)}) = β_{0} + β_{1} X_{1} + \dots + β_{p} X_{p}$ $p (X) = \frac{e^{β_{0} + β_{1} X_{1} + \dots + β_{p} X_{p}}}{1 + e^{β_{0} + β_{1} X_{1} + \dots + β_{p} X_{p}}}$

term	estimate	std.error	statistic	p.value
(Intercept)	-10.8690452	0.4922555	-22.080088	0.0000000
balance	0.0057365	0.0002319	24.737563	0.0000000
income	0.0000030	0.0000082	0.369815	0.7115203
studentYes	-0.6467758	0.2362525	-2.737646	0.0061881

19 / 41

Multiple logistic regression

$\log (\frac{p (X)}{1 - p (X)}) = β_{0} + β_{1} X_{1} + \dots + β_{p} X_{p}$ $p (X) = \frac{e^{β_{0} + β_{1} X_{1} + \dots + β_{p} X_{p}}}{1 + e^{β_{0} + β_{1} X_{1} + \dots + β_{p} X_{p}}}$

term	estimate	std.error	statistic	p.value
(Intercept)	-10.8690452	0.4922555	-22.080088	0.0000000
balance	0.0057365	0.0002319	24.737563	0.0000000
income	0.0000030	0.0000082	0.369815	0.7115203
studentYes	-0.6467758	0.2362525	-2.737646	0.0061881

Why is the coefficient for student negative now when it was positive before?

19 / 41

Confounding

What is going on here?

20 / 41

Confounding

Students tend to have higher balances than non-students
- Their marginal default rate is higher

21 / 41

Confounding

Students tend to have higher balances than non-students
- Their marginal default rate is higher
For each level of balance, students default less
- Their conditional default rate is lower

21 / 41

Logistic regression for more than two classes

So far we've discussed binary outcome data
We can generalize this to situations with multiple classes

$P (Y = k | X) = \frac{e^{β_{0 k} + β_{1 k} X_{1} + \dots + β_{p k} X_{p}}}{\sum_{l = 1}^{K} e^{β_{0 l} + β_{1 l} X_{1} + \dots + β_{p l} X_{p}}}$

Here we have a linear function for each of the $K$ classes
This is known as multinomial logistic regression

22 / 41

Dr. Lucy D'Agostino McGowan adapted from slides by Hastie & Tibshirani

Discriminant AnalysisAnother way to model multiple classes
💡 Big idea: 
Model the distribution of XX in each class separately ( P(X|Y)P(X|Y) )
Use Bayes theorem to flip things around to get P(Y|X)P(Y|X)
23 / 41

Bayes Theorem

What is Bayes theorem?

24 / 41

Bayes Theorem

What is Bayes theorem?

$P (Y = k | X = x) = \frac{P (X = x | Y = k) \times P (Y = k)}{P (X = x)}$

25 / 41

Bayes Theorem

$P (Y = k | X = x) = \frac{P (X = x | Y = k) \times P (Y = k)}{P (X = x)}$

26 / 41

Bayes Theorem

$\underset{p o s t e r i o r}{\underset{⏟}{P (Y = k | X = x)}} = \frac{P (X = x | Y = k) \times P (Y = k)}{P (X = x)}$

27 / 41

Bayes Theorem

$P (Y = k | X = x) = \frac{\overset{l i k e l i h o o d}{\overset{⏞}{P (X = x | Y = k)}} \times P (Y = k)}{P (X = x)}$

28 / 41

Bayes Theorem

$P (Y = k | X = x) = \frac{\overset{l i k e l i h o o d}{\overset{⏞}{P (X = x | Y = k)}} \times \overset{p r i o r}{\overset{⏞}{P (Y = k)}}}{P (X = x)}$

29 / 41

Bayes Theorem Example

$\begin{aligned} P (S i c k | +) & = \frac{P (+ | S i c k) P (S i c k)}{P (+)} \\ = \frac{P (+ | S i c k) P (S i c k)}{P (+ | S i c k) P (S i c k) + P (+ | H e a l t h y) P (H e a l t h y)} \end{aligned}$

30 / 41

Bayes Theorem Example

Often when a test is created the sensitivity is calculated, that is the true positive rate, the $P (+ | S i c k)$ . Let's say in this case that is 99%

30 / 41

Bayes Theorem Example

Often when a test is created the sensitivity is calculated, that is the true positive rate, the $P (+ | S i c k)$ . Let's say in this case that is 99%
Let's suppose the probability of a positive test if you are healthy is rare, 1%

30 / 41

Bayes Theorem Example

Often when a test is created the sensitivity is calculated, that is the true positive rate, the $P (+ | S i c k)$ . Let's say in this case that is 99%
Let's suppose the probability of a positive test if you are healthy is rare, 1%
Finally, let's suppose the disease is fairly common, 20% of people in the population have it.

30 / 41

Bayes Theorem Example

Often when a test is created the sensitivity is calculated, that is the true positive rate, the $P (+ | S i c k)$ . Let's say in this case that is 99%
Let's suppose the probability of a positive test if you are healthy is rare, 1%
Finally, let's suppose the disease is fairly common, 20% of people in the population have it.
What is my probability of having the disease given I tested positive?

30 / 41

Bayes Theorem Example

$\begin{aligned} P (S i c k | +) & = \frac{P (+ | S i c k) P (S i c k)}{P (+)} \\ .96 & = \frac{0.99 \times 0.2}{0.99 \times 0.2 + 0.01 \times 0.8} \end{aligned}$

Often when a test is created the sensitivity is calculated, that is the true positive rate, the $P (+ | S i c k)$ . Let's say in this case that is 99%
Let's suppose the probability of a positive test if you are healthy is rare, 1%
Finally, let's suppose the disease is fairly common, 20% of people in the population have it.

What is my probability of having the disease given I tested positive?

31 / 41

Bayes Theorem Example

$\begin{aligned} P (S i c k | +) & = \frac{P (+ | S i c k) P (S i c k)}{P (+)} \\ .96 & = \frac{0.99 \times 0.2}{0.99 \times 0.2 + 0.01 \times 0.8} \end{aligned}$

Often when a test is created the sensitivity is calculated, that is the true positive rate, the $P (+ | S i c k)$ . Let's say in this case that is 99%
Let's suppose the probability of a positive test if you are healthy is rare, 1%
If the disease is rare (let's say 0.1% have it) how does that change my probability of having it given a positive test?

What is my probability of having the disease given I tested positive?

32 / 41

Bayes Theorem Example

$\begin{aligned} P (S i c k | +) & = \frac{P (+ | S i c k) P (S i c k)}{P (+)} \\ 0.09 & = \frac{0.99 \times 0.001}{0.99 \times 0.001 + 0.01 \times 0.999} \end{aligned}$

Often when a test is created the sensitivity is calculated, that is the true positive rate, the $P (+ | S i c k)$ . Let's say in this case that is 99%
Let's suppose the probability of a positive test if you are healthy is rare, 1%
If the disease is rare (let's say 0.1% have it) how does that change my probability of having it given a positive test?

What is my probability of having the disease given I tested positive?

33 / 41

Bayes Theorem and Discriminant Analysis

$P (Y | X) = \frac{P (X | Y) \times P (Y)}{P (X)}$

This same equation is used for discriminant analysis with slightly different notation:

34 / 41

Bayes Theorem and Discriminant Analysis

$P (Y | X) = \frac{P (X | Y) \times P (Y)}{P (X)}$

This same equation is used for discriminant analysis with slightly different notation: $P (Y | X) = \frac{π_{k} f_{k} (x)}{\sum_{l = 1}^{K} f_{l} (x)}$

34 / 41

Bayes Theorem and Discriminant Analysis

$P (Y | X) = \frac{P (X | Y) \times P (Y)}{P (X)}$

This same equation is used for discriminant analysis with slightly different notation: $P (Y | X) = \frac{π_{k} f_{k} (x)}{\sum_{l = 1}^{K} f_{l} (x)}$

$f_{k} (x) = P (X | Y)$ is the density for $X$ in class $k$
- For linear discriminant analysis we will use the normal distribution to represent this density

34 / 41

Bayes Theorem and Discriminant Analysis

$P (Y | X) = \frac{P (X | Y) \times P (Y)}{P (X)}$

This same equation is used for discriminant analysis with slightly different notation: $P (Y | X) = \frac{π_{k} f_{k} (x)}{\sum_{l = 1}^{K} f_{l} (x)}$

$f_{k} (x) = P (X | Y)$ is the density for $X$ in class $k$
- For linear discriminant analysis we will use the normal distribution to represent this density
$π_{k} = P (Y)$ is the marginal or prior probability for class $k$

34 / 41

Discriminant analysis

35 / 41

Discriminant analysis

Here there are two classes

35 / 41

Discriminant analysis

Here there are two classes
We classify new points based on which density is highest

35 / 41

Discriminant analysis

Here there are two classes
We classify new points based on which density is highest
On the left, the priors for the two classes are the same

35 / 41

Discriminant analysis

Here there are two classes
We classify new points based on which density is highest
On the left, the priors for the two classes are the same
On the right, we favor the orange class, making the decision boundary shift to the left

35 / 41

Dr. Lucy D'Agostino McGowan adapted from slides by Hastie & Tibshirani

Why discriminant analysis?When the classes are well separated, logistic regression is unstable, linear discriminant analysis (LDA) is not
36 / 41

Dr. Lucy D'Agostino McGowan adapted from slides by Hastie & Tibshirani

Why discriminant analysis?When the classes are well separated, logistic regression is unstable, linear discriminant analysis (LDA) is not
When nn is small and the distribution of predictors ( XX ) is approximately normal in each class, the linear discriminant model is more stable than the logistic model
36 / 41

Dr. Lucy D'Agostino McGowan adapted from slides by Hastie & Tibshirani

Why discriminant analysis?When the classes are well separated, logistic regression is unstable, linear discriminant analysis (LDA) is not
When nn is small and the distribution of predictors ( XX ) is approximately normal in each class, the linear discriminant model is more stable than the logistic model
When we have more than 2 classes, LDA also provides a nice low dimensional way to visualize data
36 / 41

Linear Discriminant Analysis p = 1

The density for the normal distribution is

$f_{k} (x) = \frac{1}{\sqrt{2 π} σ_{k}} e^{- \frac{1}{2} {(\frac{x - μ_{k}}{σ_{k}})}^{2}}$

37 / 41

Linear Discriminant Analysis p = 1

The density for the normal distribution is

$f_{k} (x) = \frac{1}{\sqrt{2 π} σ_{k}} e^{- \frac{1}{2} {(\frac{x - μ_{k}}{σ_{k}})}^{2}}$

$μ_{k}$ is the mean in class $k$

37 / 41

Linear Discriminant Analysis p = 1

The density for the normal distribution is

$f_{k} (x) = \frac{1}{\sqrt{2 π} σ_{k}} e^{- \frac{1}{2} {(\frac{x - μ_{k}}{σ_{k}})}^{2}}$

$μ_{k}$ is the mean in class $k$
$σ_{k}^{2}$ is the variance $k$ (We will assume $σ_{k} = σ$ are the same for all classes)

37 / 41

Linear Discriminant Analysis p = 1

The density for the normal distribution is

$f_{k} (x) = \frac{1}{\sqrt{2 π} σ_{k}} e^{- \frac{1}{2} {(\frac{x - μ_{k}}{σ_{k}})}^{2}}$

We can plug this into Bayes formula

$p_{k} (X) = \frac{π_{k} \frac{1}{\sqrt{2 π} σ_{k}} e^{- \frac{1}{2} {(\frac{x - μ_{k}}{σ_{k}})}^{2}}}{\sum_{l = 1}^{k} π_{l} \frac{1}{\sqrt{2 π} σ_{l}} e^{- \frac{1}{2} {(\frac{x - μ_{l}}{σ_{l}})}^{2}}}$

38 / 41

Linear Discriminant Analysis p = 1

The density for the normal distribution is

$f_{k} (x) = \frac{1}{\sqrt{2 π} σ_{k}} e^{- \frac{1}{2} {(\frac{x - μ_{k}}{σ_{k}})}^{2}}$

We can plug this into Bayes formula

😅 Luckily things cancel!

38 / 41

Discriminant functions

To classify an observation where $X = x$ we need to determine which of the $p_{k} (x)$ is the largest
It turns out this is equivalent to assigning $x$ to the class with the largest discriminant score

$δ_{k} (x) = x \frac{μ_{k}}{σ^{2}} - \frac{μ_{k}^{2}}{2 σ^{2}} + \log (π_{k})$

39 / 41

Discriminant functions

To classify an observation where $X = x$ we need to determine which of the $p_{k} (x)$ is the largest
It turns out this is equivalent to assigning $x$ to the class with the largest discriminant score

$δ_{k} (x) = x \frac{μ_{k}}{σ^{2}} - \frac{μ_{k}^{2}}{2 σ^{2}} + \log (π_{k})$

This discriminant score , $δ_{k} (x)$ , is a function of $p_{k} (x)$ (we took some logs and discarded terms that don't include $k$ )
$δ_{k} (x)$ is a linear function of $x$

39 / 41

Discriminant functions

To classify an observation where $X = x$ we need to determine which of the $p_{k} (x)$ is the largest
It turns out this is equivalent to assigning $x$ to the class with the largest discriminant score

$δ_{k} (x) = x \frac{μ_{k}}{σ^{2}} - \frac{μ_{k}^{2}}{2 σ^{2}} + \log (π_{k})$

This discriminant score , $δ_{k} (x)$ , is a function of $p_{k} (x)$ (we took some logs and discarded terms that don't include $k$ )
$δ_{k} (x)$ is a linear function of $x$

If $K = 2$ , how do you think we would calculate the decision boundary?

39 / 41

Discriminant functions

$\begin{aligned} δ_{1} (x) & = δ_{2} (x) \end{aligned}$

40 / 41

Discriminant functions

$\begin{aligned} δ_{1} (x) & = δ_{2} (x) \end{aligned}$

Let's set $π_{1} = π_{2} = 0.5$

40 / 41

Discriminant functions

$\begin{aligned} δ_{1} (x) & = δ_{2} (x) \end{aligned}$

Let's set $π_{1} = π_{2} = 0.5$

$\begin{aligned} x \frac{μ_{1}}{σ^{2}} - \frac{μ_{1}^{2}}{2 σ^{2}} + \log (0.5) & = x \frac{μ_{2}}{σ^{2}} - \frac{μ_{2}^{2}}{2 σ^{2}} + \log (0.5) \\ x \frac{μ_{1}}{σ^{2}} - x \frac{μ_{2}}{σ^{2}} & = - \frac{μ_{2}^{2}}{2 σ^{2}} + \log (0.5) + \frac{μ_{1}^{2}}{2 σ^{2}} - \log (0.5) \\ x (μ_{1} - μ_{2}) & = \frac{μ_{1}^{2} - μ_{2}^{2}}{2} \\ x & = \frac{μ_{1}^{2} - μ_{2}^{2}}{(μ_{1} - μ_{2}) 2} \\ x & = \frac{(μ_{1} - μ_{2}) (μ_{1} + μ_{2})}{(μ_{1} - μ_{2}) 2} \\ x & = \frac{μ_{1} + μ_{2}}{2} \end{aligned}$

40 / 41

Dr. Lucy D'Agostino McGowan adapted from slides by Hastie & Tibshirani

41 / 41

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help