Trade-offs: Accuracy and interpretability, bias and variance

Trade-offs: Accuracy and interpretability, bias and varianceDr. D’Agostino McGowan1 / 52

Dr. Lucy D'Agostino McGowan adapted from slides by Hastie & Tibshirani

Study SessionsMonday 7-9p
Manchester 122
2 / 52

Lab 01

Knit, Commit, Push often
Commit and Push all files
Check on GitHub.com to make sure everything is updating
You won't see a rendered file

3 / 52

Dr. Lucy D'Agostino McGowan adapted from slides by Hastie & Tibshirani

📖 Canvasuse Google Chrome
4 / 52

Dr. Lucy D'Agostino McGowan adapted from slides by Hastie & Tibshirani

Regression and ClassificationRegression: quantitative response
Classification: qualitative (categorical) response
5 / 52

Regression and Classification

What would be an example of a regression problem?

Regression: quantitative response
Classification: qualitative (categorical) response

6 / 52

Regression and Classification

What would be an example of a classification problem?

Regression: quantitative response
Classification: qualitative (categorical) response

7 / 52

Dr. Lucy D'Agostino McGowan adapted from slides by Hastie & Tibshirani

Regression8 / 52

Auto data

Above are mpg vs horsepower, weight, and acceleration, with a blue linear-regression line fit separately to each. Can we predict mpg using these three?

9 / 52

Auto data

Above are mpg vs horsepower, weight, and acceleration, with a blue linear-regression line fit separately to each. Can we predict mpg using these three?

Maybe we can do better using a model:

$mpg \approx f (horsepower, weight, acceleration)$

9 / 52

Dr. Lucy D'Agostino McGowan adapted from slides by Hastie & Tibshirani

Notationmpg is the response variable, the outcome variable, we refer to this as YY
horsepower is a feature, input, predictor, we refer to this as X1X1
weight is X2X2
acceleration is X3X3
10 / 52

Notation

mpg is the response variable, the outcome variable, we refer to this as $Y$
horsepower is a feature, input, predictor, we refer to this as $X_{1}$
weight is $X_{2}$
acceleration is $X_{3}$ Our *input vector is

$X = [\begin{matrix} X_{1} \\ X_{2} \\ X_{3} \end{matrix}]$

10 / 52

Notation

mpg is the response variable, the outcome variable, we refer to this as $Y$
horsepower is a feature, input, predictor, we refer to this as $X_{1}$
weight is $X_{2}$
acceleration is $X_{3}$ Our *input vector is

$X = [\begin{matrix} X_{1} \\ X_{2} \\ X_{3} \end{matrix}]$

Our model is

$Y = f (X) + ϵ$

$ϵ$ is our error

10 / 52

Dr. Lucy D'Agostino McGowan adapted from slides by Hastie & Tibshirani

Why do we care about f(X)f(X)?We can use f(X)f(X) to make predictions of YY for new values of X=xX=x
11 / 52

Dr. Lucy D'Agostino McGowan adapted from slides by Hastie & Tibshirani

Why do we care about f(X)f(X)?We can use f(X)f(X) to make predictions of YY for new values of X=xX=x* We can gain a better understanding of which components of X=(X1,X2,…,Xp)X=(X1,X2,…,Xp) are important for explaining YY
11 / 52

Dr. Lucy D'Agostino McGowan adapted from slides by Hastie & Tibshirani

Why do we care about f(X)f(X)?We can use f(X)f(X) to make predictions of YY for new values of X=xX=x We can gain a better understanding of which components of X=(X1,X2,…,Xp)X=(X1,X2,…,Xp) are important for explaining YY Depending on how complex ff is, maybe we can understand how each component ( XjXj ) of XX affects YY
11 / 52

How do we choose $f (X)$ ? What is a good value for $f (X)$ at any selected value of $X$ , say $X = 100$ ? There can be many $Y$ values at $X = 100$ .

12 / 52

How do we choose $f (X)$ ? What is a good value for $f (X)$ at any selected value of $X$ , say $X = 100$ ? There can be many $Y$ values at $X = 100$ . A good value is

$f (100) = E (Y | X = 100)$

12 / 52

How do we choose $f (X)$ ? What is a good value for $f (X)$ at any selected value of $X$ , say $X = 100$ ? There can be many $Y$ values at $X = 100$ . A good value is

$f (100) = E (Y | X = 100)$

$E (Y | X = 100)$ means expected value (average) of $Y$ given $X = 100$

12 / 52

How do we choose $f (X)$ ? What is a good value for $f (X)$ at any selected value of $X$ , say $X = 100$ ? There can be many $Y$ values at $X = 100$ . A good value is

$f (100) = E (Y | X = 100)$

$E (Y | X = 100)$ means expected value (average) of $Y$ given $X = 100$

This ideal $f (x) = E (Y | X = x)$ is called the regression function

12 / 52

Regression function, $f (X)$

Also works or a vector, $X$ , for example,

$f (x) = f (x_{1}, x_{2}, x_{3}) = E [Y | X_{1} = x_{1}, X_{2} = x_{2}, X_{3} = x_{3}]$

This is the optimal predictor of $Y$ in terms of mean-squared prediction error

13 / 52

Regression function, $f (X)$

Also works or a vector, $X$ , for example,

$f (x) = f (x_{1}, x_{2}, x_{3}) = E [Y | X_{1} = x_{1}, X_{2} = x_{2}, X_{3} = x_{3}]$

This is the optimal predictor of $Y$ in terms of mean-squared prediction error
$f (x) = E (Y | X = x)$ is the function that minimizes $E [(Y - g (X))^{2} | X = x]$ over all functions $g$ at all points $X = x$

13 / 52

Regression function, $f (X)$

Also works or a vector, $X$ , for example,

$f (x) = f (x_{1}, x_{2}, x_{3}) = E [Y | X_{1} = x_{1}, X_{2} = x_{2}, X_{3} = x_{3}]$

This is the optimal predictor of $Y$ in terms of mean-squared prediction error
$f (x) = E (Y | X = x)$ is the function that minimizes $E [(Y - g (X))^{2} | X = x]$ over all functions $g$ at all points $X = x$
$ϵ = Y - f (x)$ is the irreducible error
even if we knew $f (x)$ , we would still make errors in prediction, since at each $X = x$ there is typically a distribution of possible $Y$ values

13 / 52

14 / 52

Using these points, how would I calculate the regression function?

15 / 52

Using these points, how would I calculate the regression function?

Take the average! $f (100) = E [mpg | horsepower = 100] = 19.6$

15 / 52

This point has a $Y$ value of 32.9. What is $ϵ$ ?

16 / 52

This point has a $Y$ value of 32.9. What is $ϵ$ ?

$ϵ = Y - f (X) = 32.9 - 19.6 = 13.3$

16 / 52

The error

For any estimate, $\hat{f} (x)$ , of $f (x)$ , we have

$E [(Y - \hat{f} (x))^{2} | X = x] = \underset{reducible error}{\underset{⏟}{[f (x) - \hat{f} (x)]^{2}}} + \underset{irreducible error}{\underset{⏟}{V a r (ϵ)}}$

17 / 52

Assume for a moment that both $\hat{f}$ and X are fixed.
$E (Y - \hat{Y})^{2}$ represents the average, or expected value, of the squared difference between the predicted and actual value of Y, and Var( $ϵ$ ) represents the variance associated with the error term
The focus of this class is on techniques for estimating f with the aim of minimizing the reducible error.
the irreducible error will always provide an upper bound on the accuracy of our prediction for Y
This bound is almost always unknown in practice

Dr. Lucy D'Agostino McGowan adapted from slides by Hastie & Tibshirani

Estimating ffTypically we have very few (if any!) data points at X=xX=x exactly, so we cannot compute E[Y|X=x]E[Y|X=x]
18 / 52

Estimating $f$

Typically we have very few (if any!) data points at $X = x$ exactly, so we cannot compute $E [Y | X = x]$ * For example, what if we were interested in estimating miles per gallon when horsepower was 104.

18 / 52

Estimating $f$

Typically we have very few (if any!) data points at $X = x$ exactly, so we cannot compute $E [Y | X = x]$ * For example, what if we were interested in estimating miles per gallon when horsepower was 104.

💡 We can relax the definition and let

$\hat{f} (x) = E [Y | X \in N (x)]$

18 / 52

Estimating $f$

Typically we have very few (if any!) data points at $X = x$ exactly, so we cannot compute $E [Y | X = x]$
For example, what if we were interested in estimating miles per gallon when horsepower was 104.

💡 We can relax the definition and let

$\hat{f} (x) = E [Y | X \in N (x)]$

Where $N (x)$ is some neighborhood of $x$

19 / 52

Notation pause!

$\hat{f} (x) = \underset{The expectation}{\underset{⏟}{E}} [\underset{of Y}{\underset{⏟}{Y}} \underset{given}{\underset{⏟}{|}} \underset{X is in the neighborhood of x}{\underset{⏟}{X \in N (x)}}]$

20 / 52

Notation pause!

If you need a notation pause at any point during this class, please let me know!

20 / 52

Estimating $f$

💡 We can relax the definition and let

$\hat{f} (x) = E [Y | X \in N (x)]$

21 / 52

Estimating $f$

💡 We can relax the definition and let

$\hat{f} (x) = E [Y | X \in N (x)]$

Nearest neighbor averaging does pretty well with small $p$ ( $p \leq 4$ ) and large $n$

21 / 52

Estimating $f$

💡 We can relax the definition and let

$\hat{f} (x) = E [Y | X \in N (x)]$

Nearest neighbor averaging does pretty well with small $p$ ( $p \leq 4$ ) and large $n$ Nearest neighbor is not great when $p$ is large because of the *curse of dimensionality (because nearest neighbors tend to be far away in high dimensions)

21 / 52

Estimating $f$

💡 We can relax the definition and let

$\hat{f} (x) = E [Y | X \in N (x)]$

Nearest neighbor averaging does pretty well with small $p$ ( $p \leq 4$ ) and large $n$ Nearest neighbor is not great when $p$ is large because of the *curse of dimensionality (because nearest neighbors tend to be far away in high dimensions)
What do I mean by $p$ ? What do I mean by $n$ ?

21 / 52

Parametric models

A common parametric model is a linear model

$f (X) = β_{0} + β_{1} X_{1} + β_{2} X_{2} + \dots + β_{p} X_{p}$

22 / 52

Parametric models

A common parametric model is a linear model

$f (X) = β_{0} + β_{1} X_{1} + β_{2} X_{2} + \dots + β_{p} X_{p}$

A linear model has $p + 1$ parameters ( $β_{0}, \dots, β_{p}$ )

22 / 52

Parametric models

A common parametric model is a linear model

$f (X) = β_{0} + β_{1} X_{1} + β_{2} X_{2} + \dots + β_{p} X_{p}$

A linear model has $p + 1$ parameters ( $β_{0}, \dots, β_{p}$ ) We estimate these parameters by fitting a model to *training data

22 / 52

Parametric models

A common parametric model is a linear model

$f (X) = β_{0} + β_{1} X_{1} + β_{2} X_{2} + \dots + β_{p} X_{p}$

A linear model has $p + 1$ parameters ( $β_{0}, \dots, β_{p}$ ) We estimate these parameters by fitting a model to training data Although this model is almost never correct it can often be a good interpretable approximation to the unknown true function, $f (X)$

22 / 52

Dr. Lucy D'Agostino McGowan adapted from slides by Hastie & Tibshirani

Let's look at a simulated example23 / 52

The red points are simulated values for income from the model:

$income = f (education, senority) + ϵ$

$f$ is the blue surface

24 / 52

Linear regression model fit to the simulated data

${\hat{f}}_{L} (education, senority) = {\hat{β}}_{0} + {\hat{β}}_{1} education + {\hat{β}}_{2} senority$

25 / 52

More flexible regression model ${\hat{f}}_{S} (education, seniority)$ fit to the simulated data
Here we use a technique called a thin-plate spline to fit a flexible surface

26 / 52

And even MORE flexible 😱 model $\hat{f} (education, seniority)$

Here we've basically drawn the surface to hit every point, minimizing the error, but completely overfitting

27 / 52

Dr. Lucy D'Agostino McGowan adapted from slides by Hastie & Tibshirani

🤹 Finding balancePrediction accuracy versus interpretability
Linear models are easy to interpret, thin-plate splines
are not
28 / 52

Dr. Lucy D'Agostino McGowan adapted from slides by Hastie & Tibshirani

🤹 Finding balancePrediction accuracy versus interpretability
Linear models are easy to interpret, thin-plate splines
are not Good fit versus overfit or *underfit
How do we know when the fit is just right?
28 / 52

Dr. Lucy D'Agostino McGowan adapted from slides by Hastie & Tibshirani

🤹 Finding balancePrediction accuracy versus interpretability
Linear models are easy to interpret, thin-plate splines
are not Good fit versus overfit or *underfit
How do we know when the fit is just right? Parsimony versus *black-box
We often prefer a simpler model involving fewer variables over a black-box predictor involving them all
28 / 52

29 / 52

Accuracy

We've fit a model $\hat{f} (x)$ to some training data $train = {x_{i}, y_{i}}_{1}^{N}$
We can measure accuracy as the average squared prediction error over that train data

$M S E_{train} = {Ave}_{i \in train} [y_{i} - \hat{f} (x_{i})]^{2}$

30 / 52

Accuracy

We've fit a model $\hat{f} (x)$ to some training data $train = {x_{i}, y_{i}}_{1}^{N}$
We can measure accuracy as the average squared prediction error over that train data

$M S E_{train} = {Ave}_{i \in train} [y_{i} - \hat{f} (x_{i})]^{2}$

What can go wrong here?

30 / 52

Accuracy

We've fit a model $\hat{f} (x)$ to some training data $train = {x_{i}, y_{i}}_{1}^{N}$
We can measure accuracy as the average squared prediction error over that train data

$M S E_{train} = {Ave}_{i \in train} [y_{i} - \hat{f} (x_{i})]^{2}$

What can go wrong here?

This may be biased towards overfit models

30 / 52

Accuracy

I have some train data, plotted above. What $\hat{f} (x)$ would minimize the $M S E_{train}$ ?

$M S E_{train} = {Ave}_{i \in train} [y_{i} - \hat{f} (x_{i})]^{2}$

31 / 52

Accuracy

I have some train data, plotted above. What $\hat{f} (x)$ would minimize the $M S E_{train}$ ?

$M S E_{train} = {Ave}_{i \in train} [y_{i} - \hat{f} (x_{i})]^{2}$

32 / 52

Accuracy

What is wrong with this?

33 / 52

Accuracy

What is wrong with this?

It's overfit!

33 / 52

Accuracy

If we get a new sample, that overfit model is probably going to be terrible!

34 / 52

Accuracy

We've fit a model $\hat{f} (x)$ to some training data $train = {x_{i}, y_{i}}_{1}^{N}$
Instead of measuring accuracy as the average squared prediction error over that train data, we can compute it using fresh test data $test = {x_{i}, y_{i}}_{1}^{M}$

$M S E_{test} = {Ave}_{i \in test} [y_{i} - \hat{f} (x_{i})]^{2}$

35 / 52

Black curve is the "truth" on the left. Red curve on right is $M S E_{test}$ , grey curve is $M S E_{train}$ . Orange, blue and green curves/squares correspond to fis of different flexibility.

36 / 52

Here the truth is smoother, so the smoother fit and linear model do really well

37 / 52

Here the truth is wiggly and the noise is low, so the more flexible fits do the best

38 / 52

Dr. Lucy D'Agostino McGowan adapted from slides by Hastie & Tibshirani

Bias-variance trade-offWe've fit a model, ^f(x)f^(x), to some training data
39 / 52

Dr. Lucy D'Agostino McGowan adapted from slides by Hastie & Tibshirani

Bias-variance trade-offWe've fit a model, ^f(x)f^(x), to some training data* Let's pull a test observation from this population ( x0,y0x0,y0 )
39 / 52

Dr. Lucy D'Agostino McGowan adapted from slides by Hastie & Tibshirani

Bias-variance trade-offWe've fit a model, ^f(x)f^(x), to some training data Let's pull a test observation from this population ( x0,y0x0,y0 ) The true model is Y=f(x)+ϵY=f(x)+ϵ
39 / 52

Bias-variance trade-off

We've fit a model, $\hat{f} (x)$ , to some training data Let's pull a test observation from this population ( $x_{0}, y_{0}$ ) The true model is $Y = f (x) + ϵ$ * $f (x) = E [Y | X = x]$

$E (y_{0} - \hat{f} (x_{0}))^{2} = Var (\hat{f} (x_{0})) + [Bias (\hat{f} (x_{0}))]^{2} + Var (ϵ)$

39 / 52

Bias-variance trade-off

We've fit a model, $\hat{f} (x)$ , to some training data Let's pull a test observation from this population ( $x_{0}, y_{0}$ ) The true model is $Y = f (x) + ϵ$ * $f (x) = E [Y | X = x]$

$E (y_{0} - \hat{f} (x_{0}))^{2} = Var (\hat{f} (x_{0})) + [Bias (\hat{f} (x_{0}))]^{2} + Var (ϵ)$

The expectation averages over the variability of $y_{0}$ as well as the variability of the training data. $Bias (\hat{f} (x_{0})) = E [\hat{f} (x_{0})] - f (x_{0})$

As flexibility of $\hat{f}$ $↑$ , its variance $↑$ and its bias $↓$

39 / 52

Bias-variance trade-off

We've fit a model, $\hat{f} (x)$ , to some training data Let's pull a test observation from this population ( $x_{0}, y_{0}$ ) The true model is $Y = f (x) + ϵ$ * $f (x) = E [Y | X = x]$

$E (y_{0} - \hat{f} (x_{0}))^{2} = Var (\hat{f} (x_{0})) + [Bias (\hat{f} (x_{0}))]^{2} + Var (ϵ)$

The expectation averages over the variability of $y_{0}$ as well as the variability of the training data. $Bias (\hat{f} (x_{0})) = E [\hat{f} (x_{0})] - f (x_{0})$

As flexibility of $\hat{f}$ $↑$ , its variance $↑$ and its bias $↓$ choosing the flexibility based on average test error amounts to a *bias-variance trade-off

39 / 52

That U-shape we see for the test MSE curves is due to this bias-variance trade-off
The expected test MSE for a given $x_{0}$ can be decomposed into three components: the variance of $\hat{f} (x_{o})$ , the squared bias of $\hat{f} (x_{o})$ and t4he variance of the error term $ϵ$
Here the notation $E [y_{0} - \hat{f} (x_{0})]^{2}$ defines the expected test MSE, and refers to the average test MSE that we would obtain if we repeatedly estimated $f$ using a large number of training sets, and tested each at $x_{0}$
The overall expected test MSE can be computed by averaging $E [y_{0} - \hat{f} (x_{0})]^{2}$ over all possible values of $x_{0}$ in the test set.
SO we want to minimize the expected test error, so to do that we need to pick a statistical learning method to simultenously acheive low bias and low variance.
Since both of these quantities are non-negative, the expected test MSE can never fall below Var( $ϵ$ )

Bias-variance trade-off

40 / 52

Dr. Lucy D'Agostino McGowan adapted from slides by Hastie & Tibshirani

Classification41 / 52

Dr. Lucy D'Agostino McGowan adapted from slides by Hastie & Tibshirani

NotationYY is the response variable. It is qualitative
C(X)C(X) is the classifier that assigns a class CC to some future unlabeled observation, XX
42 / 52

Dr. Lucy D'Agostino McGowan adapted from slides by Hastie & Tibshirani

NotationYY is the response variable. It is qualitative
C(X)C(X) is the classifier that assigns a class CC to some future unlabeled observation, XX* Examples:Email can be classified as C=(spam, not spam)C=(spam, not spam)
Written number is one of C={0,1,2,…,9}C={0,1,2,…,9}

42 / 52

Classification Problem

What is the goal?

43 / 52

Classification Problem

What is the goal?

Build a classifier $C (X)$ that assigns a class label from $C$ to a future unlabeled observation $X$
Assess the uncertainty in each classification
Understand the roles of the different predictors among $X = (X_{1}, X_{2}, \dots, X_{p})$

43 / 52

Suppose there are $K$ elements in $C$ , numbered $1, 2, \dots, K$

$p_{k} (x) = P (Y = k | X = x), k = 1, 2, \dots, K$ These are conditional class probabilities at $x$

44 / 52

Suppose there are $K$ elements in $C$ , numbered $1, 2, \dots, K$

$p_{k} (x) = P (Y = k | X = x), k = 1, 2, \dots, K$ These are conditional class probabilities at $x$

How do you think we could calculate this?

44 / 52

Suppose there are $K$ elements in $C$ , numbered $1, 2, \dots, K$

$p_{k} (x) = P (Y = k | X = x), k = 1, 2, \dots, K$ These are conditional class probabilities at $x$

How do you think we could calculate this?

In the plot, you could examine the mini-barplot at $x = 5$

44 / 52

Suppose there are $K$ elements in $C$ , numbered $1, 2, \dots, K$

$p_{k} (x) = P (Y = k | X = x), k = 1, 2, \dots, K$ These are conditional class probabilities at $x$

The Bayes optimal classifier at $x$ is

$C (x) = j if p_{j} (x) = max {p_{1} (x), p_{2} (x), \dots, p_{K} (x)}$

45 / 52

Notice that probability is a conditional probability
It is the probability that Y equals k given the observed preditor vector, $x$
Let's say we were using a Bayes Classifier for a two class problem, Y is 1 or 2. We would predict that the class is one if $P (Y = 1 | X = x_{0}) > 0.5$ and 2 otherwise

What if this was our data and there were no points at exactly $x = 5$ ? Then how could we calculate this?

46 / 52

What if this was our data and there were no points at exactly (x = 5)? Then how could we calculate this?

Nearest neighbor like before!

46 / 52

What if this was our data and there were no points at exactly (x = 5)? Then how could we calculate this?

Nearest neighbor like before!* This does break down as the dimensions grow, but the impact of $\hat{C} (x)$ is less than on ${\hat{p}}_{k} (x), k = 1, 2, \dots, K$

46 / 52

Accuracy

Misclassification error rate

$E r r_{test} = {Ave}_{i \in test} I [y_{i} \neq \hat{C} (x_{i})]$

47 / 52

Accuracy

Misclassification error rate

$E r r_{test} = {Ave}_{i \in test} I [y_{i} \neq \hat{C} (x_{i})]$

The Bayes Classifier using the true $p_{k} (x)$ has the smallest error

47 / 52

Accuracy

Misclassification error rate

$E r r_{test} = {Ave}_{i \in test} I [y_{i} \neq \hat{C} (x_{i})]$

The Bayes Classifier using the true $p_{k} (x)$ has the smallest error* Some of the methods we will learn build structured models for $C (x)$ (support vector machines, for example)

47 / 52

Accuracy

Misclassification error rate

$E r r_{test} = {Ave}_{i \in test} I [y_{i} \neq \hat{C} (x_{i})]$

The Bayes Classifier using the true $p_{k} (x)$ has the smallest error Some of the methods we will learn build structured models for $C (x)$ (support vector machines, for example) Some build structured models for $p_{k} (x)$ (logistic regression, for example)

47 / 52

the test error rate ${Ave}_{i \in test} I [y_{i} \neq \hat{C} (x_{i})]$ is minimized on average by very simple classifier that assigns each observation to the most likely class, given its predictor values (that's the Bayes classifier)

K-Nearest-Neighbors example

48 / 52

Here is a simulated dataset of 100 observations in two groups, blue and orange
The purple dashed line represents the Bayes decision boundary
The orange background grid indicates the region where the test observations will be classified as orange, and the blue for the blue
We'd love to be able to use the Bayes classifier to but for real data, we don't know the conditional distribution of Y given X so computing the Bayes classifier is impossible
Alot of methods try to estimate the conditional distribution of Y given X and then classify a given observation to the class with the highest estimated probability
One method to do this is K-nearest neighbors

KNN (K = 10)

49 / 52

Again, the way KNN works is if K = 10, it is finding the 10 closest observations and calculating the probability of being orange or blue and will classify that point as such
So here is an example of K nearest neighbors where K is 10

KNN

50 / 52

Because this dataset has 100 data points, K can range from 1 to 100 where at 1, the error rate in the TRAINING data will be 0 but the test error rate may be really high. So we are trying to find the happy medium. The test error is going to have that same u-shape relationship, you want to find the bottom of that U

Trade-offs

51 / 52

Dr. Lucy D'Agostino McGowan adapted from slides by Hastie & Tibshirani

📖 Canvasuse Google Chrome
52 / 52

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

Trade-offs: Accuracy and interpretability, bias and variance

Dr. D’Agostino McGowan

Study Sessions

Lab 01

📖 Canvas

Regression and Classification

Regression and Classification

Regression and Classification

Regression

Auto data

Auto data

Notation

Notation

Notation

Why do we care about f(X)f(X)?

Why do we care about f(X)f(X)?

Why do we care about f(X)f(X)?

Regression function, f(X)f(X)

Regression function, f(X)f(X)

Regression function, f(X)f(X)

The error

Estimating ff

Estimating ff

Estimating ff

Estimating ff

Notation pause!

Notation pause!

Estimating ff

Estimating ff

Estimating ff

Estimating ff

Parametric models

Parametric models

Parametric models

Parametric models

Let's look at a simulated example

🤹 Finding balance

🤹 Finding balance

🤹 Finding balance

Accuracy

Accuracy

Accuracy

Accuracy

Accuracy

Accuracy

Accuracy

Accuracy

Accuracy

Bias-variance trade-off

Bias-variance trade-off

Bias-variance trade-off

Bias-variance trade-off

Bias-variance trade-off

Bias-variance trade-off

Bias-variance trade-off

Classification

Notation

Notation

Classification Problem

Classification Problem

Accuracy

Accuracy

Accuracy

Accuracy

K-Nearest-Neighbors example

KNN (K = 10)

KNN

Trade-offs

📖 Canvas

Study Sessions

Help

Why do we care about $f (X)$ ?

Why do we care about $f (X)$ ?

Why do we care about $f (X)$ ?

Regression function, $f (X)$

Regression function, $f (X)$

Regression function, $f (X)$

Estimating $f$

Estimating $f$

Estimating $f$

Estimating $f$

Estimating $f$

Estimating $f$

Estimating $f$

Estimating $f$