Getting started

Go to our class’s GitHub organization sta-363-s20
Find the GitHub repository (which we’ll refer to as “repo” going forward) for this lab, lab-03-cv-YOUR-GITHUB-HANDLE. This repo contains a template you can build on to complete your assignment.
On GitHub, click on the green Clone or download button, select Use HTTPS. Click on the clipboard icon to copy the repo URL.
Go to RStudio Cloud and into the course workspace. Create a New Project from Git Repo. You will need to click on the down arrow next to the New Project button to see this option.
Copy and paste the URL of your assignment repo into the dialog box.
Hit OK, and you’re good to go!

Packages

In this lab we will work with two packages: tidyverse which is a collection of packages for doing data analysis in a “tidy” way and the ISLR a package that contains our data.

Install these packages by running the following in the console.

install.packages("tidyverse")
install.packages("ISLR")

Now that the necessary packages are installed, you should be able to Knit your document and see the results.

If you’d like to run your code in the Console as well you’ll also need to load the packages there. To do so, run the following in the console.

library(tidyverse) 
library(ISLR)

Note that the packages are also loaded with the same commands in your R Markdown document.

Housekeeping

Git configuration

⊕Your email address is the address tied to your GitHub account and your name should be first and last name.

Go to the Terminal pane
Type the following two lines of code, replacing the information in the quotation marks with your info:

git config --global user.email "your email"
git config --global user.name "your name"

To confirm that the changes have been implemented, run the following

git config --global user.email
git config --global user.name

Password caching

If you would like your git password cached for a week for this project, type the following in the Terminal:

git config --global credential.helper 'cache --timeout 604800'

Project name:

Currently your project is called Untitled Project. Update the name of your project to be “Lab 03 - Cross Validation”.

Warm up

Before we introduce the data, let’s warm up with some simple exercises.

YAML:

Open the R Markdown (Rmd) file in your project, change the author name to your name, and knit the document.

Commiting and pushing changes:

Go to the Git pane in your RStudio.
View the Diff and confirm that you are happy with the changes.
Add a commit message like “Update team name” in the Commit message box and hit Commit.
Click on Push. This will prompt a dialogue box where you first need to enter your user name, and then your password.

Data

For this lab, we are using the Auto data from the ISLR package.

Exercises

Conceptual questions

Explain how k-fold Cross Validation is implemented.
What are the advantages / disadvantages of k-fold Cross Validation compared to the Validation Set approach? What are the advantages / disadvantages of k-fold Cross Validation compared to Leave-one-out Cross Validation?

K-fold cross validation

We are trying to decide between three models of varying flexibility:

Model 1:
Model 2:
Model 3:

Using the Auto data, shuffle the rows and add a column, k, that groups the data into two groups, the first group, k = 1, will consist of 196 rows of the data to train the model, and k = 2 will consist of 196 rows to test the model. Make sure you set a seed so you end up with the same answer each time you run this.

set.seed(1)
Auto_2 <- Auto %>%
  --- %>% ## shuffle the data 
  --- ## add a column, k, to split the data into two groups

⊕You can use the poly() function to fit a model with a polynomial term. For example, to fit the model , you would run lm(y ~ poly(x, 3), data = data)

Filter the data frame to the training data only. Fit the three models detailed above on this data. Then filter the data frame to the testing data only. Using the model created on the training data, predict mpg in the test data set. What is the test MSE for the three models? Which model would you choose?
Using the Auto data, shuffle the rows and add a column, k, that groups the data into five groups of approximately equal size. Make sure you set a seed so you end up with the same answer each time you run this.
Create a function that generalizes the approach in Exercises 3 and 4, with K as a parameter that can vary. The function should output a data frame that contains four columns:

MSE_1 - the mean squared error for Model 1
MSE_2 - the mean squared error for Model 2
MSE_3 - the mean squared error for Model 3
n_k - the sample size for the fold

Map this function from Exercise 6 over the 5 folds and calculate the overall 5-fold Cross Validation error for each of the three models. Which model would you chose?

Lab 03 - Cross validation

Introduction