library(tidymodels)
library(ISLR)
library(rpart.plot)
Using the College
data from the ISLR
package, predict the number of applications received from a subset of the variables of your choice using a decision tree. (Not sure about the variables? Run ?College
in the console after loading the ISLR
package)
I am going to first split the data set into a training and testing set. The training set will have 389 observations, the testing 388.
college_split <- initial_split(College, prop = 0.5)
college_train <- training(college_split)
college_cv <- vfold_cv(college_train, v = 5)
To predict the number of applications, I am fitting a Regression Tree. I will set the maximum tree depth to 30 and tune based on the cost complexity (\(\alpha\))
model_spec <- decision_tree(
mode = "regression",
cost_complexity = tune(),
tree_depth = 30
) %>%
set_engine("rpart")
When choosing your model, be sure you are picking sensible predictors. For example, should you be using the number of applications accepted to predict the number of applications recieved? That doesn’t make much sense!
grid <- expand_grid(cost_complexity = seq(0, 0.1, by = .01))
model <- tune_grid(
model_spec,
Apps ~ Private + F.Undergrad + Outstate + Room.Board + Books + Personal + PhD + S.F.Ratio + perc.alumni + Expend + Grad.Rate,
grid = grid,
resamples = college_cv
)
alpha <- model %>%
select_best(metric = "rmse") %>%
pull()
It looks like the best model has an \(\alpha\) value of 0, meaning the full tree does best.
final_spec <- decision_tree(
mode = "regression",
cost_complexity = alpha,
tree_depth = 30
) %>%
set_engine("rpart")
final_model <- fit(
final_spec,
Apps ~ Private + F.Undergrad + Outstate + Room.Board + Books + Personal + PhD + S.F.Ratio + perc.alumni + Expend + Grad.Rate,
data = college_train
)
college_test <- testing(college_split)
final_model %>%
predict(new_data = college_test) %>%
bind_cols(college_test) %>%
metrics(truth = Apps, estimate = .pred)
## # A tibble: 3 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 2373.
## 2 rsq standard 0.667
## 3 mae standard 1110.
In the held out test sample this model has a RMSE of 1679.8 and an R squared of .785.
We can plot our final tree using rpart.plot
.
rpart.plot(final_model$fit, roundint = FALSE)