Decision trees - Regression tree building

# Decision trees - Regression tree building
### Dr. D’Agostino McGowan

---

<div class="my-footer">
<span>
Dr. Lucy D'Agostino McGowan <i>adapted from slides by Hastie & Tibshirani</i>
</span>
</div>

---

## The tree building process

* Divide the predictor space (the set of possible values for `$X_1, X_2, \dots, X_p$` ) into `$J$` distinct non-overlapping regions, `$R_1, R_2, \dots R_j$`
--

* For every observation that falls into the region `$R_j$`, we make the same prediction, the **mean response value** for the training observations in `$R_j$`

---

## The tree building process

* The regions could have any shape, but we choose to divide the predictor space into high-dimensional **boxes** for simplicity and ease of interpretation
--

* The goal is to find boxes, `$R_1, \dots, R_j$` that minimize the RSS, given by

`$$\sum_{j=1}^J\sum_{i\in R_j}(y_i-\hat{y}_{R_j})^2$$`
where `$\hat{y}_{R_j}$` is the mean response for the training observations within the `$j$`th box.

---

## The tree building process

* It is often computationally infeasible to consider every possible partition of the feature space into `$J$` boxes
--

* Therefore, we take a **top-down, greedy** approach known as recursive binary splitting
--

* This is **top-down** because it begins at the top of the tree and then splits the predictor space successively into two branches at a time
--

* It is **greedy** because at each step the **best** split is made at that step (instead of looking forward and picking a split that may result in a better tree in a future step)

---

## The tree building process

* First select the predictor `$X_j$` and the cutpoint `$s$` such that splitting the predictor space into `$\{X|X_j < s\}$` and `$\{X|X_k\geq s\}$` leads to the _greatest possible reduction in RSS_
--

* We repeat this process, looking for the best predictor and cutpoint to split the data within each of the resulting regions
* Now instead of splitting the _entire_ predictor space, we split one of the two previously identified regions, now we have _three_ regions
--

* Then we look to split one of these three regions to minimize the RSS
--

* This process continues until some stopping criteria are met. 
🛑 e.g., we could stop when we have created a fixed number of regions, or we could keep going until no region contains more than 5 observations, etc.

---

## <i class="fas fa-edit"></i> `Draw a partition`

Draw an example of a parition of a two-dimensional feature space that could result from recursive binary splitting with six regions. Label your figure with the regions, `$R_1, \dots, R_6$` as well as the cutpoints `$t_1, t_2, \dots$`. Draw a decision tree corresponding to this partition.

---