class: center, middle, inverse, title-slide # Decision trees - Intro + Regression trees ### Dr. D’Agostino McGowan --- layout: true <div class="my-footer"> <span> Dr. Lucy D'Agostino McGowan <i>adapted from slides by Hastie & Tibshirani</i> </span> </div> --- ## Decision trees * Can be applied to **regression** problems -- * Can be applied to **classification** problems .question[ What is the difference? ] --- class: center, middle ## Regression trees --- ## Decision tree - Baseball Salary Example ![](16-decision-trees-reg_files/figure-html/unnamed-chunk-3-1.png)<!-- --> -- .question[ How would you stratify this? ] --- ## Decision tree - Baseball Salary Example ![](16-decision-trees-reg_files/figure-html/unnamed-chunk-4-1.png)<!-- --> --- ## Let's walk through the figure * This is using the `Hitters` data from the `ISLR` 📦 -- * I fit a **regression tree** predicting the salary of a baseball player from: * Number of years they played in the major leagues * Number of hits they made in the previous year -- * At each **node** the label (e.g., `\(X_j < t_k\)` ) indicates that the _left_ branch that comes from that split. The _right_ branch is the opposite, e.g. `\(X_j \geq t_k\)`. -- * For example, the first **internal node** indicates that those to the left have less than 4.5 years in the major league, on the right have `\(\geq\)` 4.5 years. -- * The number on the _top_ of the **nodes** indicates the predicted Salary, for example before doing _any_ splitting, the average Salary for the whole dataset is 536 thousand dollars. -- * This tree has **two internal nodes** and **three termninal nodes** --- ## Decision tree - Baseball Salary Example ![](16-decision-trees-reg_files/figure-html/unnamed-chunk-5-1.png)<!-- --> --- ## Decision tree - Baseball Salary Example ![](16-decision-trees-reg_files/figure-html/unnamed-chunk-6-1.png)<!-- --> --- ## Decision tree - Baseball Salary Example ![](16-decision-trees-reg_files/figure-html/unnamed-chunk-7-1.png)<!-- --> --- ## Decision tree - Baseball Salary Example ![](16-decision-trees-reg_files/figure-html/unnamed-chunk-8-1.png)<!-- --> --- ## Terminology 🎋 The final regions, `\(R_1, R_2, R_3\)` are called **terminal nodes** -- 🎄 You can think of the trees as _upside down_, the **leaves** are at the bottom -- 🎋 The splits are called **internal nodes** --- ## Interpretation of results * `Years` is the most important factor in determining `Salary`; players with less experience earn lower salaries -- * Given that a player is less experienced, the number of `Hits` seems to play little role in the `Salary` -- * Among players who have been in the major leagues for 4.5 years or more, the number of `Hits` made in the previous year **does** affect `Salary`, players with more `Hits` tend to have higher salaries -- * This is probably an oversimplification, but see how easy it is to interpret! --- class: inverse
02
:
00
## <i class="fas fa-edit"></i> `Interpreting decision trees` * How many internal nodes does this plot have? How many terminal nodes? * What is the average Salary for players who have more than 6.5 years in the major leagues but less than 118 Hits? What % of the dataset fall in this category? ![](16-decision-trees-reg_files/figure-html/unnamed-chunk-9-1.png)<!-- -->