islr notes and exercises from An Introduction to Statistical Learning

8. Tree-based Methods

Table of Contents

The Basics of Decision Trees

Regression Trees

Overview
  • There are two main steps:
    • Partition predictor space Rp\mathbb{R}^p into regions R1,,RMR_1, \dots, R_M.
    • For all X=(X1,,Xp)RmX = (X_1, \dots, X_p) \in R_m, predict the average over the responses in RmR_m f^(X)=y^Rm:=1Nmi:yiRmyi\hat{f}(X) = \hat{y}_{R_m} := \frac{1}{N_m}\sum_{i: y_i\in R_m} y_i where Nm={yi  yiRm}N_m = |\{y_i\ |\ y_i \in R_m\}|
  • In practice, we take the regions of the partition to be rectangular for simplicity and ease of interpretation.
  • We choose the partition to minimize the RSS m=1Mi:yiRm(yiy^Rm)2 \sum_{m = 1}^M \sum_{i: y_i \in R_m} (y_i - \hat{y}_{R_m})^2
  • We search the space of partitions using a recursive binary splitting1 strategy.
Algorithm: Recursive Binary Decision Tree for Linear Regression
  1. Start with top node Rp\mathbb{R}^p
  2. While a stopping criterion is unsatisfied:
    1. Let

    (i^,j^)=argmin(i,j)(i:xiR1(yiy^R1)2+i:xiR2(yiy^R2)2)(\hat{i}, \hat{j}) = \underset{(i, j)}{\text{argmin}}\left( \sum_{i: x_i\in R_1} (y_i - \hat{y}_{R_1})^ 2 + \sum_{i: x_i\in R_2} (y_i - \hat{y}_{R_2})^ 2\right) where

    R1={XXj<xi,j}R_{1} = \{X| X_j < x_{i,j}\} R2={XXjxi,j}R_{2} = \{X| X_j \geqslant x_{i,j}\}

    1. Add nodes

      R^1={XXj^<xi^,j^}\hat{R}_{1} = \{X| X_{\hat{j}} < x_{\hat{i},\hat{j}}\} R^2={XXj^xi^,j^}\hat{R}_{2} = \{X| X_{\hat{j}} \geqslant x_{\hat{i},\hat{j}}\}

      to the partition, and recurse on one of the nodes

Tree-pruning
  • Complex trees can overfit, but simpler trees may avoid it 2
  • To get a simpler tree, we can grow a large tree T0T_0 and prune it to obtain a subtree.
  • Cost complexity or weakest link pruning is a method for finding an optimal subtree 3. For α>0\alpha > 0, we obtain a subtree Tα=argminT T0(m=1Ti:xiRm(yiy^Rm)2+αT) T_\alpha = \underset{T\ \subset T_0}{\text{argmin}} \left(\sum_{m=1}^{|T|} \sum_{i: x_i \in R_m} \left(y_i - \hat{y}_{R_m}\right)^2 + \alpha|T|\right) where T|T| is the number of terminal nodes of TT, RmR_m is the rectangle corresponding to the mm-th terminal node, and y^Rm\hat{y}_{R_m} is the predicted response (average of yiRmy_i\in R_m) 4.
  1. For each 5 α>0\alpha > 0:

    1. For k=1,Kk = 1, \dots K:
      1. Let Dk=D\{kth fold}\mathcal{D}_k = \mathcal{D} \backslash \{k-\text{th fold}\}
      2. Use recursive binary splitting to grow a tree TkT_{k}, stopping when each node has fewer than some minimum number of observations MM is reached 6
      3. Use weakest link pruning to find a subtree Tk,αT_{k, \alpha}
    2. Let CV(k)(α)CV_{(k)}(\alpha) be the KK-fold cross-validation estimate of the mean squared test error
  2. Choose α^=argminα CV(k)(α) \hat{\alpha} = \underset{\alpha}{\text{argmin}}\ CV_{(k)}(\alpha)
  3. Return T^=Tα^\hat{T} = T_{\hat{\alpha}}

Classification Trees

  • Classification trees are very similar to regression trees, but they predict qualitative responses. The predicted class for an observation (xi,yi)(x_i, y_i) in RmR_m is 7 is k^m=argmaxk p^m,k \hat{k}_m = \underset{k}{\text{argmax}}\ \hat{p}_{m,k} where p^m,k\hat{p}_{m,k} is the fraction of observations (xi,yi)(x_i, y_i) in the region RmR_m such that yi=ky_i = k.
  • One performance measure is the Classification error rate 8 for the region RmR_m is Em=1p^m,k^ E_m = 1 - \hat{p}_{m, \hat{k}}
  • A better performance measure is the Gini index for the region RmR_m, a measure of total variance 9 across classes Gm=k=1Kp^m,k(1p^m,k) G_m = \sum_{k = 1}^K \hat{p}_{m,k}(1 - \hat{p}_{m,k})
  • Another better performance measure is the entropy for the region RmR_m 10 Dm=k=1Kp^m,klog(p^m,k) D_m = \sum_{k = 1}^K - \hat{p}_{m,k}\log(\hat{p}_{m,k})
  • Typically the Gini index or entropy is used to prune, due to their sensitivity to node purity. However, if prediction accuracy is the goal then classification error rate is preferable.

Trees Versus Linear Models

  • A linear regression model is of the form

    f(X)=β0+j=1pβjXjf(X) = \beta_0 + \sum_{j = 1}^p \beta_j X_j

    while a regression tree model is of the form

    f(X)=m=1McmI(XRm) f(X) = \sum_{m = 1}^M c_m I(X \in R_m)

  • Linear regression will tend to perform better if the relationship between features and response is well-approximated by a linear function, whereas the regression tree will tend perform better if the relationship is non-linear or complex.

Advantages and Disadvantages of Trees

Advantages
  • Conceptual simplicity
  • May mirror human decision-making better than previous regression and classification methods
  • Readily visualizable and easily interpreted
  • Can handle qualitative predictors without the need for dummy variables
Disadvantages
  • Less accurate prediction than previous regression and classification methods
  • Non-robust to changes in data – small changes in data lead to large changes in estimated tree.

Bagging, Random Forests, Boosting

These are methods for improving the prediction accuracy of decision trees.

Bagging

  • The decision trees in § 8.1 suffer from high variance.
  • Bagging is a method of reducing the variance of a statistical learning process 11. The bagging estimate of the target function of the process with dataset D\mathcal{D} is

    f^bag(x)=1Bb=1Bf^b(x)\hat{f}_{\text{bag}}(x) = \frac{1}{B} \sum_{b = 1}^B \hat{f}^{*b}(x)

    where f^b(x)\hat{f^*}^b(x) is the estimate of target function on the boostrap dataset Db\mathcal{D}_b 12.

  • Bagging can be used for any statistical learning method but it is particularly useful for decision trees 13
Out-of-bag Error Estimation
  • On average, a bagged tree uses about 2/3 of the data – the remaining 1/3 is the out-of-bag (OOB) data.
  • We can predict the response for each observation using the trees for which it was OOB, yielding about B/3 prediction.
  • If we’re doing regression, we can average these predicted responses, or if we’re doing classification, take a majority vote, to get a single OOB prediction for each observation.
  • Test error can be estimated using these predictions.
Variable Importance Measures
  • Bagging typically results in improved prediction accuracy over single trees, at the expense of interpretability
  • The RSS (for bagging regression trees) and Gini index (for bagging classification trees) can provide measures of variable importance.
  • For both loss functions (RSS/Gini) the amount the loss is decreases due to a split over a given predictor, averaged over the B bagged trees. The greater the decrease, the more important the predictor

Random Forests

  • Random forests works as follows: at each split in the tree, choose a predictor from among a new random sample of 1mp1 \leqslant m \leqslant p predictors.
  • The random predictor sampling overcomes the tendency of bagged trees to look similar given strong predictors (e.g. the strongest predictor will be at the top of most or all of the bagged trees).
  • On average, pmp\frac{p-m}{p} of the splits will not consider a given predictor, giving other predictors a chance to be chosen. This decorrelation of the trees improves the reduction in variance achieved by bagged.
  • m=pm=p corresponds to bagging. m<<pm << p is useful when there is a large number of correlated predictors. Typically we choose mpm \approx \sqrt{p}

Boosting

  • Boosting is another method of improving prediction accuracy that can be applied to many statistical learning methods.
  • In decision trees, each tree is build using information from the previous trees. Instead of bootstrapped datasets, the datasets are modified based on the previously grown trees.
  • The boosting approach learns slowly, by slowly improving in areas where it underperforms. It has 3 parameters:
    • Number of trees BB. Unlike bagging and random forests, boosting can overfit if BB is too big, although this happens slowly.
    • The shrinkage parameter λ>0\lambda > 0. Typical values are λ=0.01,0.001\lambda = 0.01, 0.001. Very small λ\lambda can require very large BB to get good performance.
    • The number of tree splits dd, which controls the complexity of the boosted ensemble. Often dd works well (the resulting tree is called a stump) 14
Algorithm: Boosting for Regression Trees
  1. Set f^(x)=0\hat{f}(x) = 0 and ri=yir_i = y_i, 1in1 \leqslant i \leqslant n
  2. For b=1,2,,Bb = 1, 2, \dots, B:
    1. Fit a tree f^b\hat{f}^b with dd splits to (X,r)(X, r)
    2. Update the model f^\hat{f} by adding a shrunk version of the new tree: f^(x)f^(x)+λf^b(x) \hat{f}(x) \leftarrow \hat{f}(x) + \lambda \hat{f}^b(x)
    3. Update the residuals: ririλf^b(x) r_i \leftarrow r_i - \lambda \hat{f}^b(x)
  3. Output the boosted model

    f^(x)=b=1Bλf^b(x) \hat{f}(x) = \sum_{b = 1}^B \lambda \hat{f}^b(x)


Footnotes

  1. This strategy results in a binary tree with the partition regions as leaves, and binary splits as nodes. It is “top-down” because it starts at the top of the partition tree (with a single region), “binary” because it splits the predictor space into two regions at each node in the tree, “recursive” because it calls itself at each node, and “greedy” because at each node, it chooses the optimal split at that node 

  2. That is, it may lead to lower variance and better interpretation at the cost of a higher bias 

  3. We want a subtree with minimal estimated test error but it’s infeasible to compute this for all subtrees. 

  4. This is the RSS for the partition given by the nodes of the tree TT, with a weighted penalty αT\alpha|T| for the number of nodes (hence the complexity). 

  5. Even though α[0,)\alpha \in [0, \infty) is a continuous parameter here, in practice it will be selected from a finite set of values. In fact (cf. comment on pg 309 of the text), as α\alpha increases,”branches get pruned in a nested and predictable fashion”, resulting in a sequence of subtrees as a function of α\alpha. One can then find a sequence α1,,αN\alpha_1, \dots, \alpha_N such that at each αi\alpha_i, a branch is removed, and since the tree is finite, the algorithm is guaranteed to terminate. 

  6. The smallest possible number of observations per node is M=1M=1, which results in a partition with only one point in each region. This is clearly a maximal complexity tree, so we probably take M>>1M >> 1 in practice. 

  7. That is, the predicted class for observations in RmR_m is the most frequently occuring class in RmR_m

  8. The classification error rate isn’t sufficiently sensitive to “node purity”, that is degree to which a node contains observations from a single class. 

  9. The Gini index is a measure of “node purity” – it is minimized when all p^m,k{0,1}\hat{p}_{m, k} \in \{0, 1\}, that is, when all nodes contain observations from a single class. 

  10. The p^m,k\hat{p}_{m,k} are the empirical pmf estimates of the conditional probabilities pm,k=P(Y=kXRm)p_{m, k} = P(Y = k | X \in R_m), so DD is an estimate of the conditional entropy, i.e. the entropy of Y  XRmY\ |\ X \in R_m. Thus DD is a measure of information that the empirical pmf, and hence the corresponding tree provides, that is, of its average suprisal. As with the Gini index, DD is minimized when all p^m,k{0,1}\hat{p}_{m, k} \in \{0, 1\}. An average surprisal of zero means the tree provides all information, that is, it perfectly separates the classes. 

  11. Bagging is another name for bootstrapping. It appears that the latter is usually used in the context of estimating the standard error of a statistic, while the former is used in the context of a statistical learning process (even though these are essentially the same). 

  12. Really this is the bootstrap estimate of the average of the target function estimate over many datasets. For a given dataset D\mathcal{D}, the function f^(x)\hat{f}(x) produced by the learning process is an estimate of the target function f(x)f(x). Repeating the process 1bB1 \leqslant b \leqslant B times over datasets Db\mathcal{D}_b, we get estimates f^b(x)\hat{f}^b(x). Assuming these are iid, they have common variance σ2\sigma^2, but their average f^avg=1Bb=1Bf^b(x)\hat{f}_{\text{avg}} = \frac{1}{B} \sum_{b = 1}^B \hat{f}^{b}(x) has variance σ2B\frac{\sigma^2}{B}. Given BB large enough, this variance is low. Bagging/bootstrapping gets around the lack of separate datasets Db\mathcal{D}_b in practice by repeated sampling with replacement from a single dataset D\mathcal{D}

  13. For regression, one grows BB deep (unpruned) regression trees on BB bootstrapped datasets, each of which has low bias but high variance, then averages them to get a bootstrap estimate which has the same low bias, but much lower variance. For classification (since we can’t average over the classes of the bootstrapped trees) a simple approach is to predict the majority class over the bootstrapped trees. 

  14. In the case of a stump, the boosted ensemble is fitting an additive model, since each term is a single variable. More generally, dd is the interaction depth – since dd splits can involve at most dd variables, this controls the interaction order of the boosted model (i.e. the model can fit interaction terms up to degree dd