Partition predictor space Rp into regions R1,…,RM.
For all X=(X1,…,Xp)∈Rm, predict the average over the responses in Rmf^(X)=y^Rm:=Nm1i:yi∈Rm∑yi
where Nm=∣{yi∣yi∈Rm}∣
In practice, we take the regions of the partition to be rectangular for simplicity and ease of interpretation.
We choose the partition to minimize the RSS
m=1∑Mi:yi∈Rm∑(yi−y^Rm)2
We search the space of partitions using a recursive binary splitting1 strategy.
Algorithm: Recursive Binary Decision Tree for Linear Regression
Start with top node Rp
While a stopping criterion is unsatisfied:
Let
(i^,j^)=(i,j)argmin(i:xi∈R1∑(yi−y^R1)2+i:xi∈R2∑(yi−y^R2)2)
where
R1={X∣Xj<xi,j}R2={X∣Xj⩾xi,j}
Add nodes
R^1={X∣Xj^<xi^,j^}R^2={X∣Xj^⩾xi^,j^}
to the partition, and recurse on one of the nodes
Tree-pruning
Complex trees can overfit, but simpler trees may avoid it 2
To get a simpler tree, we can grow a large tree T0 and prune it to obtain a subtree.
Cost complexity or weakest link pruning is a method for finding an optimal subtree 3. For α>0, we obtain a subtree
Tα=T⊂T0argmin⎝⎛m=1∑∣T∣i:xi∈Rm∑(yi−y^Rm)2+α∣T∣⎠⎞
where
∣T∣ is the number of terminal nodes of T, Rm is the rectangle corresponding to the m-th terminal node, and y^Rm is the predicted response (average of yi∈Rm) 4.
Algorithm: Weakest link regression tree with K-fold cross-validation
Use recursive binary splitting to grow a tree Tk, stopping when each node has fewer than some minimum number of observations M is reached 6
Use weakest link pruning to find a subtree Tk,α
Let CV(k)(α) be the K-fold cross-validation estimate of the mean squared test error
Choose
α^=αargminCV(k)(α)
Return T^=Tα^
Classification Trees
Classification trees are very similar to regression trees, but they predict qualitative responses. The predicted class for an observation (xi,yi) in Rm is 7 is
k^m=kargmaxp^m,k
where p^m,k is the fraction of observations (xi,yi) in the region Rm such that yi=k.
One performance measure is the Classification error rate8 for the region Rm is
Em=1−p^m,k^
A better performance measure is the Gini index for the region Rm, a measure of total variance 9 across classes
Gm=k=1∑Kp^m,k(1−p^m,k)
Another better performance measure is the entropy for the region Rm10Dm=k=1∑K−p^m,klog(p^m,k)
Typically the Gini index or entropy is used to prune, due to their sensitivity to node purity. However, if prediction accuracy is the goal then classification error rate is preferable.
Trees Versus Linear Models
A linear regression model is of the form
f(X)=β0+j=1∑pβjXj
while a regression tree model is of the form
f(X)=m=1∑McmI(X∈Rm)
Linear regression will tend to perform better if the relationship between features and response is well-approximated by a linear function, whereas the regression tree will tend perform better if the relationship is non-linear or complex.
Advantages and Disadvantages of Trees
Advantages
Conceptual simplicity
May mirror human decision-making better than previous regression and classification methods
Readily visualizable and easily interpreted
Can handle qualitative predictors without the need for dummy variables
Disadvantages
Less accurate prediction than previous regression and classification methods
Non-robust to changes in data – small changes in data lead to large changes in estimated tree.
Bagging, Random Forests, Boosting
These are methods for improving the prediction accuracy of decision trees.
Bagging
The decision trees in § 8.1 suffer from high variance.
Bagging is a method of reducing the variance of a statistical learning process 11. The bagging estimate of the target function of the process with dataset D is
f^bag(x)=B1b=1∑Bf^∗b(x)
where f∗^b(x) is the estimate of target function on the boostrap dataset Db12.
Bagging can be used for any statistical learning method but it is particularly useful for decision trees 13
Out-of-bag Error Estimation
On average, a bagged tree uses about 2/3 of the data – the remaining 1/3 is the out-of-bag (OOB) data.
We can predict the response for each observation using the trees for which it was OOB, yielding about B/3 prediction.
If we’re doing regression, we can average these predicted responses, or if we’re doing classification, take a majority vote, to get a single OOB prediction for each observation.
Test error can be estimated using these predictions.
Variable Importance Measures
Bagging typically results in improved prediction accuracy over single trees, at the expense of interpretability
The RSS (for bagging regression trees) and Gini index (for bagging classification trees) can provide measures of variable importance.
For both loss functions (RSS/Gini) the amount the loss is decreases due to a split over a given predictor, averaged over the B bagged trees. The greater the decrease, the more important the predictor
Random Forests
Random forests works as follows: at each split in the tree, choose a predictor from among a new random sample of 1⩽m⩽p predictors.
The random predictor sampling overcomes the tendency of bagged trees to look similar given strong predictors (e.g. the strongest predictor will be at the top of most or all of the bagged trees).
On average, pp−m of the splits will not consider a given predictor, giving other predictors a chance to be chosen. This decorrelation of the trees improves the reduction in variance achieved by bagged.
m=p corresponds to bagging. m<<p is useful when there is a large number of correlated predictors. Typically we choose m≈p
Boosting
Boosting is another method of improving prediction accuracy that can be applied to many statistical learning methods.
In decision trees, each tree is build using information from the previous trees. Instead of bootstrapped datasets, the datasets are modified based on the previously grown trees.
The boosting approach learns slowly, by slowly improving in areas where it underperforms. It has 3 parameters:
Number of trees B. Unlike bagging and random forests, boosting can overfit if B is too big, although this happens slowly.
The shrinkage parameter λ>0. Typical values are λ=0.01,0.001. Very small λ can require very large B to get good performance.
The number of tree splits d, which controls the complexity of the boosted ensemble. Often d works well (the resulting tree is called a stump) 14
Algorithm: Boosting for Regression Trees
Set f^(x)=0 and ri=yi, 1⩽i⩽n
For b=1,2,…,B:
Fit a tree f^b with d splits to (X,r)
Update the model f^ by adding a shrunk version of the new tree:
f^(x)←f^(x)+λf^b(x)
Update the residuals:
ri←ri−λf^b(x)
Output the boosted model
f^(x)=b=1∑Bλf^b(x)
Footnotes
This strategy results in a binary tree with the partition regions as leaves, and binary splits as nodes. It is “top-down” because it starts at the top of the partition tree (with a single region), “binary” because it splits the predictor space into two regions at each node in the tree, “recursive” because it calls itself at each node, and “greedy” because at each node, it chooses the optimal split at that node ↩
That is, it may lead to lower variance and better interpretation at the cost of a higher bias ↩
We want a subtree with minimal estimated test error but it’s infeasible to compute this for all subtrees. ↩
This is the RSS for the partition given by the nodes of the tree T, with a weighted penalty α∣T∣ for the number of nodes (hence the complexity). ↩
Even though α∈[0,∞) is a continuous parameter here, in practice it will be selected from a finite set of values. In fact (cf. comment on pg 309 of the text), as α increases,”branches get pruned in a nested and predictable fashion”, resulting in a sequence of subtrees as a function of α. One can then find a sequence α1,…,αN such that at each αi, a branch is removed, and since the tree is finite, the algorithm is guaranteed to terminate. ↩
The smallest possible number of observations per node is M=1, which results in a partition with only one point in each region. This is clearly a maximal complexity tree, so we probably take M>>1 in practice. ↩
That is, the predicted class for observations in Rm is the most frequently occuring class in Rm. ↩
The classification error rate isn’t sufficiently sensitive to “node purity”, that is degree to which a node contains observations from a single class. ↩
The Gini index is a measure of “node purity” – it is minimized when all p^m,k∈{0,1}, that is, when all nodes contain observations from a single class. ↩
The p^m,k are the empirical pmf estimates of the conditional probabilities pm,k=P(Y=k∣X∈Rm), so D is an estimate of the conditional entropy, i.e. the entropy of Y∣X∈Rm. Thus D is a measure of information that the empirical pmf, and hence the corresponding tree provides, that is, of its average suprisal. As with the Gini index, D is minimized when all p^m,k∈{0,1}. An average surprisal of zero means the tree provides all information, that is, it perfectly separates the classes. ↩
Bagging is another name for bootstrapping. It appears that the latter is usually used in the context of estimating the standard error of a statistic, while the former is used in the context of a statistical learning process (even though these are essentially the same). ↩
Really this is the bootstrap estimate of the average of the target function estimate over many datasets. For a given dataset D, the function f^(x) produced by the learning process is an estimate of the target function f(x). Repeating the process 1⩽b⩽B times over datasets Db, we get estimates f^b(x). Assuming these are iid, they have common variance σ2, but their average f^avg=B1b=1∑Bf^b(x) has variance Bσ2. Given B large enough, this variance is low. Bagging/bootstrapping gets around the lack of separate datasets Db in practice by repeated sampling with replacement from a single dataset D. ↩
For regression, one grows B deep (unpruned) regression trees on B bootstrapped datasets, each of which has low bias but high variance, then averages them to get a bootstrap estimate which has the same low bias, but much lower variance. For classification (since we can’t average over the classes of the bootstrapped trees) a simple approach is to predict the majority class over the bootstrapped trees. ↩
In the case of a stump, the boosted ensemble is fitting an additive model, since each term is a single variable. More generally, d is the interaction depth – since d splits can involve at most d variables, this controls the interaction order of the boosted model (i.e. the model can fit interaction terms up to degree d. ↩