islr notes and exercises from An Introduction to Statistical Learning

8. Tree-based Methods

Exercise 8: Predicting Sales in Carseats dataset

Preparing the data

Information on the dataset can be found here

%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns; sns.set_style('whitegrid')

carseats = pd.read_csv('../../datasets/Carseats.csv', index_col=0)
carseats = carseats.reset_index(drop=True)
carseats.head()
Sales CompPrice Income Advertising Population Price ShelveLoc Age Education Urban US
0 9.50 138 73 11 276 120 Bad 42 17 Yes Yes
1 11.22 111 48 16 260 83 Good 65 10 Yes Yes
2 10.06 113 35 10 269 80 Medium 59 12 Yes Yes
3 7.40 117 100 4 466 97 Medium 55 14 Yes Yes
4 4.15 141 64 3 340 128 Bad 38 13 Yes No
carseats.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 11 columns):
Sales          400 non-null float64
CompPrice      400 non-null int64
Income         400 non-null int64
Advertising    400 non-null int64
Population     400 non-null int64
Price          400 non-null int64
ShelveLoc      400 non-null object
Age            400 non-null int64
Education      400 non-null int64
Urban          400 non-null object
US             400 non-null object
dtypes: float64(1), int64(7), object(3)
memory usage: 34.5+ KB
carseats['ShelveLoc'].unique()
array(['Bad', 'Good', 'Medium'], dtype=object)
# convert categorical to numeric values

from sklearn.preprocessing import LabelEncoder

# label encoders
cols = ['ShelveLoc', 'Urban', 'US']
encs = {col:None for col in cols}
for col in encs:
    encs[col] = LabelEncoder().fit(carseats[col].unique())
    carseats.loc[:, col] = encs[col].transform(carseats[col])

carseats.head()
Sales CompPrice Income Advertising Population Price ShelveLoc Age Education Urban US
0 9.50 138 73 11 276 120 0 42 17 1 1
1 11.22 111 48 16 260 83 1 65 10 1 1
2 10.06 113 35 10 269 80 2 59 12 1 1
3 7.40 117 100 4 466 97 2 55 14 1 1
4 4.15 141 64 3 340 128 0 38 13 1 0

a. Train test split

from sklearn.model_selection import train_test_split

X, y = carseats.drop(columns=['Sales']), carseats['Sales']
X_train, X_test, y_train, y_test = train_test_split(X, y)

b. Regression tree

from sklearn.tree import DecisionTreeRegressor, export_graphviz

reg_tree = DecisionTreeRegressor(min_samples_split=5)
reg_tree.fit(X_train, y_train)
DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=5, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')

For visualizing the tree, we found this Medium article helpful

from graphviz import Source
from sklearn import tree
from IPython.display import SVG
graph = Source(tree.export_graphviz(reg_tree, out_file=None, filled=True, 
                                    feature_names=X.columns))
display(SVG(graph.pipe(format='svg')))

svg

c. CV Grid search for optimal regression tree

Currently sklearn doesn’t implement cost complexity pruning (at the time of writing it is an open issue on Github and a pending pull request may fix the issue.

In the meantime, DecisionTreeRegressor has many parameters one can vary for optimization.

  • max_depth – the maximum depth of the tree.
  • min_samples_split - The minimum number of samples required to split an internal node (this is a stopping criterion)
  • min_samples_leaf - The minimum number of samples required to be at a leaf node.
  • max_leaf_nodes - Grow a tree with max_leaf_nodes in best-first fashion.
  • min_impurity_decrease - A node will be split if this split induces a decrease of the impurity greater than or equal to this value.

This is a lot of parameters, and there is bound to be some coflict and redundancy in a grid search. We’ll make the parameter search a little faster with the following:

  • Booth min_samples_leaf and min_samples_split are stopping criterion, but we’ll go with min_samples_split, since it’s a less restrictive criterion.
  • Both max_depth and max_leaf_nodes both control tree depth, we’ll choose max_depth, again since it’s a less restrictive criterion.

We would use a randomized search if the dataset were larger, but it’s small enough we can do an exhaustive (grid) search.

We don’t optimize max_features even though this would likely improve performance, since that is effectively a random forest with one tree and we’ll be training a random forest model later in the exercise.

This sklearn doc was helpful

from sklearn.model_selection import RandomizedSearchCV, GridSearchCV

# parameter grid of size 1e3

depths = list(range(1, 10)) + [None]
sample_splits = list(range(2, 12))
impurity_decreases = np.linspace(0, 1, 10)

reg_tree_params = {'max_depth': depths, 
              'min_samples_split': sample_splits,
              'min_impurity_decrease': impurity_decreases}

# search best case 1/10 of parameter grid
reg_tree = DecisionTreeRegressor()
# use 6 fold CV because train size is 300
reg_tree_search = GridSearchCV(estimator=reg_tree,
                                param_grid=reg_tree_params,
                                cv=6,
                                scoring='neg_mean_squared_error')
%timeit -n1 -r1 reg_tree_search.fit(X_train, y_train)
28 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
reg_tree_search.best_params_
{'max_depth': 8, 'min_impurity_decrease': 0.0, 'min_samples_split': 11}
# 6-fold cv estimate of test rmse
np.sqrt(-reg_tree_search.best_score_)
2.0189730868235154
from sklearn.metrics import mean_squared_error

# test set mse
final_reg_tree = reg_tree_search.best_estimator_
reg_tree_test_mse = mean_squared_error(final_reg_tree.predict(X_test), y_test)
np.sqrt(reg_tree_test_mse)
2.3843108670794915

d. Bagged Regression Tree

from sklearn.ensemble import BaggingRegressor


bag_reg_params = {'n_estimators': np.arange(1, 100)}
bag_reg_tree = BaggingRegressor()

bag_reg_tree_search = GridSearchCV(estimator=bag_reg_tree,
                                   param_grid=bag_reg_params,
                                   cv=6,
                                   scoring='neg_mean_squared_error')

%timeit -n1 -r1 bag_reg_tree_search.fit(X_train, y_train)
44.9 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
bag_reg_tree_search.best_params_
{'n_estimators': 51}
# 6-fold cv estimate of test rmse
np.sqrt(-bag_reg_tree_search.best_score_)
1.6309302986679604
# test set mse
final_bag_reg_tree = bag_reg_tree_search.best_estimator_
bag_reg_tree_test_mse = mean_squared_error(final_bag_reg_tree.predict(X_test), y_test)
np.sqrt(bag_reg_tree_test_mse)
1.7220755952142313

e. Random Forest

sklearn’s RandomForestRegressor has a lot of parameters. We’ll use a tactic borrowed from this Medium article - randomized search to narrow down the range for the parameters, then grid search to fine-tune.

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV

random_forest_params = reg_tree_params
random_forest_params['max_features'] = np.arange(1, 10)
random_forest_params['n_estimators'] = np.arange(10, 100, 10)

random_forest = RandomForestRegressor()

# random search to narrow the parameter space
random_forest_rsearch = RandomizedSearchCV(estimator=random_forest,
                                   param_distributions=random_forest_params,
                                   n_iter=500,
                                   cv=6,
                                   scoring='neg_mean_squared_error',
                                   random_state=27)

%timeit -n1 -r1 random_forest_rsearch.fit(X_train, y_train)
2min 13s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
random_forest_rsearch.best_params_
{'n_estimators': 60,
 'min_samples_split': 5,
 'min_impurity_decrease': 0.0,
 'max_features': 6,
 'max_depth': 9}
# grid search to fine tune
params = {'n_estimators': np.arange(31, 51, 2),
          'min_samples_split': [2, 3, 4, 5, 6],
          'min_impurity_decrease': [0.0, 1e-3, 1e-2],
          'max_features': [6, 7, 8, 9, 10] 
           }
random_forest_cvsearch = GridSearchCV(estimator=random_forest,
                                      param_grid=params,
                                      cv=6,
                                      scoring='neg_mean_squared_error')
%timeit -n1 -r1 random_forest_cvsearch.fit(X_train, y_train)
4min 19s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
random_forest_cvsearch.best_params_
{'max_features': 8,
 'min_impurity_decrease': 0.01,
 'min_samples_split': 2,
 'n_estimators': 43}
# 6-fold cv estimate of test rmse
np.sqrt(-random_forest_cvsearch.best_score_)
1.6234289198882141
# test set mse
final_random_forest = random_forest_cvsearch.best_estimator_
random_forest_test_mse = mean_squared_error(final_random_forest.predict(X_test), y_test)
np.sqrt(random_forest_test_mse)
1.712239660011634
# feature importances
rf_feat_imp = pd.DataFrame({'feature': X_train.columns, 
                            'importance': final_random_forest.feature_importances_})
rf_feat_imp.sort_values(by='importance', ascending=False)
feature importance
4 Price 0.337647
5 ShelveLoc 0.202521
0 CompPrice 0.132621
2 Advertising 0.097423
6 Age 0.088481
1 Income 0.054456
3 Population 0.037483
7 Education 0.028029
9 US 0.015726
8 Urban 0.005614