8. Tree-based Methods

Exercise 8: Predicting `Sales` in `Carseats` dataset

- Preparing the data
a. Train test split
b. Regression tree
c. CV Grid search for optimal regression tree
d. Bagged Regression Tree
e. Random Forest

Preparing the data

Information on the dataset can be found here

%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns; sns.set_style('whitegrid')

carseats = pd.read_csv('../../datasets/Carseats.csv', index_col=0)
carseats = carseats.reset_index(drop=True)
carseats.head()

	Sales	CompPrice	Income	Advertising	Population	Price	ShelveLoc	Age	Education	Urban	US
0	9.50	138	73	11	276	120	Bad	42	17	Yes	Yes
1	11.22	111	48	16	260	83	Good	65	10	Yes	Yes
2	10.06	113	35	10	269	80	Medium	59	12	Yes	Yes
3	7.40	117	100	4	466	97	Medium	55	14	Yes	Yes
4	4.15	141	64	3	340	128	Bad	38	13	Yes	No

carseats.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 11 columns):
Sales          400 non-null float64
CompPrice      400 non-null int64
Income         400 non-null int64
Advertising    400 non-null int64
Population     400 non-null int64
Price          400 non-null int64
ShelveLoc      400 non-null object
Age            400 non-null int64
Education      400 non-null int64
Urban          400 non-null object
US             400 non-null object
dtypes: float64(1), int64(7), object(3)
memory usage: 34.5+ KB

carseats['ShelveLoc'].unique()

array(['Bad', 'Good', 'Medium'], dtype=object)

# convert categorical to numeric values

from sklearn.preprocessing import LabelEncoder

# label encoders
cols = ['ShelveLoc', 'Urban', 'US']
encs = {col:None for col in cols}
for col in encs:
    encs[col] = LabelEncoder().fit(carseats[col].unique())
    carseats.loc[:, col] = encs[col].transform(carseats[col])

carseats.head()

	Sales	CompPrice	Income	Advertising	Population	Price	ShelveLoc	Age	Education	Urban	US
0	9.50	138	73	11	276	120	0	42	17	1	1
1	11.22	111	48	16	260	83	1	65	10	1	1
2	10.06	113	35	10	269	80	2	59	12	1	1
3	7.40	117	100	4	466	97	2	55	14	1	1
4	4.15	141	64	3	340	128	0	38	13	1	0

a. Train test split

from sklearn.model_selection import train_test_split

X, y = carseats.drop(columns=['Sales']), carseats['Sales']
X_train, X_test, y_train, y_test = train_test_split(X, y)

b. Regression tree

from sklearn.tree import DecisionTreeRegressor, export_graphviz

reg_tree = DecisionTreeRegressor(min_samples_split=5)
reg_tree.fit(X_train, y_train)

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=5, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')

For visualizing the tree, we found this Medium article helpful

from graphviz import Source
from sklearn import tree
from IPython.display import SVG
graph = Source(tree.export_graphviz(reg_tree, out_file=None, filled=True, 
                                    feature_names=X.columns))
display(SVG(graph.pipe(format='svg')))

c. CV Grid search for optimal regression tree

Currently sklearn doesn’t implement cost complexity pruning (at the time of writing it is an open issue on Github and a pending pull request may fix the issue.

In the meantime, DecisionTreeRegressor has many parameters one can vary for optimization.

max_depth – the maximum depth of the tree.
min_samples_split - The minimum number of samples required to split an internal node (this is a stopping criterion)
min_samples_leaf - The minimum number of samples required to be at a leaf node.
max_leaf_nodes - Grow a tree with max_leaf_nodes in best-first fashion.
min_impurity_decrease - A node will be split if this split induces a decrease of the impurity greater than or equal to this value.

This is a lot of parameters, and there is bound to be some coflict and redundancy in a grid search. We’ll make the parameter search a little faster with the following:

Booth min_samples_leaf and min_samples_split are stopping criterion, but we’ll go with min_samples_split, since it’s a less restrictive criterion.
Both max_depth and max_leaf_nodes both control tree depth, we’ll choose max_depth, again since it’s a less restrictive criterion.

We would use a randomized search if the dataset were larger, but it’s small enough we can do an exhaustive (grid) search.

We don’t optimize max_features even though this would likely improve performance, since that is effectively a random forest with one tree and we’ll be training a random forest model later in the exercise.

This sklearn doc was helpful

from sklearn.model_selection import RandomizedSearchCV, GridSearchCV

# parameter grid of size 1e3

depths = list(range(1, 10)) + [None]
sample_splits = list(range(2, 12))
impurity_decreases = np.linspace(0, 1, 10)

reg_tree_params = {'max_depth': depths, 
              'min_samples_split': sample_splits,
              'min_impurity_decrease': impurity_decreases}

# search best case 1/10 of parameter grid
reg_tree = DecisionTreeRegressor()
# use 6 fold CV because train size is 300
reg_tree_search = GridSearchCV(estimator=reg_tree,
                                param_grid=reg_tree_params,
                                cv=6,
                                scoring='neg_mean_squared_error')
%timeit -n1 -r1 reg_tree_search.fit(X_train, y_train)

28 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

reg_tree_search.best_params_

{'max_depth': 8, 'min_impurity_decrease': 0.0, 'min_samples_split': 11}

# 6-fold cv estimate of test rmse
np.sqrt(-reg_tree_search.best_score_)

2.0189730868235154

from sklearn.metrics import mean_squared_error

# test set mse
final_reg_tree = reg_tree_search.best_estimator_
reg_tree_test_mse = mean_squared_error(final_reg_tree.predict(X_test), y_test)
np.sqrt(reg_tree_test_mse)

2.3843108670794915

d. Bagged Regression Tree

from sklearn.ensemble import BaggingRegressor


bag_reg_params = {'n_estimators': np.arange(1, 100)}
bag_reg_tree = BaggingRegressor()

bag_reg_tree_search = GridSearchCV(estimator=bag_reg_tree,
                                   param_grid=bag_reg_params,
                                   cv=6,
                                   scoring='neg_mean_squared_error')

%timeit -n1 -r1 bag_reg_tree_search.fit(X_train, y_train)

44.9 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

bag_reg_tree_search.best_params_

{'n_estimators': 51}

# 6-fold cv estimate of test rmse
np.sqrt(-bag_reg_tree_search.best_score_)

1.6309302986679604

# test set mse
final_bag_reg_tree = bag_reg_tree_search.best_estimator_
bag_reg_tree_test_mse = mean_squared_error(final_bag_reg_tree.predict(X_test), y_test)
np.sqrt(bag_reg_tree_test_mse)

1.7220755952142313

e. Random Forest

sklearn’s RandomForestRegressor has a lot of parameters. We’ll use a tactic borrowed from this Medium article - randomized search to narrow down the range for the parameters, then grid search to fine-tune.

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV

random_forest_params = reg_tree_params
random_forest_params['max_features'] = np.arange(1, 10)
random_forest_params['n_estimators'] = np.arange(10, 100, 10)

random_forest = RandomForestRegressor()

# random search to narrow the parameter space
random_forest_rsearch = RandomizedSearchCV(estimator=random_forest,
                                   param_distributions=random_forest_params,
                                   n_iter=500,
                                   cv=6,
                                   scoring='neg_mean_squared_error',
                                   random_state=27)

%timeit -n1 -r1 random_forest_rsearch.fit(X_train, y_train)

2min 13s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

random_forest_rsearch.best_params_

{'n_estimators': 60,
 'min_samples_split': 5,
 'min_impurity_decrease': 0.0,
 'max_features': 6,
 'max_depth': 9}

# grid search to fine tune
params = {'n_estimators': np.arange(31, 51, 2),
          'min_samples_split': [2, 3, 4, 5, 6],
          'min_impurity_decrease': [0.0, 1e-3, 1e-2],
          'max_features': [6, 7, 8, 9, 10] 
           }
random_forest_cvsearch = GridSearchCV(estimator=random_forest,
                                      param_grid=params,
                                      cv=6,
                                      scoring='neg_mean_squared_error')
%timeit -n1 -r1 random_forest_cvsearch.fit(X_train, y_train)

4min 19s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

random_forest_cvsearch.best_params_

{'max_features': 8,
 'min_impurity_decrease': 0.01,
 'min_samples_split': 2,
 'n_estimators': 43}

# 6-fold cv estimate of test rmse
np.sqrt(-random_forest_cvsearch.best_score_)

1.6234289198882141

# test set mse
final_random_forest = random_forest_cvsearch.best_estimator_
random_forest_test_mse = mean_squared_error(final_random_forest.predict(X_test), y_test)
np.sqrt(random_forest_test_mse)

1.712239660011634

# feature importances
rf_feat_imp = pd.DataFrame({'feature': X_train.columns, 
                            'importance': final_random_forest.feature_importances_})
rf_feat_imp.sort_values(by='importance', ascending=False)

	feature	importance
4	Price	0.337647
5	ShelveLoc	0.202521
0	CompPrice	0.132621
2	Advertising	0.097423
6	Age	0.088481
1	Income	0.054456
3	Population	0.037483
7	Education	0.028029
9	US	0.015726
8	Urban	0.005614

islr notes and exercises from An Introduction to Statistical Learning