islr notes and exercises from An Introduction to Statistical Learning

5. Resampling Methods

Estimate of standard error of sample mean of medv in Boston data set

Prepare the data

import pandas as pd

boston = pd.read_csv('../../datasets/Boston.csv', index_col=0)
boston.head()
crim zn indus chas nox rm age dis rad tax ptratio black lstat medv
1 0.00632 18.0 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98 24.0
2 0.02731 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14 21.6
3 0.02729 0.0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.7
4 0.03237 0.0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.4
5 0.06905 0.0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33 36.2
boston.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 506 entries, 1 to 506
Data columns (total 14 columns):
crim       506 non-null float64
zn         506 non-null float64
indus      506 non-null float64
chas       506 non-null int64
nox        506 non-null float64
rm         506 non-null float64
age        506 non-null float64
dis        506 non-null float64
rad        506 non-null int64
tax        506 non-null int64
ptratio    506 non-null float64
black      506 non-null float64
lstat      506 non-null float64
medv       506 non-null float64
dtypes: float64(11), int64(3)
memory usage: 59.3 KB

a. Sample mean as estimator of population mean

Estimate using the sample mean 1, X=μ^\overline{X} = \hat{\mu}, where X=X=medv

mu_hat = boston.medv.mean()
mu_hat
22.532806324110698

b. Plug-in estimate of standard error of the sample mean

By the Central Limit Theorem,

se(μ^)=σn\mathbf{se}(\hat{\mu}) = \frac{\sigma}{\sqrt{n}}

where σ\sigma is the population deviation. So we have the plug-in estimate 2

se^(μ^)=sn\hat{\mathbf{se}}(\hat{\mu}) = \frac{s}{\sqrt{n}}

where ss is the sample standard deviation

import numpy as np

mu_hat_se_hat = boston.medv.std()/np.sqrt(len(boston))
mu_hat_se_hat
0.4088611474975351

c. Boostrap estimate of standard error of sample mean

boot_means = np.array([np.random.choice(boston.medv, size=len(boston.medv), replace=True).mean()
                       for i in range(100)])
boot_means.std()
0.3195039288174233

Very close to the plug in estimate

d. A 95%95\% confidence interval 3 for population mean

(mu_hat - 2*mu_hat_se_hat, mu_hat + 2*mu_hat_se_hat)
(21.715084029115626, 23.35052861910577)

e. Sample median as estimator 4 of population median

m_hat = boston.medv.median()
m_hat
21.2

f. Boostrap estimate of standard error of sample median

boot_meds = np.array([np.median(np.random.choice(boston.medv, size=len(boston.medv), replace=True))
                       for i in range(100)])
boot_meds.std()
0.38405045241478325

Since we don’t have another estimate to compare this too, we really can’t say anything about the accuracy of this one.

g. Sample quantile 5 as estimator of population quantile

q_hat = boston.medv.quantile(0.1)
q_hat
12.75

h. Boostrap estimate of standard error of sample quantile

boot_quantiles = np.array([np.quantile(np.random.choice(boston.medv, size=len(boston.medv), replace=True), 0.1)
                       for i in range(100)])
boot_quantiles.std()
0.42275170017399105

Again since we don’t have another estimate to compare this too, we really can’t say anything about the accuracy of this one.

Footnotes

By linearity of expectation, the sample mean is an unbiased estimator of the population mean. By the Law of Large Numbers it is consitent, and by the Central Limit Theorm it is asymptotically normal

  1. The sample mean is the plug-in estimator of the population mean (the plug-in estimate comes from estimating the population cdf with the empirical cdf see All of Statistics ch 7

  2. The sample deviation is the plug-in estimate of the population deviation. 

  3. This is a normal-based interval, its accuracy relies on the fact that the sample mean is asymptotically normal. 

  4. The sample median is the plug-in estimate of the poulation median. 

  5. The sample quantile is the plug-in estimate of the poulation quantile.