5. Resampling Methods

Estimate of standard error of sample mean of `medv` in `Boston` data set

Prepare the data
a. Sample mean as estimator of population mean
b. Plug-in estimate of standard error of the sample mean
c. Boostrap estimate of standard error of sample mean
<a href="#d-a-95-confidence-interval-for-population-mean" data-toc-modified-id="d.-A- $95\%$ -confidence-interval2-for-population-mean-5">d. A 95% confidence interval for population mean</a>
e. Sample median as estimatorof population median
f. Boostrap estimate of standard error of sample median
g. Sample quantile as estimator of population quantile
h. Bootstrap estimate of standard error of sample quantile
Footnotes

Prepare the data

import pandas as pd

boston = pd.read_csv('../../datasets/Boston.csv', index_col=0)
boston.head()

	crim	zn	indus	nox	rm	age	dis	rad	tax	ptratio	black	lstat	medv
1	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1	296	15.3	396.90	4.98	24.0
2	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2	242	17.8	396.90	9.14	21.6
3	0.02729	0.0	7.07	0.469	7.185	61.1	4.9671	2	242	17.8	392.83	4.03	34.7
4	0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3	222	18.7	394.63	2.94	33.4
5	0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3	222	18.7	396.90	5.33	36.2

boston.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 506 entries, 1 to 506
Data columns (total 14 columns):
crim       506 non-null float64
zn         506 non-null float64
indus      506 non-null float64
chas       506 non-null int64
nox        506 non-null float64
rm         506 non-null float64
age        506 non-null float64
dis        506 non-null float64
rad        506 non-null int64
tax        506 non-null int64
ptratio    506 non-null float64
black      506 non-null float64
lstat      506 non-null float64
medv       506 non-null float64
dtypes: float64(11), int64(3)
memory usage: 59.3 KB

a. Sample mean as estimator of population mean

Estimate using the sample mean ¹, $\overline{X} = \hat{\mu}$ , where $X=$ medv

mu_hat = boston.medv.mean()
mu_hat

22.532806324110698

b. Plug-in estimate of standard error of the sample mean

By the Central Limit Theorem,

$\mathbf{se}(\hat{\mu}) = \frac{\sigma}{\sqrt{n}}$

where $\sigma$ is the population deviation. So we have the plug-in estimate ²

$\hat{\mathbf{se}}(\hat{\mu}) = \frac{s}{\sqrt{n}}$

where $s$ is the sample standard deviation

import numpy as np

mu_hat_se_hat = boston.medv.std()/np.sqrt(len(boston))
mu_hat_se_hat

0.4088611474975351

c. Boostrap estimate of standard error of sample mean

boot_means = np.array([np.random.choice(boston.medv, size=len(boston.medv), replace=True).mean()
                       for i in range(100)])
boot_means.std()

0.3195039288174233

Very close to the plug in estimate

d. A $95\%$ confidence interval ³ for population mean

(mu_hat - 2*mu_hat_se_hat, mu_hat + 2*mu_hat_se_hat)

(21.715084029115626, 23.35052861910577)

e. Sample median as estimator ⁴ of population median

m_hat = boston.medv.median()
m_hat

21.2

f. Boostrap estimate of standard error of sample median

boot_meds = np.array([np.median(np.random.choice(boston.medv, size=len(boston.medv), replace=True))
                       for i in range(100)])
boot_meds.std()

0.38405045241478325

Since we don’t have another estimate to compare this too, we really can’t say anything about the accuracy of this one.

g. Sample quantile ⁵ as estimator of population quantile

q_hat = boston.medv.quantile(0.1)
q_hat

12.75

h. Boostrap estimate of standard error of sample quantile

boot_quantiles = np.array([np.quantile(np.random.choice(boston.medv, size=len(boston.medv), replace=True), 0.1)
                       for i in range(100)])
boot_quantiles.std()

0.42275170017399105

Again since we don’t have another estimate to compare this too, we really can’t say anything about the accuracy of this one.

Footnotes

By linearity of expectation, the sample mean is an unbiased estimator of the population mean. By the Law of Large Numbers it is consitent, and by the Central Limit Theorm it is asymptotically normal

The sample mean is the plug-in estimator of the population mean (the plug-in estimate comes from estimating the population cdf with the empirical cdf see All of Statistics ch 7. ↩
The sample deviation is the plug-in estimate of the population deviation. ↩
This is a normal-based interval, its accuracy relies on the fact that the sample mean is asymptotically normal. ↩
The sample median is the plug-in estimate of the poulation median. ↩
The sample quantile is the plug-in estimate of the poulation quantile. ↩

islr notes and exercises from An Introduction to Statistical Learning