medv
in Boston
data setimport pandas as pd
boston = pd.read_csv('../../datasets/Boston.csv', index_col=0)
boston.head()
crim | zn | indus | chas | nox | rm | age | dis | rad | tax | ptratio | black | lstat | medv | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 0.00632 | 18.0 | 2.31 | 0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1 | 296 | 15.3 | 396.90 | 4.98 | 24.0 |
2 | 0.02731 | 0.0 | 7.07 | 0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2 | 242 | 17.8 | 396.90 | 9.14 | 21.6 |
3 | 0.02729 | 0.0 | 7.07 | 0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2 | 242 | 17.8 | 392.83 | 4.03 | 34.7 |
4 | 0.03237 | 0.0 | 2.18 | 0 | 0.458 | 6.998 | 45.8 | 6.0622 | 3 | 222 | 18.7 | 394.63 | 2.94 | 33.4 |
5 | 0.06905 | 0.0 | 2.18 | 0 | 0.458 | 7.147 | 54.2 | 6.0622 | 3 | 222 | 18.7 | 396.90 | 5.33 | 36.2 |
boston.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 506 entries, 1 to 506
Data columns (total 14 columns):
crim 506 non-null float64
zn 506 non-null float64
indus 506 non-null float64
chas 506 non-null int64
nox 506 non-null float64
rm 506 non-null float64
age 506 non-null float64
dis 506 non-null float64
rad 506 non-null int64
tax 506 non-null int64
ptratio 506 non-null float64
black 506 non-null float64
lstat 506 non-null float64
medv 506 non-null float64
dtypes: float64(11), int64(3)
memory usage: 59.3 KB
Estimate using the sample mean 1, , where medv
mu_hat = boston.medv.mean()
mu_hat
22.532806324110698
By the Central Limit Theorem,
where is the population deviation. So we have the plug-in estimate 2
where is the sample standard deviation
import numpy as np
mu_hat_se_hat = boston.medv.std()/np.sqrt(len(boston))
mu_hat_se_hat
0.4088611474975351
boot_means = np.array([np.random.choice(boston.medv, size=len(boston.medv), replace=True).mean()
for i in range(100)])
boot_means.std()
0.3195039288174233
Very close to the plug in estimate
(mu_hat - 2*mu_hat_se_hat, mu_hat + 2*mu_hat_se_hat)
(21.715084029115626, 23.35052861910577)
m_hat = boston.medv.median()
m_hat
21.2
boot_meds = np.array([np.median(np.random.choice(boston.medv, size=len(boston.medv), replace=True))
for i in range(100)])
boot_meds.std()
0.38405045241478325
Since we don’t have another estimate to compare this too, we really can’t say anything about the accuracy of this one.
q_hat = boston.medv.quantile(0.1)
q_hat
12.75
boot_quantiles = np.array([np.quantile(np.random.choice(boston.medv, size=len(boston.medv), replace=True), 0.1)
for i in range(100)])
boot_quantiles.std()
0.42275170017399105
Again since we don’t have another estimate to compare this too, we really can’t say anything about the accuracy of this one.
By linearity of expectation, the sample mean is an unbiased estimator of the population mean. By the Law of Large Numbers it is consitent, and by the Central Limit Theorm it is asymptotically normal
The sample mean is the plug-in estimator of the population mean (the plug-in estimate comes from estimating the population cdf with the empirical cdf see All of Statistics ch 7. ↩
The sample deviation is the plug-in estimate of the population deviation. ↩
This is a normal-based interval, its accuracy relies on the fact that the sample mean is asymptotically normal. ↩
The sample median is the plug-in estimate of the poulation median. ↩
The sample quantile is the plug-in estimate of the poulation quantile. ↩