USArrests
dataset%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns; sns.set_style('whitegrid')
arrests = pd.read_csv('../../datasets/USAressts.csv', index_col=0)
arrests.head()
Murder | Assault | UrbanPop | Rape | |
---|---|---|---|---|
Alabama | 13.2 | 236 | 58 | 21.2 |
Alaska | 10.0 | 263 | 48 | 44.5 |
Arizona | 8.1 | 294 | 80 | 31.0 |
Arkansas | 8.8 | 190 | 50 | 19.5 |
California | 9.0 | 276 | 91 | 40.6 |
arrests.info()
<class 'pandas.core.frame.DataFrame'>
Index: 50 entries, Alabama to Wyoming
Data columns (total 4 columns):
Murder 50 non-null float64
Assault 50 non-null int64
UrbanPop 50 non-null int64
Rape 50 non-null float64
dtypes: float64(2), int64(2)
memory usage: 2.0+ KB
# standardize the data
arrests_std = (arrests - arrests.mean())/arrests.std()
from sklearn.decomposition import PCA
pca = PCA(n_components=arrests_std.shape[1])
pca.fit(arrests_std)
PCA(copy=True, iterated_power='auto', n_components=4, random_state=None,
svd_solver='auto', tol=0.0, whiten=False)
pca.explained_variance_ratio_
array([0.62006039, 0.24744129, 0.0891408 , 0.04335752])
Confusingly the “components” in sklearn are what the book calls principal component loading vectors, in this case (see table 10.1). They are given as the loading matrix
pca.components_.transpose()
array([[ 0.53589947, 0.41818087, -0.34123273, 0.6492278 ],
[ 0.58318363, 0.1879856 , -0.26814843, -0.74340748],
[ 0.27819087, -0.87280619, -0.37801579, 0.13387773],
[ 0.54343209, -0.16731864, 0.81777791, 0.08902432]])
The principal components (or really, the “scores”) are given by the transform
Z = pca.transform(arrests_std)
Z = pd.DataFrame(Z)
Z.var()/var_total
0 0.620060
1 0.247441
2 0.089141
3 0.043358
dtype: float64
Tying this all together, if
is the data matrix and
is the loading matrix then
is the score matrix (i.e. is the -th score vectors).
Indeed
np.matmul(arrests_std, pca.components_.transpose()).head()
Murder | Assault | UrbanPop | Rape | |
---|---|---|---|---|
Alabama | 0.975660 | 1.122001 | -0.439804 | 0.154697 |
Alaska | 1.930538 | 1.062427 | 2.019500 | -0.434175 |
Arizona | 1.745443 | -0.738460 | 0.054230 | -0.826264 |
Arkansas | -0.139999 | 1.108542 | 0.113422 | -0.180974 |
California | 2.498613 | -1.527427 | 0.592541 | -0.338559 |
while
Z.head()
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 0.975660 | 1.122001 | -0.439804 | 0.154697 |
1 | 1.930538 | 1.062427 | 2.019500 | -0.434175 |
2 | 1.745443 | -0.738460 | 0.054230 | -0.826264 |
3 | -0.139999 | 1.108542 | 0.113422 | -0.180974 |
4 | 2.498613 | -1.527427 | 0.592541 | -0.338559 |