islr notes and exercises from An Introduction to Statistical Learning

10. Unsupervised Learning

Exercise 8: Calculating PVE for USArrests dataset

Preparing the data

%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns; sns.set_style('whitegrid')
arrests = pd.read_csv('../../datasets/USAressts.csv', index_col=0)
arrests.head()
Murder Assault UrbanPop Rape
Alabama 13.2 236 58 21.2
Alaska 10.0 263 48 44.5
Arizona 8.1 294 80 31.0
Arkansas 8.8 190 50 19.5
California 9.0 276 91 40.6
arrests.info()
<class 'pandas.core.frame.DataFrame'>
Index: 50 entries, Alabama to Wyoming
Data columns (total 4 columns):
Murder      50 non-null float64
Assault     50 non-null int64
UrbanPop    50 non-null int64
Rape        50 non-null float64
dtypes: float64(2), int64(2)
memory usage: 2.0+ KB
# standardize the data
arrests_std = (arrests - arrests.mean())/arrests.std()

a. Calculating PVE using a builtin method

from sklearn.decomposition import PCA

pca = PCA(n_components=arrests_std.shape[1])
pca.fit(arrests_std)
PCA(copy=True, iterated_power='auto', n_components=4, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)
pca.explained_variance_ratio_
array([0.62006039, 0.24744129, 0.0891408 , 0.04335752])

b. Calculating PVE by hand

Confusingly the “components” in sklearn are what the book calls principal component loading vectors, in this case ϕ1,,ϕ4\phi_1, \dots, \phi_4 (see table 10.1). They are given as the loading matrix ϕ=[ϕ1,,ϕ4]\phi = [\phi_1, \dots, \phi_4]^\top

pca.components_.transpose()
array([[ 0.53589947,  0.41818087, -0.34123273,  0.6492278 ],
       [ 0.58318363,  0.1879856 , -0.26814843, -0.74340748],
       [ 0.27819087, -0.87280619, -0.37801579,  0.13387773],
       [ 0.54343209, -0.16731864,  0.81777791,  0.08902432]])

The principal components (or really, the “scores”) are given by the transform

Z = pca.transform(arrests_std)
Z = pd.DataFrame(Z)
Z.var()/var_total
0    0.620060
1    0.247441
2    0.089141
3    0.043358
dtype: float64

Tying this all together, if

X=(  x1    xn  )X = \begin{pmatrix} -\ \ x_1\ \ - \\ \vdots \\ -\ \ x_n \ \ - \end{pmatrix}

is the data matrix and

ϕ=(ϕ1ϕm)\phi = \begin{pmatrix} | & & |\\ \phi_1 & \cdots & \phi_m\\ | & & |\\ \end{pmatrix}

is the loading matrix then

Xϕ=(x1ϕ1x1ϕmxnϕ1xnϕm)=(z1zm)=Z \begin{aligned} X\phi &= \begin{pmatrix} x_1\cdot\phi_1 & \cdots & x_1\cdot\phi_m \\ \vdots & & \vdots\\ x_n\cdot\phi_1 & \cdots & x_n\cdot \phi_m \end{pmatrix}\\ &= \begin{pmatrix} | & & |\\ z_1 & \cdots & z_m\\ | & & |\\ \end{pmatrix}\\ &= Z \end{aligned}

is the score matrix (i.e. zi=(z1i,,zni)z_i = (z_{1i}, \dots, z_{ni}) is the ii-th score vectors).

Indeed

np.matmul(arrests_std, pca.components_.transpose()).head()
Murder Assault UrbanPop Rape
Alabama 0.975660 1.122001 -0.439804 0.154697
Alaska 1.930538 1.062427 2.019500 -0.434175
Arizona 1.745443 -0.738460 0.054230 -0.826264
Arkansas -0.139999 1.108542 0.113422 -0.180974
California 2.498613 -1.527427 0.592541 -0.338559

while

Z.head()
0 1 2 3
0 0.975660 1.122001 -0.439804 0.154697
1 1.930538 1.062427 2.019500 -0.434175
2 1.745443 -0.738460 0.054230 -0.826264
3 -0.139999 1.108542 0.113422 -0.180974
4 2.498613 -1.527427 0.592541 -0.338559