10. Unsupervised Learning

Exercise 8: Calculating PVE for `USArrests` dataset

Preparing the data
a. Calculating PVE using a builtin method
b. Calculating PVE by hand

Preparing the data

%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns; sns.set_style('whitegrid')

arrests = pd.read_csv('../../datasets/USAressts.csv', index_col=0)
arrests.head()

	Murder	Assault	UrbanPop	Rape
Alabama	13.2	236	58	21.2
Alaska	10.0	263	48	44.5
Arizona	8.1	294	80	31.0
Arkansas	8.8	190	50	19.5
California	9.0	276	91	40.6

arrests.info()

<class 'pandas.core.frame.DataFrame'>
Index: 50 entries, Alabama to Wyoming
Data columns (total 4 columns):
Murder      50 non-null float64
Assault     50 non-null int64
UrbanPop    50 non-null int64
Rape        50 non-null float64
dtypes: float64(2), int64(2)
memory usage: 2.0+ KB

# standardize the data
arrests_std = (arrests - arrests.mean())/arrests.std()

a. Calculating PVE using a builtin method

from sklearn.decomposition import PCA

pca = PCA(n_components=arrests_std.shape[1])
pca.fit(arrests_std)

PCA(copy=True, iterated_power='auto', n_components=4, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)

pca.explained_variance_ratio_

array([0.62006039, 0.24744129, 0.0891408 , 0.04335752])

b. Calculating PVE by hand

Confusingly the “components” in sklearn are what the book calls principal component loading vectors, in this case $\phi_1, \dots, \phi_4$ (see table 10.1). They are given as the loading matrix $\phi = [\phi_1, \dots, \phi_4]^\top$

pca.components_.transpose()

array([[ 0.53589947,  0.41818087, -0.34123273,  0.6492278 ],
       [ 0.58318363,  0.1879856 , -0.26814843, -0.74340748],
       [ 0.27819087, -0.87280619, -0.37801579,  0.13387773],
       [ 0.54343209, -0.16731864,  0.81777791,  0.08902432]])

The principal components (or really, the “scores”) are given by the transform

Z = pca.transform(arrests_std)
Z = pd.DataFrame(Z)
Z.var()/var_total

  0.620060
  0.247441
  0.089141
  0.043358
dtype: float64

Tying this all together, if

$X = \begin{pmatrix} -\ \ x_1\ \ - \\ \vdots \\ -\ \ x_n \ \ - \end{pmatrix}$

is the data matrix and

$\phi = \begin{pmatrix} | & & |\\ \phi_1 & \cdots & \phi_m\\ | & & |\\ \end{pmatrix}$

is the loading matrix then

$\begin{aligned} X\phi &= \begin{pmatrix} x_1\cdot\phi_1 & \cdots & x_1\cdot\phi_m \\ \vdots & & \vdots\\ x_n\cdot\phi_1 & \cdots & x_n\cdot \phi_m \end{pmatrix}\\ &= \begin{pmatrix} | & & |\\ z_1 & \cdots & z_m\\ | & & |\\ \end{pmatrix}\\ &= Z \end{aligned}$

is the score matrix (i.e. $z_i = (z_{1i}, \dots, z_{ni})$ is the $i$ -th score vectors).

Indeed

np.matmul(arrests_std, pca.components_.transpose()).head()

	Murder	Assault	UrbanPop	Rape
Alabama	0.975660	1.122001	-0.439804	0.154697
Alaska	1.930538	1.062427	2.019500	-0.434175
Arizona	1.745443	-0.738460	0.054230	-0.826264
Arkansas	-0.139999	1.108542	0.113422	-0.180974
California	2.498613	-1.527427	0.592541	-0.338559

while

Z.head()

	0	1	2	3
0	0.975660	1.122001	-0.439804	0.154697
1	1.930538	1.062427	2.019500	-0.434175
2	1.745443	-0.738460	0.054230	-0.826264
3	-0.139999	1.108542	0.113422	-0.180974
4	2.498613	-1.527427	0.592541	-0.338559

islr notes and exercises from An Introduction to Statistical Learning

10. Unsupervised Learning

Exercise 8: Calculating PVE for USArrests dataset

Preparing the data

a. Calculating PVE using a builtin method

b. Calculating PVE by hand

Exercise 8: Calculating PVE for `USArrests` dataset