islr notes and exercises from An Introduction to Statistical Learning

10. Unsupervised Learning

Exercise 1: Part of the argument that the KK-means clustering algorithm 10.1 reaches a local optimum

a. Proving identity (10.12)

We’ll prove this for the case p=1p = 1, from which the result follows easily by induction.

Starting with the left-hand side of (10.12), we have

1Cki,iCk(xixi)2=1CkiiCk(xixi)2=2Cki<i(xixi)2 \begin{aligned} \frac{1}{|C_k|}\sum_{i, i' \in C_k} (x_i - x_{i'})^2 &= \frac{1}{|C_k|}\sum_{i \neq i' \in C_k} (x_i - x_{i'})^2\\ & = \frac{2}{|C_k|}\sum_{i < i'} (x_i - x_{i'})^2 \end{aligned}

Now, we work with the right-hand side.

2iCk(xixk)2=2iCk(xi1CkiCkxi)2=2N2i((N1)xiiixi)2 \begin{aligned} 2\sum_{i \in C_k} (x_i - \overline{x}_k)^2 &= 2 \sum_{i \in C_k} \left(x_i - \frac{1}{|C_k|}\sum_{i' \in C_k} x_{i'}\right)^2\\ &= \frac{2}{N^2} \sum_{i } \left((N - 1)x_i - \sum_{i' \neq i} x_{i'}\right)^2\\ \end{aligned}

Where we have let N=CkN = |C_k| for readability.

Furthermore

((N1)xiiixi)2=(N1)2xi2ii2(N1)xixi+iixi2+i,ii2xixi \begin{aligned} & \left((N - 1)x_i - \sum_{i' \neq i} x_{i'}\right)^2 \\ &= (N - 1)^2 x_i^2 - \sum_{i' \neq i} 2(N - 1)x_ix_{i'} + \sum_{i' \neq i} x_{i'}^2 + \sum_{i', i''\neq i} 2 x_{i'} x_{i''} \end{aligned}

Thus (4) becomes

2N2i((N1)2xi2+ii2(N1)xixi+iixi2+i,ii2xixi)=2N2i<i((N1)2xi22(N1)xixi+(N1)xi2+2xixi)=2N2i<i(N(N1)xi22Nxixi)=2N2i<i(Nxi22Nxixi+Nxi2)=2N2(N)i<i(xi22xixi+xi2)=2Cki<i(xixi)2 \begin{aligned} & \frac{2}{N^2}\sum_i \left((N - 1)^2 x_i^2 + \sum_{i' \neq i} 2(N - 1)x_ix_{i'} + \sum_{i' \neq i} x_{i'}^2 + \sum_{i', i''\neq i} 2 x_{i'} x_{i''}\right)\\ &= \frac{2}{N^2}\sum_{i < i'}\Bigg((N - 1)^2 x_i^2 - 2(N - 1)x_ix_{i'} + (N - 1) x_i^2 + 2x_i x_{i'}\Bigg)\\ &= \frac{2}{N^2}\sum_{i < i'} \Bigg(N(N - 1)x_{i}^2 - 2Nx_ix_{i'}\Bigg)\\ &= \frac{2}{N^2}\sum_{i < i'} \Bigg(N x_{i}^2 - 2Nx_ix_{i'} + Nx_{i'}^2\Bigg)\\ &= \frac{2}{N^2}(N)\sum_{i < i'} \Bigg(x_{i}^2 - 2x_ix_{i'} + x_{i'}^2\Bigg)\\ &= \frac{2}{|C_k|}\sum_{i < i'} \left(x_{i} - x_{i'}\right)^2 \end{aligned}

And (12) is the same as (2)

b. Arguing the algorithm decreases the objective

The identity (10.12) shows that the objective (10.11) is equal to

k=1K(2iCkxiμk2) \sum_{k =1}^K \left(2\sum_{i\in C_k} ||x_i - \mu_{k}||^2\right)

where μk=(xk1,,xkp)\mu_k = (\overline{x}_{k1}, \dots, \overline{x}_{kp}) is the kk-th cluster centroid.

At each iteration of the algorithm, observations xix_i are reassigned to the cluster kk whose centroid is closest, i.e. such that xiμk2|| x_i - \mu_k ||^2 is minimal over kk. That is, if kmk_m denotes the cluster assigned to observation xix_i on iteration mm, and μkm\mu_{k_m} the centroid of cluster xmx_m then

xiμkm+12xiμkm2 || x_i - \mu_{k_{m + 1}} ||^2 \leqslant || x_i - \mu_{k_{m}} ||^2

for all ii. Thus (1) decreases at each iteration.

Exercise 2: Sketching a dendrogram

a. Complete linkage dendrogram

%matplotlib inline
from matplotlib import pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage
from scipy.spatial.distance import squareform
import numpy as np
dist_sq = np.array([[0, 0.3, 0.4, 0.7], [0.3, 0, 0.5, 0.8],
                       [0.4, 0.5, 0, 0.45], [0.7, 0.8, 0.45, 0]])
y = squareform(dist_sq)
Z = linkage(y, method='complete', metric='Euclidean')
plt.title('Complete Linkage Dendrogram')
plt.xlabel('sample')
plt.ylabel('Euclidean Distance')
dendrogram(Z, labels=['x1', 'x2', 'x3', 'x4'], leaf_rotation=45)
{'icoord': [[5.0, 5.0, 15.0, 15.0],
  [25.0, 25.0, 35.0, 35.0],
  [10.0, 10.0, 30.0, 30.0]],
 'dcoord': [[0.0, 0.3, 0.3, 0.0],
  [0.0, 0.45, 0.45, 0.0],
  [0.3, 0.8, 0.8, 0.45]],
 'ivl': ['x1', 'x2', 'x3', 'x4'],
 'leaves': [0, 1, 2, 3],
 'color_list': ['g', 'r', 'b']}

png

b. Single linkage dendrogram

Z = linkage(y, method='single', metric='Euclidean')
plt.title('Single Linkage Dendrogram')
plt.xlabel('sample')
plt.ylabel('Euclidean Distance')
dendrogram(Z, labels=['x1', 'x2', 'x3', 'x4'], leaf_rotation=90)
{'icoord': [[25.0, 25.0, 35.0, 35.0],
  [15.0, 15.0, 30.0, 30.0],
  [5.0, 5.0, 22.5, 22.5]],
 'dcoord': [[0.0, 0.3, 0.3, 0.0],
  [0.0, 0.4, 0.4, 0.3],
  [0.0, 0.45, 0.45, 0.4]],
 'ivl': ['x4', 'x3', 'x1', 'x2'],
 'leaves': [3, 2, 0, 1],
 'color_list': ['g', 'b', 'b']}

png

c. Clusters from the complete linkage dendrogram

The clusters are

{x1,x2},{x3,x4}\{x_1, x_2\}, \{x_3, x_4\}

d.

The clusters are

{x1,x2,x3},{x4}\{x_1, x_2, x_3\}, \{x_4\}

e.

Just exchange x1x3,x2x4 x_1 \mapsto x_3, x_2 \mapsto x_4 in the diagram in part a.

Exercise 3: Manual example of KK-means clustering for K=2K=2, n=6n=6, p=2p=2.

import pandas as pd
import seaborn as sns; sns.set_style('whitegrid')

df = pd.DataFrame({'X1': [1, 1, 0, 5, 6, 4], 'X2': [4, 3, 4, 1, 2, 0]},
                    index=range(1, 7))
df
X1 X2
1 1 4
2 1 3
3 0 4
4 5 1
5 6 2
6 4 0

a. Plot the observations

sns.scatterplot(x='X1', y='X2', data=df, color='r')
<matplotlib.axes._subplots.AxesSubplot at 0x8168b3d30>

png

b. Initialize with random cluster assignment

np.random.seed(33)
df['cluster'] = np.random.choice([1, 2], replace=True, size=6)
df['cluster']
1    1
2    2
3    1
4    1
5    1
6    2
Name: cluster, dtype: int64
sns.scatterplot(x='X1', y='X2', data=df, hue='cluster', legend='full', 
                palette=sns.color_palette(n_colors=2))
<matplotlib.axes._subplots.AxesSubplot at 0x8168bad68>

png

c. Compute the centroid for each cluster

def get_centroids(df):
    # compute centroids
    c1, c2 = df[df['cluster'] == 1], df[df['cluster'] == 2]
    cent_1 = [c1['X1'].mean(), c1['X2'].mean()]
    cent_2 = [c2['X1'].mean(), c2['X2'].mean()]

    return (cent_1, cent_2)
cent_1, cent_2 = get_centroids(df)

d. Assign observations to clusters by centroid

def d_cent(cent):
    def f(x):
        return np.linalg.norm(x - cent)
    return f


def assign_to_centroid(cent_1, cent_2):
    def f(x):
        d_1, d_2 = d_cent(cent_1)(x), d_cent(cent_2)(x)
        return 1 if d_1 < d_2 else 2
    return f

def assign_to_clusters(df):
    cent_1, cent_2 = get_centroids(df)
    df = df.drop(columns=['cluster'])
    return df.apply(assign_to_centroid(cent_1, cent_2), axis=1)
assign_to_clusters(df)
1    1
2    1
3    1
4    2
5    1
6    2
dtype: int64

e. Iterate c., d. until cluster assignments are stable

def get_final_clusters(df):
    cl_before, cl_after = df['cluster'], assign_to_clusters(df)
    while not (cl_before == cl_after).any():
        df.loc[:, 'cluster'] = cl_after
        get_final_clusters(df)
    return cl_after
get_final_clusters(df)
1    1
2    1
3    1
4    2
5    1
6    2
dtype: int64

f. Plot clusters

sns.scatterplot(x='X1', y='X2', data=df, hue='cluster', legend='full',
                palette=sns.color_palette(n_colors=2))
<matplotlib.axes._subplots.AxesSubplot at 0x1a191f0198>

png

Exercise 4: Comparing single and complete linkage

a. For clusters {1,2,3}\{1, 2, 3\} and {4,5}\{4, 5\}.

In both linkage diagrams, the clusters A={1,2,3}A = \{1, 2, 3\} and B={4,5}B = \{4, 5\} fuse when they are the most similar pair of clusters among all pairs at that height.

In the simple linkage diagram, this occurs at the height hsing(A,B)h_{sing}(A, B) when the minimum dissimilarity d(a,b)d(a, b) over pairs (a,b)A×B(a, b) \in A \times B is less than for any other clusters CC. In the complete linkage diagram, this occurs at the height hcomp(A,B)h_{comp}(A, B) when the maximum dissimilarity over pairs d(a,b)A×Bd(a, b) \in A \times B is less than for any other clusters CC.

It is possible to have the maximum over other clusters CC less than or equal to the minimum, and vice versa. In the former case, hcomp<=hsingh_{comp} <= h_{sing} and in the latter, hcomp>=hsingh_{comp} >= h_{sing}. So there is not enough information to tell.

To make this argument a bit more concrete here are some examples:

An example of hcomp>hsingh_{comp} > h_{sing}

Consider the following possible subset of observations. Suppose all other observations are far away.

array = np.array([[0, 0], [0, 1], [0, 2], 
                  [4, 1], [4, 2], 
                  [7, 2], [11, 2]])
df1 = pd.DataFrame(array, columns=['X1', 'X2'], index=range(1, 8))
df1
X1 X2
1 0 0
2 0 1
3 0 2
4 4 1
5 4 2
6 7 2
7 11 2
sns.scatterplot(x='X1', y='X2', data=df1)
<matplotlib.axes._subplots.AxesSubplot at 0x1a1b1abc88>

png

fig, ax = plt.subplots(1, 2, figsize=(10, 6))

# plot single linkage dendrogram
plt.subplot(1, 2, 1)
Z = linkage(df1[['X1', 'X2']], method='single', metric='Euclidean')
plt.title('Single Linkage Dendrogram')
plt.xlabel('sample')
plt.ylabel('Euclidean Distance')
dendrogram(Z, labels=['x'+stri. for i in range(1, 8)], leaf_rotation=45)

# plot complete linkage dendrogram
plt.subplot(1, 2, 2)
Z = linkage(df1[['X1', 'X2']], method='complete', metric='Euclidean')
plt.title('Complete Linkage Dendrogram')
plt.xlabel('sample')
plt.ylabel('Euclidean Distance')
dendrogram(Z, labels=['x'+stri. for i in range(1, 8)], leaf_rotation=45)

fig.tight_layout()

png

In this example, A=1,2,3A = {1, 2, 3} and B=4,5B ={4, 5} are indeed both clusters – they are formed at height 1 in the simple linkage dendrogram and at height 2 in the complete linkage dendrogram.

But in this case the height that they are fused is greater for the complete linkage than for the single linkage

hcomp(A,B)=11>hsing(A,B)=4h_{comp}(A, B) = 11 > h_{sing}(A, B) = 4

An example of hcomp<hsingh_{comp} < h_{sing}

We use the same observations from the last example, except the x7x_7 point changes [11,2][9,2][11, 2] \mapsto [9, 2]

df1.loc[7, 'X1'] = 9
df1
X1 X2
1 0 0
2 0 1
3 0 2
4 4 1
5 4 2
6 7 2
7 9 2
fig, ax = plt.subplots(1, 2, figsize=(10, 6))

# plot single linkage dendrogram
plt.subplot(1, 2, 1)
Z = linkage(df1[['X1', 'X2']], method='single', metric='Euclidean')
plt.title('Single Linkage Dendrogram')
plt.xlabel('sample')
plt.ylabel('Euclidean Distance')
dendrogram(Z, labels=['x'+stri. for i in range(1, 8)], leaf_rotation=45)

# plot complete linkage dendrogram
plt.subplot(1, 2, 2)
Z = linkage(df1[['X1', 'X2']], method='complete', metric='Euclidean')
plt.title('Complete Linkage Dendrogram')
plt.xlabel('sample')
plt.ylabel('Euclidean Distance')
dendrogram(Z, labels=['x'+stri. for i in range(1, 8)], leaf_rotation=45)

fig.tight_layout()

png

Again in this example, A=1,2,3A = {1, 2, 3} and B=4,5B ={4, 5} are indeed both clusters – again they are formed at height 1 in the simple linkage dendrogram and at height 2 in the complete linkage dendrogram.

But in this case the height that they are fused is greater for the single linkage than for the complete linkage

hcomp(A,B)=2<hsing(A,B)=4h_{comp}(A, B) = 2 < h_{sing}(A, B) = 4

b. For clusters {5}\{5\}, {6}\{6\}

In this case, with singleton clusters A={5}A = \{5\}, B={6}B = \{6\}

dsing(A,B)=dcomp(A,B)=d(5,6)d_{sing}(A, B) = d_{comp}(A, B) = d(5, 6)

so A,BA, B are fused when in the single linkage diagram when

d(5,6)minclusters C{mind(5,c)cC}d(5, 6) \leqslant \underset{clusters\ C}{\min}\{\min{d(5, c)} | c \in C\}

and in the complete linkage diagram when

d(5,6)minclusters C{maxd(5,c)cC}d(5, 6) \leqslant \underset{clusters\ C}{\min}\{\max{d(5, c)} | c \in C\}

so necessarily

hcomp(A,B)hsing(A,B)h_{comp}(A, B) \geqslant h_{sing}(A, B)

Exercise 5: KK-means clustering in fig 10.14

For the left-hand scaling, we would expect the orange customer in a cluster by themselves, and the other 7 customers in the other. The orange customer has a minimum distance of 3 from any other customer, and all other 7 customers are a distance one away from some other customer in that group.

For the middle scaling, we would expect to see the customers that bought computers (yellow, blue, red, magenta) in one cluster and the others in the other.

For the right-hand scaling, we would expect to see the same as for the middle scaling.

We can verify these expectations with a computation

Left-hand scaling

customers = ['black', 'orange', 'lt_blue', 'green', 'yellow', 'dk_blue',
             'red', 'magenta']
purchases = [[8, 0], [11, 0], [7, 0], [6, 0], [5, 1], [6, 1], [7, 1], 
               [8, 1]]
df_left = pd.DataFrame(purchases, columns=['socks', 'computers'], index=customers)
df_left
socks computers
black 8 0
orange 11 0
lt_blue 7 0
green 6 0
yellow 5 1
dk_blue 6 1
red 7 1
magenta 8 1
fig, ax = plt.subplots(1, 2, figsize=(10, 6))

# plot single linkage dendrogram
plt.subplot(1, 2, 1)
Z = linkage(df_left, method='single', metric='Euclidean')
plt.title('Single Linkage Dendrogram')
plt.xlabel('sample')
plt.ylabel('Euclidean Distance')
dendrogram(Z, labels=customers, leaf_rotation=45)

# plot complete linkage dendrogram
plt.subplot(1, 2, 2)
Z = linkage(df_left, method='complete', metric='Euclidean')
plt.title('Complete Linkage Dendrogram')
plt.xlabel('sample')
plt.ylabel('Euclidean Distance')
dendrogram(Z, labels=customers, leaf_rotation=45)

fig.tight_layout()

png

For both single and complete linkages, the clusters were as expected

Middle scaling

We don’t get the same plot as the book whether we scale by the standard deviation (or the variance) so we’ll code this by hand again (with a bit of guesswork)

df_mid = df/df.std()
df_mid.loc[:, 'socks'] = [1.0, 1.4, 0.9, 0.78, 0.62, 0.78, 0.9, 1.0]
df_mid.loc[:, 'computers'] = [0, 0, 0, 0, 1.4, 1.4, 1.4, 1.4]
df_mid
socks computers
black 1.00 0.0
orange 1.40 0.0
lt_blue 0.90 0.0
green 0.78 0.0
yellow 0.62 1.4
dk_blue 0.78 1.4
red 0.90 1.4
magenta 1.00 1.4
fig, ax = plt.subplots(1, 2, figsize=(10, 6))

# plot single linkage dendrogram
plt.subplot(1, 2, 1)
Z = linkage(df_mid, method='single', metric='Euclidean')
plt.title('Single Linkage Dendrogram')
plt.xlabel('sample')
plt.ylabel('Euclidean Distance')
dendrogram(Z, labels=customers, leaf_rotation=45)

# plot complete linkage dendrogram
plt.subplot(1, 2, 2)
Z = linkage(df_mid, method='complete', metric='Euclidean')
plt.title('Complete Linkage Dendrogram')
plt.xlabel('sample')
plt.ylabel('Euclidean Distance')
dendrogram(Z, labels=customers, leaf_rotation=45)

fig.tight_layout()

png

For both single and complete linkages, the clusters were again as expected

Right-hand scaling

We’ll assume $2$2 per pair of socks, and $2000$2000 per computer

df_right = df_left.copy()
df_right.loc[:, 'socks'] = 2 * df_right['socks']
df_right.loc[:, 'computers'] = 2000 * df_right['computers']
df_right
socks computers
black 16 0
orange 22 0
lt_blue 14 0
green 12 0
yellow 10 2000
dk_blue 12 2000
red 14 2000
magenta 16 2000
fig, ax = plt.subplots(1, 2, figsize=(10, 6))

# plot single linkage dendrogram
plt.subplot(1, 2, 1)
Z = linkage(df_right, method='single', metric='Euclidean')
plt.title('Single Linkage Dendrogram')
plt.xlabel('sample')
plt.ylabel('Euclidean Distance')
dendrogram(Z, labels=customers, leaf_rotation=45)

# plot complete linkage dendrogram
plt.subplot(1, 2, 2)
Z = linkage(df_right, method='complete', metric='Euclidean')
plt.title('Complete Linkage Dendrogram')
plt.xlabel('sample')
plt.ylabel('Euclidean Distance')
dendrogram(Z, labels=customers, leaf_rotation=45)

fig.tight_layout()

png

Again for both single and complete linkages, the clusters were as expected

Exercise 6: A case of PCA

a.

This means that the proportion of variance explained by the first principal component is 0.10. That is, the ratio of the (sample) variance of the first component to the total (sample) variance is 0.10