islr notes and exercises from An Introduction to Statistical Learning

10. Unsupervised Learning

Exercise 7: Comparison of correlation based distance and Euclidean distance on USArrests dataset.

For this exercise, we’ll just show the proportionality holds in general.

The authors mention (p397) that “this is an unusual use of correlation, which is normally computed between variables; here it is computed between observation profiles”. It appears the authors intended that for observations xi,xjRpx_i, x_j \in \mathbb{R}^p,

rij=k=1p(xikxi)(xjkxj)k=1p(xikxi)2k=1p(xjkxj)2=(xixi)(xjxj)xixi2xjxj2 \begin{aligned} r_{ij} &= \frac{\sum_{k = 1}^p (x_{ik} - \overline{x}_i)(x_{jk} - \overline{x}_j)}{\sqrt{\sum_{k = 1}^p(x_{ik} - \overline{x}_i)^2 \sum_{k = 1}^p(x_{jk} - \overline{x}_j)^2}}\\ &= \frac{(x_i - \overline{x}_i)^\top (x_j - \overline{x}_j)}{||x_i - \overline{x}_i||^2||x_j - \overline{x}_j||^2} \end{aligned}

where xi=1pk=1pxik\overline{x}_i = \frac{1}{p}\sum_{k = 1}^p x_{ik} is the mean over the features. This can be seen the correlation of the pairs (xik,xjk)(x_{ik}, x_{jk}), k=1,,pk = 1, \dots, p, (hence the use of the word “unusual” - the feature index has become a sample index).

If the data has been standardized then

xi=xj=0\overline{x}_i = \overline{x}_j = 0

and

xixi2=xjxj2=1||x_i - \overline{x}_i||^2 = ||x_j - \overline{x}_j||^2 = 1

so the correlation

rij=k=1pxikxjk=xixjr_{ij} = \sum_{k = 1}^p x_{ik}x_{jk} = x_i^\top x_j

and the squared euclidean distance is

xixj2=xixi2xixj+xjxj=2(1xixj)=2(1rij) \begin{aligned} ||x_i - x_j||^2 &= x_i^\top x_i - 2x_i^\top x_j + x_j^\top x_j \\ &= 2(1 - x_i^\top x_j)\\ &= 2(1 - r_{ij}) \end{aligned}

Hence

1rijxixj21 - r_{ij} \propto ||x_i - x_j||^2

as claimed.