islr notes and exercises from An Introduction to Statistical Learning

10. Unsupervised Learning

Table of Contents

The Challenge of Unsupervised Learning

  • Unsupervised learning is learning in the absence of a response. It is often part of exploratory data analysis (EDA).
  • Without a response, we aren’t intested in prediction or classification, rather we are interested in discovering interesting things about the data. This can be difficult because such a goal is somewhat subjective.
  • Objective performance critera for unsupervised learning can also be challenging.

Principal Components Analysis

  • Principal components were discussed earlier as a dimensional reduction methof in the context of regression. They provide a low-dimensional representation of the data that contains as much variation as possible.
  • Principal Components Analysis is the process of computing principal components and using them in data analysis.

What Are Principal Components?

  • The first principal component of features X1,,XpX_1, \dots, X_p is the normalized linear combination Z1=ϕ^1X Z_1 = \hat{\phi}_1^\top X where X=(X1,,Xp),ϕ^1RpX = (X_1, \dots, X_p), \hat{\phi}_1\in \mathbb{R}^p and ϕ^=1|| \hat{\phi} || = 1. The vector ϕ^1\hat{\phi}_1 is called the loading vector (its entries are called the loadings) and ϕ^1=argmaxϕRpϕ=1(1ni=1n(ϕxi)2) \hat{\phi}_1 = \underset{\underset{||\phi|| = 1}{\phi \in \mathbb{R}^p}}{\text{argmax}}\left(\frac{1}{n}\sum_{i=1}^n \left(\phi^\top x_i\right)^2\right)

  • Assume we have data XiX_i with features X1,,XpX_1, \dots, X_p which is centered in the features (each feature has mean zero). The objective function in the above optimization problem can be rewritten ϕ^1=argmaxϕRp(1ni=1nzi2) \hat{\phi}_1 = \underset{\phi \in \mathbb{R}^p}{\text{argmax}}\left(\frac{1}{n}\sum_{i=1}^n ||z_i ||^2\right) which is just the sample variance. The zi1z_{i1} are called the scores of the first principal component Z1Z_1.
  • The first principal component has a nice geometric interpretation 1. The loading vector ϕ1\phi_{1} defines a direction in Rp\mathbb{R}^p along which the variation is maximized. The principal component scores zi1z_{i1} are the projections of the data xix_i onto ϕ1\phi_1 – that is, the components of the xix_i along this direction.
  • For j=2,...,pj = 2,...,p we can compute the jj-th principal component ϕj\phi_j recursively

ϕ^j=argmaxϕRp(1ni=1n(ϕxi)2) \hat{\phi}_j = \underset{\phi \in \mathbb{R}^p}{\text{argmax}}\left(\frac{1}{n}\sum_{i=1}^n \left(\phi^\top x_i\right)^2\right)

subject to 2

ϕjϕj1=0\phi_j^\top \phi_{j - 1} = 0.

  • We can plot the principal components against each other for a low-dimensional visualization of the data. For example a biplot plots both the scores and the loading vectors 3.

Another Interpretation of Principal Components

  • Principal components can also be seen as providing low-dimensional surfaces that are “closest” to the observations.
  • The span of the first MM loading vectors ϕ1,,ϕM\phi_1, \dots, \phi_M can be seen as the MM-dimensional linear subspaces of Rp\mathbb{R}^p which is closest to the observations xix_i 4
  • Together the principal components Z1,,ZMZ_1, \dots, Z_M and loading vectors ϕ1,,ϕM\phi_1, \dots, \phi_M can be seen as an MM-dimensional approximation 5 of each observation

xijm=1Mzimϕjmx_{ij} \approx \sum_{m = 1}^M z_{im}\phi_{jm}

More on PCA

  • PCA requires that the variables are centered to have mean zero
  • PCA is sensitive to scaling, so we usually scale each variable to have standard deviation 1.
  • Scaling to standard deviation 1 is particularly important when variables are measured in different units, however if they are measured in the same units we may not wish to do this.
Uniqueness of the Principal Components

The loading vectors and score vectors are unique up to sign flips.

The Proportion of Variance Explained
  • How much of the information in a given data set is lost by projecting onto the principal components? More precisely, what is the proportion of variance explained (PVE) by each principal component?
  • Assuming centered data, the total variance 6 is vartotal:=j=1pV(Xj)=i=1p(1ni=1nxij2)\text{var}_{total} := \sum_{j = 1}^p \mathbb{V}(X_j) = \sum_{i = 1}^p \left(\frac{1}{n} \sum_{i = 1}^n x_{ij}^2 \right)

    while the variance explained by the mm-th principal component is varm:=1ni=1nzim2=1ni=1n(i=1pϕjmxij)2\text{var}_{m} := \frac{1}{n} \sum_{i=1}^n z_{im}^2 = \frac{1}{n} \sum_{i = 1}^n \left(\sum_{i = 1}^p \phi_{jm}x_{ij} \right)^2 .

  • The PVE of the mm-th component is then

    PVEm:=varmvartotal\text{PVE}_m := \frac{\text{var}_{m}}{\text{var}_{total}}

    and the cumulative PVE of the first MM components 7 is

    m=1MPVEm\sum_{m = 1}^M \text{PVE}_m

Deciding How Many Principal Components to Use
  • In general choose we may not be interested in using all principal components, but just enough to get a “good” understanding of the data 8.
  • A scree plot, which plots PVMm\text{PVM}_m vs. mm, can help identify a good number of principal components to use, is one visual method for identifying a good number of principal components. We look for an elbow - a value of mm such that PVMm\text{PVM}_m drops off thereafter.
  • In general, the question of how many principal components are “enough” is ill-defined, and depends on the application and the dataset. We maybe look at the first few principal components in order to find interesting patterns. If none are evident, then we conclude further components are unlikely to be of use. If some are evident, we continue looking at components until no more interesting patterns are found.
  • In an unpervised setting, these methods are all ad hoc, and reflect the fact that PCA is generally used in EDA 9.

Other Uses for Principal Components

  • Many statistical techniques (regression, classification, clustering) can be adapted to the n×Mn \times M PCA matrix with columns the first M<<pM << p principal component score vectors.
  • The PCA matrix can be seen as a “de-noising” 10 of the original data, since the signal (as opposed to the noise) is weighted towards the earlier principal components

Clustering Methods

  • This is a broad set of techniques for finding clusters (or subgroups) of the data set.
  • Observations should be “similar” within clusters and dissimilar across clusters. The definition of “similar” is context dependent.
  • Clustering is popular in many fields, so there exist a great number of methods.

KK-Means Clustering

  • KK-means clustering seeks to partition the data into a pre-specified number KK of distinct, non-overlapping clusters.
  • More precisely, we seek a partition C^1,C^K\hat{C}_1, \dots \hat{C}_K of the set of indices {1,n}\{1, \dots n\}

C^1,C^K=argminC1,,Ck(k=1KW(Ck))\hat{C}_1, \dots \hat{C}_K = \underset{C_1, \dots, C_k}{\text{argmin}}\left(\sum_{k = 1}^K W(C_k)\right)

where W(Ck)W(C_k) is some measure of the variation within cluster CkC_k.

  • A typical choice of W(Ck)W(C_k) is the average 11 squared Euclidean distance between points in CkC_k:

Wc.k=1Cki,iCkxixi2Wc._k = \frac{1}{|C_k|}\sum_{i, i' \in C_k} ||x_i - x_i'||^2

  • A brute force algorithm for finding the global minimum is O(Kn)O(K^n) but there is a much faster algorithm which is guaranteed to find a local minimum. It uses a random initialization so it should be performed several times.
Algorithm: KK-Means Clustering
  1. Initialize by randomly assigning a cluster number 1,K1,\dots K to each observation.
  2. While the cluster assignments change:
    1. For each k=1,Kk = 1, \dots K, compute the centroid of the kk-th cluster (the vector of feature means 12 for the observations in the cluster).
    2. Assign to each observation the number of the cluster whose centroid is closest.
Advantages
Disadvantages

Hierarchical Clustering

  • Hierarchical clustering is an alternative clustering method which doesn’t require a specified number of clusters and results in an attractive tree-based representation of the data called a dendrogram.
  • Bottom-up or agglomerative hierarchical clustering builds a dendrogram from the leaves up to the trunk.
Interpreting a Dendrogram
  • A dendrogram is a tree (visualized as upside down) with leaves corresponding to observations.
  • As we move up the tree, similar observations fuse into branches, and similar branches again fuse.
  • The earlier fusions occur, the more similar the corresponding groups of observations. The height at which two observations are joined by this fusing is a measure of this similarity.
  • At each height in the dendrogram, a horizontal cut splits the observations into kk clusters (corresponding to each of the branches cut) where 1kn1 \leqslant k \leqslant n.
  • The best choice of cut (hence number kk of clusters) is often obtained by inspecting the diagram.
The Hierarchical Clustering Algorithm
  • This algorithm uses a notion of dissimilarity defined for clusters, called a linkage.
  • Let A,BA, B be clusters, and let d(a,b)d(a, b) be a dissimilarity measure 13 for observations a,ba, b. A linkage defines a dissimilarity measure d(A,B)d(A,B) between the clusters A,BA, B. The four most common types of linkage are
    • complete: dcomp(A,B)=max(a,b)A×Bd(a,b)d_{comp}(A, B) = \underset{(a, b) \in A \times B}{\max} d(a, b)
    • single: dsing(A,B)=min(a,b)A×Bd(a,b)d_{sing}(A, B) = \underset{(a, b) \in A \times B}{\min} d(a, b)
    • average davg(A,B)=1AB(a,b)A×Bd(a,b)d_{avg}(A, B) = \frac{1}{|A||B|} \underset{(a, b) \in A \times B}{\sum} d(a, b)
    • centroid dcent(A,B)=d(xa,xb)d_{cent}(A, B) = d(x_a, x_b),
      where xax_a (resp. xbx_b) is the centroid of AA (resp. BB).
  • Average, complete, and single linkages are preferred by statisticians. Average and complete linkages are generally preferred as they result in more balanced dendrograms.
  • Centroid linkage is often used in genomics, but suffers from the possibility of an inversion, in which two clusters are fused at a height below the individual clusters, which makes interpretation difficult.
Choice of Dissimilarity Measure
  • The squared Euclidean distance is often used as a dissimilarity measure.
  • An alternative is the correlation-based distance
  • The choice of dissimilarity measure is very important and has a strong effect on the resulting dendrogram. The choice of measure should be determined by context.
  • One should consider scaling the data before choosing the dissimilarity measure.
Algorithm: Hierarchical Clustering
  1. Initialize with nn clusters, one for each observation, and compute the dissimilarities d(xi,xj)d(x_i, x_j) for each pair.
  2. For i=n,,2i = n, \dots, 2:
    1. Compute all dissimilarites among the ii clusters, and fuse the two clusters which are the least dissimilar. This dissimilarity is the height in the dendrogram where the fusion is placed.
    2. Compute the dissimilarities among the new i1i -1 clusters.
Advantages
Disadvantages

Practical Issues in Clustering

Small Decisions with Big Consequences
  • Should observations or features be standardized in some way?
  • For hierarchical clustering:
    • What dissimilarity measure should be used?
    • What type of linkage should be used?
    • Where should we cut the dendrogram to determine the number of clusters?
  • For KK-means clustering, what is the choice of KK?
Validating the Clusters Obtained
  • It is important to decide whether the clusters obtained reflect true subgroups in the data or are a result of “clustering the noise”.
  • There exist techniques for making this decision, such as obtaining pp-values for each cluster.
Other Considerations in Clustering
  • Both KK-means and hierarchical clustering assign all observations to some cluster. This can be problematic, for example in the presence of outliers that don’t clearly belong to any cluster.
  • “Mixture models” are an attractive approach to accommodating outliers (they amount to a “soft” clustering approach).
  • Clustering methods are not robust to perturbations.
A Tempered Approach to Interpreting the Results of Clustering
  • Clustering can be a very useful and valid statistical tool if used properly.
  • To overcome the sensitivity to hyperparameters, is recommended to try hyperparameter optimization.
  • To overcome the sensitivity to perturbations, it is recommended to cluster on subsets of the data.
  • Finally, results of cluster analysis should be considered a part of EDA and not taken too seriously

Footnotes

  1. See book figure 10.1 and corresponding discussion. 

  2. The linear algebra interpretation is also nice 

  3. This constraint is equivalent to corr(Zj,Zj1)=0\text{corr}(Z_j, Z_{j-1}) = 0 

  4. See the wiki article 

  5. That is, the linear subspace in Rp\mathbb{R}^p which minimizes the sum of the squared euclidean distances to the points xix_i

  6. When M=min{n1,p}M = \min\{n - 1, p\}, the approximation is exact. 

  7. More accurately, the sum on the right is an estimate of the sum on the left. 

  8. In general there are min{n1,p}\min\{n-1, p\} principal components and m=1min{n1,p}PVEm=1\sum_{m = 1}^{\min\{n-1, p\}} \text{PVE}_m = 1 

  9. Indeed, if we take M<pM < p principal components, then we are truly doing dimensional reduction. 

  10. In a supervised setting, however, we can treat the number of components as a tuning parameter. 

  11. There is a nice information-theoretic interpretation of this statement. 

  12. That we are taking an average is probably the reason for the “means” in “KK-means”. 

  13. For example, the commonly used squared Euclidean distance. See Choice of Dissimilarity Measure