10. Unsupervised Learning
Table of Contents
The Challenge of Unsupervised Learning
- Unsupervised learning is learning in the absence of a response. It is often part of exploratory data analysis (EDA).
- Without a response, we aren’t intested in prediction or classification, rather we are interested in discovering interesting things about the data. This can be difficult because such a goal is somewhat subjective.
- Objective performance critera for unsupervised learning can also be challenging.
Principal Components Analysis
- Principal components were discussed earlier as a dimensional reduction methof in the context of regression. They provide a low-dimensional representation of the data that contains as much variation as possible.
- Principal Components Analysis is the process of computing principal components and using them in data analysis.
What Are Principal Components?
-
The first principal component of features X1,…,Xp is the normalized linear combination Z1=ϕ^1⊤X
where X=(X1,…,Xp),ϕ^1∈Rp and ∣∣ϕ^∣∣=1. The vector ϕ^1 is called the loading vector (its entries are called the loadings) and
ϕ^1=∣∣ϕ∣∣=1ϕ∈Rpargmax(n1i=1∑n(ϕ⊤xi)2)
- Assume we have data Xi with features X1,…,Xp which is centered in the features (each feature has mean zero). The objective function in the above optimization problem can be rewritten ϕ^1=ϕ∈Rpargmax(n1i=1∑n∣∣zi∣∣2)
which is just the sample variance. The zi1 are called the scores of the first principal component Z1.
- The first principal component has a nice geometric interpretation . The loading vector ϕ1 defines a direction in Rp along which the variation is maximized. The principal component scores zi1 are the projections of the data xi onto ϕ1 – that is, the components of the xi along this direction.
- For j=2,...,p we can compute the j-th principal component ϕj recursively
ϕ^j=ϕ∈Rpargmax(n1i=1∑n(ϕ⊤xi)2)
subject to
ϕj⊤ϕj−1=0.
- We can plot the principal components against each other for a low-dimensional visualization of the data. For example a biplot plots both the scores and the loading vectors .
Another Interpretation of Principal Components
- Principal components can also be seen as providing low-dimensional surfaces that are “closest” to the observations.
- The span of the first M loading vectors ϕ1,…,ϕM can be seen as the M-dimensional linear subspaces of Rp which is closest to the observations xi
- Together the principal components Z1,…,ZM and loading vectors ϕ1,…,ϕM can be seen as an M-dimensional approximation of each observation
xij≈m=1∑Mzimϕjm
More on PCA
- PCA requires that the variables are centered to have mean zero
- PCA is sensitive to scaling, so we usually scale each variable to have standard deviation 1.
- Scaling to standard deviation 1 is particularly important when variables are measured in different units, however if they are measured in the same units we may not wish to do this.
Uniqueness of the Principal Components
The loading vectors and score vectors are unique up to sign flips.
The Proportion of Variance Explained
- How much of the information in a given data set is lost by projecting onto the principal components? More precisely, what is the proportion of variance explained (PVE) by each principal component?
-
Assuming centered data, the total variance is
vartotal:=j=1∑pV(Xj)=i=1∑p(n1i=1∑nxij2)
while the variance explained by the m-th principal component is varm:=n1i=1∑nzim2=n1i=1∑n(i=1∑pϕjmxij)2.
-
The PVE of the m-th component is then
PVEm:=vartotalvarm
and the cumulative PVE of the first M components is
m=1∑MPVEm
Deciding How Many Principal Components to Use
- In general choose we may not be interested in using all principal components, but just enough to get a “good” understanding of the data .
- A scree plot, which plots PVMm vs. m, can help identify a good number of principal components to use, is one visual method for identifying a good number of principal components. We look for an elbow - a value of m such that PVMm drops off thereafter.
- In general, the question of how many principal components are “enough” is ill-defined, and depends on the application and the dataset. We maybe look at the first few principal components in order to find interesting patterns. If none are evident, then we conclude further components are unlikely to be of use. If some are evident, we continue looking at components until no more interesting patterns are found.
- In an unpervised setting, these methods are all ad hoc, and reflect the fact that PCA is generally used in EDA .
Other Uses for Principal Components
- Many statistical techniques (regression, classification, clustering) can be adapted to the n×M PCA matrix with columns the first M<<p principal component score vectors.
- The PCA matrix can be seen as a “de-noising” of the original data, since the signal (as opposed to the noise) is weighted towards the earlier principal components
Clustering Methods
- This is a broad set of techniques for finding clusters (or subgroups) of the data set.
- Observations should be “similar” within clusters and dissimilar across clusters. The definition of “similar” is context dependent.
- Clustering is popular in many fields, so there exist a great number of methods.
K-Means Clustering
- K-means clustering seeks to partition the data into a pre-specified number K of distinct, non-overlapping clusters.
- More precisely, we seek a partition C^1,…C^K of the set of indices {1,…n}
C^1,…C^K=C1,…,Ckargmin(k=1∑KW(Ck))
where W(Ck) is some measure of the variation within cluster Ck.
- A typical choice of W(Ck) is the average squared Euclidean distance between points in Ck:
Wc.k=∣Ck∣1i,i′∈Ck∑∣∣xi−xi′∣∣2
- A brute force algorithm for finding the global minimum is O(Kn) but there is a much faster algorithm which is guaranteed to find a local minimum. It uses a random initialization so it should be performed several times.
Algorithm: K-Means Clustering
- Initialize by randomly assigning a cluster number 1,…K to each observation.
- While the cluster assignments change:
- For each k=1,…K, compute the centroid of the k-th cluster (the vector of feature means for the observations in the cluster).
- Assign to each observation the number of the cluster whose centroid is closest.
Advantages
Disadvantages
Hierarchical Clustering
- Hierarchical clustering is an alternative clustering method which doesn’t require a specified number of clusters and results in an attractive tree-based representation of the data called a dendrogram.
- Bottom-up or agglomerative hierarchical clustering builds a dendrogram from the leaves up to the trunk.
Interpreting a Dendrogram
- A dendrogram is a tree (visualized as upside down) with leaves corresponding to observations.
- As we move up the tree, similar observations fuse into branches, and similar branches again fuse.
- The earlier fusions occur, the more similar the corresponding groups of observations. The height at which two observations are joined by this fusing is a measure of this similarity.
- At each height in the dendrogram, a horizontal cut splits the observations into k clusters (corresponding to each of the branches cut) where 1⩽k⩽n.
- The best choice of cut (hence number k of clusters) is often obtained by inspecting the diagram.
The Hierarchical Clustering Algorithm
- This algorithm uses a notion of dissimilarity defined for clusters, called a linkage.
- Let A,B be clusters, and let d(a,b) be a dissimilarity measure for observations a,b. A linkage defines a dissimilarity measure d(A,B) between the clusters A,B. The four most common types of linkage are
- complete: dcomp(A,B)=(a,b)∈A×Bmaxd(a,b)
- single: dsing(A,B)=(a,b)∈A×Bmind(a,b)
- average davg(A,B)=∣A∣∣B∣1(a,b)∈A×B∑d(a,b)
- centroid dcent(A,B)=d(xa,xb),
where xa (resp. xb) is the centroid of A (resp. B).
- Average, complete, and single linkages are preferred by statisticians. Average and complete linkages are generally preferred as they result in more balanced dendrograms.
- Centroid linkage is often used in genomics, but suffers from the possibility of an inversion, in which two clusters are fused at a height below the individual clusters, which makes interpretation difficult.
Choice of Dissimilarity Measure
- The squared Euclidean distance is often used as a dissimilarity measure.
- An alternative is the correlation-based distance
- The choice of dissimilarity measure is very important and has a strong effect on the resulting dendrogram. The choice of measure should be determined by context.
- One should consider scaling the data before choosing the dissimilarity measure.
Algorithm: Hierarchical Clustering
- Initialize with n clusters, one for each observation, and compute the dissimilarities d(xi,xj) for each pair.
- For i=n,…,2:
- Compute all dissimilarites among the i clusters, and fuse the two clusters which are the least dissimilar. This dissimilarity is the height in the dendrogram where the fusion is placed.
- Compute the dissimilarities among the new i−1 clusters.
Advantages
Disadvantages
Practical Issues in Clustering
Small Decisions with Big Consequences
- Should observations or features be standardized in some way?
- For hierarchical clustering:
- What dissimilarity measure should be used?
- What type of linkage should be used?
- Where should we cut the dendrogram to determine the number of clusters?
- For K-means clustering, what is the choice of K?
Validating the Clusters Obtained
- It is important to decide whether the clusters obtained reflect true subgroups in the data or are a result of “clustering the noise”.
- There exist techniques for making this decision, such as obtaining p-values for each cluster.
Other Considerations in Clustering
- Both K-means and hierarchical clustering assign all observations to some cluster. This can be problematic, for example in the presence of outliers that don’t clearly belong to any cluster.
- “Mixture models” are an attractive approach to accommodating outliers (they amount to a “soft” clustering approach).
- Clustering methods are not robust to perturbations.
A Tempered Approach to Interpreting the Results of Clustering
- Clustering can be a very useful and valid statistical tool if used properly.
- To overcome the sensitivity to hyperparameters, is recommended to try hyperparameter optimization.
- To overcome the sensitivity to perturbations, it is recommended to cluster on subsets of the data.
- Finally, results of cluster analysis should be considered a part of EDA and not taken too seriously