Relevant Textbook Sections: 6
Cube: Unsupervised, Discrete, Nonprobabilistic
Lecture 13 Summary
Relevant Videos
Intro: Clustering
Intro to Unsupervised Learning
So far in the class, we've been doing supervised learning, where we were trying to predict y's given x's. Now, we'll look at unsupervised learning, where there are no more y's, and only x's. The task instead then, is to look at the data and "summarize" the data (the x's).
Intro to Clustering
One specific type of unsupervised learning that we're covering today is clustering. Some real world examples are the following: perhaps we're looking at a huge network, and we want to identify different types of groupings within the networks, for example in social media. These groups of people might have certain shared interests or activities that we'd like to find out. Another exmaple is: perhaps we have a bunch of genes and we know the interactions between them, but we want to find genes that are similar so that we can understand the science behind what's going on. Overall, putting things into groups is a way we can undersatnd things better. This may seem a little fuzzy and undefined for now, but as we go through the lecture and formalize things in the math, things will make more sense.
General Look at Unsupervised Learning
One way to look at unsupervised learning at the broadest, most general level is to go back to that idea of "summarization". Essentially, our unsupervised learning is a "summarization task" that we're trying to complete. As we think about this, it's important to consider three questions.
- Why might we to do this?
- Hypothesis Generation (we might want to look at summaries or patterns in science so that we can start asking the right questions about why certain things look the way they do)
- Compression: we may want to compress images, music, or videos in order to save storage space
- Organization: perhaps we are organizing news articles, and we to only show a few headlines that kind of represent all of the headlines. To phrase this another way: we might be trying to summarize the news into a few headlines
- How do we evaluate our summarization?
We have a goal in mind: maybe we want to group a bunch of movies into different genres. But we need a formal way of writing this in the math, and we need a goal. We can use reconstruction error.
$$\sum_n || x - \textrm{decode(encode(x))}||^2$$
We can think of the encode(x) as the summary step (in the movie example: what cluster does this movie belong to) and the decode as the conversion of the summary to our new version (best guess) of x based off just the summary. We can call this $\hat{x}$. The || || notation here represents the distance metric notation. This could really be anything: edit distance, Hamming distance, cosine similarity, etc. Remember that there are lots of types of metrics out there: gemoetry, density, stability, etc.
Note: Remember that this is a general framework for the summarization task. We've set it up in a way that this also works for examples like say compressing an image (where we can encode by compressing, and decode by decompressing), as well as the clustering of the movies example. In the movie example, it may feel like overkill, but it still works out. The encoding is picking the genre. The decoding is trying to convert the genre back to the movie. Obviously, that is impossible, so we will incur some loss for sure, but the logic here is that even though we've lost some info in the process, categorizing a movie by the correct genre would reduce to a smaller error than if you categorized a movie a wrong genre.
- How do we do it? There are a couple different ways, but today we're going to focus on nonprobabilistic clustering.
Nonprobabilistic Clustering: K-Means
Scenario and Notation
Let us dive into the K-Means clustering algorithm. First, to set up our scenario:
- We have our data $x_1, x_2, ..., x_n$
- We have a measure of similarity/distance, $d(x, x') = ||x - x'||$
- Start off by supposing we are given k (as in we know that there are k classes)
- Let $z_{nk}$ indicate group assignment. As in if $x_n$ is in cluster $k$, $z_{nk}$ is 1. And if $x_n$ is not in cluster $k$, $z_{nk}$ is 0.
- For today, let's assume the distance metric is Euclidian.
- $\mu_k$ is a "prototype" of the cluster. It's like a prototype that represents the cluster. The idea is it either represents or defines the middle point a gaven cluster (we will formalize this bellow). Note: it should also be a dimension d item, the same as the dimensions of x. It's true that it might not be possible that you might use the medioids instead of the average, for example, to select your "prototypes". For now, we're keeping things again, as general as possible.
The Objective
With the set up out of the way, we now shoud start with our objective function, which is what will get us going in the right direction. We had an objective from before:
$$\sum_n || x - \textrm{decode(encode(x))}||^2$$
We're going to rewrite this into our specific K-Means situation.
$$min_{z, u} \sum_n \sum_k z_{nk} ||x_n - \mu_k||$$
Note on explaining how we rewrote it: The $\sum_k z_{nk}$ part will make sure to pick out the correct cluster for each point (as in, the cluster that point $x_n$ is assigned to). This is similar to the way we've used indicator variables in the past. The $||x_n - \mu_k||$ term represents the decode encode part. Because we are doing clustering, the summary or "encoding" is the prototype $\mu_k$ and the loss we get is described exactly by $||x_n - \mu_k||$ since $\mu_k$ is our best option of how to describe a given point, and represents the decoding part as we have to use $\mu_k$ as our best guess for all the points in this cluster k.
It turns out this optimization is NP hard and very non convex, so it's going to be difficult to solve this objective. Luckily we can solve this using an algorithm to find a pretty good local optima.
Note: at this point in time, we note that we will square the objective, as it won't affect the minimization ultimately, but will make the math more intuitive down the road. Details to come later. Now the objective is:
$$min_{z, u} \sum_n \sum_k z_{nk} ||x_n - \mu_k||^2$$
The Algorithm (K-Means, Lloyd's algorithm)
In the above section, we just have the objective. Remember that there is also an algorithm necessary, and that these are two very different things. The K-Means objective (what we wrote above) is ultimately is the loss we're trying to minimize. And now we'll look at the actual K-Means algorithm, which is used to solve the above objective and different from the objective itself.
Note: this K-Means algorithm is also known as Lloyd's algorithm, and is a variation of block coordinate ascent. Block coordinate ascent (aka. alternating minimization) is an important concept overall so definitely note it down.
Block Coordinate Ascent: The core idea of block coordinate ascent is: given $\mu$, solving z is easy. And given z, easy to solve for $\mu$. Intuitively this makes sense in the K-Means situation. Let's first look at when given $\mu$, solving for z: if you had the prototypes, you can just assign the points to clusters by picking the closest prototype, which will minimize the objective. Then if we are given z, and need to pick $\mu$, we can pick the means easily just by taking the average, which will also minimize the loss.
Back to Algorithm
- Randomly assign points to clusters (as in randomly assigning $z_{nk}$)
- Loop until converged:
- for k = 1, ... K:
- $\mu_k = \textrm{mean}(x_n\textrm{ such that }z_nk = 1) = \frac{1}{n_k} \sum z_nk x_n$
- where $n_k$ is obtained by $\sum_n z_{nk}$
- for n = 1, ... N:
- Assign $x_n$ to $\textrm{argmin}_k ||x_n - \mu_k ||$
Note: Usually you will do several restarts and then pick the best because it is non convex)
Why this Algorithm Works
At a high level, we know this algorithm works because in each step (the calculating z step, and the calculating $\mu$ step), we are reducing the loss. Since we are doing this convergence, then we know we are reach some local minimum.
For calculating z:
At every step in the loop across each data point n, the loss is getting smaller because we are assigning points to the the closest prototype, so by definition we are minimizing the loss, since we are shortening (or at not changing) the distance between points and their prototypes.
For calculating $\mu$:
We will do a bit of math to show this. Let us go back to the objective function and take the derivative with just respect to $\mu_k$ to show that our step we took was minimizing the loss.
$$L = \sum_n z_nk (x_n - \mu_k)^T(x_n - \mu_k)$$
We can take this derivative with respect to just $\mu_k$.
$$\frac{\partial L}{\partial \mu k} = - 2 \sum z_{nk} (x_n - \mu_k) = 0$$
$$(\sum_n z_{nk})(\mu_k) = (\sum_n z_{nk} x_n)$$
We can rearrange and also pull out the $\mu$ since it doesn't depend on the n.
$$\mu_k = (\sum_n z_{nk} x_n)/(\sum_n z_{nk})$$
As you can see, this is exactly what the algorithm was doing, so it indeed was minimizing.
Conclusion:
Again, to recap, because it's minimizing at both the z and $\mu$ steps and we can keep running this until convergence, we know that this will indeed work and get us to a local optima at least. Before we hand waved saying that taking the means made sense, using just intuition. Now, we've just gone through the math more formally to show why exactly this algorithm works.
Student Question: Do the starting values affect the result? Yes, they do, great question. Some algorithms address this. One quick example is the kmeans++ algorithm.
Additional Considerations of K-Means
- How many clusters should we pick? Look for the kink in the curve, also known as the elbow method. You can read more about this method and also another method here. Remember that this does need to be a careful choice because we can't try to minimize the loss: this is because if we set k to be really big, as big as n, then the loss is 0 as every point ends up in their own cluster. So you can't just look for the lowest loss. Instead, we plot the number of clusters vs the loss and look for a kink. That kink represents how at some point adding clusters has diminishing returns. But overall this is still tricky, no exact definition of the perfect number of clusters.
- Remember that other distance metrics can be used instead of Euclidian Distance. Depending on the distance metric, the math could work out to be more difficult or more easy!
Nonprobabilistic Clustering: Hierarchical Agglomerative Clustering
Hierarchical clustering constructs a tree over the data, where the leaves are individual data items, while the root is a single cluster that contains all of the data. The algorithm is as follows:
- Start with n clusters, one for each data point
- Measure the distance between clusters. This will require an inter-cluster distance measurement that we will define below
- Merge the two ‘closest’ clusters together, reducing the number of clusters by 1, and also record the distance between these two merged clusters for the dendrogram (more on this later)
- Repeat step 2 until we’re left with only a single cluster
Some features of HAC include: HAC is nonparametric (as seen in the above algorithm, there isn't a finite number of parameters that can describe what's going), and it is also deterministic because there is no random intialization (whereas in K-Means we actually have to do some random initialization).
Dendrogram
It is easier to understand the dendrogram once you have a high level understanding of the HAC algorithm and also the concept of the intra-cluster distance. I include this section here first just in case people are curious, but I would recommend reading what's below first, then coming back to this paragraph. At a high level, the dendrogram is the tree that represents all the merges that you made. Since it's not as effective to simply put a static dendrogram here because the fashion in which is drawn is completely lost, instead here's a ivdeo that does a good job of explaining and drawing it out! So after understanding the above and below, check out this
video that touches on the dendrogram.
Distance Between Points and Distance Between Clusters
As usual, we always need our definition of distance between points as a base for anything to work. Commonly, we may choose to use the Euclidian distnace. But as mentioned above, for the algorithm to work, we also need the intra-cluster distance. This is actually very important, and is really the main decision we make when customize and use HAC; this selection of the intra-cluster distance criterion will affect the results. This intra-cluster distance is also known as linkage criteria. Some examples are below:
Example Linkages
- Minimum distance between two groups: given two groups, find the two closest pair of points across the two groups, and we will use that to represent the distance between two groups
- Maximum distance between two groups: given two groups, find the two furthest pair of points across the two groups, and we will use that to represent the distance between two groups
It turns out that min distance will result in more stringy and snakey clusters, as merging will be biased towards points that are close together, so groups of points essentially almost lined up together will start to form. Whereas on the other hand, max distance will result in more compact, rounded, complete, clusters, as groups will swallow in things from all around. This is beacuse the max distance will consider a lot more other groups to be possibly close before picking how to merge right away.
There are also compromises between the two above: you can also use these linkages: average pairwise distances, distances between centroids.
You will need to pick a linkage depending on what you want your output to look like. We discuss these because it's imporant to be able to think about the difference linkages, in case you need specific types of outputs for specific situations in the real world.