Date: March 24, 2022
Relevant Textbook Sections: 7
Cube: Unsupervised, Continuous, Nonprobabilistic
Relevant Videos
Summary
Embeddings and PCA
Our past two lectures covered settings with a discrete hidden variable z. In those scenarios, we wanted to cluster our data into different groups, where z told us what group a data point $x$ should go in. Today, we switch to a situation where we have a continuous z, which we call our embedding (because it’s an embedding of all the info in the x's into a z with lower dimension). Instead of grouping things into k discrete clusters, what if we tried to summarize the data into k continuous dimensions?
Recall: Continuous and discrete z are the analogs of continuous and discrete y from the supervised learning topics we covered. Previously, we were given y, but now we do not know what the categories or embeddings z are and need to infer them ourselves.
One example of a nonprobablistic embeddings algorithm is Principal Component Analysis (PCA), which is the focus of today’s lecture.
For an initial example, imagine if we had two-dimensional data on people’s heights and leg lengths. The heights and leg lengths are highly correlated. In a 2D plot, the points all lie very close to some line with positive slope. In this case, that line is a low dimensional axis (it’s just one dimension) that explains our data very well. It seems like we may not need two dimensions to represent our data, because we can approximate each piece of data pretty well using just points along this one line.
The problem we’ll explore today is, how can we identify this hidden axis?
Getting Started: Let's Make This Linear
Say we have $N$ datapoints $x_n$, each with dimension $D$. Suppose we try to find a set of K D-dimensional vectors {$u_k$} such that we can approximate some $x_n$ as a linear combination of them:
$$x_n \approx z_{n1}u_1 + z_{n2}u_2 + z_{n3}u_3 + ... + z_{nK}u_K = Uz_n$$
Here, $z_{n1}$, $z_{n2}$, etc are scalar, and $u_1$, $u_2$, etc are vectors. $z_n = (z_{n1}$, $z_{n2},...)$ is a K-dimensional vector that describes the position of x with respect to basis U. If we think of $z_n$ as a vector and $U$ as a matrix whose rows are the $u_k$s, we can rewrite this approximation as the matrix-vector product $Uz_n$.
We are specifically interested in cases where K < D, which means we are compressing the data from D dimensions to K dimensions. Another view is that we are embedding D-dimensional data into K dimensions. By having our reconstruction of x be a linear combination of the z's and the u’s, we are expressing x in terms of our K basis vectors (the u’s).
Minimizing the Reconstruction Error
We want our approximation of $x_n$ to be accurate. That is, the loss we are looking to minimize is a measure of the difference between $x_n$ and its approximation $Uz_n$:
$$L(\{z_n\}_{n=1}^N, U) = \frac{1}{N} \sum^N_n ||x_n - U z_n||^2_2$$
Where our $x_n$ is a D by 1 data point, our U is a D by K matrix (its rows are the basis vectors $u_k$) and our $z_n$ is a K by 1 vector (embedding). Like before, we are trying to minimize the reconstruction error.
Adding Some Constraints
The solution in the above minimization is not unique. Specifically, $Uz$ can be broken down into $UQQ^{-1}z$ where $Q$ is any $K$ by $K$ invertible matrix. Then we could set a new $U$ to be $UQ$ and a new $z$ to be $Q^{-1}z$ and attain the same loss as before.
Adding constraints on U can help to reduce the symmetries in the loss in addition to providing U's with nice properties. Depending on the application, we might want U to be sparse, or for U to be non-negative. Different projects will use different constraints on U depending on what properties they are looking for. Today, we will just add the constraint that U is orthonormal.
We will have:
$\langle u_k, u_k \rangle = 1$ (constrains any scaling)
$\langle u_k, u_{k'} \rangle = 0$ when $k \neq k'$ (makes the basis vectors orthogonal)
We get some nice linear algebra properties from setting the above constraints. Specifically, we can easily recover z given x. We have:
$$x_n \approx z_{n1}u_1 + z_{n2}u_2 + z_{n3}u_3 + ... + z_{nK}u_K$$
We can multiply by $U_k^T$ on both sides:
$$x_nU_k^T = z_{n1}U_1U_k^T + z_{n2}U_2U_k^T + ... + z_{nk}U_kU_k^T$$
Using the rules we learned from before, we know that when the k's are not equal, the terms $U_1U_k^T$, $U_2U_k^T$, etc become 0. Leaving us with:
$$x_nU_k^T = z_{nk}U_kU_k^T$$
$U_kU_k^T$ evaluates to 1, so by considering what would happen if we multiplied by the entire $U^T$ instead of just $U_k^T$, we conclude that:
$$x_nU_k^T = z_{nk}$$
$$\implies x_nU^T = z_n$$
Let's Make Things More Interpretable
Let's improve upon our first attempt at the problem. Specifically, instead of reconstructing the whole thing, let's try to reconstruct the difference of the data from the mean. Below, we break down the data in this way, where x is approximated as a perturbation from the mean datapoint $\bar{x}$:
$$x_n \approx \bar{x} + z_{n1}U_1 + z_{n2}U_2 + ... + z_{nk}U_k$$
Now, the bases are helping us capture variation from the mean instead of being used to approximate all of $x$ like before. Our new optimization is as follows:
$$L({z, u}) = \frac{1}{N} \sum_n ||(x_n - \bar{x}) - Uz_n||^2_2$$
Where $(x_n - \bar{x})$ is the new variable we are trying to reconstruct. We also keep our constraint that U is orthonormal. From linear algebra, we know that if K = D, {U_1 ... U_d}, we would have perfect reconstruction and incur no loss, because we are keeping x in the same dimension as before. Now we will use that fact to imagine we have extra $U_{k + 1}, … U_d$ to use and rewrite x as the following:
$$x_n = z_{n1}U_1 + ... + z_{nk}U_k + ... z_{nD}U_D + \bar{x}$$
Then, we can take our loss and substitute in the expression for $\bar{x}$ above.
$$L({z, u}) = \frac{1}{N} \sum_n ||\sum^D_{d=1} z_{nd} U_d - \sum^K_{d=1} z_{nd} U_d||^2_2$$
This can be simplified into (because the summation is of the same terms so we can just edit the indices on the summation and combine the two terms):
$$L({z, u}) = \frac{1}{N} \sum_n ||\sum^D_{d=K+1} z_{nd} U_d||^2_2$$
This can be split into:
$$L({z, u}) = \frac{1}{N} \sum_n \left(\sum^D_{d=K+1} z_{nd} U_d \right ) \left (\sum^D_{d=K+1} z_{nd} U^T_d \right )$$
And invoking orthonormality (as in $U_d^TU_{d'} = 0$ since $d \neq d'$) we can simplify into:
$$L({z, u}) = \frac{1}{N} \sum_n \sum^D_{d=K+1} z^2_{nd}$$
Now we substitute in $z_{nd} = U^T_d (x_n - \bar{x})$
$$L({z, u}) = \frac{1}{N} \sum_n \sum^D_{d=K+1} U^T_d (x_n - \bar{x})(x_n - \bar{x})^TU_d$$
We can rearrange:
$$L({z, u}) = \sum^D_{d=K+1} U^T_d \frac{1}{N} \sum_n (x_n - \bar{x})(x_n - \bar{x})^TU_d$$
The middle term, $\frac{1}{N} \sum_n (x_n - \bar{x})(x_n - \bar{x})^T$, is the empirical covariance of x, $\Sigma$, which is pretty cool! Our solution is something related to the covariance of the data we’ve seen.
Solving to Minimize Reconstruction Error
How do we solve the above equation? Suppose we had k=1. Then there's only one U to solve for.
$$\textrm{min}_U U^T\Sigma U \textrm{ such that } U^TU = 1$$
Next, we use the Lagrangian below:
$$\textrm{min}_U U^T\Sigma U + \lambda (1 - U^TU)$$
$$\frac{\partial \textrm{ obj}}{\partial U} \textrm{ leads to } \Sigma u = \lambda u$$
So it turns out that the solution will be an eigenvector of the sample covariance!
We can now go back to the main problem given this information.
$$\textrm{min}_U \sum^D_{d=K+1} U^T_d \Sigma U_d = \sum^D_{d=K+1} U^T_d \lambda_d U_d = \sum^D_{d=K+1} \lambda_d$$
Remember that our ultimate goal was to minimize the reconstruction error. It turns out that we will be able to do this if we “leave out” the directions (eigenvectors of $\Sigma$) with the smallest eigenvalues. This is like making sure the error is small. In an alternative view, we are able to minimize the error when we KEEP the eigenvectors with the top K eigenvalues of $\Sigma$. Luckily, this computation is very easy using SVD (eigendecomposition)! Since $\Sigma$ is symmetric, it is guaranteed to have D real eigenvalues by spectral theorem.
Note: the principle components are the top eigenvectors, which get preserved for use in reconstruction. This is what the name PCA refers to.
Alternative View: Preserving Variance
We can also take a variance view of the exact same problem. Whereas before, we were thinking in the sense of minimizing reconstruction error, where we wanted to limit distance to our newly chosen basis, we can alternatively think about it as variance preservation. This means we want the vectors that capture the most variance in the data x; we want the vectors that maximize the variance captured. If our bases pointed in directions where the data didn’t vary, then scaling along them would not be very useful for better approximating the data.
Suppose we wanted a single vector u that captured most of the variance in x.
$$z = U^Tx$$
Where z is a scalar.
$$var(z) = U^Tvar(x)U = U^T \Sigma U$$
Constraint: $U^TU = 1$
We want to maximize var(z):
$$\max_U U^T \Sigma U \textrm{ such that } U^TU = 1$$
Apply the Lagrangian and we'll see that $\Sigma u = \lambda u$, where we pick the biggest $\lambda$. See Section 7.3.4, "Identifying Directions of Maximal Variance in our Data", in the textbook for a more thorough walkthrough of this derivation.
Final Notes
The subspace found by PCA is unique, but other equally good orthonormal solutions exist, too. They just aren’t found by PCA.
Practical advice: when doing PCA, remember to subtract the mean! Forgetting to subtract the mean will affect your results.