Date: February 1, 2022
Relevant Textbook Sections: 2.6.2, 2.6.3
Cube: Supervised, Continuous, Probabilistic
Announcements
- HW1 has been extended one week, it is now due Feb 11
- Dining halls are at limited capacity, check the course calendar to see if your office hours will be in person or on Zoom
Lecture 3 Summary
Relevant Videos
Non-Probabalistic Regression
Last time, we set up linear regression we chose our model class to be all linear functions of the features of $x \in \mathbb R^D$. This model class is parametrized by the coefficients and the bias which we stacked together in the vector $w \in \mathbb R^D$. We say that a model $w$ makes the predictions $$\hat y = f(w, x) = w^\top x.$$ Note that we assume that the bias trick is already implemented (namely, $x_1 = 1$ for all $x$).
We now choose a loss function by which we measure how good of a job a model $w$ does. Due to its simplicity, a common loss function is the sum of square losses (SSL) over our training dataset $\mathcal D = \{ (x_n, y_n) \}_{n=1}^N$: $$\mathcal L(w) = \sum_{n=1}^N (y_n - \hat y_n)^2.$$ We chose SSR because it is clean and there is an analytic, The SSL is easy because there is a analytical, closed-form solution to $w^*$, the model parameter values that globally minimizes the SSR. For a long time, SSL was popular because it is so easy. Nowadays, we are able to handle more complex losses and choose different loss functions depending on the application.
Magically, there is a story about how the SSL came to be beyond just "wow, this loss is so easy let's use it!". This story comes from the probablistic perspective to linear regression.
Generative Stories
A generative model is a story about how the data came to be.
One story is that a label $y$ for datapoint $x$ was generated by some model $w$, but then adding some random noise: $$y = \hat y + \epsilon = w^\top x + \epsilon, \quad \epsilon \sim \mathcal N(0, \sigma^2).$$ This story says that each $y$ was at one point $\hat y = w^\top x$ calculated from an underlying model $w$ but was perturbed by some random noise. The distribution of $\epsilon$ encodes how this noise came to be. The Gaussian distribution just says that these risiduals are likely to be a nonzero value surrounding 0, but not extremely big. Note that $w$ is a fixed constant and is not random. This means that $$y \sim \mathcal(w^\top x, \sigma^2).$$
We can now calculate the "likelihood" of the model $w$, which is exactly synonymous to the probability of the training data: $$\ell(w) := p(\{y_n\}_{n=1}^N | \{x_n\}_{n=1}^N, w) = \prod_{n=1}^N p(y_n | x_n, w)$$.
The goal, for now, is to maximize the likelihood to find the most likely model. This seems to be the natural approach towards picking a value for $w$. If we can write down the expression for likelihood of the model, we can solve for which model $w$ has maximum likelihood.
Maximizing Likelihood
How do we maximize $\ell(w)$? Products are very difficult, so we choose to optimize $\log \ell(w)$ because the $\log$ is monotonic, meaning $$\operatorname*{argmax}_w \log \ell(w)= \operatorname*{argmax}_w \ell(w).$$
Now we can rearrange the log-likelihood in order to reveal a connection to the SSL:
$$
\begin{aligned}
\log \ell(w) &= \log \prod_{n=1}^N p(y_n | x_n, w) \\
&= \sum_{n=1}^N \log p(y_n | x_n, w) \\
&= \sum_{n=1}^N \log \mathcal N (y_n; w^\top x_n, \sigma^2) \\
&= \sum_{n=1}^N \log \left( \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left\{ -\frac{y_n - w^\top x}{2\sigma^2} \right\} \right) \\
&= \sum_{n=1}^N \log \left( \frac{1}{\sqrt{2\pi \sigma^2}} \right) + \sum_{n=1}^N \log \left( \exp\left\{ -\frac{(y_n - w^\top x)^2}{2\sigma^2} \right\} \right) \\
&= -N \log \left(\sqrt{2\pi \sigma^2} \right) + \frac{1}{2\sigma^2}\sum_{n=1}^N \left( (y_n - w^\top x)^2 \right).
\end{aligned}
$$
Notice the value of $w$ that maximize $\log \ell(w)$ is also the value of $w$ that minimizes the SSL: the first term is constant and the second term is proportional to the negative SSL. Hence, assuming Gaussian noise, the log-likelihood is maximized when the sum of squared losses is minimized. This means when we solve for $w$ in non-probablistic regression by minimizing the SSL, we are implicitly assuming the risiduals are distributed as Gaussian. Also notice this holds true no matter the variance of the risiduals (namely, we made no assumptions about $\sigma^2$ to arrive at this realization).
Maximizing Likelihood with Matrices
For both practice and as a check for our understanding, let us repeat this process with matrices:
$$
\begin{aligned}
\log \ell(w) &= \log p(\mathbf y | \mathbf X, \mathbf w) \\
&= \log \mathcal N (\mathbf y; \mathbf X \mathbf w, \sigma^2 \mathbf I_D) \\
&= \log \left( \frac{1}{\sqrt{(2\pi)^D |\sigma^2 \mathbf I_D|}} \exp\left\{ -\frac{1}{2} (\mathbf y - \mathbf X \mathbf w)^\top(\sigma^2 \mathbf I_D)^{-1}(\mathbf y - \mathbf X \mathbf w) \right\} \right) \\
&= \log \left( \frac{1}{\sqrt{(2\pi)^D |\sigma^2 \mathbf I_D|}} \right ) + \log \left( \exp\left\{ -\frac{1}{2} (\mathbf y - \mathbf X \mathbf w)^\top(\sigma^2 \mathbf I_D)^{-1}(\mathbf y - \mathbf X \mathbf w) \right\} \right) \\
&= -D \log \sqrt{2 \pi \sigma^{2}} - \frac{1}{2 \sigma^{2}}(\mathbf{y}-\mathbf{X w})^\top(\mathbf{y}-\mathbf{X w}).
\end{aligned}
$$
Estimating $\sigma^2$ instead of $w$
We can also take the derivative with respect to
$\sigma^2$ to find the $\sigma^2$. Now, we assume
that $\mathbf{w}$ is given. Important note: we are looking at $\sigma^2$ specifically, the variance, not $\sigma$.
First, we start with the equation we used from before.
$$-\frac{N}{2} \log 2\pi\sigma^2 - \frac{1}{2\sigma^2} (\mathbf{y}-\mathbf{X}\mathbf{w})^T(\mathbf{y}-\mathbf{X}\mathbf{w})$$
$$-\frac{N}{2} \log 2\pi - \frac{N}{2}\log\sigma^2 -\frac{1}{2\sigma^2}(\mathbf{y}-\mathbf{X}\mathbf{w})^T(\mathbf{y}-\mathbf{X}\mathbf{w})$$
Take the derivative with respect to $\sigma^2$ (note: not with respect to $\sigma$):
$$\begin{align*}
0 &= -\frac{N}{2\sigma^2} + \frac{1}{2(\sigma^2)^2}(\mathbf{y}-\mathbf{X}\mathbf{w})^T(\mathbf{y}-\mathbf{X}\mathbf{w}) \\
0 &= -N\sigma^2 + (\mathbf{y}-\mathbf{X}\mathbf{w})^T(\mathbf{y}-\mathbf{X}\mathbf{w}) \\
\sigma^2_{ML} &= \frac{1}{N}(\mathbf{y}-\mathbf{X}\mathbf{w})^T(\mathbf{y}-\mathbf{X}\mathbf{w})
\end{align*}
$$
This is the empirical variance, which makes sense!
Making $w$ Random Too
What we did: probabilistically approach to regression (1) the generative model (2) maximize the
likelihood w.r.t parameters
What if we were not totally clueless about $\mathbf{w}$? What if we knew it was
drawn from some particular $p(\mathbf{w})$? This leads us into the Bayesian view,
which we will get to in a few weeks!