Relevant Textbook Sections: 2.3-2.5, 2.6.1, 2.7.1
Cube: Supervised, Continuous, Nonprobabilistic
Lecture 2 Summary
Relevant Videos
Introduction
The previous lecture, we covered nonparametric regression. As a recap, a nonparametric model doesn't have a fixed number of parameters used to make predictions, which means it doesn't need to make strong assumptions about the data. Instead, as the amount of training data increases, nonparametric models tend to become more and more complex as they incorporate the new data.
This lecture, we covered parametric regression. These models have a fixed set of parameters used to make predictions, which often means stronger assumptions are made about the data. Specifically, no matter how much more training data we have, these models have the same levels of complexity, as the number of parameters stays fixed. Here, more training data simply helps us improve our choice of parameters. The simplest example of parametric regression is linear regression, which will be the focus of today's lecture. Having a solid understanding of linear regression will make learning other parametric models much easier.
General Form of All Regressions
- Choose a function class (examples: linear, KNN)
- Choose a loss function (examples: L2 loss, L1 loss)
- Define $\textbf{w}^*$ that minimizes the loss: $\textbf{w}^* = \text{argmin}_{\textbf{w}} L(\textbf{w})$
Linear Regression with L2 Loss
At lecture, we picked our function class to be linear, and our loss function to be L2, the sum of the squared residuals.
1. Choose A Function Class: Linear
Because we've picked a linear class, we start with the following:
$$\hat{y} = w_0 + w_1x_1 + w_2x_2 + ... + w_dx_d$$
$w_0$, $w_1$ ... $w_d$ are the weight parameters.
$x_1$, $x_2$ ... $x_d$ together form a single datapoint $\textbf{x}$ with d dimensions.
Sidenote: Merging Bias Trick
Before we move on, let us use the bias merging trick to rewrite the equation in a more condensed manner. To do this, first we set a $x_0 = 1$ for each data point. Essentially, imagine an imaginary dimension 0 that's set to 1 for every single datapoint. Now, if we organize all the weight parameters into one column vector $\textbf{w}$ and organize each x data point (including the 1 we just appended) into a column vector $\textbf{x}$, we can rewrite the equation as:
$$\hat{y} = \textbf{w}^T\textbf{x}$$
This works because when you multiply it out, you get:
$$\hat{y} = w_0 \cdot 1 + w_1x_1 + w_2x_2 + ... + w_dx_d = w_0 + w_1x_1 + w_2x_2 + ... + w_dx_d$$
2. Choose A Loss Function: L2 Loss
Because we have selected an L2 Loss, we start with the following:
$$L(\textbf{w}) = \frac{1}{2} \sum (y_n - \hat{y}_n)^2$$
Now, we can substitute in our $\hat{y}$ expression from before.
$$L(\textbf{w}) = \frac{1}{2}\sum (y_n - \textbf{w}^T\textbf{x}_n)^2$$
Student Question: Why did we select the L2 loss?
It turns out that picking a linear function class with an L2 loss leads to a computationally convenient problem when we solve for the optimal $\textbf{w}^*$. This is discussed further in the textbook, in Chapter 2.5.1.
3. Define $\textbf{w}^*$ that Minimizes the Loss
It turns out that in order to solve for $\textbf{w}^*$, we can take many different approaches.
Note: the following assumes the 1's to do the bias trick have already been added into the data.
How to Solve for $\textbf{w}^*$: Method 1
One method is to use gradient descent. This will be covered in more detail in future lectures.
First, initialize $\textbf{w}$ to some random values. Then, take the gradient of the loss with respect to $\textbf{w}$.
$$\frac{\partial \mathcal{L}(\textbf{w})}{\partial \textbf{w}} = \sum_n ((y_n - \textbf{w}^T\textbf{x}_n)(-\textbf{x}_n))$$
Next, update the $\textbf{w}$ in the following way:
$$\textbf{w}^{(t+1)} = \textbf{w}^{(t)} - \eta\frac{\partial \mathcal{L}(\textbf{w})}{\partial \textbf{w}} = \textbf{w}^{(t)} - \eta\sum_n ((y_n - \textbf{w}^T\textbf{x}_n)(-\textbf{x}_n))$$
Repeat this until convergence. $(y_n - \textbf{w}^T\textbf{x}_n)$ is a scalar and $(-\textbf{x}_n)$ is a vector, so the result will be a vector of weights.
How to Solve for $\textbf{w}^*$: Method 2
Another method is to directly solve for $\textbf{w}^*$. It turns out that in the case of a linear function with L2 loss, there is a clean solution for $\textbf{w}^*$. In order to solve for $\textbf{w}^*$, we take the gradient of the loss and set it to 0.
In this version of the derivation, there are D dimensions in the data and N data points. $\textbf{y}$ is an $N$-dimensional column vector, $\textbf{w}$ is a $(D+1)$-dimensional column vector of weights and $\textbf{X}$ is a $N \times (D+1)$ matrix of data such that the $n^{\text{th}}$ row is given by $\textbf{x}_n^T$. First we start with the residuals squared, but in matrix form.
$$L(\textbf{w}) = \frac{1}{2}(\textbf{y} - \textbf{X}\textbf{w})^T(\textbf{y} - \textbf{X}\textbf{w})$$
Let us take the gradient:
$$\frac{\partial \mathcal{L}(\textbf{w})}{\partial \textbf{w}} = -2\textbf{X}^T(\textbf{y}-\textbf{X}\textbf{w})$$
We use the formula $\frac{ \partial (\textbf{a} -\textbf{C} \textbf{b})^T \textbf{D} (\textbf{a} - \textbf{C} \textbf{b})}{\partial \textbf{b}} = -2 \textbf{C}^T \textbf{D} (\textbf{a} - \textbf{C}\textbf{b})$ (equation 64 in the cookbook) where $\textbf{D}$ must be symmetric. $\textbf{D}$ is the identity matrix in our case.
Note: The derivative can also be taken by first expanding the loss function and computing the derivatives term by term.
Now we set it to 0 and find $\textbf{w}^*$:
$$\begin{align*}
0 &= -2\textbf{X}^T\textbf{y} + 2\textbf{X}^T\textbf{X}\textbf{w}^* \\
\textbf{X}^T\textbf{X}\textbf{w}^* &= \textbf{X}^T\textbf{y} \\
\textbf{w}^* &= (\textbf{X}^T\textbf{X})^{-1}\textbf{X}^T\textbf{y}
\end{align*}
$$
For a given datapoint $\textbf{x}$, the corresponding prediction is
$$\begin{align*}
\hat{y} &= \left((\textbf{X}^T\textbf{X})^{-1}\textbf{X}^T\textbf{y}\right)^T\textbf{x} \\
\hat{y} &= (\textbf{X}^T\textbf{y})^T(\textbf{X}^T\textbf{X})^{-T}\textbf{x} \\
\hat{y} &= (\textbf{X}^T\textbf{y})^T(\textbf{X}^T\textbf{X})^{-1}\textbf{x}
\end{align*}
$$
Statistics View:
$(\textbf{X}^T\textbf{y})^T$ is the correlation between $\textbf{x}$ and $y$, so if the correlation is high, the weights should be higher since you should weigh $\textbf{x}$ by more in order to better predict $y$. $(\textbf{X}^T\textbf{X})$ is the variation in $\textbf{x}$.
Linear Algebra View:
We have $D$ $\textbf{x}$-vectors living in $N$-dimensions. By taking a linear combination of $D$ vectors, we are trying to produce a length $N$ vector $\hat{\textbf{y}}$. Our $\hat{\textbf{y}}$ is essentially the closest projection of $\textbf{y}$ onto the row space of $\textbf{X}$.
Student Question: Why is w a column? Simply convention. You can manipulate the dimensions and the various derivations will be slightly different, using different transposes in different places, but the meaning is all the same.
Important Note on Notation:
In lecture, we had $\textbf{y}$ as an $N$-dimensional row vector and $\textbf{X}$ as a $(D+1) \times N$ matrix. In this notation the loss is given by
$$L(\textbf{w}) = \frac{1}{2}(\textbf{y} - \textbf{w}^T\textbf{X})(\textbf{y} - \textbf{w}^T\textbf{X})^T$$
and the corresponding expressions for $\textbf{w}^*$ and $\hat{y}$ are
$$\textbf{w}^* = (\textbf{X}\textbf{X}^T)^{-1}\textbf{X}\textbf{y}^T$$
$$\hat{\textbf{y}} = (\textbf{X}\textbf{y}^T)^T(\textbf{X}\textbf{X}^T)^{-1} \textbf{x}$$
Conventions for notations vary between different sources so it is a useful skill to be able to understand and derive expressions in different notational formats.
Basis Regression
Sometimes, linear regression is not enough to model the input data, as all it does is simply scales our input variables. In order to create more complex models, we can use a basis to transform our data into more complicated forms, using polynomial functions, sine or cosine functions, and more. Essentially, what we can do is "move our data into a new basis", or in other words, transform training data before we perform the actual linear regression. Then, when we get our test data and need to make predictions, we transform the test data as well, then use our model to make our predictions.
We use $\phi(\cdot)$ to denote a basis function, which is the transformation we'll apply to each $\textbf{x}$ data input.
Perhaps we have an original point $\textbf{x} = (x_1, x_2)^T$, and we may pick a basis function $\phi(\textbf{x})$ that transforms the data point into $\phi(\textbf{x}) = (x_1, x_2, \cos(x_1), {x_1}^4, \sin(x_2))^T$.