This site is out of date.

To see our most recent course site, click here!

Lecture 4 Recap - Linear Classification

Date: February 3, 2022

Relevant Textbook Sections: 3.1 - 3.5

Cube: Supervised, Discrete, Nonprobabilistic


Lecture Video

Announcements

  • Again, HW is extended and is due Feb 11
  • Midterm 1 has been moved to Mar 1

Lecture 4 Summary

Relevant Videos


Classification vs Regression

We are still in the setting where given features $x \in \mathbb R^D$, we hope to predict some label $y \in \mathcal Y$ where $\mathcal Y$ is our output space. Until now, we have been predicting continuous scalar labels $y \in \mathbb R$ by using non-probablistic and probalistic regression. In classification, however, we instead predict discrete labels $y \in \{ C_k \}_{k=1}^K$ where $C_1, \ldots, C_K$ are our discrete possible classes.

For example, based on an image, we can classify what species a bug is. In this example, our class labels may look like $$\mathcal Y = \{\text{Fly}, \text{Ant}, \text{Bee}\}.$$

Obviously, we need a way to encode these discrete classes mathematically, and there are multiple ways to do this You may see $$\mathcal Y = \{C_1, \ldots, C_k\}$$ as a general form. If there are only two classes, then you may see either $$\mathcal Y = \{-1, +1\} \quad \text{or} \quad \mathcal Y = \{0, 1\}.$$ An example $x_n$ is "positive" if it belongs in the positive class $y_n = +1$ or "negative" if $y_n = -1$. In many settings, we use "one-hot encoding" which maps each class label $C_k$ to a vector $y \in \mathbb R^K$ of 0 everywhere except a 1 in the $k$-th element: $$\mathcal Y = \left\{ \begin{bmatrix}1\\0\\\vdots\\0\end{bmatrix},\, \begin{bmatrix}0\\1\\\vdots\\0\end{bmatrix},\, \ldots,\, \begin{bmatrix}0\\0\\\vdots\\1\end{bmatrix} \right\}$$

We have these different representations because each is convinient for a different setting. Conceptually, all of these are the same! They all say that labels $y$ are discrete.

Non-Parametric Classification

We still have to ask "what is our model?" and "what is our objective?". This second question asks what it means for one model to be better than another at the task of classificaiton.

Why don't we reuse regression? Well, KNN still works actually. We can find our $K$-nearest neighbors and instead of averaging their labels, we take the majority.

The advantage of using kernel methods is that it is super flexible. However, it can also be very slow on large datasets (which happens at prediction time) and it can be difficult to interpret.

Linear Classification

What about parametric regression? We can fit a linear regression model and use the line as a decision boundary, but this would be bad because the decision boundary will be thrown off by outliers.

Assuming we are in a binary classification setting and are using the representation $\mathcal Y = \{-1, +1\}$, we can use the model $\hat y = \operatorname{sign}(f(x, w))$ where, today, our model class is linear: $$\hat y = \operatorname{sign}(w^\top x).$$

If $x \in \mathbb R^2$, then the equation for the decision boundary is $$0 = w_0 + x_1 w_2 + x_2 w_2.$$

0/1 Loss

Now we have our model, we need to decide on a loss function. Here we run into a little bit of trouble, because one very natural loss function would be 0/1 loss: simply count the number of times we are wrong: $$\ell_{0/1}(z) = \begin{cases}1 & \text{if } z > 0 \\ 0 & \text{otherwise} \end{cases}.$$

This gives us an objective function $$\mathcal L_{\mathcal D}(w) = \sum_{\mathcal D} \ell_{0/1}(-y_n(w^\top x_n)).$$

Here, $\ell_{0/1}$ gives us the loss for a single instance and $\mathcal L_{\mathcal D}$ gives us the total loss over the training dataset $\mathcal D$.

The issue with this loss is that this doesn't consider uncertainty / distance from the boundary. This might be bad becuase you are penalizing the classifier for being wrong on a very ambiguous example the same amount as being wrong on a super obvious example. The classifier will be confused; it doesn't know which direction to move in order to improve.

Mathematically, the derivative is always 0: it does not tell us which direction we can move in to lessen the loss.

Hinge Loss

Let's choose a loss that discerns between slightly and very wrong classificaitons: $$\ell_{\text{hinge}}(z) = \begin{cases}z & \text{if } z > 0 \\ 0 & \text{otherwise} \end{cases}.$$

This gives us an objective function $$\mathcal L_{\mathcal D}(w) = \sum_{\mathcal D} \ell_{\text{hinge}}(-y_n(w^\top x_n)).$$

If the signs do not match ($\operatorname{sign}(y_n) \neq \operatorname{sign}(w^\top x_n)$), then the classifier has put $x_n$ on the wrong side and is penalized by how far off $x_n$ was from the boundary. Remember that if $w^\top x_n = 0$, we are on the decision boundary, so $|w^\top x_n| = -y_n w^\top x_n$ is a measure of distance for incorrectly classified points. You can imagine this loss as being the sum of all incorrectly classified examples.

We can also take the derivative of the hinge loss!

$$\nabla_w\, \mathcal L_{\mathcal D}(w) = \nabla_w \left(-\sum_{\text{bad $y$s}} y_n w^\top x_n \right) = -\sum_{\text{bad $y$s}} y_n x_n.$$

Remember we can optimize $w$ by gradient descent: $$w^{(t+1)} \leftarrow w^{(t)} - \eta \nabla_w\, \mathcal L(w^{(t)}).$$

However, this can be hard to do if our dataset is massive. Instead, we can use stochastic gradient descent (SGD). The idea is for each step where we compute the gradient, only compute the gradeint for a random smaller subset of our data.

$$w^{(t+1)} \leftarrow w^{(t)} - \eta \nabla_w\, \mathcal L_{\mathcal D^{(t)}}(w^{(t)})$$ where $\mathcal D^{(t)} \subset \mathcal D$ and $|\mathcal D^{(t)}| = M$.

We have to take more (noisy) steps in this case because our gradient is smaller, but because we calculate that gradient over less data this process is overall more efficient.

Suppose we choose SGD where $M = 1$ and use the update rule that if $\hat y_n$ is incorrect, then $$w^{(t+1)} \leftarrow w^{(t)} - y_m x_m $$

This happens to be the 1958 Perceptron Algorithm, and it converges to a perfect solution (with respect to the hinge loss) if the data is perfectly seperable. At the time, people found this by just searching for update procedures and only now can we understand this as a form of gradient descent.

Metrics

There are four error metrics
Metricyy-hat
True Positive11
False Positive01
True Negative00
False Negative10

These metrics can be combined to determine different kinds of rates such as $$\begin{align*} \text{precision} &= \frac{\text{TP}}{\text{TP} + \text{FP}} \\ \\ \text{accuracy} &= \frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}}\\ \\ \text{true positive rate} &= \frac{\text{TP}}{\text{TP} + \text{FN}} \\ \\ \text{false positive rate} &= \frac{\text{FP}}{\text{FP} + \text{TN}} \\ \\ \text{recall} &= \frac{\text{TP}}{\text{TP} + \text{FN}} \\ \\ \text{F1} &= \frac{2\cdot \text{precision}\cdot\text{recall}}{\text{precision}+\text{recall}} \end{align*} $$