Lecture 4 Recap - Linear Classification

Date: February 5, 2020 (Concept Check, Class Responses, Solutions)

Relevant Textbook Sections: 3.1 - 3.5

Cube: Supervised, Discrete, Nonprobabilistic

Lecture 4 Summary

Introduction
Classification vs Regression
Linear Classification
Metrics

Relevant Videos

Introduction

The previous lecture, we covered probabilistic regression. As a recap, in probabilistic regression, a generative model allows us to perform different types of inference such as posterior inference for weight parameters as well as posterior predictive for new data.

This lecture, we covered linear classification. The goal of classification is to identify a category $y$ given $\mathbf{x}$, rather than continuous $y$.

Classification vs Regression

Conceptually classification is not too different from regression. We follow the same general forms:

Choose a model (linear vs non-linear boundary)
Choose a loss function
We will write out $y \in C_1,\cdots,C_K$.
Depending on the problem, we encode $y$ as $0/1$, $+/-$ or one-hot vectors $\begin{pmatrix} 0 & 0 & 1 & 0 \end{pmatrix}$

We can still use KNN for classification by returning the majority vote of the neighbors of $\mathbf{x}$

Linear Classification

Choose a Model : Linear Boundary

We introduce a new parametric model. It is simple but we can use a basis $\phi$ to obtain complex boundaries of separation. $$\hat{y} = \text{sign}(\mathbf{w}^T\mathbf{x} + w_0)$$ Before deciding on a loss, let's just understand this model and what it does: Consider the decision boundary $\mathbf{w}^T\mathbf{x} + w_0 = 0$:
In the 2D case:
$$\begin{align*} w_1x_1 + w_2x_2 + w_0 &= 0 \\ x_2 &= -\frac{w_1}{w_2}x_1 - \frac{w_0}{w_2} \end{align*}$$ This is the equation of a line, so we have a linear boundary!

Generalizing: Consider a vector $\mathbf{s}$ connecting two points $\mathbf{x_1}$ and $\mathbf{x_2}$ on the boundary ( $\mathbf{s} = \mathbf{x_2} - \mathbf{x_1}$) $$\begin{align*} \mathbf{s}\cdot \mathbf{w} &= \mathbf{x_2}\cdot \mathbf{w} - \mathbf{x_1} \cdot \mathbf{w} \\ &= \mathbf{x_2}\cdot \mathbf{w} + w_0 - \mathbf{x_1} \cdot \mathbf{w} - w_0\\ &= 0 - 0 = 0 \end{align*} $$ $\mathbf{s}$ is orthogonal to $\mathbf{w}$.

This implies that $\mathbf{w}$ is orthogonal to the boundary. $w_0$ gives the offset.

Choose a Loss Function : Hinge Loss

Let's consider the the $0/1$ function: $$ \ell_{0/1}(z) = \left\{ \begin{array}{cc} 1 \quad& z > 0 \\ 0 \quad& \text{else} \end{array} \right. $$ and the loss function $$\mathcal{L}(\textbf{w}) = \sum_{n=1}^N \ell_{0/1}\left(-y_n(\mathbf{w}^T\mathbf{x}_n + w_0)\right)$$ that penalizes if the signs of $y_n$ and $\mathbf{w}^T\mathbf{x}_n + w_0$ do not match.
There is however an issue with this loss. It has uninformative gradient. We are either right or wrong.

Let us now consider hinge loss or linear rectifier function $$ \ell_{\text{hinge}}(z) = \left\{ \begin{array}{cc} z \quad& z > 0 \\ 0 \quad& \text{else} \end{array} \right. = \max(0,z) $$ and the loss function $$\begin{align*} \mathcal{L}(\textbf{w}) &= \sum_{n=1}^N \ell_{\text{hinge}}\left(-y_n(\mathbf{w}^T\mathbf{x}_n + w_0)\right) \\ &= -\sum_{m \in S}y_m(\mathbf{w}^T\mathbf{x}_m + w_0) \end{align*} $$ where the set $S$ consists of all $n$ such that $\text{sign}(y_n) \neq \text{sign}(\mathbf{w}^T\mathbf{x}_n + w_0)$

Now, we can take gradients! $$\frac{\partial}{\partial \mathbf{w}}\mathcal{L}(\textbf{w}) = -\sum_{m \in S}y_m\mathbf{x}_m$$ Note: We have absorbed the bias term into $\mathbf{w}$ here

How to solve for $\mathbf{w}^*$

We can use stochastic gradient descent to optimize $\mathbf{w}$ : Use a mini-batch of our data (good if datatset is larger; noisier gradient though!).

What if we took just one (incorrectly classified) datum: $$\mathcal{L}^{(i)}(\mathbf{w}) = -y_i\mathbf{w}^T\mathbf{x}_i$$ and $$\mathbf{w} \leftarrow \mathbf{w} + \eta y_i \mathbf{x}_i$$ This is the 1958 Perceptron algorithm : if $\hat{y} = y_n$, do nothing; else do above, until no error. This converges if the data are linearly separable in the feature space.

Metrics

There are four error metrics

Metric	y	y-hat
True Positive	1	1
False Positive	0	1
True Negative	0	0
False Negative	1	0

These metrics can be combined to determine different kinds of rates such as $$\begin{align*} \text{precision} &= \frac{\text{TP}}{\text{TP} + \text{FP}} \\ \\ \text{accuracy} &= \frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}}\\ \\ \text{true positive rate} &= \frac{\text{TP}}{\text{TP} + \text{FN}} \\ \\ \text{false positive rate} &= \frac{\text{FP}}{\text{FP} + \text{TN}} \\ \\ \text{recall} &= \frac{\text{TP}}{\text{TP} + \text{FN}} \\ \\ \text{F1} &= \frac{2\cdot \text{precision}\cdot\text{recall}}{\text{precision}+\text{recall}} \end{align*} $$

This site is out of date.

CS 181