Relevant Textbook Sections: 3.1 - 3.5
Cube: Supervised, Discrete, Nonprobabilistic
Lecture 4 Summary
Relevant Videos
Introduction
The previous lecture, we covered probabilistic regression. As a recap, in probabilistic regression, a generative model
allows us to perform different types of inference such as posterior inference for weight parameters as well as
posterior predictive for new data.
This lecture, we covered linear classification. The goal of classification is to identify a category $y$ given $\mathbf{x}$,
rather than continuous $y$.
Classification vs Regression
Conceptually classification is not too different from regression. We follow the same general forms:
- Choose a model (linear vs non-linear boundary)
- Choose a loss function
We will write out $y \in C_1,\cdots,C_K$.
Depending on the problem, we encode $y$ as $0/1$, $+/-$ or
one-hot vectors $\begin{pmatrix} 0 & 0 & 1 & 0 \end{pmatrix}$
We can still use KNN for classification by returning the majority vote of the neighbors of $\mathbf{x}$
Linear Classification
Choose a Model : Linear Boundary
We introduce a new parametric model. It is simple but we can use a basis $\phi$ to obtain complex boundaries of separation.
$$\hat{y} = \text{sign}(\mathbf{w}^T\mathbf{x} + w_0)$$
Before deciding on a loss, let's just understand this model and what it does:
Consider the decision boundary $\mathbf{w}^T\mathbf{x} + w_0 = 0$:
In the 2D case:
$$\begin{align*}
w_1x_1 + w_2x_2 + w_0 &= 0 \\
x_2 &= -\frac{w_1}{w_2}x_1 - \frac{w_0}{w_2}
\end{align*}$$
This is the equation of a line, so we have a linear boundary!
Generalizing: Consider a vector $\mathbf{s}$ connecting two points $\mathbf{x_1}$ and $\mathbf{x_2}$ on the boundary (
$\mathbf{s} = \mathbf{x_2} - \mathbf{x_1}$)
$$\begin{align*}
\mathbf{s}\cdot \mathbf{w} &= \mathbf{x_2}\cdot \mathbf{w} - \mathbf{x_1} \cdot \mathbf{w} \\
&= \mathbf{x_2}\cdot \mathbf{w} + w_0 - \mathbf{x_1} \cdot \mathbf{w} - w_0\\
&= 0 - 0 = 0
\end{align*}
$$
$\mathbf{s}$ is orthogonal to $\mathbf{w}$.
This implies that $\mathbf{w}$ is orthogonal to the boundary. $w_0$ gives the offset.
Choose a Loss Function : Hinge Loss
Let's consider the the $0/1$ function:
$$
\ell_{0/1}(z) =
\left\{ \begin{array}{cc}
1 \quad& z > 0 \\
0 \quad& \text{else}
\end{array} \right.
$$
and the loss function
$$\mathcal{L}(\textbf{w}) = \sum_{n=1}^N \ell_{0/1}\left(-y_n(\mathbf{w}^T\mathbf{x}_n + w_0)\right)$$
that penalizes if the signs of $y_n$ and $\mathbf{w}^T\mathbf{x}_n + w_0$ do not match.
There is however an issue with this loss. It has uninformative gradient. We are either right or wrong.
Let us now consider hinge loss or linear rectifier function
$$
\ell_{\text{hinge}}(z) =
\left\{ \begin{array}{cc}
z \quad& z > 0 \\
0 \quad& \text{else}
\end{array} \right. = \max(0,z)
$$
and the loss function
$$\begin{align*}
\mathcal{L}(\textbf{w}) &= \sum_{n=1}^N \ell_{\text{hinge}}\left(-y_n(\mathbf{w}^T\mathbf{x}_n + w_0)\right) \\
&= -\sum_{m \in S}y_m(\mathbf{w}^T\mathbf{x}_m + w_0)
\end{align*}
$$
where the set $S$ consists of all $n$ such that $\text{sign}(y_n) \neq \text{sign}(\mathbf{w}^T\mathbf{x}_n + w_0)$
Now, we can take gradients!
$$\frac{\partial}{\partial \mathbf{w}}\mathcal{L}(\textbf{w}) = -\sum_{m \in S}y_m\mathbf{x}_m$$
Note: We have absorbed the bias term into $\mathbf{w}$ here
How to solve for $\mathbf{w}^*$
We can use
stochastic gradient descent to optimize $\mathbf{w}$ : Use a
mini-batch of
our data (good if datatset is larger; noisier gradient though!).
What if we took just
one (incorrectly classified) datum:
$$\mathcal{L}^{(i)}(\mathbf{w}) = -y_i\mathbf{w}^T\mathbf{x}_i$$ and
$$\mathbf{w} \leftarrow \mathbf{w} + \eta y_i \mathbf{x}_i$$
This is the 1958
Perceptron algorithm : if $\hat{y} = y_n$, do nothing; else do above, until no error.
This converges if the data are linearly separable in the feature space.
Metrics
There are four error metrics
Metric | y | y-hat |
True Positive | 1 | 1 |
False Positive | 0 | 1 |
True Negative | 0 | 0 |
False Negative | 1 | 0 |
These metrics can be combined to determine different kinds of rates such as
$$\begin{align*}
\text{precision} &= \frac{\text{TP}}{\text{TP} + \text{FP}} \\ \\
\text{accuracy} &= \frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}}\\ \\
\text{true positive rate} &= \frac{\text{TP}}{\text{TP} + \text{FN}} \\ \\
\text{false positive rate} &= \frac{\text{FP}}{\text{FP} + \text{TN}} \\ \\
\text{recall} &= \frac{\text{TP}}{\text{TP} + \text{FN}} \\ \\
\text{F1} &= \frac{2\cdot \text{precision}\cdot\text{recall}}{\text{precision}+\text{recall}}
\end{align*}
$$