This site is out of date.

To see our most recent course site, click here!

Lecture 7 Recap - Model Selection (Bayesian)

Date: February 16, 2021 (Concept Check, Class Responses, Solutions)

Relevant Textbook Sections: 2.8, 2.9


Lecture Video

iPad Notes

Lecture 7 Summary

Diversion: More Real World Context on Model Selection

We start the lecture with a few more real world examples to discuss. If inaccurate assumptions are made, models can be wrong even when all of the math is correct. For instance, we might assume that data features are independent of each other when the real-world features are actually correlated.

Another example is a medical model assuming that antidepressants and patient features produce a recovery outcome, and that a correct drug could be selected for each patient to help them recover. Machine learning was used to predict antidepressant recommendations, but the model ended up overfitting. In the data, some patients’ illnesses were just less severe. These patients were more likely to get better no matter what medication they were prescribed. The model would predict medications and these patients would probably get well, but not because it was the best drug — this was just an easy patient to treat.

Intro to Bayesian Model Selection

When we talked about classification, we had discriminative and generative models. Each one makes sense in different scenarios.
  • Discriminative: Use $x$ to predict $y$ by maximizing $p(y | x)$. As an example, we might use data about a customer’s features ($x$) to predict what they will buy ($y$).
  • Generative: Consider how the data is being produced by modeling $p(x,y)$. As an example, we might have data about patient symptoms ($y$) and want to know what disease produced them ($x$).
Earlier in Lecture 3, we discussed a generative setting where data was produced via the following story:
     
  1. We have an input $x$
  2.  
  3. We have a noise sample $\epsilon \sim \mathcal{N}(0,\sigma^2)$
  4.  
  5. The data $y$ is produced as $y = w^Tx + \epsilon$
In the Lecture 3 story, $w$ was fixed. Today, we will add a step 0 where we consider $w$ to be a random variable as well. (The $w$’s are the different models we are considering. For example, they might represent all possible decision boundary lines.) Through the Bayesian view, we can look at three useful concepts: the posterior distribution, the posterior predictive, and the marginal likelihood.

Posterior Distribution

After obtaining some data $X, y$, we can compute the posterior distribution $p(w | X,y)$. This tells us, “Given all the data you’ve seen in $X$ and $y$, these are the possibilities (and probabilities) for what $w$ will look like.”

How can we find this distribution? First observe that in the data generation story we’re currently considering, we only need to know $w$, $x’$, and $\epsilon$ to get $y’$. So, we can make simplifications such as $p(y’ | w, x’, X, y) = p(y’ | w, x’)$, because the value of $y’$ does not depend on previously observed data. Also, seeing more $x$’s alone does not tell us more about what $w$ is, but seeing more labeled pairs of $x$’s and $y$’s together does tell us more about $w$’s distribution. Thus, $p(w|X) = p(w)$, but $p(w|X, y) \neq p(w)$.

Using this along with the fact that $p(y | X)$ does not depend on $w$, we can apply Bayes’ Theorem and then do some simplifying: \begin{align*} p(w | X,y) &= \frac{p(y | X, w)p(w | X)}{p(y | X)} \\ &= \frac{p(y | X, w)p(w)}{p(y | X)} \\ &\propto p(y | X,w)p(w) \end{align*} Depending on the setting and prior, this could be an easy calculation (if there's conjugacy) or a hard calculation (if there's not).

Note: Conjugacy is when the prior and posterior are from the same family of distributions, like if they're both Gaussian or both Beta. It can provide nice mathematical properties and cleaner forms that make calculations much easier.

Posterior Predictive

To make predictions, we use the posterior predictive $p(y’ | x’, X, y)$. For a new piece of test data $x’$, the posterior predictive $p(y’ | x’, X, y)$ tells us how likely the label for $x’$ is to be $y’$ given the data $X, y$ that we have already observed. To find the posterior predictive, we would integrate over the possible $w$’s in our posterior distribution for $w$: $$p(y’ | x’, X, y) = \int_w p(y’ | w, x’)p(w | X, y)dw$$ Note: In the Bayesian framework, we avoid working directly with $w$. It’s a random variable we don’t know the value of, so instead of working with $w$, we integrate over the whole distribution for $w$ to consider the likelihoods of all the different possible values.

Marginal Likelihood

Finally, we can compute the marginal likelihood, the probability of the training set $$p(x,y) = \int_w p(y | X,w)p(w)dw$$ The posterior predictive is what we use for inference (making predictions), but maximizing the marginal likelihood is what helps us select a model.

Setting Up a Beta-Bernoulli Model Example

Defining the Bernoulli Part

Let's consider a coin that comes up "1" with probability $\theta$. Let $x$ be the result of the toss. The probability of a flip landing a certain way comes from the Bernoulli PMF: $$p(x | \theta) = \theta^x (1 - \theta)^{1 - x}$$ If we want to write out the probability of a dataset of multiple coin flips, then we have: $$p(x_1 ... x_n | \theta) = \prod_n (\theta)^{x_n}(1-\theta)^{1 - x_n}$$ $$p(x_1 ... x_n | \theta) = \theta^{n_1}(1-\theta)^{n_0}$$ Where in the above, $n_1$ is the number of heads, and $n_0$ is the number of tails.

Defining the Beta Part

Today we'll be considering a Bayesian perspective where $p(\theta)$ ~ $Beta(\alpha, \beta)$.

Now in our data generation story, Step 0 is drawing $\theta$ from a Beta distribution. Step 1 is drawing $N$ coin flips $x_i$ from $Bern(\theta)$ after $\theta$ is determined.

The Beta distribution is parameterized by $\alpha$ and $\beta$, which describe our prior beliefs about what the probability distribution for $w$ should look like. We can think of $\alpha$ and $\beta$ as values for how many heads and tails we've seen so far — if we set $\alpha > \beta$, it would be like saying we had seen more heads than tails so far, and our prior distribution would be higher on the right. When $\alpha = \beta = 1$, the distribution is uniform, which best represents a case where we are clueless about what $\theta$ might be. If we set $\alpha = \beta > 1$, it reflects a belief that we've seen an equal number of heads and tails, and that $\theta$ is likely to be around $0.5$.

Calculating the Posterior

Starting from the proportionality we found earlier and plugging in the Beta PDF, we have: \begin{align*} p(\theta | X) &\propto p(X|\theta)p(\theta) \\ &= \theta^{n_1}(1-\theta)^{n_0}z_{\alpha, \beta}\theta^{\alpha - 1}(1-\theta)^{\beta - 1} \\ &= z_{\alpha,\beta} \theta^{n_1 + \alpha - 1}(1 - \theta)^{n_0 + \beta - 1} \end{align*} (The $z$ is just a normalizing constant that will integrate to 1. We don't need to worry about it right now.)

We just rewrote our posterior into the form of another Beta distribution, specifically $Beta(n_1 + \alpha, n_0 +\beta)$. Recall that our prior was $Beta(\alpha, \beta)$. Since they come from the same distribution family, we have conjugacy.

Different Choices for Predictors

Let's say we're interested in inferring what $\theta$ is.

If we take the maximum likelihood frequentist approach, we would ignore the prior and get $\theta_{MLE} = argmax_{\theta}$ $\theta^{n_1}(1-\theta)^{n_0}$, which leads us to an intuitive answer, the fraction of flips that came up heads: $$\theta_{MLE} = \frac{n_1}{n_1 + n_0}$$ The Bayesian approach gives us additional methods for inferring $\theta$.

Finding the Maximum a Posteriori (MAP)

Imagine that we used $\alpha = \beta = 2$ for our prior and that we've done two coin flips, both of which were heads. Then we'd have $n_1 = 2, n_2 = 0$. Using these values of $\alpha, \beta$ for our prior is like believing that we've already "seen" 2 heads and 2 tails that don't count as part of our data, so we are beginning with a belief that the coin is likely to be equal.

After seeing two heads, our posterior distribution would be shifted to be higher on the right side, towards 1. The point at the peak of the posterior distribution is the Maximum a posteriori predictor (MAP). The MAP predictor can be solved for as $$\theta_{MAP} = \frac{n_1 + \alpha - 1}{n_0 + n_1 + \alpha + \beta - 2} = \frac34.$$ The form is the same as $\theta_{MLE}$'s form was, except the MAP includes the $\alpha$ and $\beta$ of the prior as well. The intuition here is that as the number of trials gets large, the data will overwhelm whatever prior you have.

Using a frequentist MLE approach where we only consider the data we've seen, we would have gotten $\theta_{MLE} = 1$ as our estimator because we have seen only heads. $\theta_{MAP}$ incorporates our prior beliefs about how $\theta$ is distributed in addition to the data we've observed.

Using the Posterior Predictive

The most Bayesian person would actually integrate over the posterior distribution instead of choosing a value for $\theta$ — after all, $\theta$ is supposed to be a random variable we don't know the value of. Let's consider the probability that our next coin flip will be heads. Conveniently, this is equal to the expected value of $\theta$. Even more conveniently, since we know $\theta$ follows a Beta distribution, we can just look up what the expected value of a Beta random variable will be. \begin{align*} p(x = 1|X) &= \int p(x=1|\theta) p(\theta | x_1 ... x_n )d\theta \\ &= \mathbb{E}_{p(\theta | X)}(\theta) \\ &= \frac{\alpha + n_1}{\alpha + \beta + n_1 + n_0} = \frac46 = \frac23 \end{align*}

Quick Summary

We've now seen three ways to infer $\theta$:
     
  1. MLE: Let's predict using just the data we have seen.
  2.  
  3. MAP: Let's compute the posterior distribution and select the value for $\theta$ with highest probability.
  4.  
  5. Posterior Predictive: If we're feeling super Bayesian, we can avoid picking a single $\theta$ value. Instead, we'll integrate over the whole distribution to get an expected $\theta$.

Model Selection With the Marginal Likelihood

What if we didn't know whether to use $\alpha_1, \beta_1$ or $\alpha_2, \beta_2$ as our prior? We can make the decision by comparing $p(X| \alpha_1 \beta_1)$ vs. $p(X| \alpha_2, \beta_2)$ to see which option makes the data we observed more likely. \begin{align*} p(X) &= \int_{\theta}p(X | \theta)p(\theta)d\theta \\ &= \int_{\theta} \theta^{n_1}(1-\theta)^{n_0}z_{\alpha, \beta}\theta^{\alpha - 1}(1-\theta)^{\beta - 1}d\theta \\ &= z_{\alpha, \beta}\int_{\theta} \theta^{n_1 + \alpha - 1}(1 - \theta)^{n_0 + \beta - 1}\frac{z_{n_1 + \alpha, n_0 + \beta}}{z_{n_1 + \alpha, n_0 + \beta}}d\theta \\ &= \frac{z_{\alpha, \beta}}{z_{n_1 + \alpha, n_0 + \beta}} \int_{\theta} \theta^{n_1 + \alpha - 1}(1 - \theta)^{n_0 + \beta - 1}z_{n_1 + \alpha, n_0 + \beta}d\theta \\ &= \frac{z_{\alpha, \beta}}{z_{n_1 + \alpha, n_0 + \beta}} \cdot 1 \\ &= \frac{z_{\alpha, \beta}}{z_{n_1 + \alpha, n_0 + \beta}} \end{align*} The integrand at the end is equal to $1$ because we manipulated it into the form of a $Beta(n_1 + \alpha, n_0 + \beta)$ distribution, which will integrate to $1$.