Model Selection

Finding true model vs model which predicts future data well leads to different criterion.

AIC, BIC

Bayesian vs frequentist ideas for model selection

AIC – an information criterion

Consider maximized log likelihood $\log L (\hat \theta)$ . We can’t only use this model selection because it favors adding more parameters.

Therefore, we penalize for number of parameters, $p$ . So we prefer models where $\log L ( \hat \theta) - p$ is larger.

Usually, we see this multiplied by $-2$ such that we take small $-2 \log L (\hat \theta) + 2p$ .

BIC

BIC has a heavier penalty for complexity. Let $\ell$ be log likelihood and $p$ the number of parameters in the model.

BIC: $\ell(\hat \theta) - (1/2 \log n)p$ .

In comparing model 0 and model 1, prefer model 1 if

\ell (\hat \theta_1) - \text{pen}_1 \geq \ell(\hat \theta_0) - \text{pen}_0.

AIC vs BIC in a simple model selection problem

Suppose there is only an intercept and we know $\sigma^2 = 1.$

Y_1, ..., Y_n \sim N(\theta, 1).

Model 0: $\theta =0,$ and Model 1: $\theta \in \R$ .

Then $\hat \theta_0 = 0, \hat \theta_1 = \bar Y$ , the sample mean.

\ell(\theta) = (-n/2)\log(2\pi) - (1/2)\sum (Y_i - \theta)^2.

\ell(\hat \theta_1) = (-n/2)\log(2\pi) - (1/2)\sum (Y_i - \bar Y)^2.

\ell(\hat \theta_0) = (-n/2)\log(2\pi) - (1/2)\sum (Y_i)^2.

Then

\ell (\hat \theta-1) - \ell (\hat \theta_0) = (1/2) n \bar Y)^2.

Then for AIC, penalty 1 - penalty 0 = $1$ , vs $(1/2) \log n$ if looking at BIC.

SO we end up choosing model 1 under AIC if $(1/2)n\bar Y^2 \geq 1,$ i.e. $|\sqrt{n \bar Y}| \geq \sqrt{2}.$ This is like hypothesis testing with significance level $P\{|N(0,1)| \geq \sqrt{2}\}.$

Under BIC, choose model 1 if $(1/2)n\bar Y^2 \geq (1/2) \log n$ , i.e., $| \sqrt{n} \bar Y | \geq \sqrt{\log n}.$ This is like hypothesis testing with significance level $P\{|N(0,1)| \geq \sqrt{\log n}\} \rightarrow 0.$

AIC seems to care less about finding the “true model.” What is it looking for?

Bayesian model comparison, Bayes factors, BIC

Take $k$ models $M_1, ..., M_k.$

Want to find posterior probabilities of these models.

Given data $y$ , we can write the posterior probabilities of the models using Bayes’ rule. Ratios of posterior probabilities look like

(fill in) resulting in…

(p(M_1) / p(M_2))(p(y|M_1) / p(y|M_2)).

This is (prior ratio) * (Bayes factor), i.e., the factor by which we change our beliefs (something).

A prob or density of form $p(y | M_i)$ is called a marginal likelihood.

so marginal likelihood is

p(y | M_i) = \int_{\theta \in M_i} p(y | \theta, M_i) p(\theta | M_i) d\theta.

equivalently,

\int_{\theta \in M_i} L_i (\theta) f_i (\theta) d\theta.

BIC selects model 0 if (show example)

a problem: if prior is very spread out, then $P(M_0 | y)$ gets close to $1$ , so using a very wide prior means BIC will just prefer the smaller model (Lindley paradox).

methods to fix: calculate for a range of priors. see whether, in order to change your conclusion, you need priors that are extremely unrealistics.

Other proposals to fix: “partial Bayes factors”, where the key concept is to “sacrifice” a small amount of data to estimate priors in the various models.