Model Selection

Finding true model vs model which predicts future data well leads to different criterion.

AIC, BIC

Bayesian vs frequentist ideas for model selection

AIC – an information criterion

Consider maximized log likelihood logL(θ^)\log L (\hat \theta). We can’t only use this model selection because it favors adding more parameters.

Therefore, we penalize for number of parameters, pp. So we prefer models where logL(θ^)p\log L ( \hat \theta) - p is larger.

Usually, we see this multiplied by 2-2 such that we take small 2logL(θ^)+2p-2 \log L (\hat \theta) + 2p.

BIC

BIC has a heavier penalty for complexity. Let \ell be log likelihood and pp the number of parameters in the model.

BIC: (θ^)(1/2logn)p\ell(\hat \theta) - (1/2 \log n)p.

In comparing model 0 and model 1, prefer model 1 if

(θ^1)pen1(θ^0)pen0.\ell (\hat \theta_1) - \text{pen}_1 \geq \ell(\hat \theta_0) - \text{pen}_0.

AIC vs BIC in a simple model selection problem

Suppose there is only an intercept and we know σ2=1.\sigma^2 = 1.

Y1,...,YnN(θ,1).Y_1, ..., Y_n \sim N(\theta, 1).

Model 0: θ=0,\theta =0, and Model 1: θR\theta \in \R.

Then θ^0=0,θ^1=Yˉ\hat \theta_0 = 0, \hat \theta_1 = \bar Y, the sample mean.

(θ)=(n/2)log(2π)(1/2)(Yiθ)2.\ell(\theta) = (-n/2)\log(2\pi) - (1/2)\sum (Y_i - \theta)^2.

(θ^1)=(n/2)log(2π)(1/2)(YiYˉ)2.\ell(\hat \theta_1) = (-n/2)\log(2\pi) - (1/2)\sum (Y_i - \bar Y)^2.

(θ^0)=(n/2)log(2π)(1/2)(Yi)2.\ell(\hat \theta_0) = (-n/2)\log(2\pi) - (1/2)\sum (Y_i)^2.

Then

(θ^1)(θ^0)=(1/2)nYˉ)2.\ell (\hat \theta-1) - \ell (\hat \theta_0) = (1/2) n \bar Y)^2.

Then for AIC, penalty 1 - penalty 0 = 11, vs (1/2)logn(1/2) \log n if looking at BIC.

SO we end up choosing model 1 under AIC if (1/2)nYˉ21,(1/2)n\bar Y^2 \geq 1, i.e. nYˉ2.|\sqrt{n \bar Y}| \geq \sqrt{2}. This is like hypothesis testing with significance level P{N(0,1)2}.P\{|N(0,1)| \geq \sqrt{2}\}.

Under BIC, choose model 1 if (1/2)nYˉ2(1/2)logn(1/2)n\bar Y^2 \geq (1/2) \log n, i.e., nYˉlogn.| \sqrt{n} \bar Y | \geq \sqrt{\log n}. This is like hypothesis testing with significance level P{N(0,1)logn}0.P\{|N(0,1)| \geq \sqrt{\log n}\} \rightarrow 0.

AIC seems to care less about finding the “true model.” What is it looking for?

Bayesian model comparison, Bayes factors, BIC

Take kk models M1,...,Mk.M_1, ..., M_k.

Want to find posterior probabilities of these models.

Given data yy, we can write the posterior probabilities of the models using Bayes’ rule. Ratios of posterior probabilities look like

(fill in) resulting in…

(p(M1)/p(M2))(p(yM1)/p(yM2)).(p(M_1) / p(M_2))(p(y|M_1) / p(y|M_2)).

This is (prior ratio) * (Bayes factor), i.e., the factor by which we change our beliefs (something).

A prob or density of form p(yMi)p(y | M_i) is called a marginal likelihood.

so marginal likelihood is

p(yMi)=θMip(yθ,Mi)p(θMi)dθ.p(y | M_i) = \int_{\theta \in M_i} p(y | \theta, M_i) p(\theta | M_i) d\theta.

equivalently,

θMiLi(θ)fi(θ)dθ.\int_{\theta \in M_i} L_i (\theta) f_i (\theta) d\theta.

BIC selects model 0 if (show example)

a problem: if prior is very spread out, then P(M0y)P(M_0 | y) gets close to 11, so using a very wide prior means BIC will just prefer the smaller model (Lindley paradox).

methods to fix: calculate for a range of priors. see whether, in order to change your conclusion, you need priors that are extremely unrealistics.

Other proposals to fix: “partial Bayes factors”, where the key concept is to “sacrifice” a small amount of data to estimate priors in the various models.