Finding true model vs model which predicts future data well leads to different criterion.
AIC, BIC
Bayesian vs frequentist ideas for model selection
AIC – an information criterion
Consider maximized log likelihood logL(θ^). We can’t only use this model selection because it favors adding more parameters.
Therefore, we penalize for number of parameters, p. So we prefer models where logL(θ^)−p is larger.
Usually, we see this multiplied by −2 such that we take small −2logL(θ^)+2p.
BIC
BIC has a heavier penalty for complexity. Let ℓ be log likelihood and p the number of parameters in the model.
BIC: ℓ(θ^)−(1/2logn)p.
In comparing model 0 and model 1, prefer model 1 if
ℓ(θ^1)−pen1≥ℓ(θ^0)−pen0.
AIC vs BIC in a simple model selection problem
Suppose there is only an intercept and we know σ2=1.
Y1,...,Yn∼N(θ,1).
Model 0: θ=0, and Model 1: θ∈R.
Then θ^0=0,θ^1=Yˉ, the sample mean.
ℓ(θ)=(−n/2)log(2π)−(1/2)∑(Yi−θ)2.
ℓ(θ^1)=(−n/2)log(2π)−(1/2)∑(Yi−Yˉ)2.
ℓ(θ^0)=(−n/2)log(2π)−(1/2)∑(Yi)2.
Then
ℓ(θ^−1)−ℓ(θ^0)=(1/2)nYˉ)2.
Then for AIC, penalty 1 - penalty 0 = 1, vs (1/2)logn if looking at BIC.
SO we end up choosing model 1 under AIC if (1/2)nYˉ2≥1, i.e. ∣nYˉ∣≥2. This is like hypothesis testing with significance level P{∣N(0,1)∣≥2}.
Under BIC, choose model 1 if (1/2)nYˉ2≥(1/2)logn, i.e., ∣nYˉ∣≥logn. This is like hypothesis testing with significance level P{∣N(0,1)∣≥logn}→0.
AIC seems to care less about finding the “true model.” What is it looking for?
Bayesian model comparison, Bayes factors, BIC
Take k models M1,...,Mk.
Want to find posterior probabilities of these models.
Given data y, we can write the posterior probabilities of the models using Bayes’ rule. Ratios of posterior probabilities look like
(fill in) resulting in…
(p(M1)/p(M2))(p(y∣M1)/p(y∣M2)).
This is (prior ratio) * (Bayes factor), i.e., the factor by which we change our beliefs (something).
A prob or density of form p(y∣Mi) is called a marginal likelihood.
so marginal likelihood is
p(y∣Mi)=∫θ∈Mip(y∣θ,Mi)p(θ∣Mi)dθ.
equivalently,
∫θ∈MiLi(θ)fi(θ)dθ.
BIC selects model 0 if (show example)
a problem: if prior is very spread out, then P(M0∣y) gets close to 1, so using a very wide prior means BIC will just prefer the smaller model (Lindley paradox).
methods to fix: calculate for a range of priors. see whether, in order to change your conclusion, you need priors that are extremely unrealistics.
Other proposals to fix: “partial Bayes factors”, where the key concept is to “sacrifice” a small amount of data to estimate priors in the various models.