Akaike Information Criterion
Shuhua Hu
Center for Research in Scientific Computation
North Carolina State University
Raleigh, NC
March 15, 2007 -1-
background
Background
• Model
statistical model: X = h(t; q) + ²
H h: mathematical model such as ODE model, PDE model, algebraic model, etc.
H ²: random variable with some probability distribution such as normal distribution.
H X is a random variable.
Under the assumption of ² being i.i.d N(0, σ2), we have
probability distribution model: g(x|θ) = 1√
2piσ
exp
[
−(x−h(t;q))22σ2
]
, where θ = (q, σ).
H g: probability density function of x depending on parameter θ.
H θ includes mathematical model parameter q and statistical model parameter σ.
• Risk
H “Modeling” error (in terms of uncertainty assumption)
Specified inappropriate parametric probability distribution for the data at hand.
March 15, 2007 -2-
background
H Estimation error
‖ϑ− θˆ‖2 = ‖ϑ− θ‖2︸ ︷︷ ︸
bias
+ ‖θ − θˆ‖2︸ ︷︷ ︸
variance
ϑ: parameter vector for the full reality model.
θ: is the projection of ϑ onto the parameter space of the approximating model Θk.
θˆ: the maximum likelihood estimate of θ in Θk.
I Variance
For sufficiently large sample size n, we have n‖θ − θˆ‖2 assym.∼ χ2k, where E(χ2k) = k.
March 15, 2007 -3-
background
• Principle of Parsimony (with same data set)
• Akaike Information Criterion
March 15, 2007 -4-
K-L information
Kullback-Leibler Information
Information lost when approximating model is used to approximate the full reality.
• Continuous Case
I(f, g(·|θ)) =
∫
Ω
f(x) log
(
f(x)
g(x|θ)
)
dx
=
∫
Ω
f(x) log(f(x))dx−
∫
Ω
f(x) log(g(x|θ))dx︸ ︷︷ ︸
relative K-L information
H f : full reality or truth in terms of a probability distribution.
H g: approximating model in terms of a probability distribution.
H θ: parameter vector in the approximating model g.
• Remark
H I(f, g) ≥ 0, with I(f, g) = 0 if and only if f = g almost everywhere.
H I(f, g) 6= I(g, f), which implies K-L information is not the real “distance”.
March 15, 2007 -5-
AIC
Akaike Information Criterion (1973)
• Motivation
H The truth f is unknown.
H The parameter θ in g must be estimated from the empirical data y.
I Data y is generated from f(x), i.e. realization for random variable X .
I θˆ(y): estimator of θ. It is a random variable.
I I(f, g(·|θˆ(y))) is a random variable.
H Remark
I We need to use expected K-L information Ey[I(f, g(·|θˆ(y)))] to measure the “distance”
between g and f .
March 15, 2007 -6-
AIC
• Selection Target
Minimizing
g∈G
Ey[I(f, g(·|θˆ(y)))]
H Ey[I(f, g(·|θˆ(y)))] =
∫
Ω f(x) log(f(x))dx−
∫
Ω
f(y)
[∫
Ω
f(x) log(g(x|θˆ(y)))dx
]
dy︸ ︷︷ ︸
EyEx[log(g(x|θˆ(y)))]
.
H G: collection of “admissible” models (in terms of probability density functions).
H θˆ is MLE estimate based on model g and data y.
H y is the random sample from the density function f(x).
• Model Selection Criterion
Maximizing
g∈G
EyEx[log(g(x|θˆ(y)))]
March 15, 2007 -7-
AIC
• Key Result
An approximately unbiased estimate of EyEx[log(g(x|θˆ(y)] for large sample and “good”
model is
log(L(θˆ|y))− k
H L: likelihood function.
H θˆ: maximum likelihood estimate of θ.
H k: number of estimated parameters (including the variance).
• Remark
H “Good” model : the model that is close to f in the sense of having a small K-L value.
March 15, 2007 -8-
AIC
• Maximum Likelihood Case
AIC = −2 logL(θˆ|y)
↙
bias
+ 2k
↘
variance
H Calculate AIC value for each model with the same data set, and the “best” model is
the one with minimum AIC value.
H The value of AIC depends on data y, which leads to model selection uncertainty.
• Least-Squares Case
Assumption: i.i.d. normally distributed errors
AIC = n log
(
RSS
n
)
+ 2k
H RSS is estimated residual of fitted model.
March 15, 2007 -9-
TIC
Takeuchi’s Information Criterion (1976)
useful in cases where the model is not particular close to truth.
• Model Selection Criterion
Maximizing
g∈G
EyEx[log(g(x|θˆ(y)))]
• Key Result
An approximately unbiased estimator of EyEx[log(g(x|θˆ(y)] for large sample is
log(L(θˆ|y))− tr(J(θ0)I(θ0)−1)
H J(θ0) = Ef
[(
∂
∂θ log(g(x|θ))
) (
∂
∂θ log(g(x|θ))
)T]
|θ=θ0
H I(θ0) = Ef
[
−∂2 log(g(x|θ))∂θiθj
]
|θ=θ0
• Remark
H If g ≡ f , then I(θ0) = J(θ0). Hence tr(J(θ0)I(θ0)−1) = k.
H If g is close to f , then tr(J(θ0)I(θ0)
−1) ≈ k.
March 15, 2007 -10-
TIC
• TIC
TIC = −2 log(L(θˆ|y)) + 2tr(Ĵ(θˆ)[Î(θˆ)]−1),
where Î(θˆ) and Ĵ(θˆ) are both k × k matrix, and
Î(θˆ) = −∂
2 log(g(x|θˆ))
∂θ2
→ estimate of I(θ0)
Ĵ(θˆ) =
n∑
i=1
[
∂
∂θ log(g(xi|θˆ))
] [
∂
∂θ log(g(xi|θˆ))
]T
→ estimate of J(θ0)
• Remark
H Attractive in theory.
H Rarely used in practice because we need a very large sample size to obtain good estimates
for both I(θ0) and J(θ0).
March 15, 2007 -11-
AICc
A Small Sample AIC
use in the case where the sample size is small relative to the number of parameters
rule of thumb: n/k < 40
• Univariate Case
Assumption: i.i.d normal error distribution with the truth contained in the model set.
AICc = AIC +
2k(k + 1)
n− k − 1︸ ︷︷ ︸
bias-correction
• Remark
H The bias-correction term varies by type of model (e.g., normal, exponential, Poisson).
H In practice, AICc is generally suitable unless the underlying probability distribution is
extremely nonnormal, especially in terms of being strongly skewed.
March 15, 2007 -12-
AICc
• Multivariate Case
Assumption: each row of ² is i.i.d N(0,Σ).
AICc = AIC + 2
k(k˜ + 1 + p)
n− k˜ − 1− p
H Applying to the multivariate case:
Y = TB + ², where Y ∈ Rn×p, T ∈ Rn×k˜, B ∈ Rk˜×p.
H p: total number of components.
H n: number of independent multivariate observations, each with p nonindependent compo-
nents.
H k: total number of unknown parameters and k = k˜p + p(p + 1)/2.
• Remark
H Bedrick and Tsai in [1] claimed that this result can be extended to the multivariate non-
linear regression model.
March 15, 2007 -13-
AIC difference
AIC Differences, Likelihood of a Model, Akaike Weights
• AIC differences
Information loss when fitted model is used rather than the best approximating model
∆i = AICi − AICmin
H AICmin: AIC values for the best model in the set.
• Likelihood of a Model
Useful in making inference concerning the relative strength of evidence for each of the
models in the set
L(gi|y) ∝ exp
(
−1
2
∆i
)
, where ∝ means “is proportional to”.
• Akaike Weights
“Weight of evidence” in favor of model i being the best approximating model in the set
wi =
exp(−12∆i)∑R
r=1 exp(−12∆r)
March 15, 2007 -14-
confidence set
Confidence Set for K-L Best Model
• Three Heuristic Approaches (see [4])
H Based on the Akaike weights wi
To obtain a 95% confidence set on the actual K-L best model, summing the Akaike weights
from largest to smallest until that sum is just ≥ 0.95, and the corresponding subset of
models is the confidence set on the K-L best model.
H Based on AIC difference ∆i
I 0 ≤ ∆i ≤ 2, substantial support,
I 4 ≤ ∆i ≤ 7, considerable less support,
I ∆i > 10, essentially no support.
Remark
I Particularly useful for nested models, may break down when the model set is large.
I The guideline values may be somewhat larger for nonnested models.
H Motivated by likelihood-based inference
The confidence set of models is all models for which the ratio
L(gi|y)
L(gmin|y) > α, where α might be chosen as
1
8
.
March 15, 2007 -15-
multimodel inference
Multimodel Inference
• Unconditional Variance Estimator
v̂ar( ˆ¯θ) =
[
R∑
i=1
wi
√
v̂ar(θˆi|gi) + (θˆi − ˆ¯θ)2
]2
H θ is a parameter in common to all R models.
H θˆi means that the parameter θ is estimated based on model gi,
H
ˆ¯θ is a model-averaged estimate ˆ¯θ =
∑R
i=1wiθˆi.
• Remark
H “Unconditional” means not conditional on any particular model, but still conditional on
the full set of models considered.
H If θ is a parameter in common to only a subset of the R models, then wi must be recal-
culated based on just these models (thus these new weights must satisfy
∑
wi = 1).
H Use unconditional variance unless the selected model is strongly supported (for example,
wmin > 0.9).
March 15, 2007 -16-
summary
Summary of Akaike Information Criteria
• Advantages
H Valid for both nested and nonnested models.
H Compare models with different error distribution.
H Avoid multiple testing issues.
• Selected Model
H The model with minimum AIC value.
H Specific to given data set.
• Pitfall in Using Akaike Information Criteria
H Can not be used to compare models of different data sets.
For example, if nonlinear regression model g1 is fitted to a data set with n = 140 obser-
vations, one cannot validly compare it with model g2 when 7 outliers have been deleted,
leaving only n = 133.
March 15, 2007 -17-
summary
H Should use the same response variables for all the candidate models.
For example, if there was interest in the normal and log-normal model forms, the models
would have to be expressed, respectively, as
g1(x|µ, σ) = 1√
2piσ
exp
[
−(x− µ)
2
2σ2
]
, g2(x|µ, σ) = 1
x
√
2piσ
exp
[
−(log(x)− µ)
2
2σ2
]
,
instead of
g1(x|µ, σ) = 1√
2piσ
exp
[
−(x− µ)
2
2σ2
]
, g2(log(x)|µ, σ) = 1√
2piσ
exp
[
−(log(x)− µ)
2
2σ2
]
.
H Do not mix null hypothesis testing with information criterion.
I Information criterion is not a “test”, so avoid use of “significant” and “not significant”,
or “rejected” and “not rejected” in reporting results.
I Do not use AIC to rank models in the set and then test whether the best model is
“significantly better” than the second-best model.
H Should retain all the components of each likelihood in comparing different
probability distributions.
March 15, 2007 -18-
reference
References
[1] E.J. Bedrick and C.L. Tsai, Model Selection for Multivariate Regression in Small Samples,
Biometrics, 50 (1994), 226–231.
[2] H. Bozdogan, Model Selection and Akaike’s Information Criterion (AIC): The General Theory
and Its Analytical Extensions, Psychometrika, 52 (1987), 345–370.
[3] H. Bozdogan, Akaike’s Information Criterion and Recent Developments in Information Com-
plexity, Journal of Mathematical Psychology, 44 (2000), 62–91.
[4] K.P. Burnham and D.R. Anderson, Model Selection and Inference: A Practical
Information-Theoretical Approach, (1998), New York: Springer-Verlag.
[5] K.P. Burnham and D.R. Anderson, Multimodel Inference: Understanding AIC and BIC in
Model Selection, Sociological methods and research, 33 (2004), 261–304.
[6] C.M. Hurvich and C.L. Tsai, Regression and Time Series Model Selection in Small Samples,
Biometrika, 76 (1989).
March 15, 2007 -19-
本文档为【AIC信息准则2】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑,
图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
该文档来自用户分享,如有侵权行为请发邮件ishare@vip.sina.com联系网站客服,我们会及时删除。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。
网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。