AIC信息准则2

AIC信息准则2 Akaike Information Criterion Shuhua Hu Center for Research in Scientific Computation North Carolina State University Raleigh, NC March 15, 2007 -1- background Background • Model statistical model: X = h(t; q) + ² H h: mathematical model such as ODE...

Akaike Information Criterion Shuhua Hu Center for Research in Scientific Computation North Carolina State University Raleigh, NC March 15, 2007 -1- background Background • Model statistical model: X = h(t; q) + ² H h: mathematical model such as ODE model, PDE model, algebraic model, etc. H ²: random variable with some probability distribution such as normal distribution. H X is a random variable. Under the assumption of ² being i.i.d N(0, σ2), we have probability distribution model: g(x|θ) = 1√ 2piσ exp [ −(x−h(t;q))22σ2 ] , where θ = (q, σ). H g: probability density function of x depending on parameter θ. H θ includes mathematical model parameter q and statistical model parameter σ. • Risk H “Modeling” error (in terms of uncertainty assumption) Specified inappropriate parametric probability distribution for the data at hand. March 15, 2007 -2- background H Estimation error ‖ϑ− θˆ‖2 = ‖ϑ− θ‖2︸︷︷︸ bias + ‖θ − θˆ‖2︸︷︷︸ variance ϑ: parameter vector for the full reality model. θ: is the projection of ϑ onto the parameter space of the approximating model Θk. θˆ: the maximum likelihood estimate of θ in Θk. I Variance For sufficiently large sample size n, we have n‖θ − θˆ‖2 assym.∼ χ2k, where E(χ2k) = k. March 15, 2007 -3- background • Principle of Parsimony (with same data set) • Akaike Information Criterion March 15, 2007 -4- K-L information Kullback-Leibler Information Information lost when approximating model is used to approximate the full reality. • Continuous Case I(f, g(·|θ)) = ∫ Ω f(x) log ( f(x) g(x|θ) ) dx = ∫ Ω f(x) log(f(x))dx− ∫ Ω f(x) log(g(x|θ))dx︸︷︷︸ relative K-L information H f : full reality or truth in terms of a probability distribution. H g: approximating model in terms of a probability distribution. H θ: parameter vector in the approximating model g. • Remark H I(f, g) ≥ 0, with I(f, g) = 0 if and only if f = g almost everywhere. H I(f, g) 6= I(g, f), which implies K-L information is not the real “distance”. March 15, 2007 -5- AIC Akaike Information Criterion (1973) • Motivation H The truth f is unknown. H The parameter θ in g must be estimated from the empirical data y. I Data y is generated from f(x), i.e. realization for random variable X . I θˆ(y): estimator of θ. It is a random variable. I I(f, g(·|θˆ(y))) is a random variable. H Remark I We need to use expected K-L information Ey[I(f, g(·|θˆ(y)))] to measure the “distance” between g and f . March 15, 2007 -6- AIC • Selection Target Minimizing g∈G Ey[I(f, g(·|θˆ(y)))] H Ey[I(f, g(·|θˆ(y)))] = ∫ Ω f(x) log(f(x))dx− ∫ Ω f(y) [∫ Ω f(x) log(g(x|θˆ(y)))dx ] dy︸︷︷︸ EyEx[log(g(x|θˆ(y)))] . H G: collection of “admissible” models (in terms of probability density functions). H θˆ is MLE estimate based on model g and data y. H y is the random sample from the density function f(x). • Model Selection Criterion Maximizing g∈G EyEx[log(g(x|θˆ(y)))] March 15, 2007 -7- AIC • Key Result An approximately unbiased estimate of EyEx[log(g(x|θˆ(y)] for large sample and “good” model is log(L(θˆ|y))− k H L: likelihood function. H θˆ: maximum likelihood estimate of θ. H k: number of estimated parameters (including the variance). • Remark H “Good” model : the model that is close to f in the sense of having a small K-L value. March 15, 2007 -8- AIC • Maximum Likelihood Case AIC = −2 logL(θˆ|y) ↙ bias + 2k ↘ variance H Calculate AIC value for each model with the same data set, and the “best” model is the one with minimum AIC value. H The value of AIC depends on data y, which leads to model selection uncertainty. • Least-Squares Case Assumption: i.i.d. normally distributed errors AIC = n log ( RSS n ) + 2k H RSS is estimated residual of fitted model. March 15, 2007 -9- TIC Takeuchi’s Information Criterion (1976) useful in cases where the model is not particular close to truth. • Model Selection Criterion Maximizing g∈G EyEx[log(g(x|θˆ(y)))] • Key Result An approximately unbiased estimator of EyEx[log(g(x|θˆ(y)] for large sample is log(L(θˆ|y))− tr(J(θ0)I(θ0)−1) H J(θ0) = Ef [( ∂ ∂θ log(g(x|θ)) ) ( ∂ ∂θ log(g(x|θ)) )T] |θ=θ0 H I(θ0) = Ef [ −∂2 log(g(x|θ))∂θiθj ] |θ=θ0 • Remark H If g ≡ f , then I(θ0) = J(θ0). Hence tr(J(θ0)I(θ0)−1) = k. H If g is close to f , then tr(J(θ0)I(θ0) −1) ≈ k. March 15, 2007 -10- TIC • TIC TIC = −2 log(L(θˆ|y)) + 2tr(Ĵ(θˆ)[Î(θˆ)]−1), where Î(θˆ) and Ĵ(θˆ) are both k × k matrix, and Î(θˆ) = −∂ 2 log(g(x|θˆ)) ∂θ2 → estimate of I(θ0) Ĵ(θˆ) = n∑ i=1 [ ∂ ∂θ log(g(xi|θˆ)) ] [ ∂ ∂θ log(g(xi|θˆ)) ]T → estimate of J(θ0) • Remark H Attractive in theory. H Rarely used in practice because we need a very large sample size to obtain good estimates for both I(θ0) and J(θ0). March 15, 2007 -11- AICc A Small Sample AIC use in the case where the sample size is small relative to the number of parameters rule of thumb: n/k < 40 • Univariate Case Assumption: i.i.d normal error distribution with the truth contained in the model set. AICc = AIC + 2k(k + 1) n− k − 1︸︷︷︸ bias-correction • Remark H The bias-correction term varies by type of model (e.g., normal, exponential, Poisson). H In practice, AICc is generally suitable unless the underlying probability distribution is extremely nonnormal, especially in terms of being strongly skewed. March 15, 2007 -12- AICc • Multivariate Case Assumption: each row of ² is i.i.d N(0,Σ). AICc = AIC + 2 k(k˜ + 1 + p) n− k˜ − 1− p H Applying to the multivariate case: Y = TB + ², where Y ∈ Rn×p, T ∈ Rn×k˜, B ∈ Rk˜×p. H p: total number of components. H n: number of independent multivariate observations, each with p nonindependent compo- nents. H k: total number of unknown parameters and k = k˜p + p(p + 1)/2. • Remark H Bedrick and Tsai in [1] claimed that this result can be extended to the multivariate non- linear regression model. March 15, 2007 -13- AIC difference AIC Differences, Likelihood of a Model, Akaike Weights • AIC differences Information loss when fitted model is used rather than the best approximating model ∆i = AICi − AICmin H AICmin: AIC values for the best model in the set. • Likelihood of a Model Useful in making inference concerning the relative strength of evidence for each of the models in the set L(gi|y) ∝ exp ( −1 2 ∆i ) , where ∝ means “is proportional to”. • Akaike Weights “Weight of evidence” in favor of model i being the best approximating model in the set wi = exp(−12∆i)∑R r=1 exp(−12∆r) March 15, 2007 -14- confidence set Confidence Set for K-L Best Model • Three Heuristic Approaches (see [4]) H Based on the Akaike weights wi To obtain a 95% confidence set on the actual K-L best model, summing the Akaike weights from largest to smallest until that sum is just ≥ 0.95, and the corresponding subset of models is the confidence set on the K-L best model. H Based on AIC difference ∆i I 0 ≤ ∆i ≤ 2, substantial support, I 4 ≤ ∆i ≤ 7, considerable less support, I ∆i > 10, essentially no support. Remark I Particularly useful for nested models, may break down when the model set is large. I The guideline values may be somewhat larger for nonnested models. H Motivated by likelihood-based inference The confidence set of models is all models for which the ratio L(gi|y) L(gmin|y) > α, where α might be chosen as 1 8 . March 15, 2007 -15- multimodel inference Multimodel Inference • Unconditional Variance Estimator v̂ar( ˆ¯θ) = [ R∑ i=1 wi √ v̂ar(θˆi|gi) + (θˆi − ˆ¯θ)2 ]2 H θ is a parameter in common to all R models. H θˆi means that the parameter θ is estimated based on model gi, H ˆ¯θ is a model-averaged estimate ˆ¯θ = ∑R i=1wiθˆi. • Remark H “Unconditional” means not conditional on any particular model, but still conditional on the full set of models considered. H If θ is a parameter in common to only a subset of the R models, then wi must be recal- culated based on just these models (thus these new weights must satisfy ∑ wi = 1). H Use unconditional variance unless the selected model is strongly supported (for example, wmin > 0.9). March 15, 2007 -16- summary Summary of Akaike Information Criteria • Advantages H Valid for both nested and nonnested models. H Compare models with different error distribution. H Avoid multiple testing issues. • Selected Model H The model with minimum AIC value. H Specific to given data set. • Pitfall in Using Akaike Information Criteria H Can not be used to compare models of different data sets. For example, if nonlinear regression model g1 is fitted to a data set with n = 140 obser- vations, one cannot validly compare it with model g2 when 7 outliers have been deleted, leaving only n = 133. March 15, 2007 -17- summary H Should use the same response variables for all the candidate models. For example, if there was interest in the normal and log-normal model forms, the models would have to be expressed, respectively, as g1(x|µ, σ) = 1√ 2piσ exp [ −(x− µ) 2 2σ2 ] , g2(x|µ, σ) = 1 x √ 2piσ exp [ −(log(x)− µ) 2 2σ2 ] , instead of g1(x|µ, σ) = 1√ 2piσ exp [ −(x− µ) 2 2σ2 ] , g2(log(x)|µ, σ) = 1√ 2piσ exp [ −(log(x)− µ) 2 2σ2 ] . H Do not mix null hypothesis testing with information criterion. I Information criterion is not a “test”, so avoid use of “significant” and “not significant”, or “rejected” and “not rejected” in reporting results. I Do not use AIC to rank models in the set and then test whether the best model is “significantly better” than the second-best model. H Should retain all the components of each likelihood in comparing different probability distributions. March 15, 2007 -18- reference References [1] E.J. Bedrick and C.L. Tsai, Model Selection for Multivariate Regression in Small Samples, Biometrics, 50 (1994), 226–231. [2] H. Bozdogan, Model Selection and Akaike’s Information Criterion (AIC): The General Theory and Its Analytical Extensions, Psychometrika, 52 (1987), 345–370. [3] H. Bozdogan, Akaike’s Information Criterion and Recent Developments in Information Com- plexity, Journal of Mathematical Psychology, 44 (2000), 62–91. [4] K.P. Burnham and D.R. Anderson, Model Selection and Inference: A Practical Information-Theoretical Approach, (1998), New York: Springer-Verlag. [5] K.P. Burnham and D.R. Anderson, Multimodel Inference: Understanding AIC and BIC in Model Selection, Sociological methods and research, 33 (2004), 261–304. [6] C.M. Hurvich and C.L. Tsai, Regression and Time Series Model Selection in Small Samples, Biometrika, 76 (1989). March 15, 2007 -19-

                    本文档为【AIC信息准则2】，请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑，
                    图片更改请在作品中右键图片并更换，文字修改请直接点击文字进行修改，也可以新增和删除文档中的内容。 
 该文档来自用户分享，如有侵权行为请发邮件ishare@vip.sina.com联系网站客服，我们会及时删除。

                    [版权声明] 本站所有资料为用户分享产生，若发现您的权利被侵害，请联系客服邮件isharekefu@iask.cn，我们尽快处理。

                    本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权，请谨慎使用。

                    网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传，仅限个人学习分享使用，禁止用于任何广告和商用目的。
                

下载需要：免费已有0 人下载

立即下载

AIC信息准则2

你可能还喜欢