Basic Statistical Inference

Basic Statistical Inference Statistical Theory 2 Basic Statistical Inference Tassos Magdalinos University of Southampton November 2010 1 1 Point Estimation 1.1 MSE and Unbiased Estimation Consider an estimation problem: we know that a random phenomenon has a certain behaviour g...

Statistical Theory 2 Basic Statistical Inference Tassos Magdalinos University of Southampton November 2010 1 1 Point Estimation 1.1 MSE and Unbiased Estimation Consider an estimation problem: we know that a random phenomenon has a certain behaviour given by a density fX (x; �) which depends on an unknown parameter �, e.g., we know that it follows an N (0; �2) distribution, but we do not know �2. The idea of estimation is to use data (whose mathematical representation are random variables) in order to estimate �, i.e., choose a random variable �^ = g (X1; :::; Xn) based on a sample of n observations from f (x; �) in order to make a good guessfor �. Here is some jargon: Denition 1 Given an unknown parameter �, the parameter space � is the set containing all possible values of �. Typically, � � Rk. For example, if the data is generated by a N (�; 1) distribution with para- meter � = � and � = R. If the data is generated by a N (�; �2) distribution, then � = (�; �2)0 and � = R� (0;1). Denition 2 A statistic based on a sample of observations (X1; :::; Xn) is any function g (X1; :::; Xn) of the random variables X1; :::; Xn. If the obser- vations come from a family of distributions f (X1; :::; Xn; �) indexed by an unknown parameter �; an estimator �^ of � is a statistic used to approxi- mate �. If � is a vector of unknown parameters �^ is a vector of estimators of the same order as �. We sometimes write an estimator as �^n in order to emphasize its dependence on the sample size n. Note that, by denition, an estimator �^n is a random variable (or vec- tor according to the dimension of �) that depends on the random variables X1; :::; Xn only: �^n does not depend on �. How do we know that a certain choice of estimator �^n is good for esti- mating �? One way to look at it is to require that �^ is close to � with respect to some distance (norm). Provided that the estimator �^n has nite variance, an appropriate distance to choose is the mean squared error. Denition 3 The mean squared error (MSE) of a statistic �^n is dened as MSE � �^n; � � := E �� ^n � � �2� : 2 We prefer an estimator �^n to a competing estimator ~�n i¤ MSE � �^n; � � �MSE � ~�n; � � for all � 2 �: Exercise 1 Show that MSE � �^n; � � = h E � �^n � � � i2 + var � �^n � : (1) Ideally, we would like to minimiseMSE � �^n; � � w.r.t. �^n on order to nd an optimal estimator. However, joint minimisation of the RHS of (1) is not possible in general. Things get a lot simpler if �^n has the additional property of unbiasedness. Denition 4 An estimator �^n of � is called unbiased if E � �^n � = �. If �^n is unbiased, (1) implies that MSE � �^n; � � = var � �^n � . The MSE minimisation criterion implies that: between two unbiased estimators we choose the one with the smallest variance. However, one needs to bear in mind that unbiasedness is a compromise: there exist examples of a biased estimator that has smaller MSE than an unbiased estimator. See Exercise 5, Question 6. Finally, here is another desirable property of estimators in large samples. Denition 5 An estimator �^n of � is called consistent if �^n !p � as the sample size n tends to 1. 1.2 The Method of Maximum Likelihood Suppose that X1; X2; :::; Xn are random variables from a discrete family of distributions ff (X1; ::; Xn; �) : � 2 �g. The probability that X1 = x1; ::, Xn = xn is the joint mass function: fX1;:::;Xn (x1; :::; xn; �) = P� (X1 = x1; :::; Xn = xn) : We might ask what value of � maximises the probability of obtaining this observed particular sample x1; ::; xn, i.e., what value of � maximises fX1;:::;Xn (x1; :::; xn; �) : 3 This maximising value of � would provide a good estimator of �, because it would provide the largest probability of this particular sample occurring. Put di¤erently, the maximising value of � picks up the most likely mass to have generated x1; :::; xn. In general X = (X1; :::; Xn) 0 is a random vector from a family of densities (or mass functions) ff (X; �) : � 2 �g. Denition 6 The likelihood function, L (�) is dened as the density (or mass) of the data X = (X1; :::; Xn) 0, but regarded as function of � rather than X : L (�) = f (X; �) : The maximum likelihood estimator (MLE) �^n of � is dened as the maximiser of the likelihood function L (�) over the parameter space �, i.e., �^n = argmax �2� L (�) or L � �^n � = max �2� L (�) : (2) Exercise 2 Let h (�) be a positive and di¤erentiable function. Show that h (x) is maximised at the same point as log h (x). In view of the above result we dene the log-likelihood function l (�) := logL (�) and the MLE of � can be found as the maximiser of l (�) : �^n = argmax �2� l (�) or l � �^n � = max �2� l (�) : (3) There is no statistics involved in passing from (2) to (3): it is just a matter of algebraic covariance to use (3). Example 1 (Independent sampling) WhenX = (X1; :::; Xn) 0 consists of independent random variables X1; :::; Xn, f (X1; :::; Xn) = f (X1) :::f (Xn) : (4) Therefore, L (�) = nY i=1 f (Xi; �) and l (�) = nX i=1 log f (Xi; �) : (5) 4 Example 2 Let, X1; :::; Xn be i.i.d. Poisson (�) r.v.s with mass f (Xi; �) = e�� Xi Xi! for Xi 2 f0; 1; 2; :::g. By (5) the likelihood function is given by L (�) = nY i=1 f (Xi; �) = nY i=1 e�� Xi Xi! = e�n� �X1 :::�Xn nQ i=1 Xi! = e�n� �X1+:::+Xn nQ i=1 Xi! : Taking logs: l (�) = logL (�) = �n�+ nX i=1 Xi log �� log nY i=1 Xi! @l @� = �n+ Pn i=1Xi � = 0) � = 1 n nX i=1 Xi = �Xn: So the MLE of � is �^n = �Xn: This is not surprising since � = E (X1). Example 3 Let X1; :::; Xn be i.i.d. N (�; �2) random variables. Here, we have a vector of unknown parameters: � = (�; �2)0. By (5) L � �; �2 � = nY i=1 f � Xi;�; � 2 � = nY i=1 � 2��2 ��1=2 exp � � 1 2�2 (Xi � �)2 � = � 2��2 ��n=2 exp ( � 1 2�2 nX i=1 (Xi � �)2 ) ; so the log-likelihood is given by l � �; �2 � = �n 2 log 2� � n 2 log �2 � 1 2�2 nX i=1 (Xi � �)2 : The MLE for (�; �2)0 will be found by jointly maximising l (�; �2) for � and �2, i.e. it will be the solution of the two-equation system @l @� = 0 and @l @�2 = 0. Since �2 > 0 @l @� = 1 �2 nX i=1 (Xi � �) = 1 �2 � n �Xn � n� � = 0 =) � = �Xn: (6) 5 Di¤erentiating the log-likelihood w.r.t. �2 : @l @�2 = �n 2 1 �2 + 1 2 (�2)2 nX i=1 (Xi � �)2 = 0 =) �n+ 1 �2 nX i=1 (Xi � �)2 = 0 =) �2 = 1 n nX i=1 (Xi � �)2 : (7) Requiring (6) and (7) to hold simultaneously, we obtain the MLE � �^n; �^ 2 n � of (�; �2) to be �^n = �Xn; �^ 2 n = 1 n nX i=1 � Xi � �Xn �2 : Remark 1 By Exercise 6, Question 5 �^2n is biased. Thus, MLEs are not unbiased in general. On the other hand, Exercise 6, Question 5 shows that, despite being biased, the MLE estimator had smaller MSE than the unbiased estimator s2n. Example 4 (Dependent sampling and conditional likelihood function) WhenX1; ::; Xn are not independent, (4) no longer holds so (5) is not valid. A way to proceed with setting up the likelihood function is to use the denition of conditional density: f (X2jX1) = f (X2; X1) f (X1) =) f (X2jX1) = f (X2jX1) f (X1) : (8) Then f (X3; X2; X1) = f (X3jX2; X1) f (X2; X1) = f (X3jX2; X1) f (X2jX1) f (X1) Proceeding like this we obtain the joint density of (Xn; :::; X1): f (Xn; Xn�1; :::; X1) = f (XnjXn�1; :::; X1) f (Xn�1jXn�2; :::; X1) :::f (X2jX1) f (X1) : The above factorisation yields the following expressions for the likelihood and log-likelihood functions: L (�) = f (X1; �) nY i=2 f (XijXi�1; ::; X1; �) ; (9) 6 l (�) = log f (X1; �) + nY i=2 log f (XijXi�1; ::; X1; �) : (10) Example 5 (Sampling from a Gaussian AR(1)) AGaussian AR(1) process is dened recursively as Xt = �Xt�1 + ut t 2 f2; :::; ng where X1 is a constant, � 2 (�1; 1) and u1; :::; un are i.i.d. N (0; �2) ran- dom variables. Here there are two unknown parameters: � = (�; �2)0 : Since X2 = �X1 + u1, X1 is a constant and u1 � N (0; �2), X2jX1 � N (�X1; �2) : Similarly, Xt = �Xt�1 + ut; ut � N � 0; �2 � =) XtjXt�1 � N � �Xt�1; �2 � so f � XtjXt�1; :::; X1; �; �2 � = � 2��2 ��1=2 exp � � 1 2�2 (Xt � �Xt�1)2 � : (11) Noting that log f (X1) = 0 on the support of f (X1; :::; Xn) (why?) we can substitute (11) into (10) to obtain l � �; �2 � = �n 2 log 2� � n 2 log �2 � 1 2�2 nX t=2 (Xt � �Xt�1)2 Di¤erentiating l (�; �2) w.r.t. � : @l @� = 1 �2 nX t=2 (Xt � �Xt�1)Xt�1 = 0 ) nX t=2 XtXt�1 � � nX t=2 X2t�1 = 0 ) �^n = Pn t=2XtXt�1Pn t=2X 2 t�1 (12) @l @�2 = �n 2 1 �2 + 1 2 (�2)2 nX t=2 (Xt � �Xt�1)2 = 0 ) �2 = nX t=2 (Xt � �Xt�1)2 : 7 Combining this with (12) we obtain �^2n = 1 n nX t=2 (Xt � �^nXt�1)2 : 1.3 Score Vector and Information Matrix Consider a random vector X = (X1; :::; Xn) 0 with density f (X; �) = L (�) depending on a parameter � 2 Rk. Also l (�) = logL (�) = log f (X; �). The k � 1 random vector s (X; �) = @l @� is called the score vector. When l (�) is di¤erentiable, the MLE estimator �^n satises s � X; �^n � = 0: (13) In this section, we will determine the mean and covariance matrix of the score vector s = s (X; �). In what follows, we assume that X1; :::; Xn are independent as well as that the following two regularity conditions are satised: (RC1) The support of f , S = fX 2 Rn : f (X; �) > 0g, is independent of �. (RC2) f (X1; �) is twice di¤erentiable with E � @2 log f(X1;�) @�@�0 � < 0. When � is a Rk vector (RC2) means that the k�k matrix E � @2 log f(X1;�) @�@�0 � is negative denite. Condition (RC1) is satised for most densities f (�; �), e.g., ifX � Exp (�), then f (X; �) = � �e��X ; X > 0 0; X � 0: and S = (0;1) does not depend on �. The same holds for all distributions that we have encountered apart from the uniform distribution on [0; �]: f (X; �) = � 1=�; X 2 [0; �] 0; X =2 [0; �] so in this case S = [0; �] is dependent on �. This is why the uniform distrib- ution on [0; �] is a source of counterexamples for likelihood theory. 8 The importance of (RC1) lies in that it permits di¤erentiation under the integral sign, i.e., ensures that @ @� Z S f (X; �) dX = Z S @f (X; �) @� dX; (14) where S and f are dened in (RC1). This follows from the so-called Leibnitz rule [see Spiegel, p. 163, eq. (17)]. Exercise 3 Show that (14) is satised when X � Exp (�) but not when X � U [0; �]. Theorem 1 Under the regularly condition (RC1), E (s) = 0; E (ss0) = �E � @2l @�@�0 � : (15) Proof. Since L (�) = f (X; �) is a density,Z S f (X; �) dX = 1: (16) Note that R S f (X; �) dX meansZ ::: Z fX2Rn:f(X1;:::;Xn;�)>0g f (X1; :::; Xn; �) dX1::::dXn: Di¤erentiating (16) w.r.t. � and using (14) we obtain, 0 = @ @� Z S f (X; �) dX = Z S @f (X; �) @� dX: (17) The relationship between the derivative of the likelihood function and the score vector is given by the following identity: s (X; �) = @ log f (X; �) @� = 1 f (X; �) @f (X; �) @� =) @f (X; �) @� = s (X; �) f (X; �) : (18) 9 Combining (17) and (18) we obtain 0 = Z S s (X; �) f (X; �) dX = E [s (X; �)] ; showing the rst part of (15). For the second part, di¤erentiating both sides of 0 = E (s0) = Z S @ log f (X; �) @�0 f (X; �) dX and using (14) and the rule for di¤erentiating a product, we obtain 0 = @ @� Z S @ log f (X; �) @�0 f (X; �) dX = Z S @ @� � @ log f (X; �) @�0 f (X; �) � dX [by (14)] = Z S @2 log f (X; �) @��0 f (X; �) dX + Z S @f (X; �) @� @ log f (X; �) @�0 dX = E � @2 log f (X; �) @��0 � + Z S s (X; �) f (X; �) @ log f (X; �) @�0 dX [by (18)] = E � @2l @�@�0 � + Z S s (X; �) s (X; �)0 f (X; �) dX = E � @2l @�@�0 � + E � s (X; �) s (X; �)0 � and the proof follows since the last sum is equal to 0. Note that Theorem 1 is not restricted to independent or identically dis- tributed samples. The above proof depends only on the validity of Leibnitz rule for di¤erentiation under the integral sign, ensured by (RC1). Denition 7 The k � k matrix I (�) = �E � @2l @�@�0 � (19) is called (Fishers) information matrix. 10 Remarks. 1. Under (RC1), the information matrix I (�) coincides with the covari- ance matrix of the score vector s (X; �). 2. If X1; :::; Xn are i.i.d. random variables, then (5) implies that I (�) = �E � @2l @�@�0 � = �E " @2 @�@�0 nX i=1 log f (Xi; �) # = � nX i=1 E � @2 log f (Xi; �) @�@�0 � = �nE � @2 log f (X1; �) @�@�0 � (20) since the random variables Z1; :::; Zn dened by Zi = @2 log f (Xi; �) @�@�0 i 2 f1; :::; ng are identically distributed so they all have the same expectation. Under assumption (RC2), (20) has two important consequences and one not- so-important: (i) I (�) is a positive denite matrix. (ii) I (�) increases to1 with rate equal to the sample size n as n!1: lim n!1 1 n I (�) = �E � @2 log f (X1; �) @�@�0 � (21) the limit matrix being nite and positive denite. While our deriva- tion of (20) is conned to the case where X1; :::; Xn are i.i.d. random variables, the limit behaviour of the information matrix is given by (21) in much more general situations: it applies even under dependent sampling schemes, the main requirement being ergodicity. (iii) The expression on the RHS of (20) is equal to n times the infor- mation contained in the rst observation X1. Thus, the information contained in n sample points of an i.i.d. sample is equal to n times the information contained in the rst observation. 3. In the next section we will see that, for any unbiased estimator �^, var � �^ � � I (�)�1. If �^ is unbiased, it is centered around � so the more information about � is provided by an observation the smaller the variance of �^ should be: hence the name information matrixfor (19). 11 1.4 The Cramér Rao Lower Bound Theorem 2 (Cramér-Rao) If (RC1) and (RC2) hold, then the covariance matrix of any unbiased estimator �^ of � satises V � �^ � � I (�)�1 in the sense that V � �^ � � I (�)�1 is a positive semi-denite matrix. Proof. Recall that any estimator �^ = �^ (X) is independent of �: Since �^ is unbiased, E � �^ � = �, or equivalently � = Z S �^ (X) f (X; �) dX: Di¤erentiating w.r.t. �0 we obtain Ik = @ @�0 Z S �^ (X) f (X; �) dX = Z S �^ (X) @f (X; �) @�0 dX; [by (RC1)] = Z S �^ (X) s (X; �)0 f (X; �) dX; [by (18)] = E � �^s0 � = E h� �^ � � � s0 i (22) since E (�s0) = �E (s0) = 0 by (15). We now assume that k = 1 (� is a scalar) for simplicity. Then both s and I (�) are scalars and (22) reduces to 1 = E h� �^ � � � s i = cov � �^ � �; s � � h var � �^ � � �i1=2 [var (s)]1=2 = h var � �^ �i1=2 [I (�)]1=2 : by Cauchy-Schwarz inequality (see Lemma B in the Probability lecture notes). This shows the result for scalar �. The key property that delivers the lower 12 bound is the CS inequality. Note that in this case the ordering is complete, i.e. var � �^ � � I (�)�1 holds as a true inequality between numbers. Theorem 2 implies that the MSE of any unbiased estimator �^ of � can- not be smaller than I (�)�1, i.e. the best possible unbiased estimator is one whose covariance matrix achieves the Cramér Rao lower bound. Such estimators are called e¢ cient. Denition 8 An unbiased estimator �^ of � is called e¢ cient if var � �^ � = I (�)�1 : 2 Condence Intervals There are situations where, rather than requiring an approximation of the value of an unknown parameter, we are more interested in constructing (ran- dom) bounds that contain the unknown parameter with given probability of containment. For example, a light bulb manufacturer advertises that their light bulbs last longer than a year. They are interested in producing light bulbs with (unknown) true mean life between one year (so that they dont lie in the advertisement) and a year and one month (so that they sell more light bulbs). How can they be reasonably certain that their production line meets the above requirement? One approach is to construct a random interval that contains the mean with a satisfactory (say 95%) probability. We concentrate to the situation of Example 3 where the data X1; :::; Xn are i.i.d. N (�; �2) random variables. When the variance �2 is known, this is an easy problem: we know from (47) in Example 34 that �Xn � N � �; �2 n � i.e. Z := p n � �Xn � � � � � N (0; 1) : Therefore, denoting by � (x) the N (0; 1) distribution function, P (jZj � x) = P �x � p n � �� Xn � � � x ! = �(x)� � (�x) = � (x)� [1� � (x)] = 2� (x)� 1: Solving the above inequality for � we obtain, for any x > 0; 2� (x)� 1 = P � �Xn � �xp n � � � �Xn + �xp n � : (23) 13 Recall that our problem is to address questions such as nd an interval that contains the mean with probability 95% . In order to complete the interval estimation procedure we need to associate the number x to the required probability of containment. Note that the number ��1 (1� y) is well dened and positive if and only if y 2 (0; 1): since the range of � is (0; 1), ��1 (x) is only dened for x 2 (0; 1). Since (23) is valid for any x > 0, we may set x = ��1 � 1� � 2 � for any � 2 (0; 1=2) in (23): since 2� � ��1 � 1� � 2 �� 1 = 2 � 1� � 2 � � 1 = 1� �; (23) becomes 1� � = P � �Xn � �p n ��1 � 1� � 2 � � � � �Xn + �p n ��1 � 1� � 2 �� = P � �Xn � �p n z� 2 � � � �Xn + �p n z� 2 � (24) by Question 7 Exercise 5, where z� denotes the �-quantile of the N (0; 1) distribution, satisfying P (Z > z�) = �: (25) This yields the required condence interval. Since the probability of inclusion is 1 � �, (24) is said to dene a 100 (1� �)% condence interval. For example, if we wish to construct a 95% condence interval, we choose � = 0:05 (so that 1� � = 0:95) in which case the value of z� 2 can be found from statistical tables: z0:025 = 1:96 and (24) yields P � �Xn � 1:96�p n � � � �Xn + 1:96�p n � = 0:95: The above condence interval is useful only when � is known. The situ- ation is not so easy when � is unknown. We need the following fundamental theorem. Theorem 3 Let X1; :::; Xn be i.i.d. N (�; �2) random variables and let �Xn = 1 n nX i=1 Xi and s2n = 1 n� 1 nX i=1 � Xi � �Xn �2 : Then �Xn and s2n are independent random variables and (n� 1) s2n �2 � �2 (n� 1) : 14 Corollary 1 Under the assumptions of Theorem 3, the statistic Tn := p n � �Xn � � � sn � t (n� 1) where t (n� 1) denotes a t-distribution with parameter n� 1. The easiest and most intuitive way to prove Theorem 3 by applying an orthogonal transformation. See the Appendix at the end if interested. Proof of Corollary 1. We know by Example 34 that �Xn � N (�; �2=n) i.e. p n � �Xn � � � =� � N (0; 1). By Theorem 3 (n� 1) s2n �2 � �2 (n� 1) ; independently of �Xn (and hence of p n � �Xn � � � =�). Therefore, Example 29 implies that the ratio of p n � �Xn � � � =� over q (n�1)s2n �2 = (n� 1) follows a t (n� 1) distribution: t (n� 1) � p n � �Xn � � � =�q (n�1)s2n �2 = (n� 1) = p n � �Xn � � � sn = Tn: Corollary 2. Let X1; :::; Xn be i.i.d. N (�; �2) random variables with un- known mean � and variance �2. Given any positive number �, a condence interval for � can be constructed as follows: 1� � = P � �Xn � snp n t� 2 � � � �Xn + snp n t� 2 � (26) where t� 2 = F�1t(n�1) � 1� � 2 � and Ft(n�1) (�) is the distribution function of a t (n� 1) random variable. Equivalently, t� satises P (T > t�) = �; T � tn�1: A nal reminder of how condence intervals should be understood, based on the beautiful exposition of Silvey (1975, Statistical Inference). The con- dence interval in (26) should be read as: whatever the true value of � may be, the probability that the random interval� �Xn � snp n t� 2 ; �Xn + snp n t� 2 � contains this true value is 1� �. 15 3 Hypothesis Testing 3.1 Introduction and Basic Denitions We now want to consider a problem somewhat di¤erent to that of point estimation of parameters: that of testing hypotheses about those parameters. We will continue to assume that we have a statistical model specied as a joint density, or, equivalently

                    本文档为【Basic Statistical Inference】，请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑，
                    图片更改请在作品中右键图片并更换，文字修改请直接点击文字进行修改，也可以新增和删除文档中的内容。 
 该文档来自用户分享，如有侵权行为请发邮件ishare@vip.sina.com联系网站客服，我们会及时删除。

                    [版权声明] 本站所有资料为用户分享产生，若发现您的权利被侵害，请联系客服邮件isharekefu@iask.cn，我们尽快处理。

                    本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权，请谨慎使用。

                    网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传，仅限个人学习分享使用，禁止用于任何广告和商用目的。
                

下载需要：免费已有0 人下载

立即下载

Basic Statistical Inference

你可能还喜欢