Statistical Theory 2
Basic Statistical Inference
Tassos Magdalinos
University of Southampton
November 2010
1
1 Point Estimation
1.1 MSE and Unbiased Estimation
Consider an estimation problem: we know that a random phenomenon has a
certain behaviour given by a density fX (x; �) which depends on an unknown
parameter �, e.g., we know that it follows an N (0; �2) distribution, but we
do not know �2. The idea of estimation is to use data (whose mathematical
representation are random variables) in order to estimate �, i.e., choose a
random variable �^ = g (X1; :::; Xn) based on a sample of n observations from
f (x; �) in order to make a good guessfor �.
Here is some jargon:
De
nition 1 Given an unknown parameter �, the parameter space � is
the set containing all possible values of �. Typically, � � Rk.
For example, if the data is generated by a N (�; 1) distribution with para-
meter � = � and � = R. If the data is generated by a N (�; �2) distribution,
then � = (�; �2)0 and � = R� (0;1).
De
nition 2 A statistic based on a sample of observations (X1; :::; Xn) is
any function g (X1; :::; Xn) of the random variables X1; :::; Xn. If the obser-
vations come from a family of distributions f (X1; :::; Xn; �) indexed by an
unknown parameter �; an estimator �^ of � is a statistic used to approxi-
mate �. If � is a vector of unknown parameters �^ is a vector of estimators
of the same order as �. We sometimes write an estimator as �^n in order to
emphasize its dependence on the sample size n.
Note that, by de
nition, an estimator �^n is a random variable (or vec-
tor according to the dimension of �) that depends on the random variables
X1; :::; Xn only: �^n does not depend on �.
How do we know that a certain choice of estimator �^n is good for esti-
mating �? One way to look at it is to require that �^ is close to � with respect
to some distance (norm). Provided that the estimator �^n has
nite variance,
an appropriate distance to choose is the mean squared error.
De
nition 3 The mean squared error (MSE) of a statistic �^n is de
ned
as
MSE
�
�^n; �
�
:= E
��
�^n � �
�2�
:
2
We prefer an estimator �^n to a competing estimator ~�n i¤
MSE
�
�^n; �
�
�MSE
�
~�n; �
�
for all � 2 �:
Exercise 1 Show that
MSE
�
�^n; �
�
=
h
E
�
�^n
�
� �
i2
+ var
�
�^n
�
: (1)
Ideally, we would like to minimiseMSE
�
�^n; �
�
w.r.t. �^n on order to
nd
an optimal estimator. However, joint minimisation of the RHS of (1) is not
possible in general. Things get a lot simpler if �^n has the additional property
of unbiasedness.
De
nition 4 An estimator �^n of � is called unbiased if E
�
�^n
�
= �.
If �^n is unbiased, (1) implies that MSE
�
�^n; �
�
= var
�
�^n
�
. The MSE
minimisation criterion implies that: between two unbiased estimators we
choose the one with the smallest variance.
However, one needs to bear in mind that unbiasedness is a compromise:
there exist examples of a biased estimator that has smaller MSE than an
unbiased estimator. See Exercise 5, Question 6.
Finally, here is another desirable property of estimators in large samples.
De
nition 5 An estimator �^n of � is called consistent if �^n !p � as the
sample size n tends to 1.
1.2 The Method of Maximum Likelihood
Suppose that X1; X2; :::; Xn are random variables from a discrete family of
distributions ff (X1; ::; Xn; �) : � 2 �g. The probability that X1 = x1; ::,
Xn = xn is the joint mass function:
fX1;:::;Xn (x1; :::; xn; �) = P� (X1 = x1; :::; Xn = xn) :
We might ask what value of � maximises the probability of obtaining this
observed particular sample x1; ::; xn, i.e., what value of � maximises
fX1;:::;Xn (x1; :::; xn; �) :
3
This maximising value of � would provide a good estimator of �, because
it would provide the largest probability of this particular sample occurring.
Put di¤erently, the maximising value of � picks up the most likely mass to
have generated x1; :::; xn.
In general X = (X1; :::; Xn)
0 is a random vector from a family of densities
(or mass functions) ff (X; �) : � 2 �g.
De
nition 6 The likelihood function, L (�) is de
ned as the density (or
mass) of the data X = (X1; :::; Xn)
0, but regarded as function of � rather
than X :
L (�) = f (X; �) :
The maximum likelihood estimator (MLE) �^n of � is de
ned as the maximiser
of the likelihood function L (�) over the parameter space �, i.e.,
�^n = argmax
�2�
L (�) or L
�
�^n
�
= max
�2�
L (�) : (2)
Exercise 2 Let h (�) be a positive and di¤erentiable function. Show that
h (x) is maximised at the same point as log h (x).
In view of the above result we de
ne the log-likelihood function
l (�) := logL (�)
and the MLE of � can be found as the maximiser of l (�) :
�^n = argmax
�2�
l (�) or l
�
�^n
�
= max
�2�
l (�) : (3)
There is no statistics involved in passing from (2) to (3): it is just a matter
of algebraic covariance to use (3).
Example 1 (Independent sampling) WhenX = (X1; :::; Xn)
0 consists of
independent random variables X1; :::; Xn,
f (X1; :::; Xn) = f (X1) :::f (Xn) : (4)
Therefore,
L (�) =
nY
i=1
f (Xi; �) and l (�) =
nX
i=1
log f (Xi; �) : (5)
4
Example 2 Let, X1; :::; Xn be i.i.d. Poisson (�) r.v.s with mass f (Xi; �) =
e�� �
Xi
Xi!
for Xi 2 f0; 1; 2; :::g. By (5) the likelihood function is given by
L (�) =
nY
i=1
f (Xi; �) =
nY
i=1
e��
�Xi
Xi!
= e�n�
�X1 :::�Xn
nQ
i=1
Xi!
= e�n�
�X1+:::+Xn
nQ
i=1
Xi!
:
Taking logs:
l (�) = logL (�) = �n�+
nX
i=1
Xi log �� log
nY
i=1
Xi!
@l
@�
= �n+
Pn
i=1Xi
�
= 0) � = 1
n
nX
i=1
Xi = �Xn:
So the MLE of � is �^n = �Xn: This is not surprising since � = E (X1).
Example 3 Let X1; :::; Xn be i.i.d. N (�; �2) random variables. Here, we
have a vector of unknown parameters: � = (�; �2)0. By (5)
L
�
�; �2
�
=
nY
i=1
f
�
Xi;�; �
2
�
=
nY
i=1
�
2��2
��1=2
exp
�
� 1
2�2
(Xi � �)2
�
=
�
2��2
��n=2
exp
(
� 1
2�2
nX
i=1
(Xi � �)2
)
;
so the log-likelihood is given by
l
�
�; �2
�
= �n
2
log 2� � n
2
log �2 � 1
2�2
nX
i=1
(Xi � �)2 :
The MLE for (�; �2)0 will be found by jointly maximising l (�; �2) for � and
�2, i.e. it will be the solution of the two-equation system @l
@�
= 0 and @l
@�2
= 0.
Since �2 > 0
@l
@�
=
1
�2
nX
i=1
(Xi � �) = 1
�2
�
n �Xn � n�
�
= 0 =) � = �Xn: (6)
5
Di¤erentiating the log-likelihood w.r.t. �2 :
@l
@�2
= �n
2
1
�2
+
1
2 (�2)2
nX
i=1
(Xi � �)2 = 0
=) �n+ 1
�2
nX
i=1
(Xi � �)2 = 0
=) �2 = 1
n
nX
i=1
(Xi � �)2 : (7)
Requiring (6) and (7) to hold simultaneously, we obtain the MLE
�
�^n; �^
2
n
�
of (�; �2) to be
�^n = �Xn; �^
2
n =
1
n
nX
i=1
�
Xi � �Xn
�2
:
Remark 1 By Exercise 6, Question 5 �^2n is biased. Thus, MLEs are not
unbiased in general. On the other hand, Exercise 6, Question 5 shows that,
despite being biased, the MLE estimator had smaller MSE than the unbiased
estimator s2n.
Example 4 (Dependent sampling and conditional likelihood function)
WhenX1; ::; Xn are not independent, (4) no longer holds so (5) is not valid. A
way to proceed with setting up the likelihood function is to use the de
nition
of conditional density:
f (X2jX1) = f (X2; X1)
f (X1)
=) f (X2jX1) = f (X2jX1) f (X1) : (8)
Then
f (X3; X2; X1) = f (X3jX2; X1) f (X2; X1)
= f (X3jX2; X1) f (X2jX1) f (X1)
Proceeding like this we obtain the joint density of (Xn; :::; X1):
f (Xn; Xn�1; :::; X1) = f (XnjXn�1; :::; X1) f (Xn�1jXn�2; :::; X1) :::f (X2jX1) f (X1) :
The above factorisation yields the following expressions for the likelihood and
log-likelihood functions:
L (�) = f (X1; �)
nY
i=2
f (XijXi�1; ::; X1; �) ; (9)
6
l (�) = log f (X1; �) +
nY
i=2
log f (XijXi�1; ::; X1; �) : (10)
Example 5 (Sampling from a Gaussian AR(1)) AGaussian AR(1) process
is de
ned recursively as
Xt = �Xt�1 + ut t 2 f2; :::; ng
where X1 is a constant, � 2 (�1; 1) and u1; :::; un are i.i.d. N (0; �2) ran-
dom variables. Here there are two unknown parameters: � = (�; �2)0 : Since
X2 = �X1 + u1, X1 is a constant and u1 � N (0; �2), X2jX1 � N (�X1; �2) :
Similarly,
Xt = �Xt�1 + ut; ut � N
�
0; �2
�
=) XtjXt�1 � N
�
�Xt�1; �2
�
so
f
�
XtjXt�1; :::; X1; �; �2
�
=
�
2��2
��1=2
exp
�
� 1
2�2
(Xt � �Xt�1)2
�
: (11)
Noting that log f (X1) = 0 on the support of f (X1; :::; Xn) (why?) we can
substitute (11) into (10) to obtain
l
�
�; �2
�
= �n
2
log 2� � n
2
log �2 � 1
2�2
nX
t=2
(Xt � �Xt�1)2
Di¤erentiating l (�; �2) w.r.t. � :
@l
@�
=
1
�2
nX
t=2
(Xt � �Xt�1)Xt�1 = 0
)
nX
t=2
XtXt�1 � �
nX
t=2
X2t�1 = 0
) �^n =
Pn
t=2XtXt�1Pn
t=2X
2
t�1
(12)
@l
@�2
= �n
2
1
�2
+
1
2 (�2)2
nX
t=2
(Xt � �Xt�1)2 = 0
) �2 =
nX
t=2
(Xt � �Xt�1)2 :
7
Combining this with (12) we obtain
�^2n =
1
n
nX
t=2
(Xt � �^nXt�1)2 :
1.3 Score Vector and Information Matrix
Consider a random vector X = (X1; :::; Xn)
0 with density f (X; �) = L (�)
depending on a parameter � 2 Rk. Also l (�) = logL (�) = log f (X; �). The
k � 1 random vector
s (X; �) =
@l
@�
is called the score vector. When l (�) is di¤erentiable, the MLE estimator
�^n satis
es
s
�
X; �^n
�
= 0: (13)
In this section, we will determine the mean and covariance matrix of
the score vector s = s (X; �). In what follows, we assume that X1; :::; Xn
are independent as well as that the following two regularity conditions are
satis
ed:
(RC1) The support of f , S = fX 2 Rn : f (X; �) > 0g, is independent of �.
(RC2) f (X1; �) is twice di¤erentiable with E
�
@2 log f(X1;�)
@�@�0
�
< 0.
When � is a Rk vector (RC2) means that the k�k matrix E
�
@2 log f(X1;�)
@�@�0
�
is negative de
nite.
Condition (RC1) is satis
ed for most densities f (�; �), e.g., ifX � Exp (�),
then
f (X; �) =
�
�e��X ; X > 0
0; X � 0:
and S = (0;1) does not depend on �. The same holds for all distributions
that we have encountered apart from the uniform distribution on [0; �]:
f (X; �) =
�
1=�; X 2 [0; �]
0; X =2 [0; �]
so in this case S = [0; �] is dependent on �. This is why the uniform distrib-
ution on [0; �] is a source of counterexamples for likelihood theory.
8
The importance of (RC1) lies in that it permits di¤erentiation under the
integral sign, i.e., ensures that
@
@�
Z
S
f (X; �) dX =
Z
S
@f (X; �)
@�
dX; (14)
where S and f are de
ned in (RC1). This follows from the so-called Leibnitz
rule [see Spiegel, p. 163, eq. (17)].
Exercise 3 Show that (14) is satis
ed when X � Exp (�) but not when
X � U [0; �].
Theorem 1 Under the regularly condition (RC1),
E (s) = 0; E (ss0) = �E
�
@2l
@�@�0
�
: (15)
Proof. Since L (�) = f (X; �) is a density,Z
S
f (X; �) dX = 1: (16)
Note that
R
S
f (X; �) dX meansZ
:::
Z
fX2Rn:f(X1;:::;Xn;�)>0g
f (X1; :::; Xn; �) dX1::::dXn:
Di¤erentiating (16) w.r.t. � and using (14) we obtain,
0 =
@
@�
Z
S
f (X; �) dX =
Z
S
@f (X; �)
@�
dX: (17)
The relationship between the derivative of the likelihood function and the
score vector is given by the following identity:
s (X; �) =
@ log f (X; �)
@�
=
1
f (X; �)
@f (X; �)
@�
=) @f (X; �)
@�
= s (X; �) f (X; �) : (18)
9
Combining (17) and (18) we obtain
0 =
Z
S
s (X; �) f (X; �) dX = E [s (X; �)] ;
showing the
rst part of (15). For the second part, di¤erentiating both sides
of
0 = E (s0) =
Z
S
@ log f (X; �)
@�0
f (X; �) dX
and using (14) and the rule for di¤erentiating a product, we obtain
0 =
@
@�
Z
S
@ log f (X; �)
@�0
f (X; �) dX
=
Z
S
@
@�
�
@ log f (X; �)
@�0
f (X; �)
�
dX [by (14)]
=
Z
S
@2 log f (X; �)
@��0
f (X; �) dX +
Z
S
@f (X; �)
@�
@ log f (X; �)
@�0
dX
= E
�
@2 log f (X; �)
@��0
�
+
Z
S
s (X; �) f (X; �)
@ log f (X; �)
@�0
dX [by (18)]
= E
�
@2l
@�@�0
�
+
Z
S
s (X; �) s (X; �)0 f (X; �) dX
= E
�
@2l
@�@�0
�
+ E
�
s (X; �) s (X; �)0
�
and the proof follows since the last sum is equal to 0.
Note that Theorem 1 is not restricted to independent or identically dis-
tributed samples. The above proof depends only on the validity of Leibnitz
rule for di¤erentiation under the integral sign, ensured by (RC1).
De
nition 7 The k � k matrix
I (�) = �E
�
@2l
@�@�0
�
(19)
is called (Fishers) information matrix.
10
Remarks.
1. Under (RC1), the information matrix I (�) coincides with the covari-
ance matrix of the score vector s (X; �).
2. If X1; :::; Xn are i.i.d. random variables, then (5) implies that
I (�) = �E
�
@2l
@�@�0
�
= �E
"
@2
@�@�0
nX
i=1
log f (Xi; �)
#
= �
nX
i=1
E
�
@2 log f (Xi; �)
@�@�0
�
= �nE
�
@2 log f (X1; �)
@�@�0
�
(20)
since the random variables Z1; :::; Zn de
ned by
Zi =
@2 log f (Xi; �)
@�@�0
i 2 f1; :::; ng
are identically distributed so they all have the same expectation. Under
assumption (RC2), (20) has two important consequences and one not-
so-important:
(i) I (�) is a positive de
nite matrix.
(ii) I (�) increases to1 with rate equal to the sample size n as n!1:
lim
n!1
1
n
I (�) = �E
�
@2 log f (X1; �)
@�@�0
�
(21)
the limit matrix being
nite and positive de
nite. While our deriva-
tion of (20) is con
ned to the case where X1; :::; Xn are i.i.d. random
variables, the limit behaviour of the information matrix is given by
(21) in much more general situations: it applies even under dependent
sampling schemes, the main requirement being ergodicity.
(iii) The expression on the RHS of (20) is equal to n times the infor-
mation contained in the
rst observation X1. Thus, the information
contained in n sample points of an i.i.d. sample is equal to n times
the information contained in the
rst observation.
3. In the next section we will see that, for any unbiased estimator �^,
var
�
�^
�
� I (�)�1. If �^ is unbiased, it is centered around � so the
more information about � is provided by an observation the smaller
the variance of �^ should be: hence the name information matrixfor
(19).
11
1.4 The Cramér Rao Lower Bound
Theorem 2 (Cramér-Rao) If (RC1) and (RC2) hold, then the covariance
matrix of any unbiased estimator �^ of � satis
es
V
�
�^
�
� I (�)�1
in the sense that V
�
�^
�
� I (�)�1 is a positive semi-de
nite matrix.
Proof. Recall that any estimator �^ = �^ (X) is independent of �: Since �^ is
unbiased, E
�
�^
�
= �, or equivalently
� =
Z
S
�^ (X) f (X; �) dX:
Di¤erentiating w.r.t. �0 we obtain
Ik =
@
@�0
Z
S
�^ (X) f (X; �) dX
=
Z
S
�^ (X)
@f (X; �)
@�0
dX; [by (RC1)]
=
Z
S
�^ (X) s (X; �)0 f (X; �) dX; [by (18)]
= E
�
�^s0
�
= E
h�
�^ � �
�
s0
i
(22)
since E (�s0) = �E (s0) = 0 by (15).
We now assume that k = 1 (� is a scalar) for simplicity. Then both s and
I (�) are scalars and (22) reduces to
1 = E
h�
�^ � �
�
s
i
= cov
�
�^ � �; s
�
�
h
var
�
�^ � �
�i1=2
[var (s)]1=2
=
h
var
�
�^
�i1=2
[I (�)]1=2 :
by Cauchy-Schwarz inequality (see Lemma B in the Probability lecture notes).
This shows the result for scalar �. The key property that delivers the lower
12
bound is the CS inequality. Note that in this case the ordering is complete,
i.e. var
�
�^
�
� I (�)�1 holds as a true inequality between numbers.
Theorem 2 implies that the MSE of any unbiased estimator �^ of � can-
not be smaller than I (�)�1, i.e. the best possible unbiased estimator is
one whose covariance matrix achieves the Cramér Rao lower bound. Such
estimators are called e¢ cient.
De
nition 8 An unbiased estimator �^ of � is called e¢ cient if
var
�
�^
�
= I (�)�1 :
2 Con
dence Intervals
There are situations where, rather than requiring an approximation of the
value of an unknown parameter, we are more interested in constructing (ran-
dom) bounds that contain the unknown parameter with given probability of
containment. For example, a light bulb manufacturer advertises that their
light bulbs last longer than a year. They are interested in producing light
bulbs with (unknown) true mean life between one year (so that they dont lie
in the advertisement) and a year and one month (so that they sell more light
bulbs). How can they be reasonably certain that their production line meets
the above requirement? One approach is to construct a random interval that
contains the mean with a satisfactory (say 95%) probability.
We concentrate to the situation of Example 3 where the data X1; :::; Xn
are i.i.d. N (�; �2) random variables. When the variance �2 is known, this is
an easy problem: we know from (47) in Example 34 that
�Xn � N
�
�;
�2
n
�
i.e. Z :=
p
n
�
�Xn � �
�
�
� N (0; 1) :
Therefore, denoting by � (x) the N (0; 1) distribution function,
P (jZj � x) = P
�x �
p
n
�
�� �Xn
�
�
� x
!
= �(x)� � (�x)
= � (x)� [1� � (x)] = 2� (x)� 1:
Solving the above inequality for � we obtain, for any x > 0;
2� (x)� 1 = P
�
�Xn � �xp
n
� � � �Xn + �xp
n
�
: (23)
13
Recall that our problem is to address questions such as
nd an interval that
contains the mean with probability 95% . In order to complete the interval
estimation procedure we need to associate the number x to the required
probability of containment. Note that the number ��1 (1� y) is well de
ned
and positive if and only if y 2 (0; 1): since the range of � is (0; 1), ��1 (x)
is only de
ned for x 2 (0; 1). Since (23) is valid for any x > 0, we may set
x = ��1
�
1� �
2
�
for any � 2 (0; 1=2) in (23): since
2�
�
��1
�
1� �
2
��
� 1 = 2
�
1� �
2
�
� 1 = 1� �;
(23) becomes
1� � = P
�
�Xn � �p
n
��1
�
1� �
2
�
� � � �Xn + �p
n
��1
�
1� �
2
��
= P
�
�Xn � �p
n
z�
2
� � � �Xn + �p
n
z�
2
�
(24)
by Question 7 Exercise 5, where z� denotes the �-quantile of the N (0; 1)
distribution, satisfying
P (Z > z�) = �: (25)
This yields the required con
dence interval. Since the probability of inclusion
is 1 � �, (24) is said to de
ne a 100 (1� �)% con
dence interval. For
example, if we wish to construct a 95% con
dence interval, we choose � =
0:05 (so that 1� � = 0:95) in which case the value of z�
2
can be found from
statistical tables: z0:025 = 1:96 and (24) yields
P
�
�Xn � 1:96�p
n
� � � �Xn + 1:96�p
n
�
= 0:95:
The above con
dence interval is useful only when � is known. The situ-
ation is not so easy when � is unknown. We need the following fundamental
theorem.
Theorem 3 Let X1; :::; Xn be i.i.d. N (�; �2) random variables and let
�Xn =
1
n
nX
i=1
Xi and s2n =
1
n� 1
nX
i=1
�
Xi � �Xn
�2
:
Then �Xn and s2n are independent random variables and
(n� 1) s2n
�2
� �2 (n� 1) :
14
Corollary 1 Under the assumptions of Theorem 3, the statistic
Tn :=
p
n
�
�Xn � �
�
sn
� t (n� 1)
where t (n� 1) denotes a t-distribution with parameter n� 1.
The easiest and most intuitive way to prove Theorem 3 by applying an
orthogonal transformation. See the Appendix at the end if interested.
Proof of Corollary 1. We know by Example 34 that �Xn � N (�; �2=n)
i.e.
p
n
�
�Xn � �
�
=� � N (0; 1). By Theorem 3
(n� 1) s2n
�2
� �2 (n� 1) ;
independently of �Xn (and hence of
p
n
�
�Xn � �
�
=�). Therefore, Example
29 implies that the ratio of
p
n
�
�Xn � �
�
=� over
q
(n�1)s2n
�2
= (n� 1) follows a
t (n� 1) distribution:
t (n� 1) �
p
n
�
�Xn � �
�
=�q
(n�1)s2n
�2
= (n� 1)
=
p
n
�
�Xn � �
�
sn
= Tn:
Corollary 2. Let X1; :::; Xn be i.i.d. N (�; �2) random variables with un-
known mean � and variance �2. Given any positive number �, a con
dence
interval for � can be constructed as follows:
1� � = P
�
�Xn � snp
n
t�
2
� � � �Xn + snp
n
t�
2
�
(26)
where t�
2
= F�1t(n�1)
�
1� �
2
�
and Ft(n�1) (�) is the distribution function of a
t (n� 1) random variable. Equivalently, t� satis
es
P (T > t�) = �; T � tn�1:
A
nal reminder of how con
dence intervals should be understood, based
on the beautiful exposition of Silvey (1975, Statistical Inference). The con
-
dence interval in (26) should be read as: whatever the true value of � may
be, the probability that the random interval�
�Xn � snp
n
t�
2
; �Xn +
snp
n
t�
2
�
contains this true value is 1� �.
15
3 Hypothesis Testing
3.1 Introduction and Basic De
nitions
We now want to consider a problem somewhat di¤erent to that of point
estimation of parameters: that of testing hypotheses about those parameters.
We will continue to assume that we have a statistical model speci
ed as a
joint density, or, equivalently
本文档为【Basic Statistical Inference】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑,
图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
该文档来自用户分享,如有侵权行为请发邮件ishare@vip.sina.com联系网站客服,我们会及时删除。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。
网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。