A Maximum Entropy Approach
to Natural Language Processing
Adam L. Berger t
Columbia University
Vincent J. Della Pietra ~
Renaissance Technologies
Stephen A. Della Pietra ~
Renaissance Technologies
The concept of maximum entropy can be traced back along multiple threads to Biblical times. Only
recently, however, have computers become powerful enough to permit the widescale application
of this concept to real world problems in statistical estimation and pattern recognition. In this
paper, we describe a method for statistical modeling based on maximum entropy. We present
a maximum-likelihood approach for automatically constructing maximum entropy models and
describe how to implement this approach efficiently, using as examples several problems in natural
language processing.
1. Introduction
Statistical modeling addresses the problem of constructing a stochastic model to predict
the behavior of a random process. In constructing this model, we typically have at our
disposal a sample of output from the process. Given this sample, which represents an
incomplete state of knowledge about the process, the modeling problem is to parlay
this knowledge into a representation of the process. We can then use this representation
to make predictions about the future behavior about the process.
Baseball managers (who rank among the better paid statistical modelers) employ
batting averages, compiled from a history of at-bats, to gauge the likelihood that a
player will succeed in his next appearance at the plate. Thus informed, they manipu-
late their lineups accordingly. Wall Street speculators (who rank among the best paid
statistical modelers) build models based on past stock price movements to predict to-
morrow's fluctuations and alter their portfolios to capitalize on the predicted future.
At the other end of the pay scale reside natural language researchers, who design
language and acoustic models for use in speech recognition systems and related ap-
plications.
The past few decades have witnessed significant progress toward increasing the
predictive capacity of statistical models of natural language. In language modeling, for
instance, Bahl et al. (1989) have used decision tree models and Della Pietra et al. (1994)
have used automatically inferred link grammars to model long range correlations in
language. In parsing, Black et al. (1992) have described how to extract grammatical
* This research, supported in part by ARPA under grant ONR N00014-91-C-0135, was conducted while
the authors were at the IBM T. J. Watson Research Center, P.O. Box 704, Yorktown Heights, NY 10598
t Now at Computer Science Department, Columbia University.
:~ Now at Renaissance Technologies, Stony Brook, NY.
© 1996 Association for Computational Linguistics
Computational Linguistics Volume 22, Number 1
rules from annotated text automatically and incorporate these rules into statistical
models of grammar. In speech recognition, Lucassen and Mercer (1984) have intro-
duced a technique for automatically discovering relevant features for the translation
of word spelling to word pronunciation.
These efforts, while varied in specifics, all confront two essential tasks of statistical
modeling. The first task is to determine a set of statistics that captures the behavior of
a random proceSs. Given a set of statistics, the second task is to corral these facts into
an accurate model of the process--a model capable of predicting the future output
of the process. The first task is one of feature selection; the second is one of model
selection. In the following pages we present a unified approach to these two tasks
based on the maximum entropy philosophy.
In Section 2 we give an overview of the maximum entropy philosophy and work
through a motivating example. In Section 3 we describe the mathematical structure of
maximum entropy models and give an efficient algorithm for estimating the parame-
ters of such models. In Section 4 we discuss feature selection, and present an automatic
method for discovering facts about a process from a sample of output from the process.
We then present a series of refinements to the method to make it practical to imple-
ment. Finally, in Section 5 we describe the application of maximum entropy ideas to
several tasks in stochastic language processing: bilingual sense disambiguation, word
reordering, and sentence segmentation.
2. A Maximum Entropy Overview
We introduce the concept of maximum entropy through a simple example. Suppose we
wish to model an expert translator's decisions concerning the proper French rendering
of the English word in. Our model p of the expert's decisions assigns to each French
word or phrase f an estimate, p(f), of the probability that the expert would choose f as
a translation of in. To guide us in developing p, we collect a large sample of instances
of the expert's decisions. Our goal is to extract a set of facts about the decision-making
process from the sample (the first task of modeling) that will aid us in constructing a
model of this process (the second task).
One obvious clue we might glean from the sample is the list of allowed trans-
lations. For example, we might discover that the expert translator always chooses
among the following five French phrases: {dans, en, ?l, au cours de, pendant}. With this
information in hand, we can impose our first constraint on our model p:
p(dans) + p(en) + p(h) + p(au cours de) + p(pendant) = 1
This equation represents our first statistic of the process; we can now proceed to
search for a suitable model that obeys this equation. Of course, there are an infinite
number of models p for which this identity holds. One model satisfying the above
equation is p(dans) = 1; in other words, the model always predicts dans. Another
model obeying this constraint predicts pendant with a probability of 1/2, and ~ with a
probability of 1/2. But both of these models offend our sensibilities: knowing only that
the expert always chose from among these five French phrases, how can we justify
either of these probability distributions? Each seems to be making rather bold assump-
tions, with no empirical justification. Put another way, these two models assume more
than we actually know about the expert's decision-making process. All we know is
that the expert chose exclusively from among these five French phrases; given this,
40
Computational Linguistics Volume 22, Number I
these questions, how do we go about finding the most uniform model subject to a set
of constraints like those we have described?
The maximum entropy method answers both of these questions, as we will demon-
strate in the next few pages. Intuitively, the principle is simple: model all that is known
and assume nothing about that which is unknown. In other words, given a collection
of facts, choose a model consistent with all the facts, but otherwise as uniform as
possible. This is precisely the approach we took in selecting our model p at each step
in the above example.
The maximum entropy concept has a long history. Adopting the least complex
hypothesis possible is embodied in Occam's razor ("Nunquam ponenda est pluralitas
sine necesitate.') and even appears earlier, in the Bible and the writings of Herotodus
(Jaynes 1990). Laplace might justly be considered the father of maximum entropy,
having enunciated the underlying theme 200 years ago in his "Principle of Insufficient
Reason:" when one has no information to distinguish between the probability of two
events, the best strategy is to consider them equally likely (Guiasu and Shenitzer
1985). As E. T. Jaynes, a more recent pioneer of maximum entropy, put it (Jaynes
1990):
... the fact that a certain probability distribution maximizes entropy
subject to certain constraints representing our incomplete information,
is the fundamental property which justifies use of that distribution
for inference; it agrees with everything that is known, but carefully
avoids assuming anything that is not known. It is a transcription into
mathematics of an ancient principle of wisdom ...
3. Maximum Entropy Modeling
We consider a random process that produces an output value y, a member of a finite set
3;. For the translation example just considered, the process generates a translation of the
word in, and the output y can be any word in the set {dans, en, ?~, au cours de, pendant}.
In generating y, the process may be influenced by some contextual information x, a
member of a finite set X. In the present example, this information could include the
words in the English sentence surrounding in.
Our task is to construct a stochastic model that accurately represents the behavior
of the random process. Such a model is a method of estimating the conditional prob-
ability that, given a context x, the process will output y. We will denote by p(ylx) the
probability that the model assigns to y in context x. With a slight abuse of notation, we
will also use p(ylx) to denote the entire conditional probability distribution provided
by the model, with the interpretation that y and x are placeholders rather than specific
instantiations. The proper interpretation should be clear from the context. We will de-
note by/~ the set of all conditional probability distributions. Thus a model p(y[x) is,
by definition, just an element of ~v.
3.1 Training Data
To study the process, we observe the behavior of the random process for some time,
collecting a large number of samples (xl,yl), (x2, y2) . . . . . (XN, YN). In the example we
have been considering, each sample would consist of a phrase x containing the words
surrounding in, together with the translation y of in that the process produced. For
now, we can imagine that these training samples have been generated by a human
expert who was presented with a number of random phrases containing in and asked
to choose a good translation for each. When we discuss real-world applications in
42
Computational Linguistics Volume 22, Number 1
Combining (1), (2) and (3) yields the more explicit equation
~(x)p(yix)f(x, y) = ~ ~(x, y)f(x, y)
x,y x,y
We call the requirement (3) a constraint equation or simply a constraint. By re-
stricting attention to those models p(ylx) for which (3) holds, we are eliminating from
consideration those models that do not agree with the training sample on how often
the output of the process should exhibit the feature f.
To sum up so far, we now have a means of representing statistical phenomena
inherent in a sample of data (namely, ~(f)), and also a means of requiring that our
model of the process exhibit these phenomena (namely, p(f) =/5(f)).
One final note about features and constraints bears repeating: although the words
"feature" and "constraint" are often used interchangeably in discussions of maximum
entropy, we will be vigilant in distinguishing the two and urge the reader to do
likewise. A feature is a binary-valued function of (x,y); a constraint is an equation
between the expected value of the feature function in the model and its expected
value in the training data.
3.3 The Maximum Entropy Principle
Suppose that we are given n feature functions fi, which determine statistics we feel
are important in modeling the process. We would like our model to accord with these
statistics. That is, we would like p to lie in the subset C of 7 ~ defined by
C,=_{pEP[p(fi)=P(fi) fori E {1,2 . . . . . n}} (4)
Figure 1 provides a geometric interpretation of this setup. Here 7 ~ is the space of all
(unconditional) probability distributions on three points, sometimes called a simplex.
If we impose no constraints (depicted in (a)), then all probability models are allowable.
Imposing one linear constraint Q restricts us to those p E P that lie on the region
defined by C1, as shown in (b). A second linear constraint could determine p exactly, if
the two constraints are satisfiable; this is the case in (c), where the intersection of C1 and
C2 is non-empty. Alternatively, a second linear constraint could be inconsistent with the
first--for instance, the first might require that the probability of the first point is 1/3
and the second that the probability of the third point is 3/4--this is shown in (d). In the
present setting, however, the linear constraints are extracted from the training sample
and cannot, by construction, be inconsistent. Furthermore, the linear constraints in our
applications will not even come close to determining p C/v uniquely as they do in (c);
instead, the set C = Q ~ C2 M ... N C, of allowable models will be infinite.
Among the models p E C, the maximum entropy philosophy dictates that we select
the most uniform distribution. But now we face a question left open in Section 2: what
does "uniform" mean?
A mathematical measure of the uniformity of a conditional distribution p(y[x) is
provided by the conditional entropy 1
H(p) - - (x)v(ylx) log p(ylx) (5)
x,y
1 A more common notation for the conditional entropy is H(Y [ X), where Y and X are random variables
with joint distribution ~(x)p(y[x). To emphasize the dependence of the entropy on the probability
distribution p, we have adopted the alternate notation H(p).
44
Berger, Della Pietra, and Della Pietra A Maximum Entropy Approach
Figure 1
Four different scenarios in constrained optimization. ~ represents the space of all probability
distributions. In (a), no constraints are applied, and all p C ~ are allowable. In (b), the
constraint C1 narrows the set of allowable models to those that lie on the line defined by the
linear constraint. In (c), two consistent constraints C1 and C2 define a single model p C CI A C2.
In (d), the two constraints are inconsistent (i.e., Q N C3 = 0); no p E/~ can satisfy them both.
The entropy is bounded from below by zero, the entropy of a model with no uncer-
tainty at all, and from above by log lYl, the entropy of the uniform distribution over
all possible lYl values of y. With this definition in hand, we are ready to present the
principle of max imum entropy.
Maximum Entropy Principle
To select a model from a set C of allowed probabil ity distributions,
choose the model p. E C with max imum entropy H(p):
p. = argmaxH(p) (6)
pEC
It can be shown that p. is always well-defined; that is, there is always a unique model
p. with max imum entropy in any constrained set C.
3.4 Parametric Form
The max imum entropy principle presents us with a problem in constrained optimiza-
tion: find the p. E C that maximizes H(p). In simple cases, we can find the solution to
45
Computational Linguistics Volume 22, Number 1
this problem analytically. This was true for the example presented in Section 2 when
we imposed the first two constraints on p. Unfortunately, the solution to the general
problem of maximum entropy cannot be written explicitly, and we need a more in-
direct approach. (The reader is invited to try to calculate the solution for the same
example when the third constraint is imposed.)
To address the general problem, we apply the method of Lagrange multipliers
from the theory of constrained optimization. The relevant steps are outlined here;
the reader is referred to Della Pietra et al. (1995) for a more thorough discussion of
constrained optimization as applied to maximum entropy.
• We will refer to the original constrained optimization problem,
find p, -- argmaxH(p)
pEC
as the primal problem.
For each feature fi we introduce a parameter hi (a Lagrange multiplier).
We define the Lagrangian A(p, ~) by
A(p, )~) _= H(p) + ~ ,~i (p(fi) - f2(fi))
i
Holding A fixed, we compute the unconstrained maximum of the
Lagrangian A(p, )~) over all p E ~. We denote by p~ the p where A(p, ,~)
achieves its maximum and by ~(~) the value at this maximum:
p;~ = argmaxA(p,)~)
pE'P
• (A) - A(p;~,A)
(7)
(8)
(9)
We call @(;~) the dual function. The functions p;~ and ~(;~) may be
calculated explicitly using simple calculus. We find
px(y lx) - Z~(x)eXp . ,Xifdx, y) (10)
~(2~) = - ~Z~ ~(x)log Z),(x) + y~ ,~iP(fi) (11)
x i
where Z;,(x) is a normalizing constant determined by the requirement
that ~,yp2~(ylx) = 1 for all x:
Z,x(x)=~_exp(~)~ifi(x,y)) (12)
Y
Finally, we pose the unconstrained dual optimization problem:
Find )~* = argmax ¢g()~)
46
Berger, Della Pietra, and Della Pietra A Maximum Entropy Approach
At first glance it is not clear what these machinations achieve. However, a funda-
mental principle in the theory of Lagrange multipliers, called generically the Kuhn-
Tucker theorem, asserts that under suitable assumptions, the primal and dual problems
are, in fact, closely related. This is the case in the present situation. Although a de-
tailed account of this relationship is beyond the scope of this paper, it is easy to state
the final result: Suppose that A* is the solution of the dual problem. Then Px* is the
solution of the primal problem; that is p;~. = p,. In other words,
The maximum entropy model subject to the constraints C has the para-
metric form 2 p;~. of (10), where the parameter values A* can be deter-
mined by maximizing the dual function ~(A).
The most important practical consequence of this result is that any algorithm for
finding the maximum A* of ~(A) can be used to find the maximum p, of H(p) for
peC.
3.5 Relation to Maximum Likelihood
The log-likelihood L~(p) of the empirical distribution/5 as predicted by a model p is
defined by 3
L~ (p) = log H P(Ylx)P(X' y) = ~ ~(x, y) log p(ylx)
x,y x,y
(13)
It is easy to check that the dual function ~(A) of the previous section is, in fact, just
the log-likelihood for the exponential model p~; that is
• (A) = L~(p;,) (14)
With this interpretation, the result of the previous section can be rephrased as:
The model p, E C with maximum entropy is the model in the para-
metric family p:~(ylx) that maximizes the likelihood of the training
sample ~.
This result provides an added justification for the maximum entropy principle: If
the notion of selecting a model p, on the basis of maximum entropy isn't compelling
enough, it so happens that this same p, is also the model that can best account for the
training sample, from among all models of the same parametric form (10).
Table 1 summarizes the primal-dual framework we have established.
2 It might be that the dual function tI,(A) does not achieve its maximum at any finite A*. In this case, the
maximum entropy model will not have the form p~ for any A. However, it will be the limit of models
of this form, as indicated by the following result, whose proof we omit:
Suppose An is any sequence such that ~(An) converges to the maximum O f @(A). Then PAn
converges to p..
3 We will henceforth abbreviate L~,(p) by L(p) when the empirical distribution ~ is clear from context.
47
Computational Linguistics Volume 22, Number 1
Table 1
The duality of maximum entropy and maximum likelihood is an example
of the more general phenomenon of duality in constrained optimization.
Primal Dual
problem argmaxp~ c H(p) argmax~ • ()`)
description maximum entropy maximum likelihood
type of search constrained optimization unconstrained optimization
search domain p E C real-valued vectors {)`1, ),2...}
solution p . ),*
Kuhn-Tucker theorem: p, = p~.
3.6 Computing the Parameters
For all but the most simple problems, the ;~* that maximize ~()~) cannot be found
analytically. Instead, we must resort to numerical methods. From the perspective of
numerical optimization, the function @()0 is well behaved, since it is smooth and
convex-~ in )~. Consequently, a variety of numerical methods can be used to calcu-
late )~*. One simple method is coordinate-wise ascent, in which )~* is computed by
iteratively maximizing q
本文档为【A Maximum Entropy approach to Natural Language Processing (1996)】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑,
图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
该文档来自用户分享,如有侵权行为请发邮件ishare@vip.sina.com联系网站客服,我们会及时删除。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。
网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。