A Maximum Entropy approach to Natural Language Processing (1996)

A Maximum Entropy approach to Natural Language Processing (1996) A Maximum Entropy Approach to Natural Language Processing Adam L. Berger t Columbia University Vincent J. Della Pietra ~ Renaissance Technologies Stephen A. Della Pietra ~ Renaissance Technologies The concept of maximum entropy can be traced back ...

A Maximum Entropy Approach to Natural Language Processing Adam L. Berger t Columbia University Vincent J. Della Pietra ~ Renaissance Technologies Stephen A. Della Pietra ~ Renaissance Technologies The concept of maximum entropy can be traced back along multiple threads to Biblical times. Only recently, however, have computers become powerful enough to permit the widescale application of this concept to real world problems in statistical estimation and pattern recognition. In this paper, we describe a method for statistical modeling based on maximum entropy. We present a maximum-likelihood approach for automatically constructing maximum entropy models and describe how to implement this approach efficiently, using as examples several problems in natural language processing. 1. Introduction Statistical modeling addresses the problem of constructing a stochastic model to predict the behavior of a random process. In constructing this model, we typically have at our disposal a sample of output from the process. Given this sample, which represents an incomplete state of knowledge about the process, the modeling problem is to parlay this knowledge into a representation of the process. We can then use this representation to make predictions about the future behavior about the process. Baseball managers (who rank among the better paid statistical modelers) employ batting averages, compiled from a history of at-bats, to gauge the likelihood that a player will succeed in his next appearance at the plate. Thus informed, they manipu- late their lineups accordingly. Wall Street speculators (who rank among the best paid statistical modelers) build models based on past stock price movements to predict to- morrow's fluctuations and alter their portfolios to capitalize on the predicted future. At the other end of the pay scale reside natural language researchers, who design language and acoustic models for use in speech recognition systems and related ap- plications. The past few decades have witnessed significant progress toward increasing the predictive capacity of statistical models of natural language. In language modeling, for instance, Bahl et al. (1989) have used decision tree models and Della Pietra et al. (1994) have used automatically inferred link grammars to model long range correlations in language. In parsing, Black et al. (1992) have described how to extract grammatical * This research, supported in part by ARPA under grant ONR N00014-91-C-0135, was conducted while the authors were at the IBM T. J. Watson Research Center, P.O. Box 704, Yorktown Heights, NY 10598 t Now at Computer Science Department, Columbia University. :~ Now at Renaissance Technologies, Stony Brook, NY. © 1996 Association for Computational Linguistics Computational Linguistics Volume 22, Number 1 rules from annotated text automatically and incorporate these rules into statistical models of grammar. In speech recognition, Lucassen and Mercer (1984) have intro- duced a technique for automatically discovering relevant features for the translation of word spelling to word pronunciation. These efforts, while varied in specifics, all confront two essential tasks of statistical modeling. The first task is to determine a set of statistics that captures the behavior of a random proceSs. Given a set of statistics, the second task is to corral these facts into an accurate model of the process--a model capable of predicting the future output of the process. The first task is one of feature selection; the second is one of model selection. In the following pages we present a unified approach to these two tasks based on the maximum entropy philosophy. In Section 2 we give an overview of the maximum entropy philosophy and work through a motivating example. In Section 3 we describe the mathematical structure of maximum entropy models and give an efficient algorithm for estimating the parame- ters of such models. In Section 4 we discuss feature selection, and present an automatic method for discovering facts about a process from a sample of output from the process. We then present a series of refinements to the method to make it practical to imple- ment. Finally, in Section 5 we describe the application of maximum entropy ideas to several tasks in stochastic language processing: bilingual sense disambiguation, word reordering, and sentence segmentation. 2. A Maximum Entropy Overview We introduce the concept of maximum entropy through a simple example. Suppose we wish to model an expert translator's decisions concerning the proper French rendering of the English word in. Our model p of the expert's decisions assigns to each French word or phrase f an estimate, p(f), of the probability that the expert would choose f as a translation of in. To guide us in developing p, we collect a large sample of instances of the expert's decisions. Our goal is to extract a set of facts about the decision-making process from the sample (the first task of modeling) that will aid us in constructing a model of this process (the second task). One obvious clue we might glean from the sample is the list of allowed trans- lations. For example, we might discover that the expert translator always chooses among the following five French phrases: {dans, en, ?l, au cours de, pendant}. With this information in hand, we can impose our first constraint on our model p: p(dans) + p(en) + p(h) + p(au cours de) + p(pendant) = 1 This equation represents our first statistic of the process; we can now proceed to search for a suitable model that obeys this equation. Of course, there are an infinite number of models p for which this identity holds. One model satisfying the above equation is p(dans) = 1; in other words, the model always predicts dans. Another model obeying this constraint predicts pendant with a probability of 1/2, and ~ with a probability of 1/2. But both of these models offend our sensibilities: knowing only that the expert always chose from among these five French phrases, how can we justify either of these probability distributions? Each seems to be making rather bold assump- tions, with no empirical justification. Put another way, these two models assume more than we actually know about the expert's decision-making process. All we know is that the expert chose exclusively from among these five French phrases; given this, 40 Computational Linguistics Volume 22, Number I these questions, how do we go about finding the most uniform model subject to a set of constraints like those we have described? The maximum entropy method answers both of these questions, as we will demon- strate in the next few pages. Intuitively, the principle is simple: model all that is known and assume nothing about that which is unknown. In other words, given a collection of facts, choose a model consistent with all the facts, but otherwise as uniform as possible. This is precisely the approach we took in selecting our model p at each step in the above example. The maximum entropy concept has a long history. Adopting the least complex hypothesis possible is embodied in Occam's razor ("Nunquam ponenda est pluralitas sine necesitate.') and even appears earlier, in the Bible and the writings of Herotodus (Jaynes 1990). Laplace might justly be considered the father of maximum entropy, having enunciated the underlying theme 200 years ago in his "Principle of Insufficient Reason:" when one has no information to distinguish between the probability of two events, the best strategy is to consider them equally likely (Guiasu and Shenitzer 1985). As E. T. Jaynes, a more recent pioneer of maximum entropy, put it (Jaynes 1990): ... the fact that a certain probability distribution maximizes entropy subject to certain constraints representing our incomplete information, is the fundamental property which justifies use of that distribution for inference; it agrees with everything that is known, but carefully avoids assuming anything that is not known. It is a transcription into mathematics of an ancient principle of wisdom ... 3. Maximum Entropy Modeling We consider a random process that produces an output value y, a member of a finite set 3;. For the translation example just considered, the process generates a translation of the word in, and the output y can be any word in the set {dans, en, ?~, au cours de, pendant}. In generating y, the process may be influenced by some contextual information x, a member of a finite set X. In the present example, this information could include the words in the English sentence surrounding in. Our task is to construct a stochastic model that accurately represents the behavior of the random process. Such a model is a method of estimating the conditional prob- ability that, given a context x, the process will output y. We will denote by p(ylx) the probability that the model assigns to y in context x. With a slight abuse of notation, we will also use p(ylx) to denote the entire conditional probability distribution provided by the model, with the interpretation that y and x are placeholders rather than specific instantiations. The proper interpretation should be clear from the context. We will de- note by/~ the set of all conditional probability distributions. Thus a model p(y[x) is, by definition, just an element of ~v. 3.1 Training Data To study the process, we observe the behavior of the random process for some time, collecting a large number of samples (xl,yl), (x2, y2) . . . . . (XN, YN). In the example we have been considering, each sample would consist of a phrase x containing the words surrounding in, together with the translation y of in that the process produced. For now, we can imagine that these training samples have been generated by a human expert who was presented with a number of random phrases containing in and asked to choose a good translation for each. When we discuss real-world applications in 42 Computational Linguistics Volume 22, Number 1 Combining (1), (2) and (3) yields the more explicit equation ~(x)p(yix)f(x, y) = ~ ~(x, y)f(x, y) x,y x,y We call the requirement (3) a constraint equation or simply a constraint. By re- stricting attention to those models p(ylx) for which (3) holds, we are eliminating from consideration those models that do not agree with the training sample on how often the output of the process should exhibit the feature f. To sum up so far, we now have a means of representing statistical phenomena inherent in a sample of data (namely, ~(f)), and also a means of requiring that our model of the process exhibit these phenomena (namely, p(f) =/5(f)). One final note about features and constraints bears repeating: although the words "feature" and "constraint" are often used interchangeably in discussions of maximum entropy, we will be vigilant in distinguishing the two and urge the reader to do likewise. A feature is a binary-valued function of (x,y); a constraint is an equation between the expected value of the feature function in the model and its expected value in the training data. 3.3 The Maximum Entropy Principle Suppose that we are given n feature functions fi, which determine statistics we feel are important in modeling the process. We would like our model to accord with these statistics. That is, we would like p to lie in the subset C of 7 ~ defined by C,=_{pEP[p(fi)=P(fi) fori E {1,2 . . . . . n}} (4) Figure 1 provides a geometric interpretation of this setup. Here 7 ~ is the space of all (unconditional) probability distributions on three points, sometimes called a simplex. If we impose no constraints (depicted in (a)), then all probability models are allowable. Imposing one linear constraint Q restricts us to those p E P that lie on the region defined by C1, as shown in (b). A second linear constraint could determine p exactly, if the two constraints are satisfiable; this is the case in (c), where the intersection of C1 and C2 is non-empty. Alternatively, a second linear constraint could be inconsistent with the first--for instance, the first might require that the probability of the first point is 1/3 and the second that the probability of the third point is 3/4--this is shown in (d). In the present setting, however, the linear constraints are extracted from the training sample and cannot, by construction, be inconsistent. Furthermore, the linear constraints in our applications will not even come close to determining p C/v uniquely as they do in (c); instead, the set C = Q ~ C2 M ... N C, of allowable models will be infinite. Among the models p E C, the maximum entropy philosophy dictates that we select the most uniform distribution. But now we face a question left open in Section 2: what does "uniform" mean? A mathematical measure of the uniformity of a conditional distribution p(y[x) is provided by the conditional entropy 1 H(p) - - (x)v(ylx) log p(ylx) (5) x,y 1 A more common notation for the conditional entropy is H(Y [ X), where Y and X are random variables with joint distribution ~(x)p(y[x). To emphasize the dependence of the entropy on the probability distribution p, we have adopted the alternate notation H(p). 44 Berger, Della Pietra, and Della Pietra A Maximum Entropy Approach Figure 1 Four different scenarios in constrained optimization. ~ represents the space of all probability distributions. In (a), no constraints are applied, and all p C ~ are allowable. In (b), the constraint C1 narrows the set of allowable models to those that lie on the line defined by the linear constraint. In (c), two consistent constraints C1 and C2 define a single model p C CI A C2. In (d), the two constraints are inconsistent (i.e., Q N C3 = 0); no p E/~ can satisfy them both. The entropy is bounded from below by zero, the entropy of a model with no uncer- tainty at all, and from above by log lYl, the entropy of the uniform distribution over all possible lYl values of y. With this definition in hand, we are ready to present the principle of max imum entropy. Maximum Entropy Principle To select a model from a set C of allowed probabil ity distributions, choose the model p. E C with max imum entropy H(p): p. = argmaxH(p) (6) pEC It can be shown that p. is always well-defined; that is, there is always a unique model p. with max imum entropy in any constrained set C. 3.4 Parametric Form The max imum entropy principle presents us with a problem in constrained optimiza- tion: find the p. E C that maximizes H(p). In simple cases, we can find the solution to 45 Computational Linguistics Volume 22, Number 1 this problem analytically. This was true for the example presented in Section 2 when we imposed the first two constraints on p. Unfortunately, the solution to the general problem of maximum entropy cannot be written explicitly, and we need a more in- direct approach. (The reader is invited to try to calculate the solution for the same example when the third constraint is imposed.) To address the general problem, we apply the method of Lagrange multipliers from the theory of constrained optimization. The relevant steps are outlined here; the reader is referred to Della Pietra et al. (1995) for a more thorough discussion of constrained optimization as applied to maximum entropy. • We will refer to the original constrained optimization problem, find p, -- argmaxH(p) pEC as the primal problem. For each feature fi we introduce a parameter hi (a Lagrange multiplier). We define the Lagrangian A(p, ~) by A(p, )~) _= H(p) + ~ ,~i (p(fi) - f2(fi)) i Holding A fixed, we compute the unconstrained maximum of the Lagrangian A(p, )~) over all p E ~. We denote by p~ the p where A(p, ,~) achieves its maximum and by ~(~) the value at this maximum: p;~ = argmaxA(p,)~) pE'P • (A) - A(p;~,A) (7) (8) (9) We call @(;~) the dual function. The functions p;~ and ~(;~) may be calculated explicitly using simple calculus. We find px(y lx) - Z~(x)eXp . ,Xifdx, y) (10) ~(2~) = - ~Z~ ~(x)log Z),(x) + y~ ,~iP(fi) (11) x i where Z;,(x) is a normalizing constant determined by the requirement that ~,yp2~(ylx) = 1 for all x: Z,x(x)=~_exp(~)~ifi(x,y)) (12) Y Finally, we pose the unconstrained dual optimization problem: Find )~* = argmax ¢g()~) 46 Berger, Della Pietra, and Della Pietra A Maximum Entropy Approach At first glance it is not clear what these machinations achieve. However, a funda- mental principle in the theory of Lagrange multipliers, called generically the Kuhn- Tucker theorem, asserts that under suitable assumptions, the primal and dual problems are, in fact, closely related. This is the case in the present situation. Although a de- tailed account of this relationship is beyond the scope of this paper, it is easy to state the final result: Suppose that A* is the solution of the dual problem. Then Px* is the solution of the primal problem; that is p;~. = p,. In other words, The maximum entropy model subject to the constraints C has the para- metric form 2 p;~. of (10), where the parameter values A* can be deter- mined by maximizing the dual function ~(A). The most important practical consequence of this result is that any algorithm for finding the maximum A* of ~(A) can be used to find the maximum p, of H(p) for peC. 3.5 Relation to Maximum Likelihood The log-likelihood L~(p) of the empirical distribution/5 as predicted by a model p is defined by 3 L~ (p) = log H P(Ylx)P(X' y) = ~ ~(x, y) log p(ylx) x,y x,y (13) It is easy to check that the dual function ~(A) of the previous section is, in fact, just the log-likelihood for the exponential model p~; that is • (A) = L~(p;,) (14) With this interpretation, the result of the previous section can be rephrased as: The model p, E C with maximum entropy is the model in the para- metric family p:~(ylx) that maximizes the likelihood of the training sample ~. This result provides an added justification for the maximum entropy principle: If the notion of selecting a model p, on the basis of maximum entropy isn't compelling enough, it so happens that this same p, is also the model that can best account for the training sample, from among all models of the same parametric form (10). Table 1 summarizes the primal-dual framework we have established. 2 It might be that the dual function tI,(A) does not achieve its maximum at any finite A*. In this case, the maximum entropy model will not have the form p~ for any A. However, it will be the limit of models of this form, as indicated by the following result, whose proof we omit: Suppose An is any sequence such that ~(An) converges to the maximum O f @(A). Then PAn converges to p.. 3 We will henceforth abbreviate L~,(p) by L(p) when the empirical distribution ~ is clear from context. 47 Computational Linguistics Volume 22, Number 1 Table 1 The duality of maximum entropy and maximum likelihood is an example of the more general phenomenon of duality in constrained optimization. Primal Dual problem argmaxp~ c H(p) argmax~ • ()`) description maximum entropy maximum likelihood type of search constrained optimization unconstrained optimization search domain p E C real-valued vectors {)`1, ),2...} solution p . ),* Kuhn-Tucker theorem: p, = p~. 3.6 Computing the Parameters For all but the most simple problems, the ;~* that maximize ~()~) cannot be found analytically. Instead, we must resort to numerical methods. From the perspective of numerical optimization, the function @()0 is well behaved, since it is smooth and convex-~ in )~. Consequently, a variety of numerical methods can be used to calcu- late )~*. One simple method is coordinate-wise ascent, in which )~* is computed by iteratively maximizing q

                    本文档为【A Maximum Entropy approach to Natural Language Processing (1996)】，请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑，
                    图片更改请在作品中右键图片并更换，文字修改请直接点击文字进行修改，也可以新增和删除文档中的内容。 
 该文档来自用户分享，如有侵权行为请发邮件ishare@vip.sina.com联系网站客服，我们会及时删除。

                    [版权声明] 本站所有资料为用户分享产生，若发现您的权利被侵害，请联系客服邮件isharekefu@iask.cn，我们尽快处理。

                    本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权，请谨慎使用。

                    网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传，仅限个人学习分享使用，禁止用于任何广告和商用目的。
                

下载需要：免费已有0 人下载

立即下载

A Maximum Entropy approach to Natural Language Processing (1996)

你可能还喜欢