首页 Extended Boolean IR

Extended Boolean IR

举报
开通vip

Extended Boolean IR RESEARCH CONTRIBUTIONS Extended Boolean Information Retrieval GE~J~ID S~JL~ Cornell University ~]DW,~ A, ~X International Institute for Tropical Agriculture, Ibadan, Nigeria WRY ~ ITF Programming Technology Center Gerard Salton has worked on a nu...

Extended Boolean IR
RESEARCH CONTRIBUTIONS Extended Boolean Information Retrieval GE~J~ID S~JL~ Cornell University ~]DW,~ A, ~X International Institute for Tropical Agriculture, Ibadan, Nigeria WRY ~ ITF Programming Technology Center Gerard Salton has worked on a number of non-numeric computer applications including automatic information retrieval. He has published a large number of articles and several books on information retrieval, most recently, "Introduction to Modern Information Retrieval," 1983. Edward Fox is currently assistant professor of computer science at VPI. Harry Wu is currently involved in the comparative study of relational database systems. His main interest is in information storage and retrieval. Authors' Present Addresses: G. Salton, Dept. of Computer Science, Cornell Univ., Ithaca, NY 14853; E. A. Fox, Virginia Polytechnic Institute and State Univ., Blacksburg, VA 24061; H. Wu, ITT--Programming Technology Center, lO00 Oronogue Lane, Stratford, CT 06497. This study was supported in part by the National Science Foundation under Grant IST-8108696. Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission. © 1983 ACM 0001-0782/83/1100-1022 75¢ 1. CONVENTIONAL RETRIEVAL STRATEGIES In conventional information retrieval, the stored records are normally identified by sets of key words or index terms, and requests for information are expressed by using Boolean com- binations of index terms. The retrieval strategy is normally based on an auxiliary inverted-term index that lists the corre- sponding set of document references for each allowable index term. The Boolean retrieval system is designed to retrieve all stored records exhibiting the precise combination of key words included in the query: when two query terms are related by an and connective, both terms must be present in order to retrieve a particular stored record; when an or con- nective is used, at least one of the query terms must be present to retrieve a particular item. In some systems where the natural language text of the documents or the document excerpts is stored, the user queries may be formulated as combinations of text words. In that case, the queries may include location restrictions for the query terms--for exam- ple, a requirement that the query terms occur in the same sentence of any retrieved document or within some specified number of words of each other. Boolean retrieval systems have become popular in opera- tional situations because high standards of performance are achievable. Furthermore. the retrieval technology which is based on list intersections and list unions to implement Boo- lean conjunction ("A and B") and Boolean disjunction ("A or B"), respectively, is now well understood [1,2]. The conventional Boolean retrieval technology is however also saddled with various disadvantages: 1. The size of the output obtained in response to a given query is difficult to control; depending on the assignment frequency of the query terms and the actual term combi- nations used in a query formulation, a great deal of output can be obtained or, alternatively, no output might be re- trieved at all. 2. The output obtained in response to a query is not ranked in any order of presumed importance to the user; thus, each retrieved item is assumed to be as important as any other retrieved item. 3. No provisions are made for assigning importance factors or weights to the terms attached either to the documents or ABSTRACT: A new, extended Boolean information-retrieval system is introduced that is intermediate between the Boolean system of query processing and the vector-processing model. The query structure inherent in the Boolean system is preserved, while at the same time weighted terms may be incorporated into both queries and stored documents; the retrieved output can also be ranked in strict similarity order with the user queries. A conventional retrieval system can be modified to make use of the extended system. Laboratory tests indicate that the extended system produces better retrieval output than either the Boolean or the vector-processing system. 1022 Communications of the ACM December 1983 Volume 26 Number 12 Fine 附注 辅助的 Fine 附注 摘录 Fine 附注 背负 Fine 附注 假定的,推定的 Fine 附注 规定 RESEARCH COKrRUmONS to the queries; thus, all terms included in the documents and queries are assumed to carry equal importance. 4. Boolean query formulations may produce counterintuitive results: for example, in response to an or query ("A or B or • .. or Z"), an item containing only one query term is deemed just as important as an item containing all query terms; similarly, given an and query ("A and B and . . , and Z"), an item containing all but one of the query terms is assumed to be just as useless as an item containing none of the query terms. Some of the disadvantages of the conventional Boolean re- trieval system are eliminated in the vector-processing retrieval model [3, 4]. In that case, both the stored records and the information requests are unstructured and expressed simply by sets of key words of varying lengths. In the vector-process- ing system, both query and document terms can be weighted, and a similarity computation between queries and stored rec- ords makes it possible to obtain ranked output in decreasing order of the query-document similarity. By choosing an ap- propriate retrieval threshold, the user can then obtain as much or as little output as desired. The vector-processing system suffers from one major disad- vantage: the structure inherent in the standard Boolean query formulation is absent. This implies that it is no longer possible to incorporate phrase-like constructs or sets of synonymous terms by using and and or connectives, respectively. {In a Boolean query formulation, an and.connective may be used to identify query phrases as in "information and retrieval"; similarly, or connectives may relate synonymous terms as in "hand-held calculators or pocket computers or microcompu- ters.") Various intermediate retrieval systems have been designed that include features of both the Boolean and the vector- processing models. For example, in the SInE system (Syracuse Information Retrieval Experiment}, a standard Boolean query- processing system may be used first to retrieve items that respond precisely to the Boolean query formulations. An out- put ranking is then obtained by using the weights attached to the document terms to display the retrieved items in decreas- ing order according to the sum of the weights of the matching terms in queries and documents [5]. Document weights and output ranking are also incorpo- rated in various extensions of the classical Boolean systems based on fuzzy-set theory [6--9]. In the standard fuzzy-set model of retrieval, the document terms may be weighted and the queries are processed as standard Boolean formulations. Given queries "A or B," "A and B," and "not A," a document X with weights dA(X) and de{X} for terms A and B, respec- tively, receives the following retrieval values in a fuzzy-set retrieval system: max[dA(X}, de{X}] for query {A or B) min[dA(X), de(X)] for query (A and B) 1 - dA(X) for query (not A) For binary document vectors where the document term weights are restricted to 0 and 1, the fuzzy-set model is com- patible with the Boolean system of retrieval. Unfortunately, the fuzzy-set system suffers from a lack of discrimination among the retrieved output nearly to the same extent as the conventional Boolean retrieval system, since the rank of a retrieved item depends only on the lowest or highest weighted document term for and and or queries, respectively. In addition, it is difficult to extend the fuzzy-set model to situations where term weights are also attached to query terms instead of only to the document terms [6--9]. In the remainder of this study, an extended Boolean re- trieval model is introduced that accommodates both weighted query and weighted document terms• The extended model represents a compromise between the strictness of the con- ventional Boolean system and the lack of structure inherent in the vector-processing system. The extended model thus preserves the query structure inherent in a Boolean system with its distinction between term phrases ("anded" terms) and term synonyms ("ored" terms); at the same time, the model makes it possible to retrieve items that are not retrievable by a conventional Boolean system, and all retrieved items are ranked in decreasing order of query-document similarity. The basic model is introduced in the remainder of this study and evaluation results are used to illustrate the effec- tiveness of the system. 2. THE EXTENDED RETRIEVAL MODEL 2.1. Motivation Consider first the operations of a conventional retrieval sys- tem based on Boolean query formulations. Three document classes may be distinguished with respect to two-term queries such as (A or B) and (A and B): those exhibiting both terms, those containing only one of the terms, and those containing neither term. The or-type query assigns a value of 0 to the items containing neither term and values of I to the remain- ing items; the and-query assigns a value of I to the items containing both terms and values of 0 to the remainder• This situation is represented in Table I(a). When only two terms are under consideration, the term assigment can be repesented by a two-dimensional map, as shown in Figure 1, where each term is assigned a different coordinate axis. It is clear that for and-queries, the (1, 1) point on the map, representing the situation when both terms are present in an item, is the desirable location; for or-queries, on the other hand, the (0, 0) point identifying the situation when both terms are absent from an item is to be avoided. This suggests that a discriminating retrieval system might rank the stored items in order of increasing distance from the (1, 1) point for and-queries, and in order of decreasing distance from (0, 0) for the or-queries. For a document with term weights dA and de for terms A and B, respectively, the normal Euclidian distance from the (0, 0) point is ff(dA -- 0) 2 + (de - 0) 2, whereas the distance from the (1, 1) point is .J(1 - dA) 2 + (1 -- de) 2. It is convenient to operate with normalized distance measures by dividing out the maxi- mum distance of .J2 between the (0, 0) and (1, 1) points. For TABLE I. Query-Document Similarity Values for Conventional and Extended Systems. (a) Conventional Boolean Retrieval Terms Similarity with Query A B (A or B) (A and B) Document D~ 1 1 1 1 Document D2 1 0 1 0 Document D3 0 1 1 0 Document D4 0 0 0 0 (b) Extended Retrieval Document Ol 1 1 1 1 Document D2 1 0 1/v~ 1-1/~/_2 Document 03 0 1 1/~2 1-1/~2 Document D4 0 0 0 0 November 1983 Volume 26 Number 11 Communications of the ACM 1023 Fine 附注 有悖常理的 Fine 高亮 Fine 高亮 Fine 高亮 Fine 高亮 Fine 高亮 Fine 高亮 Fine 高亮 Fine 高亮 Fine 高亮 Fine 高亮 Fine 高亮 Fine 高亮 Fine 高亮 Fine 高亮 Fine 高亮 Fine 高亮 Fine 高亮 RESEARCH CONTRIBUTIONS (a) (0, 1) Similarity value equ_als" (1 - 1/~/2) / / X / / / / 7 \ ,>,.. (1, 1) Ibl (0, 1) Similarity value_equals (1/V2) B ~ X / / ' (1, 1) ~ A >A 0 (1, O) 0 (1, O) (A and B) (A or B) FIGURE 1. Equidistance lines from (1, 1) point for "and" and from (0, O) point for "or." (a) Lines of equal similarity equidistant from (1, 1). (b) Lines of equal similarity equidistant from (0, 0). 0 _< da, ds -< 1, the following similarity measures between queries Q and documents D will then rank the documents in decreasing distance from (0, 0) and increasing distance from (1, 1), respectively. sim(D, QiA.,~)) = "~/d~ + d~ v 2 (1) ~ / (1 - da) 2 + {1 - dB} 2 sim(O, Q{A.dB~} 1 V 2 Table l(b) shows the similarity values obtained by using formulas (1) for the sample documents previously used in Table l(a). It may be seen that when only one query term is present in a given document, the document receives values of 1/V2 and I - 1/x/2 for or and and queries, respectively, rather than I and 0 as in the Boolean system, That is, the presence of a single term in a document is not worth as much as the presence of both terms for or-queries, but is worth more than the absence of both terms for the and case. The formulas of expression (1) are directly applicable to the case of weighted-document terms when 0 _< dA, ds -< 1. In that case, each document is represented by a point on the unit square as shown in Figure 1. The locus of equidistant points from (1, 1) and (0, 0) for and and or, respectively, is shown in Figure I by appropriate lines on the graph. Three documents labeled X, Y, and Z are used as examples in Figure 1. It may be seen that the similarity computation of expression (1) pro- duces a ranking where sim(Q, X) > sim(Q, Y) > sim(Q, Z) with respect to both (A and B) and (A or B). The basic similarity measurements of expression (1) can be extended to the case of weighted-query terms based on term weights 0 _< a, b _< 1 reflecting the importance of the individ- ual terms A, B in the query. In that case "~ /a2d~ + b2d~ sim(D, QIIA~I o, fB~I)) = V a '-T + b 2 sim(D, QtIA~i ..d is.biD) (2) ~ /a2(1 - dA) 2 + b2(1 - dB) 2 1 V a 2 -.I- b 2 When the query terms are fully weighted, that is, a = b = 1, the similarity measures of expression (2) reduce to the basic formulas of expression (1). In the foregoing development, it was assumed that the query-document similarity could be measured by using a nor- malized Euclidian distance between corresponding points in the document space. The notion of distance between points in a document space can be generalized by introducing the well- known Lp vector norm defined for an n-dimensional vector (dl, d2 . . . . . d.) as lid 11~, = lid,, d2 . . . . . d.ll, = (d~' + d~ + . . . + dL') ' / " (3) Since normalized distances are to be used to measure query- document similarities, expression (3) can be rewritten in the form lid lip(normalized} (~) I /P = Ildll,, [ . (d~+d~+. - -+d~)] '/p n (4) Expression (4) exhibits the form previously introduced in (1). The theory of vector norms will now be used as a model for query-document comparisons in a Boolean query environ- ment [10, 11]. 2.2 The p Norm Model Consider a set of terms A1, A2 . . . . . A. and let d^, represent the weight of term A~ in some document D = (dA,, dA=, . . . . dA.) where 0 <_ dA, --< 1. A generalized Boolean or-query, Q~, can now be written as Qorlp) = [(A1, al) or p (A2, a2) or p . . . . or p (A., an)] where aj indicates the weight of query term A. 0 _< a~ _< 1 and 1 _< p _< oo. Similarly, a generalized and-query, Q~ is written as Qand(p} = [(A1, al) and p (A2, a2) and p . . . . and p (A., a.}]. 1024 Communications of the ACM November 1983 Volume 26 Number 11 Fine 附注 等距离点的轨迹 Fine 高亮 Fine 高亮 Fine 高亮 The similarities between a given document D = (da,dA ...... da.) and Q~wJ, Q~I may now be defined as 1"2 sim(D, Qo.~l)=[ a~ +a~+ • .+a . p ] (5) sim(D, Q..d~v~) =1- [ a'p(1 --da,)"+ag(la( + ag--dA~}P++ .-- +"'ag+a~(1-da.)P] '/p (6) It can now be shown that when expressions (5) and (6) are used to compute the similarities between a set of (possible weighted) document terms and a (possibly weighted) or-query and and-query, respectively, the effect of a standard vector- processing retrieval model is obtained when p is set equal to 1. When p = oo and the query and document term weights are limited to 0 or 1, a conventional Boolean retrieval model is produced. Finally, for intermediate values of p between 1 and oo, a retrieval system intermediate between the vector- processing and the Boolean models is obtained [12]. Consider first the situation where p = 1. In that case, it is simple to show that sim(D, Q,~) = sim(D, Q~), because sim(D, Q..~m) =1 _ [0 , (1 -da , ) +a2(la, + a2--daa)+'' '+ "-" + an+a"(1 -da , ) ] = l - [ (a '+a2+'"+a")a ,+- (a~da 'a2+. . .+a2da~+'"+a, +a,dao)] = alda I + a2da2 + . . . + anda, o, .-k 02 -.k . . . .+ an = sim(D, Qo,lu) (7) Expression (7) represents the inner product between a docu- ment D = (de,, dA . . . . . . de.) and query Q ( al a, ) a, .a_ a2 + """ -4- a2 a,, "[- . . . -[-, an In other words, when p = 1, the distinction between the and and or connectives in a query disappears and a simple vector- processing retrieval model is obtained where the similarities between queries and documents are measured by the inner product between the document term weights dA, and the normalized query weights w. represented as a ) 1 + a2 "b . - - "b a . The query weights al are relative rather than absolute weights used to repre- sent the presumed importance of one query term relative to the other query terms. Because of the normalization inherent in the denominator of expres- sions (5) and (6}. the restriction O _< al -< 1 is actually unnecessary. Query weights larger than 1 are acceptable, provided the relative importance of the terms is properly reflected by the term weights. 2 The notations or p, and p as well as or(p), and(p) are used inlerchangeably. RESEARCH CONTRIB{mONS When p = oo, one obtains sim(D, Q~.dt®l) = l iml~ - [ -a'p(1 -- da')P + a~ ~" " -'+ ag(l+ a~- da")P ] 1/p = 1 - max[el(1 - dat) . . . . . a,(1 - da,)] max[a,, a2 . . . . . a.] Assuming that the query terms are all equally weighted, such as when a, = a2 = • • • = a, = 1, one has sim(D, Q..d(o¢)) = 1 - max[(1 -- de,), (1 - dA2) . . . . . (1 - de°)] = min[dA,, dA . . . . . . dAo] Similarly, sim(D, Qor(®l) [a~'d,~, + agd~ + ... + a~'d~,] '/p lp_~im L a( + ag + .-- + a~ J = max[a,da,, a2da~ . . . . . a.da~] max[a,, a2 . . . . . a.] When all query term weights are equal to 1, that is, a, = 02 = . . . . a, = 1, one obtains sim(D, Qo,I~I) = max[dA,, dA2, . . . . dao] This leads to the conclusion that when p = oo and the query is unweighted, the query-document matching function is de- pendent only on the document term of highest weight for Q~ and the document term of lowest weight for Q,~. This is precisely the situation previously mentioned for the fuzzy-set model of retrieval, and by extension, for the conventional Boolean retrieval system when both query and document terms are unweighted. By varying the value of p between I and ~ and using the query-document similarity function of expressions (5) and (6), it is then possible to obtain a system intermediate between a pure vector-processing model (p = 1) and a conventional Boolean retrieval system (p = ~). The larger the value of p, the more importance is given to the query structure as re- flected by the and and or connections. As the p-value de- creases, the distinction between an and connection and an or connection becomes weaker, until that distinction disappears completely as p reaches a lower bound of 1. The effect of the p-value variations is represented schematically in Figure 2. The theory of vector norms can be used to show that the query-document similarity values obtained for I < p < oo are indeed intermediate between those obtainable for the ex- treme cases when p = 1 anc~ p = oo. Indeed sim(D, Qondl~l) --< sim(D, Q.ndCp)) --< sim(D, Q..am = sim(D, Qorl,I) -< sim(D, Qorcpl) -< sim(D, Qo,l®l) (8) The var
本文档为【Extended Boolean IR】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑, 图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
该文档来自用户分享,如有侵权行为请发邮件ishare@vip.sina.com联系网站客服,我们会及时删除。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。
网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。
下载需要: 免费 已有0 人下载
最新资料
资料动态
专题动态
is_920442
暂无简介~
格式:pdf
大小:1MB
软件:PDF阅读器
页数:15
分类:工学
上传时间:2011-09-22
浏览量:15