RESEARCH
CONTRIBUTIONS
Extended Boolean
Information Retrieval
GE~J~ID S~JL~ Cornell University
~]DW,~ A, ~X International Institute for Tropical Agriculture, Ibadan, Nigeria
WRY ~ ITF Programming Technology Center
Gerard Salton has worked on
a number of non-numeric
computer applications
including automatic
information retrieval. He has
published a large number of
articles and several books on
information retrieval, most
recently, "Introduction to
Modern Information
Retrieval," 1983.
Edward Fox is currently
assistant professor of
computer science at VPI.
Harry Wu is currently
involved in the comparative
study of relational database
systems. His main interest
is in information storage
and retrieval.
Authors' Present Addresses:
G. Salton, Dept. of Computer
Science, Cornell Univ.,
Ithaca, NY 14853;
E. A. Fox, Virginia
Polytechnic Institute and
State Univ., Blacksburg, VA
24061;
H. Wu, ITT--Programming
Technology Center, lO00
Oronogue Lane, Stratford, CT
06497.
This study was supported in
part by the National Science
Foundation under Grant
IST-8108696.
Permission to copy without
fee all or part of this material
is granted provided that the
copies are not made or
distributed for direct
commercial advantage, the
ACM copyright notice and
the title of the publication
and its date appear, and
notice is given that copying
is by permission of the
Association for Computing
Machinery. To copy
otherwise, or to republish,
requires a fee and/or specific
permission. © 1983 ACM
0001-0782/83/1100-1022 75¢
1. CONVENTIONAL RETRIEVAL STRATEGIES
In conventional information retrieval, the stored records are
normally identified by sets of key words or index terms, and
requests for information are expressed by using Boolean com-
binations of index terms. The retrieval strategy is normally
based on an auxiliary inverted-term index that lists the corre-
sponding set of document references for each allowable index
term. The Boolean retrieval system is designed to retrieve all
stored records exhibiting the precise combination of key
words included in the query: when two query terms are
related by an and connective, both terms must be present in
order to retrieve a particular stored record; when an or con-
nective is used, at least one of the query terms must be
present to retrieve a particular item. In some systems where
the natural language text of the documents or the document
excerpts is stored, the user queries may be formulated as
combinations of text words. In that case, the queries may
include location restrictions for the query terms--for exam-
ple, a requirement that the query terms occur in the same
sentence of any retrieved document or within some specified
number of words of each other.
Boolean retrieval systems have become popular in opera-
tional situations because high standards of performance are
achievable. Furthermore. the retrieval technology which is
based on list intersections and list unions to implement Boo-
lean conjunction ("A and B") and Boolean disjunction ("A or
B"), respectively, is now well understood [1,2].
The conventional Boolean retrieval technology is however
also saddled with various disadvantages:
1. The size of the output obtained in response to a given
query is difficult to control; depending on the assignment
frequency of the query terms and the actual term combi-
nations used in a query formulation, a great deal of output
can be obtained or, alternatively, no output might be re-
trieved at all.
2. The output obtained in response to a query is not ranked
in any order of presumed importance to the user; thus,
each retrieved item is assumed to be as important as any
other retrieved item.
3. No provisions are made for assigning importance factors or
weights to the terms attached either to the documents or
ABSTRACT: A new, extended
Boolean information-retrieval
system is introduced that is
intermediate between the Boolean
system of query processing and the
vector-processing model. The query
structure inherent in the Boolean
system is preserved, while at the
same time weighted terms may be
incorporated into both queries and
stored documents; the retrieved
output can also be ranked in strict
similarity order with the user
queries. A conventional retrieval
system can be modified to make use
of the extended system. Laboratory
tests indicate that the extended
system produces better retrieval
output than either the Boolean or
the vector-processing system.
1022 Communications of the ACM December 1983 Volume 26 Number 12
Fine
附注
辅助的
Fine
附注
摘录
Fine
附注
背负
Fine
附注
假定的,推定的
Fine
附注
规定
RESEARCH COKrRUmONS
to the queries; thus, all terms included in the documents
and queries are assumed to carry equal importance.
4. Boolean query formulations may produce counterintuitive
results: for example, in response to an or query ("A or B or
• .. or Z"), an item containing only one query term is
deemed just as important as an item containing all query
terms; similarly, given an and query ("A and B and . . , and
Z"), an item containing all but one of the query terms is
assumed to be just as useless as an item containing none of
the query terms.
Some of the disadvantages of the conventional Boolean re-
trieval system are eliminated in the vector-processing retrieval
model [3, 4]. In that case, both the stored records and the
information requests are unstructured and expressed simply
by sets of key words of varying lengths. In the vector-process-
ing system, both query and document terms can be weighted,
and a similarity computation between queries and stored rec-
ords makes it possible to obtain ranked output in decreasing
order of the query-document similarity. By choosing an ap-
propriate retrieval threshold, the user can then obtain as
much or as little output as desired.
The vector-processing system suffers from one major disad-
vantage: the structure inherent in the standard Boolean query
formulation is absent. This implies that it is no longer possible
to incorporate phrase-like constructs or sets of synonymous
terms by using and and or connectives, respectively. {In a
Boolean query formulation, an and.connective may be used
to identify query phrases as in "information and retrieval";
similarly, or connectives may relate synonymous terms as in
"hand-held calculators or pocket computers or microcompu-
ters.")
Various intermediate retrieval systems have been designed
that include features of both the Boolean and the vector-
processing models. For example, in the SInE system (Syracuse
Information Retrieval Experiment}, a standard Boolean query-
processing system may be used first to retrieve items that
respond precisely to the Boolean query formulations. An out-
put ranking is then obtained by using the weights attached to
the document terms to display the retrieved items in decreas-
ing order according to the sum of the weights of the matching
terms in queries and documents [5].
Document weights and output ranking are also incorpo-
rated in various extensions of the classical Boolean systems
based on fuzzy-set theory [6--9]. In the standard fuzzy-set
model of retrieval, the document terms may be weighted and
the queries are processed as standard Boolean formulations.
Given queries "A or B," "A and B," and "not A," a document
X with weights dA(X) and de{X} for terms A and B, respec-
tively, receives the following retrieval values in a fuzzy-set
retrieval system:
max[dA(X}, de{X}] for query {A or B)
min[dA(X), de(X)] for query (A and B)
1 - dA(X) for query (not A)
For binary document vectors where the document term
weights are restricted to 0 and 1, the fuzzy-set model is com-
patible with the Boolean system of retrieval. Unfortunately,
the fuzzy-set system suffers from a lack of discrimination
among the retrieved output nearly to the same extent as the
conventional Boolean retrieval system, since the rank of a
retrieved item depends only on the lowest or highest
weighted document term for and and or queries, respectively.
In addition, it is difficult to extend the fuzzy-set model to
situations where term weights are also attached to query
terms instead of only to the document terms [6--9].
In the remainder of this study, an extended Boolean re-
trieval model is introduced that accommodates both weighted
query and weighted document terms• The extended model
represents a compromise between the strictness of the con-
ventional Boolean system and the lack of structure inherent
in the vector-processing system. The extended model thus
preserves the query structure inherent in a Boolean system
with its distinction between term phrases ("anded" terms) and
term synonyms ("ored" terms); at the same time, the model
makes it possible to retrieve items that are not retrievable by
a conventional Boolean system, and all retrieved items are
ranked in decreasing order of query-document similarity.
The basic model is introduced in the remainder of this
study and evaluation results are used to illustrate the effec-
tiveness of the system.
2. THE EXTENDED RETRIEVAL MODEL
2.1. Motivation
Consider first the operations of a conventional retrieval sys-
tem based on Boolean query formulations. Three document
classes may be distinguished with respect to two-term queries
such as (A or B) and (A and B): those exhibiting both terms,
those containing only one of the terms, and those containing
neither term. The or-type query assigns a value of 0 to the
items containing neither term and values of I to the remain-
ing items; the and-query assigns a value of I to the items
containing both terms and values of 0 to the remainder• This
situation is represented in Table I(a).
When only two terms are under consideration, the term
assigment can be repesented by a two-dimensional map, as
shown in Figure 1, where each term is assigned a different
coordinate axis. It is clear that for and-queries, the (1, 1) point
on the map, representing the situation when both terms are
present in an item, is the desirable location; for or-queries, on
the other hand, the (0, 0) point identifying the situation when
both terms are absent from an item is to be avoided. This
suggests that a discriminating retrieval system might rank the
stored items in order of increasing distance from the (1, 1)
point for and-queries, and in order of decreasing distance
from (0, 0) for the or-queries. For a document with term
weights dA and de for terms A and B, respectively, the normal
Euclidian distance from the (0, 0) point is
ff(dA -- 0) 2 + (de - 0) 2, whereas the distance from the (1, 1)
point is .J(1 - dA) 2 + (1 -- de) 2. It is convenient to operate
with normalized distance measures by dividing out the maxi-
mum distance of .J2 between the (0, 0) and (1, 1) points. For
TABLE I. Query-Document Similarity Values for Conventional and
Extended Systems.
(a) Conventional Boolean Retrieval
Terms Similarity with Query
A B (A or B) (A and B)
Document D~ 1 1 1 1
Document D2 1 0 1 0
Document D3 0 1 1 0
Document D4 0 0 0 0
(b) Extended Retrieval
Document Ol 1 1 1 1
Document D2 1 0 1/v~ 1-1/~/_2
Document 03 0 1 1/~2 1-1/~2
Document D4 0 0 0 0
November 1983 Volume 26 Number 11 Communications of the ACM 1023
Fine
附注
有悖常理的
Fine
高亮
Fine
高亮
Fine
高亮
Fine
高亮
Fine
高亮
Fine
高亮
Fine
高亮
Fine
高亮
Fine
高亮
Fine
高亮
Fine
高亮
Fine
高亮
Fine
高亮
Fine
高亮
Fine
高亮
Fine
高亮
Fine
高亮
RESEARCH CONTRIBUTIONS
(a)
(0, 1)
Similarity
value equ_als"
(1 - 1/~/2)
/ /
X / / / /
7 \ ,>,..
(1, 1)
Ibl
(0, 1)
Similarity
value_equals
(1/V2)
B
~ X / / '
(1, 1)
~ A >A
0 (1, O) 0 (1, O)
(A and B) (A or B)
FIGURE 1. Equidistance lines from (1, 1) point for "and" and from (0, O) point for "or." (a) Lines of equal similarity equidistant from (1, 1).
(b) Lines of equal similarity equidistant from (0, 0).
0 _< da, ds -< 1, the following similarity measures between
queries Q and documents D will then rank the documents in
decreasing distance from (0, 0) and increasing distance from
(1, 1), respectively.
sim(D, QiA.,~)) = "~/d~
+ d~
v 2 (1)
~ / (1 - da) 2 + {1 - dB} 2
sim(O, Q{A.dB~} 1 V 2
Table l(b) shows the similarity values obtained by using
formulas (1) for the sample documents previously used in
Table l(a). It may be seen that when only one query term is
present in a given document, the document receives values of
1/V2 and I - 1/x/2 for or and and queries, respectively,
rather than I and 0 as in the Boolean system, That is, the
presence of a single term in a document is not worth as much
as the presence of both terms for or-queries, but is worth
more than the absence of both terms for the and case.
The formulas of expression (1) are directly applicable to the
case of weighted-document terms when 0 _< dA, ds -< 1. In that
case, each document is represented by a point on the unit
square as shown in Figure 1. The locus of equidistant points
from (1, 1) and (0, 0) for and and or, respectively, is shown in
Figure I by appropriate lines on the graph. Three documents
labeled X, Y, and Z are used as examples in Figure 1. It may
be seen that the similarity computation of expression (1) pro-
duces a ranking where sim(Q, X) > sim(Q, Y) > sim(Q, Z) with
respect to both (A and B) and (A or B).
The basic similarity measurements of expression (1) can be
extended to the case of weighted-query terms based on term
weights 0 _< a, b _< 1 reflecting the importance of the individ-
ual terms A, B in the query. In that case
"~ /a2d~ + b2d~
sim(D, QIIA~I o, fB~I)) = V a '-T + b 2
sim(D, QtIA~i ..d is.biD) (2)
~ /a2(1 - dA) 2 + b2(1 - dB) 2
1 V a 2 -.I- b 2
When the query terms are fully weighted, that is, a = b = 1,
the similarity measures of expression (2) reduce to the basic
formulas of expression (1).
In the foregoing development, it was assumed that the
query-document similarity could be measured by using a nor-
malized Euclidian distance between corresponding points in
the document space. The notion of distance between points in
a document space can be generalized by introducing the well-
known Lp vector norm defined for an n-dimensional vector
(dl, d2 . . . . . d.) as
lid 11~, = lid,, d2 . . . . . d.ll, = (d~' + d~ + . . . + dL') ' / " (3)
Since normalized distances are to be used to measure query-
document similarities, expression (3) can be rewritten in the
form
lid lip(normalized} (~) I /P = Ildll,,
[ . (d~+d~+. - -+d~)] '/p
n
(4)
Expression (4) exhibits the form previously introduced in (1).
The theory of vector norms will now be used as a model for
query-document comparisons in a Boolean query environ-
ment [10, 11].
2.2 The p Norm Model
Consider a set of terms A1, A2 . . . . . A. and let d^, represent
the weight of term A~ in some document D = (dA,, dA=, . . . .
dA.) where 0 <_ dA, --< 1. A generalized Boolean or-query, Q~,
can now be written as
Qorlp) = [(A1, al) or p (A2, a2) or p . . . . or p (A., an)]
where aj indicates the weight of query term A. 0 _< a~ _< 1
and 1 _< p _< oo. Similarly, a generalized and-query, Q~ is
written as
Qand(p} = [(A1, al) and p (A2, a2) and p . . . . and p (A., a.}].
1024 Communications of the ACM November 1983 Volume 26 Number 11
Fine
附注
等距离点的轨迹
Fine
高亮
Fine
高亮
Fine
高亮
The similarities between a given document D = (da,dA ......
da.) and Q~wJ, Q~I may now be defined as 1"2
sim(D, Qo.~l)=[ a~ +a~+ • .+a . p
]
(5)
sim(D, Q..d~v~)
=1- [ a'p(1 --da,)"+ag(la( + ag--dA~}P++ .-- +"'ag+a~(1-da.)P] '/p
(6)
It can now be shown that when expressions (5) and (6) are
used to compute the similarities between a set of (possible
weighted) document terms and a (possibly weighted) or-query
and and-query, respectively, the effect of a standard vector-
processing retrieval model is obtained when p is set equal to
1. When p = oo and the query and document term weights
are limited to 0 or 1, a conventional Boolean retrieval model
is produced. Finally, for intermediate values of p between 1
and oo, a retrieval system intermediate between the vector-
processing and the Boolean models is obtained [12].
Consider first the situation where p = 1. In that case, it is
simple to show that sim(D, Q,~) = sim(D, Q~), because
sim(D, Q..~m)
=1 _ [0 , (1 -da , ) +a2(la, + a2--daa)+'' '+ "-" + an+a"(1 -da , ) ]
= l - [ (a '+a2+'"+a")a ,+- (a~da 'a2+. . .+a2da~+'"+a, +a,dao)]
= alda I + a2da2 + . . . + anda,
o, .-k 02 -.k . . . .+ an
= sim(D, Qo,lu) (7)
Expression (7) represents the inner product between a docu-
ment
D = (de,, dA . . . . . . de.) and query Q
( al a, )
a, .a_ a2 + """ -4- a2 a,, "[- . . . -[-, an
In other words, when p = 1, the distinction between the and
and or connectives in a query disappears and a simple vector-
processing retrieval model is obtained where the similarities
between queries and documents are measured by the inner
product between the document term weights dA, and the
normalized query weights w. represented as
a )
1 + a2 "b . - - "b a .
The query weights al are relative rather than absolute weights used to repre-
sent the presumed importance of one query term relative to the other query
terms. Because of the normalization inherent in the denominator of expres-
sions (5) and (6}. the restriction O _< al -< 1 is actually unnecessary. Query
weights larger than 1 are acceptable, provided the relative importance of the
terms is properly reflected by the term weights.
2 The notations or p, and p as well as or(p), and(p) are used inlerchangeably.
RESEARCH CONTRIB{mONS
When p = oo, one obtains
sim(D, Q~.dt®l)
= l iml~ - [ -a'p(1 -- da')P + a~ ~" " -'+ ag(l+ a~- da")P ] 1/p
= 1 - max[el(1 - dat) . . . . . a,(1 - da,)]
max[a,, a2 . . . . . a.]
Assuming that the query terms are all equally weighted, such
as when a, = a2 = • • • = a, = 1, one has
sim(D, Q..d(o¢))
= 1 - max[(1 -- de,), (1 - dA2) . . . . . (1 - de°)]
= min[dA,, dA . . . . . . dAo]
Similarly,
sim(D, Qor(®l)
[a~'d,~, + agd~ + ... + a~'d~,] '/p
lp_~im L a( + ag + .-- + a~ J
= max[a,da,, a2da~ . . . . . a.da~]
max[a,, a2 . . . . . a.]
When all query term weights are equal to 1, that is, a, = 02 =
. . . . a, = 1, one obtains
sim(D, Qo,I~I) = max[dA,, dA2, . . . . dao]
This leads to the conclusion that when p = oo and the query
is unweighted, the query-document matching function is de-
pendent only on the document term of highest weight for Q~
and the document term of lowest weight for Q,~. This is
precisely the situation previously mentioned for the fuzzy-set
model of retrieval, and by extension, for the conventional
Boolean retrieval system when both query and document
terms are unweighted.
By varying the value of p between I and ~ and using the
query-document similarity function of expressions (5) and (6),
it is then possible to obtain a system intermediate between a
pure vector-processing model (p = 1) and a conventional
Boolean retrieval system (p = ~). The larger the value of p,
the more importance is given to the query structure as re-
flected by the and and or connections. As the p-value de-
creases, the distinction between an and connection and an or
connection becomes weaker, until that distinction disappears
completely as p reaches a lower bound of 1. The effect of the
p-value variations is represented schematically in Figure 2.
The theory of vector norms can be used to show that the
query-document similarity values obtained for I < p < oo are
indeed intermediate between those obtainable for the ex-
treme cases when p = 1 anc~ p = oo. Indeed
sim(D, Qondl~l) --< sim(D, Q.ndCp)) --< sim(D, Q..am
= sim(D, Qorl,I) -< sim(D, Qorcpl) -< sim(D, Qo,l®l) (8)
The var
本文档为【Extended Boolean IR】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑,
图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
该文档来自用户分享,如有侵权行为请发邮件ishare@vip.sina.com联系网站客服,我们会及时删除。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。
网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。