首页 《信息检索原理》双语课件:第五章:检索评价

《信息检索原理》双语课件:第五章:检索评价

举报
开通vip

《信息检索原理》双语课件:第五章:检索评价nullThe Principle of Information RetrievalThe Principle of Information RetrievalDepartment of Information Management School of Information Engineering Nanjing University of Finance & Economics 2011II 课程内容II 课程内容5 Evaluation in information retrieval5 Evaluation...

《信息检索原理》双语课件:第五章:检索评价
nullThe Principle of Information RetrievalThe Principle of Information RetrievalDepartment of Information Management School of Information Engineering Nanjing University of Finance & Economics 2011II 课程内容II 课程内容5 Evaluation in information retrieval5 Evaluation in information retrievalHow to measure user happinessHow to measure user happinessThe key utility measure is user happiness Relevance of results (effectiveness) Speed of response Size of the index User interface design Independent of the quality of the results Utility, success, completeness, satisfaction, worth, value, time, cost, … 5.1 Introduction5.1 Introduction评价的作用评价的作用评价在信息检索研究中发挥着重要作用 评价在信息检索系统的研发中一直处于核心的地位,以致于算法与其效果评价方式是合二为一的(Saracevic, SIGIR 1995) 信息检索系统评价的起源信息检索系统评价的起源Kent等人第一次提出了关于Precision和Recall(开始称为relevance)的概念(Kent, 1955)信息检索系统评价的起源信息检索系统评价的起源Cranfield-like evaluation methodology Cranfield在上世纪伍十年代末到六十年代初提出了基于查询样例集、 标准 excel标准偏差excel标准偏差函数exl标准差函数国标检验抽样标准表免费下载红头文件格式标准下载 答案集和语料库的评测 方案 气瓶 现场处置方案 .pdf气瓶 现场处置方案 .doc见习基地管理方案.doc关于群访事件的化解方案建筑工地扬尘治理专项方案下载 ,被称为IR评价的“grand-daddy” 确立了评价在信息检索研究中的核心地位 Cranfield是一个地名,也是一个研究所的名称信息检索系统评价的起源信息检索系统评价的起源Gerard Salton 与 SMART 系统 Gerard Salton是SMART系统的主要研发者。SMART首次提供了一个研究平台,你可以只关心算法,而不必关心索引什么的,同时也提供了一个评测计算,你提供了答案后,可以给出常用的指标信息检索系统评价的起源信息检索系统评价的起源Sparck-Jones 的著作 “Information retrieval experiment” 主要论述IR实验和评测How to measure information retrieval effectiveness1-2How to measure information retrieval effectiveness1-2We need a test collection consisting of three things A document collection A test suite of information needs, expressible as queries A set of relevance judgments, standardly a binary assessment of either relevant or not relevant for each query-document pairHow to measure information retrieval effectiveness2-2How to measure information retrieval effectiveness2-2And in this test collection A document is given a binary classification as either relevant or not relevant Collection and suite of information needs have to be of a reasonable size Results are highly variable over different documents and information needs 50 information needs at leastDifficultiesDifficultiesThe difference of stated information need and query Relevance is assessed relative to an information need, not a query The subjectivity of relevance decision Many systems contain various parameters that can be adjusted to tune system performance The correct procedure is to have one or more development test collectionsDifficultiesDifficultiesVoorhees 估计,对一个规模为800万的文档集合进行针对1个查询主 快递公司问题件快递公司问题件货款处理关于圆的周长面积重点题型关于解方程组的题及答案关于南海问题 的相关性评判需要耗费1名标注人员9个月的工作时间 TREC提出pooling 方法 快递客服问题件处理详细方法山木方法pdf计算方法pdf华与华方法下载八字理论方法下载 ,在保证评价结果可靠性的基础上大大减少了评判工作量 缺点:处理的查询数目少,针对小规模的查询集合,仍需要耗费十余名标注人员1-2个月的工作时间5.2 Standard test collections5.2 Standard test collectionsThe Cranfield collection Text Retrieval Conference (TREC) GOV2 NII Test Collections for IR Systems (NTCIR) Reuters-21578 and Reuters-RCV1The Cranfield collectionThe Cranfield collectionCollected in the United Kingdom starting in the late 1950s, it contains 1398 abstracts of aerodynamics journal articles, a set of 225 queries Allowing precise quantitative measures But too smallText Retrieval Conference (TREC)1-2Text Retrieval Conference (TREC)1-2The U.S. National Institute of Standards and Technology (NIST) has run a large IR test bed evaluation series since 1992 Over a range of different test collections But the best known test collections are the ones used for the TREC Ad Hoc track during the first 8 TREC evaluations from 1992 to 1999 Comprise 6 CDs containing 1.89 million documents (mainly, but not exclusively, newswire articles)Text Retrieval Conference (TREC)2-2Text Retrieval Conference (TREC)2-2TRECs 6–8 provide 150 information needs over about 528,000 newswire and Foreign Broadcast Information Service articles Relevance judgments are available only for the documents that were among the top k returned for some system which was entered in the TREC evaluationGOV2GOV2Contains 25 million GOV2 web pages GOV2 is one of the largest Web test collection but still more than 3 orders of magnitude smaller than the current size of the document collections indexed by the large web search companiesNII Test Collections for IR Systems (NTCIR)NII Test Collections for IR Systems (NTCIR)Similar sizes to the TREC collections Focusing on East Asian language and cross-language information retrievalReuters-21578 and Reuters-RCV1Reuters-21578 and Reuters-RCV1Most used for text classification Reuters-21578 collection of 21578 newswire articles Reuters Corpus Volume 1 (RCV1) is much larger, consisting of 806,791 documents8.3 Evaluation of unranked retrieval sets8.3 Evaluation of unranked retrieval setsPrecision Recall Accuracy F measurePrecisionPrecisionPrecision (P) is the fraction of retrieved documents that are relevantRecallRecallRecall (R) is the fraction of relevant documents that are retrievedThe another way to define P and RThe another way to define P and RThe comparison of P and RThe comparison of P and RTypical web surfers prefer P to R Various professional searchers such as paralegals and intelligence analysts prefer R to P Individuals searching their hard disks prefer R to PThe comparison of P and RThe comparison of P and RThe two quantities clearly trade off against one another Recall is a non-decreasing function of the number of documents retrieved Can always get a recall of 1 (but very low precision) by retrieving all documents for all queries Precision usually decreases as the number of documents retrieved is increasednullWhich is more difficult to measure?Which is more difficult to measure?Precision Recallnull关于查全率的质疑关于查全率的质疑分母是个无法确定的值,所以建立在其理论上的查全率也是不实际的 相关文献没能被检索出来的原因是什么? 数据库的 设计 领导形象设计圆作业设计ao工艺污水处理厂设计附属工程施工组织设计清扫机器人结构设计 水平还是用户操作水平? 是标引的原因还是检索的原因? 是任何一个数据库都有一个相关的系数存在?关于查准率的质疑关于查准率的质疑数据库中存在大量应查到而查不到的文献时,查出来的文献就是100%准确有意义吗? 查准率分母中的不相关文档是如何产生的? 是系统造成的? 用户检索时由于表达不清造成的? 还是用户最终取舍形成的?相对查准率相对查准率关于查全率与查准率关系的质疑1-2关于查全率与查准率关系的质疑1-2a: relevant and retrievedc: relevant and unretrievedb: irrelevant and retrievedirrelevant and unretrieved关于查全率与查准率关系的质疑2-2关于查全率与查准率关系的质疑2-2一般认为是呈反比关系 如果a/(c+a)值增大,必然是c值减小 c值减小有两种原因,是c线下移或b线右移,其结果必然b值增大,所以a/(b+a)值减小,反之也成立 但是这是建立在一种假设之上,那就是a为定量 但是定量不是a,而是(c+a),如c值下降,a值必然上升,所以a是变量 所以,b和c之间没有必然的联系,可以想象b线和c线能够同时向边线移动。c和b等于0不太可能,但不是没有可能 事实上,检索系统可以同时提高两个指标nullnullnullnullContingency tableContingency tableThe other way to define P and RThe other way to define P and RP = tp/(tp + fp) R = tp/(tp + fn)F measure1-4F measure1-4A single measure that trades off precision versus recall α∈[0, 1] and thus β2∈[0,∞] The default balanced F measure use β=1F measure2-4F measure2-4Values of β < 1 emphasize precision, while values of β > 1 emphasize recall It is harmonic mean rather than the simpler averageF measure3-4F measure3-4The harmonic mean is always less than either the arithmetic or geometric mean, and often quite close to the minimum of the two numbers This strongly suggests that the arithmetic mean is an unsuitable measure to use because it closer to their maximum than harmonic meanF measure4-4F measure4-4Accuracy1-2Accuracy1-2Accuracy is the fraction of its classifications that are correct An information retrieval system can be thought of as a two-class classifier Accuracy=(tp+tn)/(tp+fp+fn+tn)Accuracy2-2Accuracy2-2Often used for evaluating machine learning classification problems Not an appropriate measure for information retrieval problems Normally over 99.9% of the documents are in the not relevant category Can maximize accuracy by simply deeming all documents irrelevant to all queries, that is to say tn maybe too large8.4 Evaluation of ranked retrieval results8.4 Evaluation of ranked retrieval resultsPrecision, recall, and the F measure are set-based measures and are computed using unordered sets of documents In a ranked retrieval context, appropriate sets of retrieved documents are naturally given by the top k retrieved documentsThe typesThe typesPrecision-recall curve Mean Average Precision (MAP) Precision at k R-precision ROC curvePrecision-recall curve1-2Precision-recall curve1-2Precision-recall curve2-2Precision-recall curve2-2If the (k + 1)th document retrieved is irrelevant then recall is the same as for the top k documents, but precision has dropped If it is relevant, then both precision and recall increase, and the curve jags up and to the rightHow to evaluate?How to evaluate?More cliffy, less effectivePrecision-recall curve with interpolated precisionPrecision-recall curve with interpolated precisionCan remove the jiggles of curve The interpolated precision at a certain recall level r is defined as the highest precision found for any recall level q ≥ r The justification is that almost anyone would be prepared to look at more documents if it would increase the percentage of the viewed set that were relevantPrecision-recall curve with 11-point interpolated average precision1-2Precision-recall curve with 11-point interpolated average precision1-2Can boil curve’s information down to a few numbers 11 points means 0.0, 0.1, 0.2, … , 1.0 For each recall level, we then calculate the arithmetic mean of the interpolated precision at that recall level for each information need in the test collectionPrecision-recall curve with 11-point interpolated average precision2-2Precision-recall curve with 11-point interpolated average precision2-2nullMean Average Precision (MAP)1-3Mean Average Precision (MAP)1-3Most standard among the TREC community A single-figure measure of quality across recall levels Have especially good discrimination and stabilityMean Average Precision (MAP)2-3Mean Average Precision (MAP)2-3For a single information need, Average Precision is the average of the precision value obtained for the set of top k documents existing after each relevant document is retrievedMean Average Precision (MAP)3-3Mean Average Precision (MAP)3-3The set of relevant documents for an information need qj ∈ Q is {d1, . . . dmj} Rjk is the set of ranked retrieval results from the top result until you get to document dkThe disadvantage of MAPThe disadvantage of MAPFixed recall levels are not chosen, corresponding precision is at all recall levels Calculated scores normally vary widely across information needs( 0.1 to 0.7 ) So a set of test information needs must be large and diverse enough to be representative of system effectiveness across different queriesPrecision at kPrecision at kWhat matters is how many good results there are on the first page or the first k pages, especially for search engine This method measures precision at fixed low levels of retrieved results, such as 10 or 30 documents It has the advantage of not requiring any estimate of the size of the set of relevant documents but the disadvantages that it is the least stableR-precision1-2R-precision1-2It requires having a set of known relevant documents of size Rel, from which we calculate the precision of the top Rel documents returned Like Precision at k, R-precision describes only one point on the precision-recall curve And the R-precision and R-recall is equal ( r/Rel ) and is identical to the break-even point of the PR curveR-precision2-2R-precision2-2But even a perfect system could only achieve a precision at 20 of 0.4 if there were only 8 relevant documents Averaging this measure across queries thus makes more senseROC curve1-3ROC curve1-3Receiver Operating Characteristics, ROC ROC curve plots the true positive rate ( sensitivity or recall ) against the false positive rate ( 1-specificity The false positive rate is given by fp/( fp+tn ) Specificity is given by tn/( fp + tn)ROC curve2-3ROC curve2-3ROC curve always goes from the bottom left to the top right of the graph For a good system, the graph climbs steeply on the left sideROC curve3-3ROC curve3-38.5 Assessing relevance8.5 Assessing relevanceAbout information needsAbout information needsInformation needs must be germane to the documents Information needs are best designed by domain experts Using random combinations of query terms as an information need is generally not a good idea because typically they will not resemble the actual distribution of information needsAbout assessment1-2About assessment1-2Time-consuming and expensive process involving human beings For tiny collections like Cranfield, exhaustive judgments of relevance for each query and document pair were obtained For large modern collections, it is usual for relevance to be assessed only for a subset of the documents for each query The most standard approach is pooling, where relevance is assessed over a subset of the collection that is formed from the top k documents returnedAbout assessment2-2About assessment2-2Humans and their relevance judgments are quite variable But this is not a problem to be solved In the final analysis, the success of an IR system depends on how good it is at satisfying the needs of these variable humans But needs to measure the agreement of these assessmentsKappa statistic1-5Kappa statistic1-5It can measure how much agreement between relevance judgments P(A) is the proportion of the times the judges agreed P(E) is the proportion of the times they would be expected to agree by chance, which needs to be estimatedKappa statistic2-5Kappa statistic2-5Kappa statistic3-5Kappa statistic3-5P(A) = (300+ 70)/400 = 370/400 = 0.925 P(nonrelevant) = (80+90)/(400+400) = 170/800 = 0.2125 P(relevant) = (320+ 310)/(400+ 400) = 630/800 = 0.7878 P(E) = P(nonrelevant) + P(relevant) = 0.21252 + 0.78782 = 0.665 k= (P(A)− P(E))/(1− P(E)) = (0.925−0.665)/(1− 0.665) = 0.776Kappa statistic4-5Kappa statistic4-5Result Kappa value above 0.8 is taken as good agreement Kappa value between 0.67 and 0.8 is taken as fair agreement Kappa value below 0.67 is seen as dubious basis for evaluation If there are more than two judges, it is normal to calculate an average pairwise kappa valueKappa statistic5-5Kappa statistic5-5The level of agreement in TREC normally falls in the range of “fair” (0.67–0.8) This means not requiring more fine-grained relevance labeling But these deviation has in general been found to have little impact on the relative effectiveness ranking despite the variation of individual assessors’ judgmentsAbout relevanceAbout relevanceSome dubious hypothesis The relevance of one document is treated as independent of the relevance of other documents in the collection Assessments are binary and not nuanced Information need is treated as an absolute, objective decision So any results are heavily skewed by the choice of collection, queries, and relevance judgment setMarginal relevanceMarginal relevanceIt means whether a document still has distinctive usefulness after the user has looked at certain other documents The most extreme case of this is documents that are duplicates – a phenomenon that is actually very common on the Web Maximizing marginal relevance requires returning documents that exhibit diversity and novelty8.6 Evaluation of Search Engines8.6 Evaluation of Search Engines九款流行搜索引擎评测 华军软件园测评中心 2008-07-21null主要搜索服务种类主要搜索服务种类平均搜索速度平均搜索速度搜索结果显示搜索结果显示搜索性能测试——网页搜索测试搜索性能测试——网页搜索测试搜索性能测试——图片搜索测试搜索性能测试——图片搜索测试搜索性能测试——音频搜索测试搜索性能测试——音频搜索测试搜索性能测试——视频搜索测试搜索性能测试——视频搜索测试搜索性能测试——文件搜索测试搜索性能测试——文件搜索测试美国九大搜索引擎评测 美国九大搜索引擎评测 nullnull基于用户行为分析的自动评价基于用户行为分析的自动评价标准测试集方法和日志分析方法的综合 从宏观用户行为的角度客观反映搜索引擎检索质量的变化 提供一个快速、半即时的评价各大搜索引擎检索效果的平台点击集中度点击集中度针对查询历史信息的特征 假设:不同用户具有相同的检索需求时,他们的点击都会集中在某个网页或几个页面上。 导航类需求,目标页面唯一 信息类需求,目标页面多样化 特征:用户点击分布点击集中度点击集中度不同的搜索引擎用户点击分布不同 例:电影(inf)nullnull
本文档为【《信息检索原理》双语课件:第五章:检索评价】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑, 图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
该文档来自用户分享,如有侵权行为请发邮件ishare@vip.sina.com联系网站客服,我们会及时删除。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。
网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。
下载需要: 免费 已有0 人下载
最新资料
资料动态
专题动态
is_119000
暂无简介~
格式:ppt
大小:2MB
软件:PowerPoint
页数:0
分类:其他高等教育
上传时间:2013-12-14
浏览量:47