视频场景检测算法的性能评价

视频场景检测算法的性能评价 Performance Evaluation Protocol for Video Scene Detection Algorithms Xian-Sheng Hua, Dong Zhang, Mingjing Li, Hong-Jiang Zhang Microsoft Research Asia 49 Zhichun Road, Beijing 100080, China {i-xshua; i-dzhang; mjli; hjzhang}@microsoft.com ABSTRACT ...

Performance Evaluation Protocol for Video Scene Detection Algorithms Xian-Sheng Hua, Dong Zhang, Mingjing Li, Hong-Jiang Zhang Microsoft Research Asia 49 Zhichun Road, Beijing 100080, China {i-xshua; i-dzhang; mjli; hjzhang}@microsoft.com ABSTRACT In this paper, we address the problem of performance evaluation (PE) of video scene detection algorithms. An objective and comprehensive protocol for performance evaluation and comparison of different scene detection algorithms is proposed. Unlike traditional precision-recall measures, the proposed PE measures treat false alarms unequally, and take the extent of incorrectness of the wrongly detected scene boundaries into consideration, which can generate more accurate and more reasonable PE results. The protocol has been applied not only on comparing different scene detection algorithms but also on determining the most appropriate thresholds for a given scene detection algorithm. 1. INTRODUCTION Efficient content-based management of digital video is very important for successful retrieval of video libraries and various other video applications. Video temporal structure parsing, which extracts temporal construction unit of video programs, is essential to automatic content-based organization and retrieval of video data. There are usually two basic layers of temporal construction units in videos, namely, shot and scene. A video shot may be defined as a sequence of frames captured by a single camera in a single continuous action in time and space [1], while a video scene is defined as a group of content correlated (in time, space, etc.) shots, which is a higher level semantic unit. Numerous video retrieval and management tasks rely on accurate shot detection results and reasonable scene detection results. Lots of shot detection approaches have been proposed in literatures [1][2][3]. Automatic shot detection will become very easy when using digital video camera or using digital editing tools, since shot boundary information can be accessed directly from the metadata. Dissimilar with shot detection, scene detection is semantic related and needs to do more work on content analysis. There are also a number of scene detection algorithms proposed in recent literatures [5][6][10]. However, all the proposed methods are evaluated on different data sets using different methods by the corresponding authors. To compare different methods, it is necessary to evaluate their performance under a general performance evaluation (PE) protocol and a common data set. Furthermore, scene detection is not a totally objective task. That is to say, different people may have different views on where the scene boundaries should be located. This makes the evaluation of scene detection algorithms quite difficult. Our motivation of the work presenting in this paper is try to solve these problems by proposing a general PE protocol for scene detection algorithms. Previous work on shot detection performance evaluation (PE) has been reported in Video TREC [7], which provides an evaluation standard based on precision and recall [8]. This PE protocol for shot detection has been widely accepted. As for shot detection, we can get exact and unbiased ground-truth since start and end locations of shots are well defined. For scene detection, this cannot be expected since different users have different cognitions about scene. In previous PE work, three major problems are addressed [9]. The first problem is that there is no common data set. The second one is that the definition of the ground truth depends on individual users. The last one is feature dependence of various algorithms. In [9], although the first and the last problems are taken into account, the evaluation algorithm is not accurate enough, as to be explained later. In this paper, we have further studied the problems that must be considered on PE for scene detection, including the second problem above that is not solved in paper [9]. The rest of the paper is organized as follows. In Section 2, we introduce the basic concept of scene detection evaluation, together with two traditional PE measures. Proposed PE protocol is then presented in Section 3. The experimental results of the protocol on comparing different video scene detection algorithms and on determining of the best threshold for a given scene detection algorithm are described in Section 4. Conclusion remarks are given in Section 5. 2. EVALUATION OF SCENE DETECTION The goal of scene detection evaluation consists of two parts. One part is to identify whether an automatic scene detection result is correct compared with manually labeled ground-truth. If not, another part is to measure how far the scene detection result is away from the correct one. That is, we formulate the scene detection evaluation problem as how well the automatic segmentation result fits the manual labeled ground-truth. Fig. 1 is employed to illustrate this problem. 1 2 3 4 5 6 7 8 9 10 Scene Boundary Shot Ground-truth 2 1 3 4 5 6 7 8 9 10 Detected result A B C D E F Fig. 1. Problem formulation of scene detection PE. In Fig. 1, scene boundary A, D and F are correct, while C is not correct. There is one shot between the correct position B and the detected boundary C. And E is a missing scene boundary. Traditionally, different measures have been proposed to compute the error or correct rate over the results of different segmentation methods. Precision and recall are the widely used performance measures for information retrieval [8], which can be defined by the following symbols, :CN the number of scene boundaries which are correctly found by the automatic scene detection algorithm. :IN the number of scene boundaries which are inserted by the automatic scene detection algorithm, i.e., the number of false alarms (false positives). :DN the number of scene boundaries which are deleted by automatic scene detection algorithm, i.e., the number of false negatives. Precision and Recall are defined as IC C NN Nprecision + = (1) DC C NN N recall + = (2) The performance for an “ideal” scene detection algorithm has high values of both precision and recall. However, traditional precision-recall measures are not comprehensive enough for the evaluation task. These measures treat all incorrect detected scene boundaries equally. For example, three false alarms in one scene should not have the same PE value as three false alarms in three different scenes, because there are two more scenes are correctly segmented in the first result. In addition, some of the missing (not recalled) scene boundaries are not really “missed” but located in different positions. Sometimes the wrong positions are very close to the correct ones. Previous PE measurers can only reveal that they are incorrect but cannot indicate how incorrect they are. The proposed PE measures in this paper can distinguish the above cases, thus can generate more accurate and more reasonable PE results. 3. PROPOSED EVALUATION PROTOCOL In our PE protocol, three PE measures are defined to evaluate different scene detection algorithms, namely, False Positive Index (FPI), False Negative Index (FNI) and Overall Performance Index (OPI). For convenience, we firstly define a set of symbols which will be used in the protocol representation, as listed below. 1. Shot and shot boundaries { }110 , , −= NShotShotShotShot �L : all shots in the video ( )iShotRB / ( )iShotLB : left/right shot boundary (represented by a frame number or time) where N is the number of shots in the video. 2. Scene and scene boundaries { }110 , , −= MSceneSceneSceneScene �L : all scenes in the video { }kNkkk kShotShotShotScene 110 , , −= �L : one scene )( kSceneLBS / )( kSceneRBS : left/right scene boundary (represented by a shot) ( )( )kk SceneLSBLBSceneLB )( = : left scene boundary ( )( )kk SceneRSBRBSceneRB )( = : right scene boundary { }MSceneBSceneBSceneBSceneB , , , 20 L= : scene boundary set    = < = MKSceneRB MkSceneLB SceneB k k k ),( ),( : scene boundaries where NN M k k =∑ − = 1 0 , M is the number of scenes in the video. 3. Scene length (duration) ( )kSceneLen : length of a scene (represented by the number of frames or duration) 4. Superscript In the above symbols, we distinguish the items of ground truth from the ones of automatically detected results by adding superscript “g” (for ground-truths) and “d” (for automatically detected results), respectively. 3.1 False Positive Index (FPI) False Positive Index is defined to measure to what extent that false scene boundaries are wrongly detected by the automatic algorithms. For a scene g kScene in ground-truth, all Detected Scene Boundaries within this scene are represented by ( ) ( ){ }gkdjgkdjk SceneRBSceneBSceneLBSceneBDSB <<= : (3) As aforementioned, false alarms in one scene can not be treated the same as the same number of false alarms in different scenes. According to this idea, we define the false positive performance index for one scene as .10 , , 1 11 12 <<= − − =++++= − α α α ααα kk D D k DSBDwhere FPI k kL (4) where ⋅ denotes the number of elements in a set. In the above definition, we decrease the “penalty” of false alarms in one scene when the number of false alarms increases. We may use other decreasing scheme instead of the one in equation (4), but the value generated by this method can be easily normalized. The parameterα here indicates how we “penalize” the second and later false alarms in one scene. The bigger the parameter is, the more “penalty” we set. In fact, when 1=α , FPI is degraded into traditional measure. While if 0=α , FPI has the same value as the measure in [9]. Indeed, α can be determined by user’s preference. In our experiments, α is set to 2/3. For all scenes in the video, the overall FPI is defined as ( ) ∑ − = − = 1 0 1 gM k kg FPIM FPI α (5) Obviously, FPI is normalized to [0, 1]. From the above equation, we can see that the “penalty” of different false scene boundaries are not equal, which is one of the main distinctions between our scheme and previous/traditional PE works. The advantage of our scheme can be illustrated by the examples below. Fig. 2 shows an example of ground-truth scene boundaries and three automatically detected results. On the one hand, if we use traditional evaluation scheme, the PE measures of Result 1 and 2 will be the same. However, although there are four false alarms, two of the three scenes are correctly segmented in Result 2, while none of the three scenes is correctly segmented in Result 1. So Result 2 is obviously better than Result 1. On the other hand, if we evaluate the three results by the method proposed in [9], the PE measures of Results 2 and 3 will be the same. However, Result 3 is obviously better than Result 2. According to our PE measures, the values of FPI of the three results are 0.407, 0.267 and 0.111, respectively, which clearly indicate that Result 3 is much better than Result 2, and Result 2 is much better than Result 1. Therefore, this PE measure (FPI) can generate more reasonable evaluation results. Ground-truth Detected Result 2 Detected Result 1 Detected Result 3 Fig. 2. Examples that illustrate the advantages of FPI. 3.2 False Negative Index (FNI) False Negative Index is defined not only to measure how many scene boundaries are missed by the automatic detection algorithms but also measure to what extent the scene boundaries are missed. For a scene boundary g kSceneB in ground-truth, the Closest Scene Boundary in the detected results is denoted by d jk SceneBCSB = (6) where { }dgkdigk d i g k i MiSceneBSceneBSceneB SceneBSceneBj ≤≤<< −= +− 0, ,minarg 11 (7) Then the performance index of false negative for this scene boundary is defined by (as illustrated by Fig. 3) ( ) ( )     = − g k k g k k k SceneLenSceneLen FNI δδ � 1 ,1min (8) where kδ is the distance of gkSceneB and kCSB , i.e., k g kk CSBSceneB −�δ (9) Ground-truth Detect Result g kScene 1− gkScene kδ g kSceneB kCSB Fig. 3. FNI calculation. For all scenes in the video, the overall FNI is ∑ − = − = 1 11 1 gM k kg FNIM FNI (10) From Equation (6)-(10) and Fig. 3, we can see, this measure can indicate how well the detected scene boundaries fit with the closest ground-truth scene boundaries. If we use traditional method mentioned in Section 2 and the scheme proposed in [9], all scene boundaries that are not exactly detected will be counted as misses without any further discrimination, and the distance kδ will not affect the PE result. Unlike these two methods, our scheme can produce more accurate and more reasonable PE results by taking kδ into account. 3.3 Overall Performance Index (OPI) The definition of Overall Performance Index is to present a single and concise measure of overall performance for a given scene detection algorithm. For a ground-truth hg , the overall evaluation index is (now add superscript h to distinguish the different ground-truths labeled by different persons or at different times by the same person) ( ) 10 ,1 ≤≤−+= βββ hhh FNIFPIEr (11) where β is the relative importance of false positive and false negative. In our experiments, we set 5.0=β , which means we treat false positive and false negative the same. If we care false positive more than false negative, we may set 5.0>β , and vice versa. Note that this index measures how “bad” the detected result is, so to measure how “good”, we use hh ErOPI −=1 (12) With appropriate values of β , hOPI gives a quantitative performance evaluation of the tested algorithm. The higher the hOPI , the better the algorithm. Generally, PE requires an exact and unbiased ground-truth. However, it is difficult for evaluation of scene detection, since the definition of scene is ambiguous and different users have different cognitions about what is scene. So, unlike shot detection evaluation, unbiased and exact ground-truth is not expected to be obtained for scene detection evaluation. Although it is hard to completely solve this problem, we average the PE results over a group of ground-truths labeled by different people or by the same people but at different times. That is, for a ground-truth set { }1210 ,, , −= HggggG L� , the final Overall Performance Index is defined as ∑ − = = 1 0 1 H h hOPI H OPI (13) For FPI and FNI, we can also get average values over all ground- truths in the same way. The performance of an “ideal” algorithm is expected to have lower values for the first two PE measures (FPI and FNI) and higher value for the third PE measure (OPI). 4. EXPERIMENT RESULTS In previous scene evaluation scheme [9], only one set ground- truth is collected, while in our experiments, fifteen students majored in arts are instructed to label ground-truths using a tool developed by us. Our video database consists of 10 video clips. The total duration is about two hours. Some clips are from MPEG-7 test sequences and others are from raw home videos, which have covered various complexities. Shot detection and key frame extraction are performed on all video data. Then, the fifteen users are instructed to manually label ground-truth for the scene boundaries. Fifteen ground-truths for the same video set are obtained. Thus, for every automatic detection result, we get fifteen PE results. The final PE result is generated by equation (13) in Section 3. This method is applied in two experiments, as described below. Experiment One The goal of the Experiment One is to compare the performance of different scene detection algorithms using the proposed PE measures. Three scene detection algorithms are implemented [5][6][10]. The evaluation results of the three algorithms, including the two traditional measures (precision and recall) are listed in Table 2. Table 1. PE results of three scene detection algorithms. Algorithm FPI FNI OPI Precision Recall Alg. 1 in [5] 0.220 0.229 0.775 0.490 0.635 Alg. 2 in [6] 0.106 0.411 0.742 0.600 0.469 Alg. 3 in [10] 0.221 0.470 0.654 0.435 0.471 From the table, we can see that Alg. 1 is the best one. For these three algorithms, this result matches quite well with precision- recall measures, although some values are not completely accordant. For instance, the recall rate of Alg. 2 is lower than that of Alg. 3, while FNI of Alg. 2 is lower than that of Alg. 3 too. Experiment Two The goal of the Experiment Two is to determinate the best threshold for a given scene detection algorithm [6] by our proposed PE measures. The algorithm is based on a hierarchical clustering process, which greedily merges the two temporally adjacent shots with minimum distance (or maximum similarity) among all pairs of temporally adjacent shots. The merging process is repeated until the minimum distance is larger than a pre-defined threshold D. The smaller the threshold D, the larger the number of the detected scenes. One extreme is that no merging is performed. Thus every shot is identified as one scene, so the number of the scenes is equal to the number of the shots. This is the extreme over-segmentation and the maximum scene number is got. Another extreme is that all shots are merged into one scene. This is the extreme under-segmentation, and the minimum number of scenes is got. With our PE measures, we could determine the best threshold for the algorithm that could avoid over- segmentation and under-segmentation at the same time to the utmost. The PE curves of different thresholds are shown in Fig.5. Fig. 4. PE curves: performance versus threshold. From Fig.5, we can see that the FPI is a non-decreasing function of the threshold, while the FNI is a non-increasing function of the threshold. This is because less and less scene boundaries are detected when the threshold increases. The maximum value of OPI is got near the intersection point of FPI and FNI. 5. CONCLUSION AND FUTURE WORK In this paper, we propose an objective and comprehensive PE protocol for video scene detection algorithms, which is consisted of three measures, including False Positive Index, False Negative Index and Overall Performance Index. Unlike traditional precision-recall measures and previous works, the proposed PE measures treat false alarms unequally, and take the extent of incorrectness of the wrongly detected scene boundaries into consideration, which can generate more accurate and more reasonable PE results. Our PE protocol has been applied not only on comparing different video scene detection algorithms under common data set, but also on determining the best threshold that can yield the best detection results for a given scene detection approach. The proposed protocol can also be helpful to improve, select and even design new scene detection algorithms, e.g., to check whether a threshold is sensitive to noises and other environment variables or not. 6. REFERENCE [1] X. U. Cabedo and S. K. Bhattacharjee, “Shot Detection Tools in Digital Video,” Proc. of Non-linear Model Based Image Analysis 1998, Springer Verlag, pp 121-126, Glasgow, July 1998. [2] H.J. Zhang, et al, “Automatic Partitioning of Full-Motion Vide

                    本文档为【视频场景检测算法的性能评价】，请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑，
                    图片更改请在作品中右键图片并更换，文字修改请直接点击文字进行修改，也可以新增和删除文档中的内容。 
 该文档来自用户分享，如有侵权行为请发邮件ishare@vip.sina.com联系网站客服，我们会及时删除。

                    [版权声明] 本站所有资料为用户分享产生，若发现您的权利被侵害，请联系客服邮件isharekefu@iask.cn，我们尽快处理。

                    本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权，请谨慎使用。

                    网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传，仅限个人学习分享使用，禁止用于任何广告和商用目的。
                

下载需要：免费已有0 人下载

立即下载

视频场景检测算法的性能评价

你可能还喜欢