Performance Evaluation Protocol for Video Scene Detection Algorithms
Xian-Sheng Hua, Dong Zhang, Mingjing Li, Hong-Jiang Zhang
Microsoft Research Asia
49 Zhichun Road, Beijing 100080, China
{i-xshua; i-dzhang; mjli; hjzhang}@microsoft.com
ABSTRACT
In this paper, we address the problem of performance evaluation
(PE) of video scene detection algorithms. An objective and
comprehensive protocol for performance evaluation and
comparison of different scene detection algorithms is proposed.
Unlike traditional precision-recall measures, the proposed PE
measures treat false alarms unequally, and take the extent of
incorrectness of the wrongly detected scene boundaries into
consideration, which can generate more accurate and more
reasonable PE results. The protocol has been applied not only on
comparing different scene detection algorithms but also on
determining the most appropriate thresholds for a given scene
detection algorithm.
1. INTRODUCTION
Efficient content-based management of digital video is very
important for successful retrieval of video libraries and various
other video applications. Video temporal structure parsing,
which extracts temporal construction unit of video programs, is
essential to automatic content-based organization and retrieval of
video data. There are usually two basic layers of temporal
construction units in videos, namely, shot and scene. A video
shot may be defined as a sequence of frames captured by a single
camera in a single continuous action in time and space [1], while
a video scene is defined as a group of content correlated (in time,
space, etc.) shots, which is a higher level semantic unit.
Numerous video retrieval and management tasks rely on accurate
shot detection results and reasonable scene detection results.
Lots of shot detection approaches have been proposed in
literatures [1][2][3]. Automatic shot detection will become very
easy when using digital video camera or using digital editing
tools, since shot boundary information can be accessed directly
from the metadata. Dissimilar with shot detection, scene
detection is semantic related and needs to do more work on
content analysis.
There are also a number of scene detection algorithms proposed
in recent literatures [5][6][10]. However, all the proposed
methods are evaluated on different data sets using different
methods by the corresponding authors. To compare different
methods, it is necessary to evaluate their performance under a
general performance evaluation (PE) protocol and a common
data set. Furthermore, scene detection is not a totally objective
task. That is to say, different people may have different views on
where the scene boundaries should be located. This makes the
evaluation of scene detection algorithms quite difficult. Our
motivation of the work presenting in this paper is try to solve
these problems by proposing a general PE protocol for scene
detection algorithms.
Previous work on shot detection performance evaluation (PE) has
been reported in Video TREC [7], which provides an evaluation
standard based on precision and recall [8]. This PE protocol for
shot detection has been widely accepted. As for shot detection,
we can get exact and unbiased ground-truth since start and end
locations of shots are well defined. For scene detection, this
cannot be expected since different users have different cognitions
about scene. In previous PE work, three major problems are
addressed [9]. The first problem is that there is no common data
set. The second one is that the definition of the ground truth
depends on individual users. The last one is feature dependence
of various algorithms. In [9], although the first and the last
problems are taken into account, the evaluation algorithm is not
accurate enough, as to be explained later.
In this paper, we have further studied the problems that must be
considered on PE for scene detection, including the second
problem above that is not solved in paper [9].
The rest of the paper is organized as follows. In Section 2, we
introduce the basic concept of scene detection evaluation,
together with two traditional PE measures. Proposed PE protocol
is then presented in Section 3. The experimental results of the
protocol on comparing different video scene detection algorithms
and on determining of the best threshold for a given scene
detection algorithm are described in Section 4. Conclusion
remarks are given in Section 5.
2. EVALUATION OF SCENE DETECTION
The goal of scene detection evaluation consists of two parts. One
part is to identify whether an automatic scene detection result is
correct compared with manually labeled ground-truth. If not,
another part is to measure how far the scene detection result is
away from the correct one. That is, we formulate the scene
detection evaluation problem as how well the automatic
segmentation result fits the manual labeled ground-truth. Fig. 1
is employed to illustrate this problem.
1 2 3 4 5 6 7 8 9 10
Scene Boundary Shot
Ground-truth
2 1 3 4 5 6 7 8 9 10 Detected result
A B C D E F
Fig. 1. Problem formulation of scene detection PE.
In Fig. 1, scene boundary A, D and F are correct, while C is not
correct. There is one shot between the correct position B and the
detected boundary C. And E is a missing scene boundary.
Traditionally, different measures have been proposed to compute
the error or correct rate over the results of different segmentation
methods. Precision and recall are the widely used performance
measures for information retrieval [8], which can be defined by
the following symbols,
:CN the number of scene boundaries which are correctly
found by the automatic scene detection algorithm.
:IN the number of scene boundaries which are inserted by the
automatic scene detection algorithm, i.e., the number of
false alarms (false positives).
:DN the number of scene boundaries which are deleted by
automatic scene detection algorithm, i.e., the number of
false negatives.
Precision and Recall are defined as
IC
C
NN
Nprecision
+
=
(1)
DC
C
NN
N
recall
+
=
(2)
The performance for an “ideal” scene detection algorithm has
high values of both precision and recall. However, traditional
precision-recall measures are not comprehensive enough for the
evaluation task. These measures treat all incorrect detected scene
boundaries equally. For example, three false alarms in one scene
should not have the same PE value as three false alarms in three
different scenes, because there are two more scenes are correctly
segmented in the first result. In addition, some of the missing
(not recalled) scene boundaries are not really “missed” but
located in different positions. Sometimes the wrong positions are
very close to the correct ones. Previous PE measurers can only
reveal that they are incorrect but cannot indicate how incorrect
they are. The proposed PE measures in this paper can distinguish
the above cases, thus can generate more accurate and more
reasonable PE results.
3. PROPOSED EVALUATION PROTOCOL
In our PE protocol, three PE measures are defined to evaluate
different scene detection algorithms, namely, False Positive
Index (FPI), False Negative Index (FNI) and Overall
Performance Index (OPI).
For convenience, we firstly define a set of symbols which will be
used in the protocol representation, as listed below.
1. Shot and shot boundaries
{ }110 , , −= NShotShotShotShot �L : all shots in the video
( )iShotRB / ( )iShotLB : left/right shot boundary (represented
by a frame number or time)
where N is the number of shots in the video.
2. Scene and scene boundaries
{ }110 , , −= MSceneSceneSceneScene �L : all scenes in the video
{ }kNkkk kShotShotShotScene 110 , , −= �L : one scene
)( kSceneLBS / )( kSceneRBS : left/right scene boundary
(represented by a shot)
( )( )kk SceneLSBLBSceneLB )( = : left scene boundary
( )( )kk SceneRSBRBSceneRB )( = : right scene boundary
{ }MSceneBSceneBSceneBSceneB , , , 20 L= : scene boundary set
=
<
=
MKSceneRB
MkSceneLB
SceneB
k
k
k
),(
),( : scene boundaries
where NN
M
k
k =∑
−
=
1
0
, M is the number of scenes in the video.
3. Scene length (duration)
( )kSceneLen : length of a scene (represented by the number of
frames or duration)
4. Superscript
In the above symbols, we distinguish the items of ground truth
from the ones of automatically detected results by adding
superscript “g” (for ground-truths) and “d” (for automatically
detected results), respectively.
3.1 False Positive Index (FPI)
False Positive Index is defined to measure to what extent that
false scene boundaries are wrongly detected by the automatic
algorithms. For a scene g
kScene in ground-truth, all Detected
Scene Boundaries within this scene are represented by
( ) ( ){ }gkdjgkdjk SceneRBSceneBSceneLBSceneBDSB <<= : (3)
As aforementioned, false alarms in one scene can not be treated
the same as the same number of false alarms in different scenes.
According to this idea, we define the false positive performance
index for one scene as
.10 ,
,
1
11 12
<<=
−
−
=++++= −
α
α
α
ααα
kk
D
D
k
DSBDwhere
FPI
k
kL
(4)
where
⋅
denotes the number of elements in a set.
In the above definition, we decrease the “penalty” of false alarms
in one scene when the number of false alarms increases. We may
use other decreasing scheme instead of the one in equation (4),
but the value generated by this method can be easily normalized.
The parameterα here indicates how we “penalize” the second
and later false alarms in one scene. The bigger the parameter is,
the more “penalty” we set. In fact, when 1=α , FPI is degraded
into traditional measure. While if 0=α , FPI has the same value
as the measure in [9]. Indeed, α can be determined by user’s
preference. In our experiments, α is set to 2/3.
For all scenes in the video, the overall FPI is defined as
( ) ∑
−
=
−
=
1
0
1 gM
k
kg FPIM
FPI α (5)
Obviously, FPI is normalized to [0, 1].
From the above equation, we can see that the “penalty” of
different false scene boundaries are not equal, which is one of the
main distinctions between our scheme and previous/traditional
PE works. The advantage of our scheme can be illustrated by the
examples below.
Fig. 2 shows an example of ground-truth scene boundaries and
three automatically detected results. On the one hand, if we use
traditional evaluation scheme, the PE measures of Result 1 and 2
will be the same. However, although there are four false alarms,
two of the three scenes are correctly segmented in Result 2, while
none of the three scenes is correctly segmented in Result 1. So
Result 2 is obviously better than Result 1. On the other hand, if
we evaluate the three results by the method proposed in [9], the
PE measures of Results 2 and 3 will be the same. However,
Result 3 is obviously better than Result 2. According to our PE
measures, the values of FPI of the three results are 0.407, 0.267
and 0.111, respectively, which clearly indicate that Result 3 is
much better than Result 2, and Result 2 is much better than
Result 1. Therefore, this PE measure (FPI) can generate more
reasonable evaluation results.
Ground-truth
Detected Result 2
Detected Result 1
Detected Result 3
Fig. 2. Examples that illustrate the advantages of FPI.
3.2 False Negative Index (FNI)
False Negative Index is defined not only to measure how many
scene boundaries are missed by the automatic detection
algorithms but also measure to what extent the scene boundaries
are missed. For a scene boundary g
kSceneB in ground-truth, the
Closest Scene Boundary in the detected results is denoted by
d
jk SceneBCSB = (6)
where
{
}dgkdigk
d
i
g
k
i
MiSceneBSceneBSceneB
SceneBSceneBj
≤≤<<
−=
+− 0,
,minarg
11
(7)
Then the performance index of false negative for this scene
boundary is defined by (as illustrated by Fig. 3)
( ) ( )
=
−
g
k
k
g
k
k
k SceneLenSceneLen
FNI
δδ
�
1
,1min (8)
where
kδ is the distance of gkSceneB and kCSB , i.e.,
k
g
kk CSBSceneB −�δ (9)
Ground-truth
Detect Result
g
kScene 1− gkScene
kδ
g
kSceneB
kCSB
Fig. 3. FNI calculation.
For all scenes in the video, the overall FNI is
∑
−
=
−
=
1
11
1 gM
k
kg FNIM
FNI (10)
From Equation (6)-(10) and Fig. 3, we can see, this measure can
indicate how well the detected scene boundaries fit with the
closest ground-truth scene boundaries. If we use traditional
method mentioned in Section 2 and the scheme proposed in [9],
all scene boundaries that are not exactly detected will be counted
as misses without any further discrimination, and the distance
kδ will not affect the PE result. Unlike these two methods, our
scheme can produce more accurate and more reasonable PE
results by taking
kδ into account.
3.3 Overall Performance Index (OPI)
The definition of Overall Performance Index is to present a
single and concise measure of overall performance for a given
scene detection algorithm.
For a ground-truth hg , the overall evaluation index is (now add
superscript h to distinguish the different ground-truths labeled by
different persons or at different times by the same person)
( ) 10 ,1 ≤≤−+= βββ hhh FNIFPIEr (11)
where β is the relative importance of false positive and false
negative. In our experiments, we set 5.0=β , which means we
treat false positive and false negative the same. If we care false
positive more than false negative, we may set 5.0>β , and vice
versa.
Note that this index measures how “bad” the detected result is, so
to measure how “good”, we use
hh ErOPI −=1 (12)
With appropriate values of β , hOPI gives a quantitative
performance evaluation of the tested algorithm. The higher
the hOPI , the better the algorithm.
Generally, PE requires an exact and unbiased ground-truth.
However, it is difficult for evaluation of scene detection, since
the definition of scene is ambiguous and different users have
different cognitions about what is scene. So, unlike shot
detection evaluation, unbiased and exact ground-truth is not
expected to be obtained for scene detection evaluation. Although
it is hard to completely solve this problem, we average the PE
results over a group of ground-truths labeled by different people
or by the same people but at different times. That is, for a
ground-truth set { }1210 ,, , −= HggggG L� , the final Overall
Performance Index is defined as
∑
−
=
=
1
0
1 H
h
hOPI
H
OPI (13)
For FPI and FNI, we can also get average values over all ground-
truths in the same way. The performance of an “ideal” algorithm
is expected to have lower values for the first two PE measures
(FPI and FNI) and higher value for the third PE measure (OPI).
4. EXPERIMENT RESULTS
In previous scene evaluation scheme [9], only one set ground-
truth is collected, while in our experiments, fifteen students
majored in arts are instructed to label ground-truths using a tool
developed by us. Our video database consists of 10 video clips.
The total duration is about two hours. Some clips are from
MPEG-7 test sequences and others are from raw home videos,
which have covered various complexities. Shot detection and
key frame extraction are performed on all video data. Then, the
fifteen users are instructed to manually label ground-truth for the
scene boundaries. Fifteen ground-truths for the same video set
are obtained. Thus, for every automatic detection result, we get
fifteen PE results. The final PE result is generated by equation
(13) in Section 3. This method is applied in two experiments, as
described below.
Experiment One
The goal of the Experiment One is to compare the performance
of different scene detection algorithms using the proposed PE
measures. Three scene detection algorithms are implemented
[5][6][10]. The evaluation results of the three algorithms,
including the two traditional measures (precision and recall) are
listed in Table 2.
Table 1. PE results of three scene detection algorithms.
Algorithm FPI FNI OPI Precision Recall
Alg. 1 in [5] 0.220 0.229 0.775 0.490 0.635
Alg. 2 in [6] 0.106 0.411 0.742 0.600 0.469
Alg. 3 in [10] 0.221 0.470 0.654 0.435 0.471
From the table, we can see that Alg. 1 is the best one. For these
three algorithms, this result matches quite well with precision-
recall measures, although some values are not completely
accordant. For instance, the recall rate of Alg. 2 is lower than
that of Alg. 3, while FNI of Alg. 2 is lower than that of Alg. 3 too.
Experiment Two
The goal of the Experiment Two is to determinate the best
threshold for a given scene detection algorithm [6] by our
proposed PE measures.
The algorithm is based on a hierarchical clustering process,
which greedily merges the two temporally adjacent shots with
minimum distance (or maximum similarity) among all pairs of
temporally adjacent shots. The merging process is repeated until
the minimum distance is larger than a pre-defined threshold D.
The smaller the threshold D, the larger the number of the
detected scenes. One extreme is that no merging is performed.
Thus every shot is identified as one scene, so the number of the
scenes is equal to the number of the shots. This is the extreme
over-segmentation and the maximum scene number is got.
Another extreme is that all shots are merged into one scene. This
is the extreme under-segmentation, and the minimum number of
scenes is got. With our PE measures, we could determine the
best threshold for the algorithm that could avoid over-
segmentation and under-segmentation at the same time to the
utmost. The PE curves of different thresholds are shown in Fig.5.
Fig. 4. PE curves: performance versus threshold.
From Fig.5, we can see that the FPI is a non-decreasing function
of the threshold, while the FNI is a non-increasing function of the
threshold. This is because less and less scene boundaries are
detected when the threshold increases. The maximum value of
OPI is got near the intersection point of FPI and FNI.
5. CONCLUSION AND FUTURE WORK
In this paper, we propose an objective and comprehensive PE
protocol for video scene detection algorithms, which is consisted
of three measures, including False Positive Index, False Negative
Index and Overall Performance Index. Unlike traditional
precision-recall measures and previous works, the proposed PE
measures treat false alarms unequally, and take the extent of
incorrectness of the wrongly detected scene boundaries into
consideration, which can generate more accurate and more
reasonable PE results. Our PE protocol has been applied not
only on comparing different video scene detection algorithms
under common data set, but also on determining the best
threshold that can yield the best detection results for a given
scene detection approach. The proposed protocol can also be
helpful to improve, select and even design new scene detection
algorithms, e.g., to check whether a threshold is sensitive to
noises and other environment variables or not.
6. REFERENCE
[1] X. U. Cabedo and S. K. Bhattacharjee, “Shot Detection
Tools in Digital Video,” Proc. of Non-linear Model Based
Image Analysis 1998, Springer Verlag, pp 121-126,
Glasgow, July 1998.
[2] H.J. Zhang, et al, “Automatic Partitioning of Full-Motion
Vide
本文档为【视频场景检测算法的性能评价】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑,
图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
该文档来自用户分享,如有侵权行为请发邮件ishare@vip.sina.com联系网站客服,我们会及时删除。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。
网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。