Summarisation of Short-Term and Long-Term Videos using Texture and Colour

Summarisation of Short-T erm and Long-T erm V ideos using T exture an d Colour J ohanna Carvajal, Chris McCool, Conrad Sanderson NICT A, GPO Box 2434, Brisbane, QLD 4001, Australia ∗ Univ ersity of Queensland, School of ITEE, St Lucia, QLD 4072, Australia Queensland Univ ersity of T echnology (QUT), B risbane, QLD 40 00, Australia Abstract W e pr esent a novel ap pr oach to video summa risation that makes use of a Bag- of-visual-T extur es (BoT) appr oach. T wo systems ar e pr oposed, one based solely on the BoT ap- pr oach and another w hich exploits both colour information and BoT featu r es. On 50 short-term videos fr om the Open V ide o Pr oject w e sho w that our BoT and fusion systems both achieve state -of-the-a rt pe rformance, obtaining an average F-measure of 0.83 and 0.8 6 r espectively , a r elative imp r ove- ment of 9% and 13% when compar ed to the pr evious state- of-the-a rt. When a pplied to a n ew underwater surveillan ce dataset containin g 33 lon g-term videos, th e pr oposed sys- tem r educes the amoun t of foo tage by a factor of 2 7, with only mino r de gradation in the information content. This or der of magnitud e r eduction in video da ta r epr esents sig- niﬁcant savin gs in terms of time and p otential labour co st when manually r evie wing such footage. 1. Intr oduction V ideo abstraction aims at pr oviding co ncise repr esenta- tions of long v ideos. It has applications in bro wsing and re- triev al of large volumes o f videos [1] and also in improvin g the effecti veness an d efﬁciency of v ideo storage [21]. V ideo abstraction can be categorised into two gener al g roups: video summa risation and video skim ming [10, 2 1]. V ideo summarisation , also kn own as still imag e abstraction, static storyboar d or static video abstract, is a com pilation o f repre- sentativ e frames selected from the original video [6]. V ideo skimming, also k nown as moving image abstraction or moving/dy namic storyboard, is a co llection of short vid eo clips [2, 1 0]. Both appr oaches sho uld p reserve the most importan t co ntent from the v ideo in order to present a com - prehen sible an d under standable description for the end u ser . In gener al, video skimming provides a more coheren t and v isually attractiv e result. It often retains a high -level of linguistic m eaning due to its cap acity to com bine audio ∗ Acknowle dgements: NICT A is fu nded by the Australia n Gov ernment through t he Department of C ommunications and the A ustralia n Rese arch Council through the ICT Centr e of Exce llence Program. and moving elements [14, 21]. Howe ver , v ideo summarisa- tion is easier to gener ate and is not constrained in term s of timing and synchron isation [2, 2 1]. V ideo sum marisation is an acti ve area of research within the computer vision commun ity an d it has been ap plied in various video categories such as Wil dlife V ideos [2 3], sports videos [1 5], TV docum entaries [2], amon g others. In [1] the various approaches to video summarisation are divided into six technique s consisting of : f eature selectio n, clustering algorithm s, e vent detection methods, shot selec- tion, tr ajectory analy sis and th e use of mo saics. Often a combinatio n of techniq ues is used, for example one of the most common approaches is to combine f eature selection with a form of clustering [2, 6, 13]. In [24] a vid eo summary is obtained by extracting a fea- ture vector from each frame and then clustering the result- ing set of feature vectors. The smallest clusters a re then removed. A keyframe – a frame that for ms p art of the video summary – is selected f or each clu ster centroid by taking the frame whose f eature vector is closest to the centroid. Simi- lar a pproac hes are ad opted in [2, 6, 7, 10] wh ere the major difference is in the cho ice of feature vector used to represent each fr ame. Colo ur histogram s are used in [2, 6], m otion- based fea tures are used in [7], and saliency maps ar e used in [ 10]. Each of the p reviously proposed feature vectors has its d rawbacks. For in stance, the colour histogr am ap- proach u sed in [2, 6] r etains o nly co arse in formatio n about the frame. Motio n-based features of [7] fail when the mo- tion in the videos is too large. Finally , the saliency maps used in [10] perfor m poorly fo r clutter ed and textured back- groun ds. T o d ate, limited work has bee n done on incorpo - rating texture information to perform video summarisation. Contributions. I n this paper we ﬁrst propo se the use o f texture inf ormation to improve video summarisation. W e propo se th e use of th e comp utationally efﬁcient an d effec- ti ve bag- of-textures appr oach; we conjectur e that this will improve video summarisation as it has been su ccessfully ap- plied to a r ange of imag e processing tasks, such a s match ing and classiﬁcation o f natural scenes and faces [1 2, 19, 2 2]. The bag-of -textures mo del divides an image into small patches, extracts appear ance d escriptors fro m each patch, quantises each descriptor into a d iscrete “visual word”, and then co mputes a compact histogram representation [8 ], pro- viding co nsiderably dif feren t inform ation than colour his- tograms. In ad dition, we propose a fu sion based system for v ideo summarisation, where both colour and texture informa tion is exploited. This will allow us to overcome the shortc omings of either appro ach. Similar appro aches have been sh own to be ad vantageous in object classiﬁca- tion tasks [1 1]. W e sho w that our system may be applied not only to short-term videos but also to lon g-term videos, helping in the detection of the existence o f a rare species of ﬁsh. The layout of this paper is as f ollows. In Section 2 we de- scribe in detail our p ropo sed v ideo su mmarisation meth od that exploits the beneﬁts of using texture histograms b ased on the ba g-of- textures model. In Section 3 we present our improved vid eo summar isation m ethod th at fuses the v i- sual infor mation provided by b oth th e colou r and texture histograms. In Section 4 we d escribe h ow we e valuate the video su mmaries o f short- term and lo ng-ter m videos. In Section 5, we pr esent experiments which show that the propo sed methods o btain h igher p erfor mance than existing methods based on colou r histogr ams. Section 6 summarises the main ﬁndings. 2. Bag-of-T exture s for V ideo S ummarisation This section d escribes o ur p roposed bag -of-texture s (BoT) approach. Ther e are four main stages: 1. Pre-p rocessing: The input video is sub-samp led after which each frame is ﬁltered and rescaled. 2. BoT repr esentation: (i) Local T e xtur e F eatu r es . Each fram e is d ivided into small patches (blocks) and from ea ch block we extract 2D-DCT feature s, which is an effec- ti ve and compact representa tion [16 ]. (ii) Dictiona ry T raining . A generic visual diction ary is trained to describe th e m ost c ommon ly occur- ring textures in an independent training set. (iii) Generation of BoT Histogram . Each frame is represented b y a histogram which is obtained by matching the fe ature vectors from each block to the dictionary . 3. Keyframe selectio n: Similar fra mes are g roupe d into an automatically determ ined n umber of clusters. One keyframe is selected p er cluster . 4. Post-pr ocessing: In this ﬁnal stage, we e liminate pos- sible repetitiv e frames and create the static v ideo sum- mary . Each of these stages is elucidated in the following sections. 2.1. Pre-pr oces sing 2.1.1 Sampling and Rescaling The or iginal inp ut vide o is re-sampled to one frame pe r sec- ond in ord er to reduce the numb er of video frames to be examined. Each fr ame is then co n verted into gray-scale and re-scaled to be a quarter of its o riginal size, in order to re- duce the compu tational co st of the following st ages. 2.1.2 Noise Filtering There are o ften uninf ormative frames th at appear at th e b e- ginning and/or the en d of a segmen t that may affect the ap - pearance o f a vid eo sum mary [6]. These fram es a re usu- ally colour-homo geneo us due to fade-in and fade- out ef - fects, and ha ve a small standard deviation of their pixel val- ues. Frames wi th a standard de viation below a threshold are eliminated. 2.2. BoT Representation 2.2.1 Local T exture F eatures Each frame is divided in to N overlapp ing blo cks. T o each block we app ly the 2D discrete cosine transform (2D-DCT) to obtain a D -dimensio nal featur e vector that represents the local texture inform ation [16]. T hus, th e local textur e fea- ture for the n -th block of the i -th frame is x i,n . 2.2.2 Dictionary T raining The dictionary is tr ained using th e k -means a lgorithm [3] by po oling the local texture featur es from a set o f train ing frames. The resulting G cluster cen ters { µ 1 , · · · , µ G } rep- resent the local textures (codew ord s) of the dictionar y . 2.2.3 Generation of BoT Histogram In the BoT app roach the i -th fr ame is represented by a his- togram, h BoT i . This G -d imensional histog ram represen ts the relativ e frequ ency of the loc al texture fea tures within the frame. The g -th dimension of h BoT i is the r elativ e frequency of the g -th local textur e feature from th e diction ary , similar to [5]. T he histogram is normalised to su m to one. Thus, each lo cal te xtur e feature can be conv erted to a lo cal his- togram, h BoT i,n , of dimension G where each dime nsion g is giv en by , h BoT g,i,n =    1 if g = arg min k ∈ 1 , ··· ,G k x i,n − µ k k 2 0 otherwise . (1) These N local histograms can th en be sum med an d nor- malised to produce the ﬁnal BoT histogram, h BoT i = 1 N X N n =1 h BoT i,n . (2) 2.3. K eyframe Selection T o o btain a set of keyframes we ad opt an ap proach sim- ilar to that of [6]. A keyfram e is a frame that forms part of the video summarisation. The k -means algorithm is u sed to cluster similar frames into K segments, and the resultant centroids are then used to select the keyframes. Initially , the fram es are gr ouped consecu tiv ely , assumin g that seq uential frames share similar content. T o automati- cally determine the number of clusters, K , we calculate the Euclidean d istance b etween two consecutive frames. If the distance is greater than a threshold τ then K is incremen ted. For each cluster centroid th e f rame whose BoT histog ram is closest is selected as a keyframe . A total of K keyframes is then reached. 2.4. P ost-processing Having obtained the initial set of K keyframes we then attempt to discard those keyframes which are to o simi- lar . Th is is achieved by com paring all keyframes ag ainst each oth er . If the E uclidean d istance b etween th e Bo T h is- tograms of the keyframes is smaller than a threshold τ then one of the two keyframe s under con sideration is d iscarded. This giv es the ﬁnal static video summary that consists of N as keyframes, where N as ≤ K , with as standin g fo r a utomatic summar y . Lastly , the static video summary is obtained after organ- ising the resulting ke yfr ames in temporal order . 3. Fusion of Colour and BoT In this section , we present a hy brid system that fuses colour histog rams [6] and Bo T texture in formation , termed as CaT (for C olou r a nd T exture). The p ropo sed CaT ap- proach to video summa risation has the same 4 s tages as our propo sed BoT vid eo su mmarisation app roach, but with ad- ditions in order to o btain colo ur histogr ams. W e describe these additions below . 1. Pre-p rocessing: Th e input v ideo is processed in two in- depend ent ways. First, we obtain the BoT h istograms as de scribed in Section 2.1. Second, to o btain th e colour histograms we extract the Hue componen t, from the HSV colou r space, of the unscaled in put fram e similar to [ 6]. In bo th cases we remove uninform a- ti ve frames by emp loying the noise ﬁltering p rocess described in Section 2.1. 2. T e xtur e and Colour Histog ram: The BoT h istogram is the same as explained in Section 2.2. The colo ur his- togram, h hue i , of the i -th frame is com puted using only the Hue compon ent as in [6]. 3. Keyframe Selection: Th e BoT and co lour h istograms are clustered u sing k -means. This stage is similar to Section 2.3. The d ifference lies in the distance measure used to compare all frames against each other . (i) T o select the numb er of ke yfr ames K we com- bine th e inform ation fr om the BoT and co lour histograms. Wh en calculating th e d istance b e- tween frame a and b we use the weighted sum- mation of Euclidea n d istances: α k h BoT a − h BoT b k 2 + β k h hue a − h hue b k 2 (3) under the constraints α + β = 1 , α ≥ 0 , β ≥ 0 . (ii) Each keyframe is selected b y ﬁnd ing the frame which is c losest to each cluster c entroid. For the CaT appro ach the distan ce between a frame and a centroid is calcu lated as a weighted summation of the Euclidean distances, as per (3). 4. Post-pr ocessing: T o eliminate similar frames we use the procedure descr ibed in Section 2.4 but replace th e Euclidean distance with th e weigh ted summatio n o f the Euclidean distances, as per (3). 4. Datasets and Evaluation Met rics T o evaluate the performan ce o f video summar isation we use two datasets consisting of short- and long-term v ideo data. Th e sho rt-term data is o btained f rom th e Ope n V ideo Project 1 . The long-term data is a new da taset that co nsists of 14 hour s of underwater video surveillance which mon itors the behaviour of marine w ildlife. 4.1. Short-T erm V ideos W e use the 50 v ideos from the Ope n V ideo Pr oject which contain ground tru th [ 6]. Each grou nd truth consists o f the summary provided by P = 5 users. The u sers provided the summaries und er no restrictions upon length nor appear- ance of the summaries. T o e valuate the perform ance o n the short-term v ideo data we u se the “Comparison of User Summaries” (CUS) method [6]. This method com pares the au tomatic video summarisation and groun d tr uth by exhaustiv ely calculat- ing the distance between the fr ames fr om the a utomatic summarisation an d the grou nd truth. T w o frames are sim- ilar if th e distance between their respective featu re vec- tors ( histogram s) is less than an ev aluation thresho ld δ . If the fra mes match they are r emoved fro m th e next itera- tion o f the co mparison pr ocess. For perfo rmance ev alu- ation, th e distance measure used fo r the BoT approach is the Euclidean distance, h owe ver , to be consistent with prior work [6], th e distanc e measure fo r the c olour h istograms is the L 1 -norm . T herefor e, the distance measu re u sed for CaT 1 Open V ideo Project : http://www .open-video.or g is the weighted summa tion of the Euclidean d istance for the BoT histogr ams and the L 1 -norm fo r the colo ur histogram s: α k h bof a − h bof b k 2 + β k h hue a − h hue b k 1 . (4) V arious ev aluation metrics exist to m easure the quality of an automatic video summary . W e u se three ev aluation met- rics so that we can compa re o ur proposed ap proach es with two state-of- the-art m ethods [ 6, 2]. T o com pare with [6] we use accuracy ( acc ) and error ( err ), and to compare with [2] we use the F -measure. T o calculate acc and er r , each frame in the automatic video summary is compared with all frames in the user sum- mary and th en the number of matching fr ames ( N m ) and non-m atching frames ( N nm ) are calculated: acc = N m N u , er r = N nm N u (5) where N as and N u are the tota l numbe r of frames from the automatic and user summary , r espectively . The F -measure, deﬁned as F = 2 × p recision × re call precision + recall (6) is used to to provide a sing le nu mber that b alances precision = N m / N as and recall = N m / N u . The ev aluation metrics are pr esented as an average. First, we take the average from the P users to obtain acc P , err P , and F P ; for each video there are P = 5 users. The n we take the a verage acro ss all o f the videos to obtain acc , e rr , and F . In terms of acc it is desirab le to hav e a high value as it me asures the number of m atching frames. In terms of err it is desirable to hav e a small value as it measures the n umber o f non-match ing f rames. W ith r egards to F it is desirab le to obtain a high value, which occur s when the precision and recall are large. 4.2. Long-T erm V ideos The long-term vid eos c onsist o f 1 4 h ours o f underwater footage from 3 3 videos which are on average 25 minutes in du ration. This data was obta ined fro m the NSW -DPI 2 , courtesy of David Hara sti. Exam ple images are shown in Figure 1. In each video there is always at least on e se g- ment whe re a rare species of ﬁsh, the black cod, is within view . Normally these videos would be inspected by a hu- man expert to dete rmine if there is an instan ce o f the rare ﬁsh with in. W e propose that video su mmarisation ca n be used to re duce the am ount of f ootage to be viewed in order to detect the existence of this rare species of ﬁsh. Using gr ound truth which provides time-stamps when this rare sp ecies is within view , we examin e the effec- ti veness of video sum marisation to provide at least one 2 Ne w South W ales Depart ment of Primary Industries, Australia. Figure 1. Example images from the long-term underwater surveillance videos; the added red ellipsoids highlight the rare species of interest. keyframe in e ach static vid eo summary with the r are species of inter est within v iew . This is useful as it presents a way to reduce the time an d cost of manu ally v iewing a large amount of video data. T o calculate the pe rforma nce of long-ter m v ideos we present results in ter ms of detection accu racy an d the av- erage co mpression ratio ( R c ). Detectio n accuracy refers to whether an in stance of th e rar e species is among any of the chosen keyframes for a static video summ ary; 75% would mean th at ther e is at le ast 1 keyframe of th e r are species in 75% of the static video summaries. T o c alculate the average compression ratio we ﬁrst note that bec ause we h av e long -term v ideos then f or each video there m ight be many hundr eds of keyfram es. T o present all of these keyframes effecti vely to the user we re- encode them into a static v ideo summary by presentin g each keyframe f or 0 . 25 secon ds. This giv es the user time to ef- fectively v iew the keyframe. Thus the t -th long-ter m video V t is converted to a static video summar y S t with a com- pression ratio giv en by: R c,t = 4 × Duration ( V t ) Duration ( S t ) (7) where Duration is th e d uration of a v ideo a nd the factor of 4 is introd uced as there are 4 keyfram es per second of th e shortened video. 5. Experiment s An important p art of b oth the BoT and CaT ap proach es is the training of the d ictionary to obtain the texture his- tograms. T o tr ain this d ictionary we use 10 frame s ran domly selected from vid eos taken from the Open V id eo Project th at have no user summaries, ensuring they ar e indep endent of the evaluation dataset. In addition, the frames selected to train the dictionar y look sig niﬁcantly different to the gr ound truth provided by the users. T o obta in the prop osed local texture f eatures we d ivide each frame into a set of overlappin g bloc ks. Sim ilar to [19] we u se a block size of 8 × 8 with an overlap margin o f 6 p ixels, an d rep resent each bloc k as a D = 15 d imen- sional feature vector containin g 2D-DCT coefﬁcients. W e extract the ﬁrst 16 2 D-DCT coefﬁcients, which represent low-frequency informatio n [1 6], and omit the ﬁrst coefﬁ- cient as it is th e most sensitiv e to illuminatio n cha nges. W ith regards to the colou r h istogram, we quantise the Hue compon ent into 1 6 bins as per [6]. These param eters are the same for all experiments. The v alues fo r the threshold τ , fusion weigh t α and eval- uation th reshold δ were d etermined expe rimentally . For all of the experiments we search for the op timal fusion param- eter α = { 0 . 0 , 0 . 1 , · · · , 1 . 0 } . Our p roposed m ethods were implemented using th e OpenCV [4] and Armadillo [18] C++ libraries. 5.1. Short-T erm V ideos W e com pare the per forman ce against two baseline sys- tems from literature, VSUMM [6] and VISON [2 ]. The two baseline systems both use co lour inform ation as their primary featu re. VSUMM u ses colou r informatio n b y re- taining o nly th e Hu e co mpon ent of HSV and gener ating a histogram of 1 6 bins. VISON is a state- of-the- art approach and consists of a h istogram of the HSV representation of each frame. I t combines the HSV inform ation in a com- pressed form such that the Hue compon ent is treated with greater importanc e an d results in a histogram of 256 bins. An initial set o f expe riments were performe d to ﬁnd the optimal nu mber of compon ents for the dictionar y of ou r propo sed textur e featu res. Using a ﬁxed n umber o f com- ponen ts G = { 8 , 1 6 , 32 } and a ﬁxed n umber of thresholds τ = { 0 . 05 , 0 . 10 , · · · , 0 . 5 } , we fo und that u sing just G = 8 compon ents provided op timal per forman ce. W e kept the number of compon ents constant for the remainder of o ur experiments. In Figure 2 we present a sum mary of th e a verage per- forman ce for 50 short-vide os of o ur p roposed systems, BoT and CaT , and the two baselines. T wo interesting results can be seen from this ﬁgure. First, it can b e seen that the texture-only BoT system per- forms better t han either the VSUMM o r VISON approaches which primarily use colour informatio n. The BoT system 0.2 0.4 0.6 0.8 1 Performance acc er r F VISON VSUMM BoT CaT Figure 2. Comparativ e e v aluation of our proposed methods with VSUMM [6] and VISON [2]. Lower value s of er r as well as higher v alues of acc and F are desired. obtains an average F -measure of F = 0 . 83 , which is a relativ e impr ovement of 9 % when compar ed to VISON, F = 0 . 76 . Further more, the acc and err of the BoT system shows that it pr oduces a m ore accu rate summar isation than VSUMM an d also has the lowest err of a ny system 3 . This suggests that texture informa tion is either eq ually or mo re importan t than colour infor mation for the task of video sum- marisation. Second, the p roposed CaT sy stem (fu sing colour his- tograms an d the pro posed texture histogra ms) perf orms better than the two baseline systems and th e prop osed texture-only BoT system. The CaT system has an average F -m easure of F = 0 . 86 , which is a relativ e imp rovement of 13 % when compared to VISON F = 0 . 76 , the p revious state-of-the- art a pproac h. Figure 3 shows the qualitative results for the automa tic summarisation p rovided by VSUMM an d VISON as well as our p ropo sed BoT and CaT systems. I t can be seen that VSUMM (Fig ure 3a) with F P = 0 . 83 , VISON (Fig ure 3b) with F P = 0 . 7 8 , an d our p ropo sed BoT (Figure 3c) with F P = 0 . 7 4 contain som e keyfram es that may n ot be o f in- terest and/or are rep etitiv e. In contrast, the prop osed CaT system (Figure 3d) provides the most co nsistent v ideo sum- mary with F P = 0 . 86 . 5.2. Long-T erm V ideos In this section we p resent results on 33 lon g-term videos which last o n a verage for 25 minutes. W e exam ine th e ap- plicability of video summ arisation to long-term videos to efﬁciently detect a rare species o f ﬁsh an d measure perfor- mance in terms o f detectio n accuracy and compression rate (see Section 4.2). 3 No results in terms of acc and err were supplied for VISON in [2]. (a) (b) (c) (d) Figure 3. Static video summary for “ the future of ener gy gases - segmen t 09” , using (a) VS UMM, (b) VIS ON, (c) propo sed BoT , and (d) propo sed CaT . The accuracy and average co mpression ratio of the al- gorithm for various thresholds, τ = { 0 . 025 , 0 . 05 , . . . , 0 . 1 } , is pr esented in Figur e 4. It can be seen in Figure 4 a th at the CaT a lgorithm co nsistently outperfo rms the BoT an d VSUMM alg orithms. W e a ttribute this to the fact tha t the backgr ound in these videos is r elativ ely stable and so the colour histog rams used in VSUM M do n ot ch ange as o ften compare d to the short-ter m videos used in [6] . In Figure 4b it can be seen th at wh ile using the VSUMM algorithm pro- vides better a verage compr ession r atio than either the BoT or C aT approaches, it comes at the cost of accuracy . In gen- eral the proposed fusion approa ch provides the most con - sistent trade-o ff between accuracy and average co mpression ratio. W e take the o ptimal system at the thr eshold τ = 0 . 05 as this provides a hig h degree o f detection accuracy , 85% , and a g ood average co mpression ratio o f 27 . This system will allow a user to see the ﬁsh of interest in 85% of the sum- marised vid eos wh ile r educing the amoun t of video data to view by 27 times, m ore than an order of magnitud e. Such an approa ch would redu ce the 1 4 hour s of video data to just 31 minutes, thus enabling signiﬁcan tly mo re efﬁ cient revie w- ing of the data. 6. Summary and Futur e W o rk In this paper, w e have proposed th e n ovel use o f tex- tures to perform v ideo sum marisation. W e pro posed to use a visual-bag-of- textures (BoT) in two ways. First, a BoT system which uses o nly texture features is prop osed and it is shown to outperform two state-of-the-ar t systems which use colour only , VSUMM and VISON. Sec ond, a fu sed system that comb ines Colour an d T e xture (CaT) is prop osed and it is shown to p rovide further imp rovements. (a) 0.025 0.05 0.075 0.1 40 50 60 70 80 90 100 τ Detection Accuracy (%) VSUMM BoT CaT (b) 0.025 0.05 0.075 0.1 0 20 40 60 80 100 120 140 τ Average Rc VSUMM BoT CaT Figure 4. Demonstration of the trade-of f between (a) the detec- tion accuracy and (b) the av erage compression rati o R c for the 33 long-term videos using the CaT , BoT and VSUMM approaches. Both of our propo sed systems outper form two state- of-the- art app roaches, VSUMM an d VISON, which use colour features. Experiments on 5 0 short-term v ideos, o b- tained f rom the Open V ideo Project, sho w that our pro - posed texture-o nly system (BoT) ob tains an F -mea sure of 0 . 83 , which is better tha n either VS UMM or VISON which obtain an a verage F -measure of 0 . 73 and 0 . 76 , respec- ti vely . Furthermore, o ur fu sed system (CaT) d emonstrates that c ombinin g colo ur and texture features yields state-of- the-art perform ance with a n av erage F -m easure of 0 . 86 . W e ha ve also shown that v ideo summarisatio n can be ap- plied effecti vely to long-term videos. Using 33 lo ng-term surveillance videos, in our case un derwater surveillanc e footage, we have shown th at video summarisation can be used to sig niﬁcantly r educe the a mount o f footage to view , by up to a factor of 2 7, with only a min or degrad ation in the informa tion con tent. Future work should examine altern ativ e features and application setting s with a particular emph asis fo r lon g- term vid eos. For instance, em phasising the importan ce of f oregroun d ob jects [ 17] should be explored, as well as explicit modelling of movement (o r actions) of such ob- jects [9, 2 0]. Moreover, the a pplicability of video summ ari- sation to CCTV surveillance footage shou ld also be consid- ered. Refer ences [1] M. Ajmal, M. Ashraf, M. Shakir , Y . Abbas, and F . Shah. V ideo summarization : techniques and classiﬁ cation. In L ec- tur e Notes in Computer Science, V ol. 7594 , pages 1–13. 2012. [2] J. Almeida, N. J. L eite, and R. da S . T orres. VISON: VI deo Summarization for ONline applications. P attern Recognition Letters , 33(4):39 7–409, 2012. [3] C. M. Bishop. P attern Recognition and Mac hine Learning . Springer , 2006. [4] G. Bradski. The OpenCV Library. Dr . Dobb’ s Journa l of Softwar e T ools , 2000. [5] N. Dardas, Q. Chen, N. D. Georga nas, and E. Petri u. Hand gesture recognition using bag-of-features and multi-class support ve ctor machine. In IEEE Int. Symp. Haptic Audio- V isual Envir onmen ts and Games (HA VE) , pages 1–5, 2010. [6] S. E. F . de A vila, A. P . B. Lopes, A. da Luz Jr ., and A. d e Al- buqu erque Ara ´ ujo. VSUMM: a mechan ism designed to pro- duce static video summaries and a nov el ev aluation method. P attern Recognition Letters , 32(1):56–6 8, 2011. [7] A. Di vak aran, K. A. Peker , and H. Sun. V ide o summarization using motion descriptors. In Pro c. SPIE Conf. on Storag e and Retrieval fr om M ultimedia Databases , 2001. [8] K. Grauman and B. Leibe. V isual object reco gnition. Synthe- sis L ectur es on Artiﬁcial Intelli gence and Mac hine Learning , 5(2):1–181, 2011. [9] M. T . Harandi, C. Sanderson, S. Shirazi, and B. C. Lov ell. Ke rnel analysis on Grassmann manifolds f or action recogni- tion. P attern Reco gnition Lett ers , 34(15):19 06–1915, 2013. [10] Q.-G. Ji, Z.-D. Fang, Z.-H. Xie, and Z.-M. Lu. Video abstraction based on the visual attention model and on- line clustering. Signal Pro cessing: Image Communication , 28(3):241–2 53, 2013. [11] Z. Li, Y . Liu, R. Hayward, and R. W alk er . Color and texture feature fusion using kernel PCA wi th application to object- based vegetation species classiﬁcation. In IEEE Interna- tional Confer enc e on Imag e Pro cessing (ICIP) , pages 2701 – 2704, 2010. [12] Z. L in and J. Brandt. A local bag-of-features model for large-scale object retriev al. In K. Daniilidis, P . Maragos, and N. Paragios, editors, Eur opean Confer ence on Computer V i- sion (ECC V) , volume 6316 of Lectur e Notes in C omputer Science , pages 294–3 08. Springer Berlin Heidelberg, 2010. [13] P . Mundur , Y . Rao, and Y . Y esha. Ke yframe-based video summarization using Delaunay clustering. International J ournal on Digital Libraries , 6(2):219–2 32, 2006. [14] J. Oh, Q. W en, J. Lee, and S. Hwang. V ideo abstraction. In Video Data Manag ement and Information Retrieval , pages 321–34 6. Idea Group Inc. and IRM Press, 2004. [15] J.-Q. Ouyang and R. Liu. Ontology reasoning scheme for constructing meaningful sports video summarisation. IET Imag e P r ocessing , 7(4):324– 334, 2013. [16] W . B. Penn ebaker and J. L. Mitchell. JPEG Still Ima ge Data Compr ession Standar d . New Y ork: V an Nostrand Reinhold, 1993. [17] V . Reddy , C. Sanderson, and B. C. L ov ell. Improv ed foregrou nd detection via block-based classiﬁer cascade with probabilistic decision integration. IEEE T ransactions on Cir cuits and Systems for V ideo T echnolo gy , 23(1):83–93, 2013. [18] C. Sanderso n. Armadillo: an open source C ++ linear algebra library for fast prototyping and computationally intensiv e experimen ts. T echnical report, NICT A, 2010. [19] C. Sanderson and B. C. Lovell. Multi-region probabilistic histograms for robust and scalable identity inference. In Lectur e Notes in Computer Science (LNCS), V ol. 5558 , pages 199–20 8, 2009. [20] A. S anin, C. S anderson, M. Harandi, and B. C. Lovell. Spatio-temporal covarian ce descriptors f or action and gesture recognition. In W orkshop on the Applications of Computer V ision (W A CV) , pa ges 103–110, 2013. [21] B. T . T ruong and S. V enkatesh . Vide o abstraction: a sys- tematic revie w and classiﬁcation. ACM T ran s. Multimedia Comput. Commun. Appl. , 3(1), Feb . 2007. [22] J. Y ang, K. Y u, Y . Gong, and T . Huang. Linear spatial pyra- mid matching using sparse coding for image classiﬁcation. In IEEE Conferen ce on Computer V ision and P attern Reco g- nition (CVPR) , pages 1794–18 01, 2009. [23] S.-P . Y ong, J. Deng, and M. Purvis. Key-frame extraction o f wildlife video based on semantic context modeling. In In- ternational J oint Confer ence on Neural Networks (IJCNN) , pages 1–8, 2012. [24] Y . Zhuang, Y . Rui, T . H uang, and S . Mehrotra. Adapti ve ke y frame extraction using unsu pervised clustering. In IEEE International Confer ence on Image P r ocessing , volume 1, pages 866–870 , 1998.

Summarisation of Short-Term and Long-Term Videos using Texture and Colour

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment