On Evaluating Perceptual Quality of Online User-Generated Videos

JOURNAL OF L A T E X CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 1 On Ev aluating Perceptual Quality of Online User -Generated V ideos Soobeom Jang, and Jong-Seok Lee, Senior Member , IEEE Abstract —This paper deals with the issue of the per ceptual quality evaluation of user-generated videos shar ed online, which is an important step toward designing video-sharing services that maximize users’ satisfaction in terms of quality . W e ﬁrst analyze viewers’ quality perception patterns by applying graph analysis techniques to subjectiv e rating data. W e then examine the perf ormance of existing state-of-the-art objectiv e metrics f or the quality estimation of user-generated videos. In addition, we in vestigate the feasibility of metadata accompanied with videos in online video-sharing services for quality estimation. Finally , various issues in the quality assessment of online user-generated videos are discussed, including difﬁculties and opportunities. Index T erms —User -generated video, pair ed comparison, qual- ity assessment, metadata. I . I N T RO D U C T I O N W ITH the advances of imaging, communications, and internet technologies, public online video-sharing ser- vices (e.g., Y ouT ube, V imeo) hav e become popular . In such services, a wide range of video content from user-generated amateur videos to professionally produced videos, such as movie trailers and music videos, is uploaded and shared. T oday , online video sharing has become the most considerable medium for producing and consuming multimedia content for various purposes, such as fun, information exchange, and promotion. For the success of video-sharing services, it is important to consider users’ quality of experience (QoE) regarding shared content, as in many other multimedia services. As the ﬁrst step of maximizing QoE, it is necessary to measure perceptual quality of the online videos. The quality information of videos can be used for valuable service components such as auto- matic quality adjustment, streaming quality enhancement, and quality-based video recommendation. The most accurate way to measure perceptual video quality is to conduct subjective Manuscript received April 19, 2015; re vised February 17, 2016. This research was supported by the MSIP (Ministry of Science, ICT and Future Planning), K orea, under the “IT Consilience Creative Program” (IITP-2015- R0346-15-1008) supervised by the IITP (Institute for Information & Commu- nications T echnology Promotion). The authors are with the School of Integrated T echnology and Y on- sei Con ver gence Institute of T echnology , Y onsei University , Incheon, 21983, Korea. Phone: +82-32-749-5846. F AX: +82-32-818-5801. E-mail: soobeom.jang@yonsei.ac.kr;jong-seok.lee@yonsei.ac.kr . c  2016 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating ne w collective works, for resale or redistrib ution to serv ers or lists, or reuse of any copyrighted component of this work in other works. This is the author’s version of an article that has been published in this journal (DOI: 10.1109/TMM.2016.2581582). Changes were made to this version by the publisher prior to publication. The ﬁnal version of record is av ailable at https://doi.org/10.1109/TMM.2016.2581582. quality assessment by employing multiple human subjects. Howe ver , subjecti ve quality assessment is not feasible for online videos because of a tremendous amount of videos in online video-sharing services. An alternati ve is objecti ve quality assessment, which uses a model that mimics the human perceptual mechanism. The traditional objective quality metrics to estimate video quality hav e two limitations. First, the e xistence of the ref- erence video, i.e., the pristine video of the gi ven video, is important. In general, objectiv e quality assessment frame works are classiﬁed into three groups: full-reference (FR), reduced- reference (RR), and no-reference (NR). In the cases of FR and RR framew orks, full or partial information about the reference is provided. On the other hand, NR quality assessment does not use any prior information about the reference video, which makes the problem more complicated. In fact, the accuracy of NR objectiv e metrics is usually lo wer than that of FR and NR metrics [1]. Second, the types of degradation that are dealt with are rather limited. V ideo quality is affected by a lar ge number of factors, for which the human perceptual mechanism varies signiﬁcantly . Because of this v ariability , it is too complicated to consider all dif ferent video quality factors in a single objectiv e quality metric. Hence, existing objectiv e quality metrics ha ve considered only a single or a small number of major quality factors in volved in production and distribution such as compression artifacts, packet loss artifacts, and random noise, assuming that the original video has perfect quality . This approach has been successful for professionally produced videos. Howe ver , it is doubtful whether the current state-of-the- art approaches for estimating video quality are also suitable for online videos. First, for a giv en online video, the cor- responding reference video is not av ailable in most cases, where NR assessment is the only option for objective quality ev aluation. The performance of e xisting NR metrics is still unsatisfactory , which makes the quality assessment of online videos very challenging. Second, online video-sharing services cov er an e xtremely wide range of videos. There are two types of videos in online video-sharing services: professional and amateur . Professional videos, which are typically created by professional video makers, and amateur videos, which are created by general users, are signiﬁcantly different in v arious aspects such as content and production and editing styles. In particular , user-generated videos hav e lar ge v ariations in these characteristics, so the y hav e wide ranges of popularity , user preference, and quality . Moreover , div erse quality factors are in volved in online user-generated videos (see Section II for further details). Howe ver , existing NR metrics hav e been JOURNAL OF L A T E X CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 2 dev eloped to work only for certain types of distortion due to compression, transmission error, random noise, etc. Therefore, it is not guaranteed that the existing NR metrics will perform well on those videos. Online videos are usually accompanied with additional information, called metadata, including the title, description, viewcount, rating (e.g., like and dislike), and comments. Some of the metadata of an online video (e.g., the spatial resolution and title) reﬂect the characteristics of the video signal itself, while other metadata, including the vie wcount or comments, provide information about the popularity of and users’ preference for the video. These types of information hav e the potential to be used as hints about the quality of the video because the quality is one of the factors that affects the perceptual preference of vie wers. Therefore, they can be useful for the quality assessment of online user-generated videos by replacing or being used in combination with objectiv e quality metrics. This paper deals with the issue of e valuating the perceptual quality of online user -generated videos. The research questions considered are: • Are there any note worthy patterns regarding viewers’ judgment of the relativ e quality of user-generated videos? • How well do e xisting state-of-the-art NR objecti ve quality metrics perform for user-generated videos? • T o what extent are metadata-based metrics useful for the perceptual quality estimation of user-generated videos? • What makes the signal-based or metadata-based quality estimation of user-generated videos dif ﬁcult? T o the best of our knowledge, our work is the ﬁrst attempt to inv estigate the issue of the perceptual quality assessment of online user-generated videos comprehensively in v arious aspects. Our contributions can be summarized as follo ws. First, by examining subjecti ve ratings gathered by crowdsourcing for online user-generated videos, we in vestigate the viewers’ patterns of perceptual quality e v aluation. Second, we analyze the performance of state-of-the-art NR quality assessment algorithms, metadata-driven features, and their combination in perceptual quality estimation. The study aims the efﬁcacy and limitations of the signal-based and metadata-based methods. Finally , based on the experimental results, various issues in the quality assessment of online user-generated videos are discussed in detail. W e comment on the difﬁculties and limitations of the quality assessment of user-generated videos in general and provide particular examples demonstrating such difﬁculties and limitations, helping us understand better the nature of the quality assessment of online videos. The rest of the paper is org anized as follo ws. Section II describes the background of this study , i.e., visual quality assessment, characteristics of online videos, and previous approaches to the quality assessment of online videos. Section III introduces the dataset used in our study , including video data and subjective data. In Section IV, patterns of user perception of online videos are examined via graph analysis. Section V presents the results of quality estimation using NR quality assessment algorithms and metadata. Section VI discusses issues of the quality assessment of online user - generated videos. Finally , Section VII concludes the paper . I I . B A C K G RO U N D S A. V isual Quality Assessment The overall QoE of a video service highly depends on the perceptual visual quality of the videos provided by the service. One way to score the quality of videos is to ha ve the videos are ev aluated by human subjects, which is called subjecti ve quality assessment. F or man y practical multimedia applications, qual- ity assessment with human subjects is not applicable due to the cost and real-time operation constraints. T o deal with this, research has been conducted to de velop automatic algorithms that mimic the human perceptual mechanism, which is called objectiv e quality assessment. Objectiv e quality assessment metrics are classiﬁed into three categories: FR, RR, and NR metrics. FR quality assessment uses the entire reference video, which is the original signal without any distortion or quality degradation. Structural sim- ilarity (SSIM) [2], multi-scale SSIM (MS-SSIM) [3], most apparent distortion (MAD) [4], and visual information ﬁdelity (VIF) [5] are well-kno wn FR quality metrics for images, and motion-based video integrity ev aluation (MO VIE) [6] and spatiotemporal MAD (ST -MAD) [7] are FR metrics for videos. RR metrics do not need the whole reference signal, but use its partial information. Reduced-reference entropic differencing (RRED) [8] for images and video quality metric (VQM) [9] for videos are examples of RR quality metrics. A challenging situation of objective quality assessment is when there is no reference for the given signal being assessed. Estimating quality from only the giv en image or video itself is hard, since no prior knowledge of the reference can be utilized. Currently av ailable NR metrics include the blind image integrity notator using discrete cosine transform statistics (BLIINDS-II) [10], the blind/referenceless image spatial quality e v aluator (BRISQUE) [11], and the V ideo BLI- INDS (V -BLIINDS) [1]. These metrics typically use natural scene statistics (NSS) as prior knowledge of images and videos, and the main difference among them lies in how to obtain information about NSS. BLIINDS-II constructs NSS models from the probability distribution of discrete cosine transform (DCT) coefﬁcients extracted from macroblocks. BRISQUE uses mean-subtracted contrast-normalized (MSCN) coefﬁcients rather than transform domain coefﬁcients to speed up the quality assessment process. V -BLIINDS extracts NSS features based on a DCT -based NSS model as in BLIINDS-II, but uses the frame difference to obtain the spatiotemporal in- formation of the video. Additionally , V -BLIINDS uses motion estimation techniques to examine motion consistency in the video. B. Characteristics of Online V ideos In online video-sharing services, both user-generated videos and professional videos are shared. In terms of quality , they have signiﬁcantly different characteristics, especially in ﬁlming and editing. Many of the makers of user-generated videos do not ha ve professional kno wledge of photography and editing, so quality degradation factors can easily be in volv ed in ev ery step, from the acquisition to the distribution of videos. JOURNAL OF L A T E X CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 3 T ABLE I: T ypes of observable quality degradation factors in online user -generated videos. Step T ypes of degradation factors V ideo Acquisition Limited spatial/temporal resolution, misfocusing, blur, jerkiness, camera shaking, noise, occlusion, insufﬁcient/e xcessiv e lighting, poor composition, poor color reproduction Processing/editing Bad transition effect (e.g., fade in/out, overlap), harming caption (e.g., title screen, subtitle), frame-in-frame, inappropriate image processing Uploading V ideo compression artifacts, temporal resolution reduction, spatial resolution loss Audio Acquisition Device noise, en vironmental noise, too-loud or too-low v olume, incomprehensible language Processing/editing Inappropriate background music, audio-video desynchronization, unsuitable sound effects, audio track loss Uploading Audio compression artifacts V ideo & Audio Content Boredom, violence, sexuality , harmful content Deliv ery Buffering, pack et loss, quality ﬂuctuation T able I presents a list of observable quality degradation factors in online user-generated videos. They are grouped with respect to channels affected by degradation (i.e., video, audio, and audio-video) and steps inv olving degradation. V isual factors in the acquisition step consist of problems with equipment, camera skills, and en vironments. In partic- ular , according to the work in [12], typical artifacts in user- generated videos include camera shake, harmful occlusions, and camera misalignment. V isual quality degradation also occurs during video processing and editing due to the editor’ s lack of knowledge of photography and video making or their intent. F or e xample, scene transition effects, captions, and frame-in-frame effects, where a small image frame is inserted in the main video frame, can be used, which may degrade visual quality . Image processing (e.g., color modiﬁcation, sharpening) can be applied during editing, which may not be pleasant to vie wers. In the uploading step, the system or the uploader may compress the video or modify the spatial and temporal resolutions of the video, which may introduce com- pression artifacts, visual information loss, or motion jerkiness. Audio quality degradation can also occur at each step of acquisition, processing and editing, and uploading. Some of the audio quality factors inv olved in the acquisition and uploading steps are similar to the visual quality factors (equip- ment noise and compression, etc.). Moreov er , the language used in the recorded speech may hav e a negati ve effect on perception when a viewer does not understand the language. In the processing and editing step, inserting inappropriate sound sources, background music, or sound ef fects may decrease user satisfaction. Loss of the audio track may be a critical issue when the content signiﬁcantly depends on the sound. Some quality factors related to the characteristics of the content or communication en vironment apply to both audio and video channels. First, the content of a video can be a problem. Boring, violent, sexual, and harmful content can spoil the overall experience of watching the video. Second, the communication en vironment from the server to a viewer is not always guaranteed, so buf fering, packet loss, and quality ﬂuctuation may occur, which are critical in streaming multi- media content [13]. Content in online video-sharing services is usually accom- panied with the information reﬂecting uploading and consump- tion patterns, provided by uploaders and viewers, which is called metadata. The metadata of a video clip, either assigned by the uploader or automatically extracted from the video, include the title, information of the uploader, upload date, duration, video format, and category . Metadata determined by vie wers include the vie wcount, comments, and ratings. One can analyze metadata to discover the production and consumption patterns of online videos and to improv e the quality of service. Moreov er, information from metadata (e.g., video links and subscribers) can be used to construct a social network consisting of online videos. Analysis of the social network can be used for content recommendations [14] and in vestigating the network topology of online video sharing, especially the e volution of online video communities [15] [16]. C. Quality Assessment of Online Content There are few studies that consider particular characteristics of online images and videos. The method proposed in [17] estimates the quality of online videos using motion estimation, temporal factors to ev aluate jerkiness and unstableness, spatial factors (including blockiness, blurriness, dynamic range, and intensity contrast), and video editing styles (including shot length distrib ution, width, height, and black side ratio). Since these features depend on the genre of video content, rob ustness is not guaranteed, as pointed out in [17]. The work in [18] predicted the quality of user-generated images using their social link distrib ution, the tone of viewer comments, and access logs from other websites. It was discov ered that social functionality is more important in determining the quality of user -generated images in online en vironments than the distortion of the images themselves. Our work deals with videos, which are more challenging for quality assessment than images. In comparison to the aforementioned prior work, we conduct a more comprehensiv e and thorough analysis of the issue of the quality assessment of online user-generated videos based on state-of-the-art video quality metrics and metadata- driv en metrics. I I I . V I D E O A N D S U B J E C T I V E D A T A S E T In this section, we introduce the video and subjectiv e dataset used in this work. W e use the dataset presented in [19]. In [19], 50 user-generated amateur videos and their metadata were collected from Y ouT ube via keyword search, and subjecti ve quality ratings for the videos were obtained from a crowdsourcing-based subjectiv e quality assessment JOURNAL OF L A T E X CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 4 T ABLE II: Description and quality degradation factors of the videos in the dataset. The video ID is deﬁned based on the ranking in terms of the mean opinion score (MOS) (see Section IV -A). ID Description Quality degradation factors 1 A hand drawing portraits on paper with pencil Fast motion and charcoal from scratch 2 A man drawing on the ﬂoor with chalk T ime lapse 3 A series of animal couples showing friendship Blur , compression artifacts 4 Procedure to cook cheese sandwich (close-up of the food) Misfocusing 5 T wo men imitating animals eating food Compression artif acts, jerkiness 6 A baby swimming in a pool Camera shaking 7 Escaping from a chase (ﬁrst-person perspective) Fisheye lens ef fect 8 Nature scenes including mountain, cloud, and sea Blur , compression artifacts 9 Cheering univ ersity students Camera shaking, captions (shot by a camera moving around a campus) 10 A red fox in a cage Blur , camera shaking 11 Cats and kittens in a house Blur , misfocusing 12 A crowd dancing at a station Poor color reproduction 13 Sev en people creating rhythmic sounds with a car Camera shaking, misfocusing 14 People dancing at a square Camera shaking 15 A group of children learning to cook Jerkiness, misfocusing 16 A slide show of nature landscape pictures Blur , compression artifacts 17 Soldiers patrolling streets Camera shaking, compression artifacts 18 A sleeping baby and a cat (close-up shot) Camera shaking, compression artifacts 19 A baby smiling at the camera Poor color reproduction, compression artifacts 20 A man playing a video game Frame-in-frame, jerkiness 21 Cats having fun with water Blur , camera shaking, captions 22 A man playing a video game Frame-in-frame, jerkiness 23 T win babies talk to each other with gestures Camera shaking, compression artifacts 24 W alking motion from the walker’ s vie wpoint Compression artifacts, camera noise 25 A man sitting in a car and singing a song Compression artifacts, misfocusing 26 A man playing with a pet bear and a dog on the grass Misfocusing, shaking image frame 27 Pillow ﬁght on a street Insufﬁcient lighting, v arying illumination 28 Kittens playing with each other Blur , weak compression artifacts 29 People dancing and cheering outside Compression artifacts, packet loss, excessi ve lighting 30 A baby laughing loudly Camera noise, compression artifacts, captions 31 A man breakdancing in a ﬁtness center Packet loss, compression artifacts, camera noise 32 Cheerleading in a basketball court Compression artifacts, misfocusing 33 Microlight ﬂying (ﬁrst-person perspectiv e) Blur , compression artifacts, misfocusing, packet loss 34 A man participating in parkour and freerunning Compression artifacts, misfocusing, camera shaking, varying illumination 35 Three people singing in a car Camera shaking, compression artifacts, blur 36 Street posing performance Camera shaking, occlusion, blur 37 A puppy and a kitten playing roughly Low frame rate, compression artifacts, blur , jerkiness 38 Exploring a dormitory building (ﬁrst-person perspectiv e) Camera shaking, compression artifacts, misfocusing 39 Shopping people at a supermarket Compression artifacts, captions, blur 40 A man working out in a park Low frame rate, insuf ﬁcient lighting, blur , varying illumination, poor color reproduction 41 People in a hotel lobby Camera shaking, misfocusing, occlusion 42 Bike trick performance Compression artifacts, blur, misfocusing, jerkiness 43 A man cooking and eating chicken Camera shaking 44 A baby walking around with his toy cart at home V ertical black sides, camera shaking, varying illumination 45 Men eating food Camera shaking, misfocusing, compression artifacts 46 An old singing performance clip Camera noise, poor color reproduction, compression artifacts, blur 47 A man doing sandsack training Compression artifacts, jerkiness, camera noise 48 A series of short clips Poor color reproduction, varying illumination, compression artifacts, camera noise 49 T wo men practicing martial arts Black line running, poor color reproduction, jerkiness, blur , packet loss experiment. The description and observed quality degradation factors of the videos are presented in T able II. Since the metadata collected in [19] were rather dated and limited, we collected more detailed and recent metadata for the videos by using Y ouT ube Data API. For each video, the following metadata were gathered: the maximum spatial resolution, upload date, video length, video vie wcount, video likes, video dislikes, video fa vorites, video comments, video JOURNAL OF L A T E X CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 5 T ABLE III: Range and position information of the metadata of the videos in the dataset. Metadata Min Max A verage Median Max r esolution (height) 144 1080 - 480 Date a 168 3086 1970 2304 Duration b 27 559 184.3 163 #viewcount 29 96372651 6949894 117220 #like 0 364131 28750 1458 #dislike 0 22426 1231 17 #comment 0 49015 5102 260 Description length 0 3541 279.9 100 #subscribe 0 1552053 105682 567 #channel viewcount 486 289704644 31341621 892815 #channel comment 0 19499 1691 10 #channel video 1 5390 225.6 27 Channel description length 0 911 91 0 a Days until the upload date since Y ouT ube was activ ated (14 Feb . 2005). b In seconds. description length, channel vie wcount, channel dislikes, chan- nel fa vorites, channel comments, channel description length, and number of uploaded videos of the uploader . The collection process for metadata was conducted in March 2014. While collecting recent metadata, we found that one video in the original dataset was deleted, which was excluded from our experiment. T able III presents the metadata statistics of the videos. It can be seen that the dataset cov ers a wide range of online user-generated videos in v arious vie wpoints of produc- tion and consumption characteristics. The subjectiv e ratings were collected based on the paired comparison methodology [20] in [19]. Subjects were recruited from Amazon Mechanical Turk. A web page showing two randomly selected videos from the dataset in a side-by-side manner w as used for quality comparison. Subjects were asked to choose a video with better visual quality than the other . Subjects had to play both videos before entering their ratings to prev ent cheating and false ratings. In total, 8,471 paired- comparison results were obtained. Each video was shown 332 times, and each pair was matched 6.78 times on a verage. I V . G R A P H - B A S E D S U B J E C T I V E D A TA A NA L Y S I S The subjective paired comparison data forms an adjacency matrix representing a set of edges of a graph G = ( V , E ) . Here, V is the set of nodes corresponding to the videos, and E is the set of weighted directed edges, where each weight is the winning count where a video is preferred to another one. Therefore, it is possible to apply graph theory to analyze the subjecti ve data, which aims at obtaining further insight into viewers’ patterns of quality perception. In this section, two techniques are adopted, HodgeRank analysis and graph clustering. A. HodgeRank Analysis The HodgeRank framew ork introduced in [21] decomposes imbalanced and incomplete paired-comparison data into the quality scores of video stimuli and inconsistency of sub- jects’ judgments. In HodgeRank analysis, the statistical rank aggregation problem is posed, which ﬁnds the global score s = [ s 1 , s 2 , s 3 , ..., s n ] , where n is the number of video stimuli, such that min s ∈ R n X i,j w ij ( s i − s j − ˆ Y ij ) 2 (1) where w ij is the number of comparisons between stimuli i and j , and s i and s j are the quality scores of stimuli i and j , respectiv ely , which are considered as mean opinion scores (MOS). ˆ Y ij is the ( i, j ) th element of ˆ Y , which is the subjecti ve data matrix deriv ed from the original graph of paired comparison G by ˆ Y ij = 2 ˆ π ij − 1 (2) Here, ˆ π ij is the observed winning rate of stimulus i against stimulus j , which is deﬁned as ˆ π ij = M ij M ij + M j i (3) where M ij is the number of counts where stimulus i is preferred to stimulus j . Note that ˆ Y is ske w-symmetric. The con verted subjectiv e data matrix ˆ Y can be uniquely decomposed into three components as follo ws, which is called HodgeRank decomposition: ˆ Y = ˆ Y g + ˆ Y c + ˆ Y h (4) where ˆ Y g , ˆ Y c , and ˆ Y h satisfy the following conditions: ˆ Y g ij = ˆ s i − ˆ s j , (5) JOURNAL OF L A T E X CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 6 5 10 15 20 25 30 35 40 45 5 10 15 20 25 30 35 40 45 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 5 10 15 20 25 30 35 40 45 5 10 15 20 25 30 35 40 45 −1.2 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 (a) (b) 5 10 15 20 25 30 35 40 45 5 10 15 20 25 30 35 40 45 −1 −0.5 0 0.5 1 5 10 15 20 25 30 35 40 45 5 10 15 20 25 30 35 40 45 −1 −0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 x 10 −6 (c) (d) Fig. 1: Results of HodgeRank analysis. (a) Subjecti ve data matrix ˆ Y , (b) d e composed global part ˆ Y g , (c) decomposed curl part ˆ Y c , and (d) decomposed harmonic part ˆ Y h . Each axis represents the videos sorted by the global score of HodgeRank. In each ﬁgure, only the lower triangular part below the diagonal axis is sho wn because the matrices are ske w-symmetric. ˆ Y h ij + ˆ Y h j k + ˆ Y h ki = 0 , for ( i, j ) , ( j, k ) , ( k , i ) ∈ E (6) X j 6 = i w ij ˆ Y h ij = 0 , for each i ∈ V . (7) where ˆ s i and ˆ s j are the estimated scores for stimuli i and j , respectiv ely . The global part ˆ Y g determines the overall ﬂow of the graph, which is formed by score differences. The curl part ˆ Y c indicates the local (triangle) inconsistency (i.e., the situation where stimulus i is preferred to stimulus j , stimulus j is preferred to stimulus k , and stimulus k is preferred to stimulus i for different i, j, and k ). The harmonic part ˆ Y h represents the inconsistency caused by cyclic ties in volving more than three nodes, which corresponds to the global inconsistency . Fig.1 sho ws the results of the HodgeRank decomposition applied to the subjectiv e data. The overall trend sho wn in Fig. 1(a) is that the total scores decrease (darker color) for elements closer to the lower left corner, showing that the perceiv ed superiority of one video against another becomes clear as their quality difference increases. This is reﬂected in the global part in Fig. 1(b), i.e., the absolute values of the matrix elements increase (darker color). Howe ver , Fig. 1(a) is also noisy , which corresponds to the inconsistency parts in Fig. 1(c) and 1(d). T o quantify this, we deﬁne the ratio of total inconsistency as T otal Inconsistency = k ˆ Y h k 2 F k ˆ Y k 2 F + k ˆ Y c k 2 F k ˆ Y k 2 F (8) where k·k F is the Frobenius norm of a matrix. The obtained ratio of total inconsistency for the subjectiv e data is 67%. That is, the amount of inconsistenc y , including local inconsistency and global inconsistenc y , is larger than that of the global ﬂow of the graph. Between the two sources of inconsistency , the amount of the harmonic component is far smaller than that of the curl component, as can be seen from the scale of the color bar in Fig. 1(d). This implies that it is easy for human subjects to determine quality superiority between JOURNAL OF L A T E X CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 7 videos with signiﬁcantly different ranks (i.e., quality scores), while determining preference for videos where quality is ranked similarly is relativ ely dif ﬁcult. This will be discussed further in Section VI. B. Graph Clustering The HodgeRank analysis showed that videos with similar ranks in the MOS are subjecti vely ambiguous in terms of quality . Therefore, one may hypothesize that the videos can be grouped in such a way that different groups have distinguish- able quality differences, while videos in each group have a similar quality . W e attempt to examine if this is the case and, if so, how many groups can be found via graph clustering. W e use the algorithm presented in [22], whose objectiv e is to divide the whole graph represented by the adjacency matrix M into l groups, U 1 , U 2 , ..., U l , by maximizing the modularity measure Q : Q = 1 A l X p =1 X i,j ∈ U p M ij − d in i d out j A ! (9) where A = P i,j M ij is the sum of all edge weights in the graph, and d in i = P j M ij and d out i = P j M j i represent the in-degree and out-de gree of the i -th node, respectively . This algorithm is based on the random walk with restart (R WR) model. It ﬁrst computes the relev ance matrix R , which is the estimated result matrix of the R WR model, from the transition matrix, which equals M : R = (1 − δ )( I − δ ˜ M ) (10) where δ is the restart probability (an adjustable parameter in the R WR model), I is an identity matrix, and ˜ M is the column- normalized version of M . The algorithm then consists of two steps. First, starting with an arbitrary node, it repeatedly adds the node with the largest single compactness measure, until its single compactness measure does not increase. The single compactness of node i with respect to local cluster G 0 is represented by: C ( i, G 0 ) = 1 B   R ii + X j ∈ G 0 ( R ij + R j i ) − R i ∗ R ∗ i B   (11) . where B = P i,j R ij is the total sum of the elements of R , and R i ∗ = P k R ik and R ∗ i = P k R ki are the row and column sums of R , respectively . If the construction of a local cluster from a node is ﬁnished, the algorithm starts from another node that has not been assigned to any local clusters yet to make another local cluster . This process is repeated until all nodes are assigned to certain local clusters. After constructing local clusters, the algorithm merges the compact clusters by maximizing the increase of the total modularity Q of clusters in a greedy manner until there is no increase of Q , which results in ﬁnal clusters. 0 0.2 0.4 0.6 0.8 1 0.4 0.5 0.6 0.7 0.8 0.9 1 Restart Probability Modularity Q #Cluster: 1 #Cluster: 2 Fig. 2: Graph clustering results with respect to restart proba- bility δ . W e apply the aforementioned algorithm to the subjective data graph for various restart probability values. Fig. 2 shows the ﬁnal modularity value with respect to the restart probabil- ity , ranging from 0.01 to 0.99. The number of ﬁnal clusters differs when the restart probability differs. It can be seen that the graph is clustered into one or two groups in all cases. In particular , the results with Q = 1 correspond to the case in which all nodes in the graph are assigned to one cluster . That is, it is dif ﬁcult to divide the nodes into groups with a clearly distinguished subjectiv e preference. Fig. 3 shows e xamples of ﬁnal clustering results that have high modularity . T wo clusters are formed in these examples, and the cluster containing high- quality videos (marked with blue) is much bigger than that containing low-quality videos (marked with red). It seems that in the used video dataset, discriminating quality for videos with high and medium quality was dif ﬁcult, whereas videos with medium and low quality were more easily distinguished. V . Q UA L I T Y E S T I M A T I O N In this section, we in vestigate the problem of the objectiv e quality assessment of online videos. First, the performance of the state-of-the-art objecti ve quality metrics is ev aluated. Second, quality estimation using metadata-dri ven metrics is in vestigated. A. No-Reference Objective Metrics The performance of three state-of-the-art NR quality met- rics, namely , V -BLIINDS, BRISQUE, and BLIINDS-II, is examined using the video and subjectiv e data described in the previous section. Some videos have title scenes at the beginning (e.g., title text shown on a black background for a few seconds), which are excluded for quality ev aluation because they are usually irrelev ant for judging video quality . The performance of the metrics is sho wn in T able IV. W e adopt the Spearman rank-order correlation coefﬁcient (SR OCC) as the performance index because the relationship JOURNAL OF L A T E X CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 8 2 3 1 4 5 17 8 13 7 10 14 16 6 9 12 11 18 20 19 15 25 23 21 22 26 32 24 28 39 29 30 34 36 40 27 37 31 35 33 42 44 38 41 46 43 45 48 49 47 2 3 1 4 5 17 8 13 7 10 14 16 6 9 12 11 18 20 19 15 25 23 21 22 26 32 24 28 39 29 30 34 36 40 27 37 31 35 33 42 44 38 41 46 43 45 48 49 47 2 3 1 4 5 17 8 13 7 10 14 16 6 9 12 11 18 20 19 15 25 23 21 22 26 32 24 28 39 29 30 34 36 40 27 37 31 35 33 42 44 38 41 46 43 45 48 49 47 (a) (b) (c) Fig. 3: Examples of graph clustering results. Each video is represented by its ID, and the high (low)-quality group is marked with blue (red). (a) δ = 0 . 81 ( Q = 0 . 9275 ) (b) δ = 0 . 15 ( Q = 0 . 8503 ) (c) δ = 0 . 99 ( Q = 0 . 8225 ) between the metrics’ outputs and MOS is nonlinear . Statistical test results re veal that only the SR OCC of BRISQUE is sta- tistically signiﬁcant at a signiﬁcance lev el of 0 . 05 ( F = 6 . 00 ). Interestingly , V -BLIINDS, which is a video quality metric, appears to be inferior to BRISQUE and BLIINDS-II, which are image quality metrics, meaning that the way V -BLIINDS incorporates temporal information is not very ef fective for the user-generated videos. Overall, the performance of the metrics is far inferior to that shown in existing studies using other databases. The SR OCC values of BLIINDS-II and BRISQUE on the LIVE IQA database containing images corrupted by blurring, compression, random noise, etc. [23] were as high as 0.9250 and 0.9395, respectiv ely [24]. The SR OCC of V - BLIINDS for the LIVE VQA database containing videos degraded by compression, packet loss, etc. [25] was 0.759 in [1]. This implies that the problems of online video quality assessment are very different from those of traditional quality assessment. The reasons why the NR metrics fail are discussed further in Section VI with examples. B. Metadata-driven Metrics Metadata-driv en metrics are deﬁned as either the original values of the metadata listed in T able III or the values obtained by combining them (e.g., #like divided by #view for normal- ization). T able V sho ws the performance of the metadata- driv en metrics for quality prediction. It is observed that the performance of the metrics signiﬁcantly v aries, from fairly high to almost no correlation with the MOS. It is worth noting that sev eral metadata-dri ven metrics show better performance than the NR quality metrics. Generally , the video-speciﬁc metrics (e.g., video viewcount, the number of video comments) show better performance than the channel-speciﬁc metrics (e.g., channel viewcount, number of channel comments). T ABLE IV: Spearman rank-order correlation coefﬁcients (SR OCC) between MOS and V -BLIINDS, BRISQUE, and BLIINDS-II, respectiv ely . NR Metric SR OCC V -BLIINDS 0.1406 BRISQUE 0.3364 BLIINDS-II 0.2383 T ABLE V: SR OCC between MOS and metadata-driven met- rics. Note that the metrics are sorted in descending order of SR OCC. Metric SR OCC Description length 0.5402 #like / #view 0.5347 Max r esolution 0.5331 #subscribe 0.4694 #subscribe / #channel video 0.4408 #like 0.4307 #dislike 0.3910 #channel vie wcount 0.3855 #viewcount 0.3728 #comment 0.3699 #like / date 0.3558 Channel description length 0.3482 #channel vie wcount / #channel video 0.3061 Date 0.3042 #view / date 0.2727 #channel comment 0.2099 #comment / #view 0.1861 #channel video 0.1465 #channel comment / #channel video 0.1414 Duration -0.0533 The metadata-driv en metric showing the highest SR OCC is the description length. A possible explanation for this is that a video with good quality would have signiﬁcant visual and semantic information, and the uploader may want to provide a detailed description about the video faithfully . The second- best metadata-driven metric is the ratio of the like count to the viewcount. This metric inherently contains information about the satisfaction of the viewers, which is closely related to the video quality . The third-ranked metadata-driven metric is the maximum spatial resolution. The av ailability of high resolution for a video means that it was captured with a high-performance camera or that it did not undergo video processing, which would reduce the spatial resolution and possibly degrade video quality . The number of subscribers of an uploader is ranked fourth, which indicates the popularity of the uploader . A popular uploader’ s videos are also popular , and their visual quality plays an important role. Other metrics related to video popularity , including the numbers of likes and dislikes, viewcount, and number of comments show signiﬁcant correlations with the MOS, but JOURNAL OF L A T E X CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 9 0 2 4 6 8 10 12 14 16 18 20 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 #Metadata−driven Metric SROCC Metadata only V−BLIINDS + Metadata BRISQUE + Metadata BLIINDS−II + Metadata 0 2 4 6 8 10 12 14 16 18 20 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 #Metadata−driven Metric SROCC Metadata only V−BLIINDS + Metadata BRISQUE + Metadata BLIINDS−II + Metadata (a) (b) Fig. 4: SR OCC of (a) linear regression and (b) SVR using metadata and objecti ve quality assessment algorithms. they remain moderate. This indicates that popularity is a meaningful but not perfect predictor of quality . The age of a video (“date” in the table) has a moderate correlation with perceptual quality , which means that newer videos tend to hav e better quality . This can be understood by considering that recent recording devices usually produce better quality videos than old devices and users are more experienced in video production than before. Apparently , each of the metadata-driv en metrics represents a distinct partial characteristic of the videos. Therefore, it can be expected that combining multiple metrics will yield improv ed results due to their complementarity . F or the integration of the metrics, we employ two techniques, namely , linear regression and nonlinear support v ector regression (SVR) using Gaussian kernels [26]. The former is written as: f Linear ( x ) = a T x + b Linear (12) where x is a vector composed of metadata-driven metrics, and a and b Linear are tunable parameters of the linear re gression model. The latter is given as: f S V R ( x ) = v X i =1 ( α i − α ∗ i ) φ ( x i , x ) + b S V R (13) where α i , α ∗ i , x i ( i = 1 , 2 , ..., v ), and b S V R are parameters of the SVR model, and φ ( x i , x ) is a Gaussian kernel e xpressed by φ ( x i , x ) = exp ( − k x i − x k 2 2 σ 2 ) (14) where σ 2 is the variance of the Gaussian kernel [27]. W e train the models by changing the number of metadata- driv en metrics, i.e., the length of the input vector x in (12) and (13), by adding metrics one by one in descending order of SROCC in T able V. The models are trained and ev aluated using all the data without separating the training and test data to obtain insight into the performance potential of the integrated models to estimate quality . In addition, to further incorporate the information of visual signals within the regression models, we also test models combining metadata- driv en metrics and V -BLIINDS, BRISQUE, and BLIINDS- II, respectiv ely , to examine the synergy between the two modalities (i.e., the visual signal and metadata). Fig. 4 sho ws the results of the linear regression and SVR model ev aluation. In both cases, the performance in terms of SR OCC is improved by increasing the numbers of the metadata-driv en metrics. When 20 metadata-driv en metrics are used, the SROCC becomes nearly 0.8 for linear regression and 0.85 for SVR, which is a signiﬁcant improvement compared with the highest SR OCC achie ved by a single metric (0.54 by the description length). The effect of incorporating an NR objectiv e quality metric is negligible (for linear regression) or marginal (for SVR). Such a limited contribution of the video signal-based metrics seems to be due to their limited quality predictability as shown in T able IV. T o compare the quality prediction performance of metadata and NR quality metrics in more depth, we analyze differences between the predicted ranks and MOS ranks. Fig. 5 sho ws histograms of rank differences between the MOS and esti- mated quality scores by the NR objectiv e quality metrics or SVR combining 20 metadata-dri ven metrics. It is observ ed that the SVR model of metadata shows smaller rank differences than the objective quality metrics. From an ANO V A test, it is conﬁrmed that the mean locations of the rank dif ferences for the four cases are signiﬁcantly different at a signiﬁcance lev el of 0 . 01 ( F = 10 . 84 ). Additionally , Duncan’ s multiple range tests re veal that the mean location of the rank differences of the SVR-based metadata model is signiﬁcantly dif ferent from those of the signal-based metrics at a signiﬁcance lev el of 0 . 01 . These results demonstrate that metadata-based quality prediction is a powerful way of dealing with the online video quality assessment problem by ov ercoming the limitations of the con ventional objectiv e metrics based on the visual signal. JOURNAL OF L A T E X CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 10 0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 V−BLIINDS (Mean: 15.02) Rank Difference Count 0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 BRISQUE (Mean: 12.98) Rank Difference Count (a) (b) 0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 BLIINDS−II (Mean: 14.00) Rank Difference Count 0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 Metadata (Mean: 5.10) Rank Difference Count (c) (d) Fig. 5: Rank difference between estimated quality scores and MOS. (a) V -BLIINDS, (b) BRISQUE, (c) BLIINDS-II, and (d) the SVR model with 20 metadata-driven metrics. V I . D I S C U S S I O N In the pre vious sections, it was sho wn that the quality ev aluation of online user-generated videos is not easy both subjectiv ely and objectively . In this section, we provide further discussion on these issues in more detail with representative examples. In Section IV, it was observed that vie wers have difﬁculty in determining quality superiority among certain videos. As shown in T able I, there are many different factors of quality degradation in user -generated videos. In many cases, therefore, users are required to compare quality across different factors, which is not only difﬁcult, but also very subjecti ve depending on personal taste. For example, video #15 has jerkiness and misfocusing, video #16 has blur , and video #17 has camera shaking. In the subjectiv e test results, video #16 is preferred to video #15, while video #17 is less preferred than video #15, and video #17 is preferred to video #16. As a result, the match result among these three videos forms a triangular triad, which contributes to the local inconsistency shown in Fig. 1(c). In Section V, we showed that the state-of-the-art NR ob- jectiv e metrics fail to predict the perceptual quality of user- generated videos. This is largely due to the fact that the quality degradation conditions targeted during the de velopment of the metrics are signiﬁcantly different from those in v olved in user- generated videos. When the NR metrics were dev eloped, it was normally assumed that original video sequences hav e perfect quality . Howe ver , as shown in T able I, the user- generated videos are already subject to various types of quality degradation in the production stage (e.g., insufﬁcient lighting, hand shaking). Furthermore, NR metrics are usually only optimized for a limited number of typical quality factors, such as compression artifacts, packet loss, blurring, and random noise, while there are man y more quality factors that can be in volved during editing, processing, and distribution, some of which are ev en related to aesthetic aspects. Editing effects are particularly dif ﬁcult to assess for the NR metrics, not only because many of them are not considered by the metrics, but also some of them may be wrongly JOURNAL OF L A T E X CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 11 treated as artifacts. V ideos #1 and #2 are examples containing unique editing effects in the temporal domain. The y are a f ast- playing video and a time lapse video, respecti vely , which are interesting to vie wers and thus ranked 1st and 2nd in MOS, respectiv ely . Howe ver , as V -BLIINDS regards them as ha ving poor motion consistency , it gi ves them undesirable low ranks (i.e., 40th and 44th, respectively). In Section V -B, metadata were shown to be useful to extract quality information, showing better quality ev aluation performance than the NR objectiv e metrics. A limitation of metadata is that some are sensiti ve to popularity , users’ preference, or other video-unrelated factors, which may not perfectly coincide with the perceived quality of videos. V ideo #25 is such an example. It is a music video made by a fan of a musician. It has moderate visual quality and thus is ranked 25th in MOS. Howe ver , since the main content of this video is music, its popularity (the viewcount, the number of likes and comments, etc.) is mainly determined by the audio content rather than the visual quality . Moreov er , it has the longest video description (listing the former works of the musician) in the dataset, according to which it would be ranked 14th (note that the description length is the best-performing metric in T able V). A way to alleviate the limitation of each metadata-driven metric is to combine sev eral metadata-dri ven metrics and expect them to compensate for the limited information of each of them, which was sho wn to be ef fecti ve in our results. For instance, the av ailable maximum resolution was shown to be highly correlated with MOS in T able V, so video #29 would be ranked highly since a high-resolution (1080p) version of this video is available. Howe ver , it is ranked only 29th in MOS due to compression artifacts and packet loss artifacts. When multiple metadata-dri ven metrics are combined using SVR, the rank of the video becomes 24th, which is much closer to the MOS rank. V I I . C O N C L U S I O N W e have presented our work on in vestigating the issue of the subjectiv e and objectiv e visual quality assessment of online user-generated video content. First, we examined users’ patterns of quality ev aluation of online user-generated videos via the HodgeRank decomposition and graph clus- tering techniques. A large amount of local inconsistency in the paired-comparison results was found by the HodgeRank analysis, which implies that it is difﬁcult for human vie wers to determine quality superiority between videos ranked similarly in MOS, mainly due to the difﬁculty of comparing quality across dif ferent factors. Consequently , subjectiv e distinction between different quality le vels is only clear at a large cluster lev el, which was sho wn by the graph clustering results. W e then benchmarked the performance of existing state-of-the-art NR objectiv e metrics, and explored the potential of metadata- driv en metrics for the quality estimation of user-generated video content. It was shown that the existing NR metrics do not yield satisfactory performance, whereas metadata-driv en metrics perform signiﬁcantly better than the NR metrics. In particular , as each of the metadata-driv en metrics co vers only limited information on visual quality , combining them signiﬁcantly improv ed the performance. Finally , based on the results and e xamples, we pro vided a detailed discussion on why the subjectiv e and objective quality assessment of user- generated videos is difﬁcult. Our results demonstrated that the problem of quality as- sessment of user-generated videos is very different from the con ventional video quality assessment problem dealt with in the prior work. At the same time, our results hav e signiﬁcant implications for future research: The existence of di verse quality factors in volved in user-generated videos and the failure of existing metrics may suggest that the problem of user -generated video quality assessment is too large to be conquered by a single metric; thus, metrics specialized to different factors should be applied separately (and then be combined later, if needed). Since many factors, such as editing effects, are not covered by existing metrics, developing reliable metrics for them would be necessary . Moreover , the problem is highly subjective and depends on personal taste, so personalized quality assessment may be an effecti ve method in the future. Therefore, proper ways to collect personalized data as ground truths would be required, where big data analysis techniques may be helpful. While our results are for a dataset with a limited number of video sequences, it is still reasonable to consider that most of them, particularly those related to the different nature of user-generated videos from professional ones, can be applied generally , although the matter of degree exists. Ne vertheless, larger scale experiments with larger numbers of videos with more div erse characteristics will be desirable in the future. R E F E R E N C E S [1] M. A. Saad, A. C. Bovik, and C. Charrier , “Blind prediction of natural video quality , ” IEEE T ransactions on Image Pr ocessing , vol. 23, no. 3, pp. 1352–1365, 2014. [2] Z. W ang, A. C. Bovik, H. R. Sheikh, and E. P . Simoncelli, “Image quality assessment: from error visibility to structural similarity , ” IEEE T ransactions on Image Processing , vol. 13, no. 4, pp. 600–612, 2004. [3] Z. W ang, E. P . Simoncelli, and A. C. Bovik, “Multiscale structural similarity for image quality assessment, ” in Proceedings of the 37th Asilomar Confer ence on Signals, Systems and Computers , v ol. 2, 2003, pp. 1398–1402. [4] E. C. Larson and D. M. Chandler, “Most apparent distortion: Full- reference image quality assessment and the role of strate gy , ” Journal of Electr onic Imaging , v ol. 19, no. 1, pp. 011 006–1–011 006–21, 2010. [5] H. R. Sheikh and A. C. Bo vik, “ A visual information ﬁdelity approach to video quality assessment, ” in Proceedings of the 1st International W orkshop on V ideo Pr ocessing and Quality Metrics for Consumer Electr onics , 2005, pp. 23–25. [6] K. Seshadrinathan and A. C. Bovik, “Motion tuned spatio-temporal quality assessment of natural videos, ” IEEE T ransactions on Image Pr ocessing , v ol. 19, no. 2, pp. 335–350, 2010. [7] P . V . V u, C. T . V u, and D. M. Chandler , “ A spatiotemporal most- apparent-distortion model for video quality assessment, ” in Proceedings of the 18th IEEE International Confer ence on Image Pr ocessing , 2011, pp. 2505–2508. [8] R. Soundararajan and A. C. Bovik, “V ideo quality assessment by reduced reference spatio-temporal entropic differencing, ” IEEE Tr ans- actions on Circuits and Systems for V ideo T echnology , vol. 23, no. 4, pp. 684–694, 2013. [9] M. H. Pinson and S. W olf, “ A new standardized method for objecti vely measuring video quality , ” IEEE T ransactions on Broadcasting , v ol. 50, no. 3, pp. 312–322, 2004. JOURNAL OF L A T E X CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 12 [10] M. A. Saad, A. C. Bovik, and C. Charrier , “Blind image quality assessment: A natural scene statistics approach in the DCT domain, ” IEEE Tr ansactions on Image Processing , vol. 21, no. 8, pp. 3339–3352, 2012. [11] A. Mittal, A. K. Moorthy , and A. C. Bovik, “No-reference image quality assessment in the spatial domain, ” IEEE T ransactions on Image Pr ocessing , v ol. 21, no. 12, pp. 4695–4708, 2012. [12] S. W ilk and W . Effelsberg, “The inﬂuence of camera shakes, harmful occlusions and camera misalignment on the perceived quality in user generated video, ” in Proceedings of IEEE International Confer ence on Multimedia and Expo , 2014, pp. 1–6. [13] T . Hoßfeld, R. Schatz, and U. R. Krieger , “QoE of youtube video streaming for current internet transport protocols, ” in Measurement, Modelling, and Evaluation of Computing Systems and Dependability and F ault T olerance . Springer , 2014, pp. 136–150. [14] J. Davidson, B. Liebald, J. Liu, P . Nandy , T . V an Vleet, U. Gargi, S. Gupta, Y . He, M. Lambert, B. Livingston et al. , “The YouTube video recommendation system, ” in Pr oceedings of the 4th ACM Conference on Recommender Systems , 2010, pp. 293–296. [15] M. Cha, H. Kwak, P . Rodriguez, Y .-Y . Ahn, and S. Moon, “ Analyzing the video popularity characteristics of large-scale user generated content systems, ” IEEE/A CM T ransactions on Networking , vol. 17, no. 5, pp. 1357–1370, 2009. [16] F . Figueiredo, J. M. Almeida, M. A. Gonc ¸ alves, and F . Benevenuto, “On the dynamics of social media popularity: A Youtube Case study , ” ACM T ransactions on Internet T echnology , vol. 14, no. 4, pp. 24:1–24:23, 2014. [17] T . Xia, T . Mei, G. Hua, Y .-D. Zhang, and X.-S. Hua, “V isual quality assessment for web videos, ” Journal of V isual Communication and Image Repr esentation , v ol. 21, no. 8, pp. 826–837, 2010. [18] Y . Y ang, X. W ang, T . Guan, J. Shen, and L. Y u, “ A multi-dimensional image quality prediction model for user-generated images in social networks, ” Information Sciences , v ol. 281, pp. 601–610, 2014. [19] C.-H. Han and J.-S. Lee, “Quality assessment of on-line videos using metadata, ” in Pr oceedings of IEEE International Conference on Acous- tics, Speech and Signal Pr ocessing , 2014, pp. 1385–1388. [20] J.-S. Lee, “On designing paired comparison experiments for subjectiv e multimedia quality assessment, ” IEEE T ransactions on Multimedia , vol. 16, no. 2, pp. 564–571, 2014. [21] Q. Xu, Q. Huang, T . Jiang, B. Y an, W . Lin, and Y . Y ao, “HodgeRank on random graphs for subjective video quality assessment, ” IEEE T ransactions on Multimedia , v ol. 14, no. 3, pp. 844–857, 2012. [22] D. Duan, Y . Li, Y . Jin, and Z. Lu, “Community mining on dynamic weighted directed graphs, ” in Pr oceedings of the 1st A CM International W orkshop on Complex Networks Meet Information & Knowledge Man- agement , 2009, pp. 11–18. [23] H. R. Sheikh, M. F . Sabir, and A. C. Bovik, “ A statistical ev aluation of recent full reference image quality assessment algorithms, ” IEEE T ransactions on Image Pr ocessing , vol. 15, no. 11, pp. 3440–3451, 2006. [24] K. Gu, G. Zhai, X. Y ang, and W . Zhang, “Using free energy principle for blind image quality assessment, ” IEEE T ransactions on Multimedia , vol. 17, no. 1, pp. 50–63, 2015. [25] K. Seshadrinathan, R. Soundararajan, A. C. Bovik, and L. K. Cormack, “Study of subjecti ve and objectiv e quality assessment of video, ” IEEE T ransactions on Ima ge Pr ocessing , vol. 19, no. 6, pp. 1427–1441, 2010. [26] A. Smola and V . V apnik, “Support vector regression machines, ” Ad- vances in Neural Information Pr ocessing Systems , vol. 9, pp. 155–161, 1997. [27] A. J. Smola and B. Sch ¨ olkopf, “ A tutorial on support vector regression, ” Statistics and Computing , v ol. 14, no. 3, pp. 199–222, 2004. Soobeom Jang recei ved the B.S. degree from the School of Integrated T echnology , Y onsei Uni versity , Seoul, Korea, in 2014. He is currently pursuing the Ph.D. degree at the School of Integrated T echnology , Y onsei University . His research interests include social multimedia analysis and multimedia applica- tions. Jong-Seok Lee (M’06-SM’14) receiv ed his Ph.D. degree in electrical engineering and computer sci- ence in 2006 from KAIST , Korea, where he also worked as a postdoctoral researcher and an ad- junct professor . From 2008 to 2011, he worked as a research scientist at Swiss Federal Institute of T echnology in Lausanne (EPFL), Switzerland. Currently , he is an associate professor in the School of Integrated T echnology at Y onsei Univ ersity , K o- rea. His research interests include multimedia signal processing and machine learning. He is an author or co-author of about 100 publications. He serves as an Area Editor for the Signal Processing: Image Communication.

On Evaluating Perceptual Quality of Online User-Generated Videos

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment