Data Analysis in Multimedia Quality Assessment: Revisiting the Statistical Tests

1 Data Analysis in Multimedia Quality Assessment: Re visiting the Statistical T ests Manish Narwaria, Luk ´ a ˇ s Krasula, and Patrick Le Callet Abstract —Assessment of multimedia quality relies hea vily on subjective assessment, and is typically done by human subjects in the form of preferences or continuous ratings. Such data is crucial for analysis of different multimedia processing algorithms as well as validation of objective (computational) methods for the said purpose. T o that end, statistical testing provides a the- oretical framework towards drawing meaningful inferences, and making well grounded conclusions and recommendations. While parametric tests (such as t test, ANO V A, and error estimates like conﬁdence intervals) are popular and widely used in the community , there appears to be a certain degree of confusion in the application of such tests. Speciﬁcally , the assumption of nor - mality and homogeneity of variance is often not well understood. Theref ore, the main goal of this paper is to revisit them from a theoretical perspective and in the process pro vide useful insights into their practical implications. Experimental results on both simulated and real data are presented to support the arguments made. A software implementing the said recommendations is also made publicly available, in order to achieve the goal of repr oducible research. I . I N T RO D U C T I O N The growth of low-cost devices has virtually made mul- timedia signals an integral part of our daily lives. T odays end users are constantly interacting with multimedia, and are more demanding in terms of their multimedia experience, and perceptual quality is one of the intrinsic factors affecting such interaction. As a result, assessment of perceptual quality is an important aspect in todays multimedia communication systems [1]. The most reliable way of quality estimation typically in volves the use of a human subject panel who provides ratings/preferences for the targeted multimedia content [1], [2]. This is referred to as subjective assessment. In contrast, objectiv e estimation of quality relies on the use of computa- tional (mathematical) models [3] that are expected to mimic subjectiv e perception. Parametric statistical tests ﬁnd extensiv e application in multimedia quality estimation mainly for two purposes. First, they are used to compare and analyze subjectiv e data collected from human participants. For instance, a t -test can be used to compare Mean Opinion Score (MOS) from two different conditions in a variety of applications (eg. analyzing codec performance [4], inv estigating the effect of upscalers on video quality [5], studying optimization criteria in HDR tone map- ping [6] and so on). Analysis of V ariance (ANO V A) is also a commonly used technique for analyzing the effect of two Manish Narwaria is with Dhirubhai Ambani Institute of Information and Communication T echnology (DA-IICT), Gandhinagar, Gujarat, 382007, India. Luk ´ a ˇ s Krasula and P atrick Le Callet are with LS2N/IPI group, Univ ersity of Nantes, 44306, France e-mail: (manish narwaria@daiict.ac.in, lukas.krasula@univ-nantes.fr , patrick.lecallet@univ-nantes.fr). or more factors/treatment lev els and their interactions. These include identifying audiovisual interactions [7], examining the impact of reﬂections in HDR video tone mapping [8], in vestigating the effect of resolution, bit rate and color space on under water videos [9], studying the possible impact of compression le vel and type of content on perceptual quality tow ards ﬁnding optimal presentation duration in subjectiv e quality assessment [10] etc. Second, these tests are used to validate objective (computational) methods against subjective data. This can in turn be used to statistically compare se veral objectiv e methods in terms of their prediction accuracies as compared to the subjecti ve data. Such validation studies are obviously central to benchmarking objecti ve methods before they can be deployed in practice. The need for statistical testing arises due to the fact that subjectiv e studies use a ﬁnite sample of human subjects. Therefore, these tests can help in generalizing and making in- ferences for the population. For that purpose, parametric tests such as t -test, F -test, ANO V A, and error estimation (eg. using conﬁdence intervals) are widely used in the community . While the application of parametric tests is generally straightforward (aided by the av ailability of numerous software packages), the interpretation of the results requires some care. In particular , statistical tests in many cases are simply treated as blac k boxes , and are applied without considering the practical implications of the assumptions in these tests. As the name implies, such tests are based on apriori kno wl- edge of parameterizable probability distribution functions (eg. t distribution, F distrib ution which are respectiv ely character - ized by one and two degrees of freedom.). While it is true that parametric tests are distribution dependent (as opposed to non-parametric tests which are some times referred to as being distribution-fr ee ), there appears to be some confusion regarding the assumptions made in these tests. In particular , the assumption of normality and homogeneity of variance in many cases appears to be not well understood for both subjective and objecti ve data analysis. In practice, these assumptions are sometimes considered as bottle necks in applying parametric tests. As a result, nonparametric tests are recommended if the data violates one or both the assumptions. A typical approach to applying parametric statistical tests is depicted in the left ﬂow diagram in Figure 1, and consists of arriving at one of the three decisions D 1 , D 2 or D 3 : • D 1 : normality checks (eg. JB test, K-S test) are applied to examine if the giv en subjectiv e/objective data is nor- mal. If such normality checks determine the data to be nonnormal then nonparametric tests are carried out. • D 2 : If the normality test determines the data to be normal , then homogeneity of variance is tested by applying a 2 Subjective/objective data for multimedia quality assessment ANOVA t pooled , Is the data normally distributed (JB test, K - S test…) yes Is homogeneity of variance satisfied (F test…) ? yes Nonparame tric tests unpooled t no D 1 D 2 D 3 no Subjective/objective data for multimedia quality assessment ( balanced design recommended ) ANOVA t pooled , Are group variances similar (em pirical rule)? yes no conclude that there is sta tistica lly signific ant differ ence between the groups Is mean (MOS) useful summary statistic? yes no Nonparame tric tests using other measur es such SOS, PDU Is unequal v ariance condition pra ctically reasonable (expected) yes no Fig. 1: T ypical procedure of applying parametric tests (the left ﬂow chart) and the recommended approach (right ﬂow diagram). The drawbacks associated with making decisions D 1 , D 2 or D 3 are discussed in sections II and III. Figure best viewed in color . test of v ariance (eg. Lev ene’ s test, F test etc). If the groups/samples do not satisfy the said assumption then modiﬁed tests (eg. unpooled t test) are applied which do not use pooled v ariance in computing the test statistic. • D 3 : If the data satisﬁes both assumptions of normality and homogeneity of v ariance then the usual t test or ANO V A (which employ pooled variance) are applied. In this paper, we seek to draw attention to few drawbacks associated with such decisions. Speciﬁcally , we re visit theo- retical formulations and the resultant practical implications to highlight shortcomings and recommend alternati ve approach (right ﬂow diagram in Figure 1) in the light of the said assumptions. W e emphasize that these assumptions should not be viewed as constraints or bottle necks in the application of parametric tests. Instead these should be carefully considered and understood in the context of their practical implications. Subsequently , we provide a set of recommendations to ame- liorate some of the drawbacks that may stem from either wrong interpretation or application of the said assumptions in parametric testing. A software implementing the said rec- ommendations is also made publicly av ailable ∗ , in order to achiev e the goal of reproducible research. The remainder of the paper is organized as follo ws. In sec- tion II we analyze the distributional assumptions in parametric test. Section III provides an analysis of the assumption of homogeneity of variance. Section IV points out the practical implications in the context of multimedia quality assessment. In section V we present the experimental results and analysis while Section VI lists a set of recommendations to wards proper ∗ https://sites.google.com/site/narwariam/home/research use of parametric testing in the context of the said assumptions. W e provide concluding thoughts in section VII. I I . R E V I S I T I N G D I S T R I B U T I O NA L A S S U M P T I O N S I N PA R AM E T R I C T E S T S Parametric tests require certain assumptions including the assumption of normality , homogeneity of variance and data independence. As highlighted in left ﬂo w diagram in Figure 1, normality checks have usually been applied on subjectiv e or objectiv e data [2], [3], [4], [11]. Such use of normality checks indicates that the assumption of normality is, in many cases, misunderstood to be applicable on the data for which statistical tests are to be carried out. This is, howe ver , incorrect in the light of the fact that all parametric tests essentially work by locating the observed test statistic on a kno wn probability dis- tribution function. Then, depending on the desired signiﬁcance lev el and the location of test statistic, one typically accepts or rejects the null hypothesis. F or example, in t -test, the t -statistic is ﬁrst computed from the observed sample. This t -statistic is then compared with v alues from a t -distribution (correspond- ing to the particular degrees of freedom). In other words, the computed test statistic ( t -statistic, F -statistic etc.) is assumed to follow the corresponding distribution ( t -distribution in t - test, F -distrib ution in F -test and ANO V A etc.). Thus, the more appropriate question to be asked in para- metric testing is whether the test statistic follows the assumed distribution (rather than the data being normally distributed). The answer to such question requires that the subjective (or objectiv e) test be repeated for a large number of times, each time using a different sample (both in terms of human subjects and content). Then, in each instance, the test statistic can be 3 computed to obtain its sampling distribution. This process is, howe ver , neither practical for obvious reasons nor desir- able. Instead, one can rely on the fundamental central limit theorem (CL T). Informally , the CL T states that the sampling distribution of the arithmetic mean (and sum) will approach a normal distribution as the sample size increases, regardless of the underlying population distribution [12]. It is due to this result that the test statistic in parametric tests are guaranteed to follow the assumed distribution, provided that the sample size is large enough (approaching inﬁnity in theory). W e begin by considering two populations p 1 and p 2 with means µ 1 and µ 2 and variances σ 2 1 and σ 2 2 , respectiv ely . In the context of multimedia quality assessment, these popula- tions will typically represent the collection of subjective (or objectiv e) opinion scores for two conditions (eg. subjectiv e or objectiv e quality scores for two proﬁles of a video codec, indi- vudual quality scores for audiovisual content corresponding to two parameter settings, quality scores for content rendered by two depth image based rendering methods, individual quality scores for two tonemapped HDR videos and so on) for which we need to compare mean quality scores i.e. µ 1 and µ 2 . Assume that p 1 and p 2 are sampled i.e. subjectiv e or objective assessment is actually performed on a set of content using a sample of human subjects or using objective methods. Let the corresponding samples be denoted by x 1 = [ x 11 , ..., x 1 n 1 ] and x 2 = [ x 21 , ..., x 2 n 2 ] where n 1 and n 2 are the sample sizes, and the sample observations are assumed to be independent and identically distributed (iid) random variables. Note that there are no assumptions regarding the distribution of either the populations ( p 1 and p 2 ) or corresponding samples ( x 1 and x 2 ) . A. Sampling distribution of test statistic in t -test Let x 1 , x 2 and s 2 1 , s 2 2 denote the sample means and variances, respecti vely . Then the goal of the analysis is to infer if µ 1 = µ 2 (the null hypothesis) or not. T o that end, one can employ the t -test. T o deﬁne the t -statistic, we use the result from the CL T i.e. x 1 ∼ N  µ 1 , σ 1 √ n 1  and x 2 ∼ N  µ 2 , σ 2 √ n 2  (1) Then, the difference between the samples means will also be normally distributed i.e. x 1 − x 2 ∼ N   µ 1 − µ 2 , s σ 2 1 n 1 + σ 2 2 n 2   (2) By standardization, we hav e x 1 − x 2 − ( µ 1 − µ 2 ) q σ 2 1 n 1 + σ 2 2 n 2 ∼ N (0 , 1) (3) Note that in eq. (3) only the numerator is a random variable while the denominator is constant. Howe ver , in practice, the population v ariance is generally not known. W e therefore need to use sample variance as an unbiased estimator of the population variance. T o proceed further , we consider two cases for deﬁning the null hypothesis. 1) Case 1: Samples drawn fr om same population: W e can deﬁne the null hypothesis as H 0 : the two samples are taken from the same population. This implies that not only are we assuming the population means to be equal but other population parameters including variances are equal. Thus, we hav e µ 1 = µ 2 and σ 2 1 = σ 2 2 = σ 2 (say). In order to obtain a more accurate estimate of the (common) population variance, we can employ the pooled variance s 2 p which is deﬁned as s 2 p = s 2 1 ( n 1 − 1) + s 2 2 ( n 2 − 1) ( n 1 − 1) + ( n 2 − 1) (4) Thus, under H 0 , the denominator in eq. (3) can be modiﬁed accordingly and the t -statistic deﬁned as t pooled = x 1 − x 2 s p q 1 n 1 + 1 n 2 , d f pooled = n 1 + n 2 − 2 (5) W ith the said modiﬁcation, the reader will now note that the denominator in eq. (5) is also a random variable, unlike eq. (3) where it was a constant. Thus, t pooled is a ratio of two random v ariables. The numerator is the dif ference of two independent normally distributed random variables ( x 1 and x 2 ), and will therefore be normally distributed [13]. Further , the squared denominator will be equal to s 2 p n 1 + s 2 p n 2 which denotes the variance of the said normal distribution in the numerator . Hence, the denominator in eq. (5) will be chi-squared distributed [13]. Accordingly , the test statistic t pooled is characterized by the ratio of normally and square root of chi-squared distributed variables. It will therefore be approximately † distributed according to the t -distribution [13] with d f pooled = n 1 + n 2 − 2 degrees of freedom, and this will be irrespective of the distribution of either the populations ( p 1 and p 2 ) or corresponding samples ( x 1 and x 2 ) . 2) Case 2: Samples drawn fr om two differ ent populations with same population mean: In the second case, we assume that the two samples hav e been drawn from two dif ferent popu- lations with same population mean i.e. µ 1 = µ 2 (but σ 2 1 6 = σ 2 2 ). Hence, other population parameters such as variance or any other statistic need not be equal. Then, we can use sample variances as an estimate of the two population variances, and under the assumption of the null hypothesis, eq. (3) can be modiﬁed to obtain the follo wing test statistic t unpooled = x 1 − x 2 q s 2 1 n 1 + s 2 2 n 2 , d f unpooled =  σ 2 1 n 1 + σ 2 2 n 2  2 σ 4 1 n 2 1 ( n 1 − 1) + σ 4 2 n 2 2 ( n 2 − 1) (6) In practice, we use s 2 1 and s 2 2 to compute d f unpooled in eq. (6) because σ 2 1 and σ 2 2 are not known. W e will discuss the two cases in section III. † In theory , the sample size should tend to inﬁnity for the sample means to be normally distributed according to CL T . Howe ver , in practice, smaller samples sizes allo w us to approximate the assumption of normality , re gardless of population or sample distribution. 4 B. The case of ANO V A and F -test The sampling distribution of the test statistic ( F ) in F - test (ANO V A also relies on F -test) is assumed to follow the F -distrib ution [13]. It can be shown that this assumption is valid irrespective of the data distrib ution with the same ca veat concerning the CL T mentioned in the pre vious sub-section. Before doing that, we assume that there are k groups each with n i observations (let the total number of observations be denoted by M = k X i =1 n i ), and deﬁne the following: mean x i of i th group, grand mean X and variance s 2 i of the i th group as x i = n i X j =1 x ij n i , s 2 i = n i X j =1 ( x ij − x i ) 2 n i , X = k X i =1 n i X j =1 x ij k X i =1 n i (7) The F -statistic in ANO V A is deﬁned as the ratio of inter- group (i.e. between groups) and intra-group (i.e. within each group) variations. W e denote these quantities by S S B and S S W , respectively , with the corresponding degrees of freedom being d f B and d f W . Then, the F -statistic is computed as F = S S B /d f B S S W /d f W = k X i =1 n i  x i − X  2 / ( k − 1) k X i =1 n i X j =1 ( x ij − x i ) 2 / ( M − k ) (8) By noting that the denominator in eq. (8) is essentially a weighted sum of indi vidual group variances, we can vie w the F -statistic as F = k X i =1 n i  x i − X  2 / ( k − 1) n 1 s 2 1 + n 2 s 2 2 + ... + n k s 2 k ( n 1 − 1)+( n 2 − 1) ... +( n k − 1) (9) One can see that the numerator in eq. (9) is squared difference of two normally distributed variables ( x i and X ), and will be thus chi-squared distributed. The denominator can be seen to be very similar to the pooled variance used in eq. (4), and will be chi-squared distributed following similar arguments. It follows that F is a ratio of two chi-squared distributed random variables which in turn implies that it will be approximately distributed according to the F -distribution (with k − 1 and M − k degrees of freedom). Once again, this is independent of the distribution of the population or the groups, and only relies on the approximations related to sample size as required in the CL T . C. Data normality checks: are they r equir ed? As discussed in previous sub-sections, the CL T being a theoretical result only provides asymptotic approximation in that as sample size tends to inﬁnity the sampling distribution of mean tends to be normally distributed, and this holds irrespectiv e of the sample or population distribution [12]. Note that the CL T does not specify any sample size above which the said sampling distribution will be normal. In practice, smaller sample sizes are generally sufﬁcient to allo w reasonable ap- proximations. For instance, in the context of subjectiv e quality assessment, Ref. [14] recommends a minimum of 15 subjects while the authors in [15] suggested using at least 24 subjects for audiovisual quality measurement. Because the sampling distribution of mean is directly or indirectly used in computing the test statistics such as t , F etc., there are no requirements of normality (or any other distribution) on the data to be analyzed. It is, therefore, not surprising that previous works [2], [16], [17] hav e noted that parametric tests such as ANO V A are rob ust to non-normal data distributions , and the focus on distributional assumptions in these tests is not required [18]. The second theoretical argument against the application of normality checks before conducting parametric tests is the in- ﬂation of T ype I error probability . A commonly adopted strat- egy is to ﬁrst check whether the giv en sample/data is normally distributed or not. T o that end, normality tests such as the K olmogorov-Smirnov (K-S) test, Jarque-Bera test, Shapiro- W ilk test etc. are popular . If the tests determine the given data is normally distributed then a parametric test is used. Other- wise, a non-parametric test is performed. As a result of this two-step process, there will be an increase in type I error prob- ability . Assume that H ∗ 0 : given data is normally distributed (the null hypothesis in a normality test) and H 0 be the null hypothesis of the test that will follow . Then, the probability of rejecting H 0 can be written as the sum of mutually exclusi ve ev ents i.e. P ( reject H 0 ) = P ( reject H 0 and not reject H ∗ 0 ) + P ( reject H 0 and reject H ∗ 0 ) (10) In the above equation the ﬁrst expression on right hand side corresponds to the case of using a parametric test while the second expression corresponds to the use of a suitable non- parametric test. Because the critical regions corresponding to the parametric and non-parametric tests will be in general different, the resultant critical re gion which is a union of the critical regions of the individual tests is increased. Con- sequently , the probability ‡ to reject H 0 (when it is true) is increased thereby increasing the probability of a type I error . The third argument against the use of normality tests is the theoretical contradiction concerning the sample size. It is known that most normality tests, by deﬁnition, tend to reject the null hypothesis H ∗ 0 (giv en data is normally distributed) as the sample size increases. For instance, in the JB test for normality , the test statistic value is directly proportional to the sample size. In other words, larger the sample size, it is more likely to be determined as non-normal. Howe ver , according to CL T , the approximation of normality of the sampling distribution of mean improves as the sample size increases. This leads to a contradiction between the requirement of data normality and the asymptotic behavior in the CL T . ‡ This probability value is not related to the p value of the signiﬁcance test. Instead, it refers to the probability (over repeated trials) of making a type I error i.e. rejecting H 0 when it is true. 5 While other methods such visual (eg. histogram visualiza- tion, normal probability plots) or those based on empirical rules (eg. if sample kurtosis is between 2-4, then the sample is deemed to be normally distributed) can overcome the limitations associated with the more formal normality tests, these are not required because it is the normality of sampling distribution of mean that is needed rather than the data being normal. I I I . T O P O O L O R N O T T O P O O L ? In this section, we analyze the assumption of homogeneity of variance and point out the theoretical aspects that need to be considered in the context of this assumption. The relev ant practical considerations will be discussed in the next section. A. Should homogeneity of variance be chec ked? As discussed in the previous section, the null hypothesis can be deﬁned in two cases. For Case 1, we require the assumption of homogeneity of variance (i.e. σ 2 1 = σ 2 2 ) and is applicable in the context of ANO V A (for more than two groups) and t pooled (for two groups). Note that both the tests use an estimate of the pooled variance in order to compute the corresponding test statistic. On the other hand, Case 2 does not require homogeneity of variance and is applicable in deﬁning the test statistic t unpooled . Therefore, t unpooled is widely used in statistical data analysis and has been included in many statistical packages such as SPSS. Howe ver , it can be noted that in general d f unpooled < d f pooled (except when σ 2 1 = σ 2 2 and n 1 = n 2 , in which case both are equal), and hence the use of t unpooled will increase the probability of T ype II error (i.e. the test will be more conservati ve). In light of this, a popular and seemingly logical strategy is to ﬁrst conduct a preliminary test of variance based on which a decision to either use t pooled (or ANO V A) or t unpooled (if the test of variance leads to the conclusion that σ 2 1 6 = σ 2 2 ). Notice that this strategy , howe ver , in volves cascaded use of the given data in rejecting or accepting two hypotheses (one from test of v ariance and the other from the t -test). In other words, two signiﬁcance tests are performed on the same data. As a consequence, the T ype I error probability will be increased [13]. Suppose H ∗∗ 0 : σ 2 1 = σ 2 2 (the null hypothesis in a preliminary variance test for equality of population variances) and H 0 : µ 1 = µ 2 be (the null hypothesis for the t -test that will follow). Then, the probability of rejecting H 0 in this case can be written as (similar to eq. 10) the sum of probability of rejecting H 0 when H ∗∗ 0 is not rejected and the probability of rejecting H 0 when H ∗∗ 0 is also rejected. Follo wing the same arguments as in section II-C, the resultant critical region which is a union of the critical regions of the individual t -tests is increased thereby inﬂating the probability of T ype I error . Further , note from eq. (6) that the degrees of freedom for t unpooled depends on population variances σ 2 1 and σ 2 2 , and will therefore be a random v ariable in case these are estimated from sample variances (which is practically the more likely case). As a result, its analysis, both theoretical and e xperimental is more complicated due to the fact that its distribution is not independent of sample v ariances [19]. Thus, the interest in t unpooled is more from a theoretical perspective in that it allows for a corr ection in degrees of freedom which in turn renders it valid in cases when population variances are not equal. In practice, howe ver , it is more relev ant to consider the implications of comparing means of two populations whose spread (v ariances) are dif ferent. Hence, applying statistical tests for checking homogeneity of variance prior to using t test, ANO V A etc. is not recommended due to theoretical (due to increased probability of type I error) reasons, and is of less interest in practice. B. The case of balanced design It can be sho wn that the test statistic t pooled is v alid ev en if σ 2 1 6 = σ 2 2 provided that the sample sizes are equal (balanced design). T o prove this, we compare the distributions of t unpooled and t pooled by writing them in terms of the theoretical t distribution [19] in the following form: t pooled = c pooled · t d f pooled , t unpooled = c unpooled · t d f unpooled (11) where t d f pooled and t d f unpooled are the t distrib utions with respectiv e degrees of freedom. Thus, for t pooled and t unpooled to follow the respecti ve theoretical t distributions the corre- sponding multiplicativ e factors c pooled and c unpooled should be equal to 1. It can, howe ver , be shown [19] that while c unpooled is always equal to 1, the value of c pooled depends on sample size and population v ariances i.e. c pooled = v u u u t ( n 1 + n 2 − 2)  σ 2 1 n 1 + σ 2 2 n 2   1 n 1 + 1 n 1  { ( n 1 − 1) σ 2 1 + ( n 2 − 1) σ 2 2 } (12) From the above equation, it is easy to see that c pooled = 1 if the population variances are equal ( σ 2 1 = σ 2 2 ). Howe ver , c pooled is also equal to 1 if sample sizes are equal ( n 1 = n 2 ). In other words, t pooled will follow the expected theoretical distribution if balanced design is used, despite the violation of the assumption of homogeneity of variance. Because sev eral practical applications tend to target a balanced design i.e. equal sample sizes, the use of t pooled is valid in such cases ev en if sample variances differ by a large amount. Particularly , in case of multimedia quality assessment, the use of balanced design is common. For instance, typical subjective quality assessment tests use the same number of human subjects to ev aluate the quality of different conditions (although the subject panel may or may not comprise of the same subjects in ev aluating the quality of each condition). I V . P R A C T I CA L C O N S I D E R A T I O N S I N T H E D O M A I N O F M U LTI M E D I A Q UA L I T Y A S S E S S M E N T In this section, we discuss the assumption of homogeneity of variance from the practical vie w point, and take an illustrati ve example from the domain of video quality assessment. Let us consider that an original (i.e. undistorted) video sequence is viewed and rated for its visual quality by all the concerned observers on a scale of 1 (worst) to 5 (excellent). Hence, 6 0 2000 4000 6000 8000 10000 12000 14000 16000 Original QP 1 QP 2 shift in means (MOS) due to treatments QP 1 and QP 2 µ org = 4.7 µ QP 1 = 3.5 µ QP 2 = 1.6 (a) Fig. 2: Illustration of treatment effects QP 1 and QP 2 . The shift in location does not alter the v ariance of the groups. The values of µ 1 , µ 2 and µ 3 are assumed for illustration only . Figure best viewed in color . this set of indi vidual ratings forms the population of interest P org for this condition (i.e. undistorted video). W e can express each element of P org as P ( i ) org = µ org +  i where µ org is the mean of P org and  i denotes the random error (with zero mean and ﬁnite variance) that will be introduced in each individual rating. This error term can be used to take into account the fact that some observers may be more critical (so their corresponding ratings will be less than µ org ) while others may be less critical (i.e. their ratings are expected to be higher than µ org ) of the video quality . Suppose the said video is now compressed using two quantization parameter ( QP ) values QP 1 and QP 2 and QP 2 > QP 1 ( QP is employed in video compression as a measure to quantify quantization lev els, higher QP implies higher quantization and in general lower video quality). A. The case of systematic treatment effect In the considered example, quantization can be considred as a treatment that is applied to the original video. Assuming all other conditions to be identical (i.e. same display , ambient light, vie wing distance etc.), the treatments QP 1 and QP 2 will decrease the video quality and essentially cause a shift in means (MOS). In other words, the intervention in original video will result in shifted (in location) version of the popula- tion P org , as sho wn in Figure 2. Let µ QP 1 and µ QP 2 denote the means of the populations P QP 1 and P QP 2 , respectively . Then, if these treatments have a systematic ef fect on video quality , we can express the elements of the corresponding populations as P ( i ) QP 1 = µ org + E QP 1 +  i and P ( i ) QP 2 = µ org + E QP 2 +  i . Here E QP 1 and E QP 2 are the ef fects of the treatments QP 1 and QP 2 , respectively . Hence, the quality scores for the ne w conditions are shifted from µ org by an amount triggered by the visible impact of the treatments on the video quality , and can be quantiﬁed by E QP 1 and E QP 2 . In the example shown in Figure 2, E QP 1 = − 1 . 2 and E QP 2 = − 3 . 1 (negati ve values are indicativ e of decrease in video quality). Notice that the resulting populations P QP 1 and P QP 2 will hav e the same variance as P org because the treatments ( QP 1 and QP 2 ) will cause systematic changes in indi vidual ratings (i.e. observers who were more critical in case of original video will remain so for the ne w conditions also). In the alternate case, if the treatments do not cause any changes in the opinion scores i.e the effect is not visible to the observers (i.e. E QP 1 = 0 and E QP 2 = 0 ), then the three populations will be the same and one can conclude that the treatments do not lead to statistically signiﬁcant dif ferences in means (MOS). B. The case of heter ogeneous variances In the third case, if the treatments QP 1 and QP 2 do not in- troduce systematic effect on video quality , then the individual opinion scores may randomly increase (video quality improves visibly according to some observers), decrease (video quality degrades visibly according to some observers) or remain the same (video quality levels remains same as without any treatment). In such case, we can say that the treatments caused the ratings to become heterogeneous because apart from the inherent random error (  i ), the varying values of E QP 1 and E QP 2 will introduce additional and possibly different v aria- tions in P QP 1 and P QP 2 . Consequently , the variances of the three populations P org , P QP 1 and P QP 2 will be dif ferent. Hence, testing if µ org = µ QP 1 = µ QP 2 may not be useful since the populations will be dif ferent in any case. Practically , such cases are of less interest because one generally knows the effect of a gi ven treatment apriori (in the gi ven example of video compression, it is known QP 1 and QP 2 will lower video quality levels as compared to the original video) and statistical tests help to establish if the observed differences due to the treatment are merely due to chance (i.e. due to sampling error) or not. If the population variances are unequal, it may point out to 2 possibilities: (1) additional factors may hav e crept in, (2) the observers hav e not been consistent in their ratings. The ﬁrst possibility is generally minimized by careful experimental design including training sessions at the beginning of the test to ensure that the participants hav e understood the task well. The effect of second possibility is mitigated by rejecting outliers i.e. inconsistent observers that can cause variance to change are removed from further studies or analysis. 7 Such outlier rejection is well accepted and recommended in multimedia quality analysis, and well documented outlier rejection strategies exist [14], [20]. Therefore, outlier rejection provides indirect support for the assumption of homogeneity of variance, e ven though the explicit goal is to remo ve data points which might be dissimilar rather than making the variances of groups similar . In other words, experimental design in subjectiv e tests for quality will help to ensure that the v ariances of the groups to be analyzed are similar . In general, the issue of heterogeneous group variances can be a voided [21] if proper experimental guidelines have been followed. In other words, Case 2 (i.e. samples/groups drawn from dif ferent populations with same population mean) may be practically less useful although it is perfectly valid for theoretical analysis. In sum- mary , careful experimental design is more crucial for reliable statistical analysis and comparisons rather than focusing on homogeneity of variance and/or distributional assumptions (data normality). It may also be noted that while the use of t pooled , ANO V A requires that population variances are equal, it does not imply that sample/group variances be exactly equal. Rather the said variances should be similar . This can be quantiﬁed by computing the ratio of maximum to minimum group v ariance. Empirically , if the said ratio is greater than or less than 1 / 4 (= 0 . 25 ), then the population variances can be deemed to be unequal. In such case, it may not be meaningful to conduct t -test or ANO V A because the samples are likely to be drawn from two different populations. C. Comparing gr oups with differ ent variances Homogeneity of variance condition should be viewed in the light of practical considerations and not as a constraint. Therefore, it can be assessed via the empirical rule in order to obtain information about the presence of groups/samples that may have very different variances as compared to the remaining ones, and might suggest the possibility that the samples are taken from dif ferent populations (in which case comparing the means via t unpooled or other test which does not use pooled v ariance may be less meaningful). Once again, practical context should be used to ascertain if unequal vari- ance condition is reasonable in view of the goals of analysis. For instance, it is possible that only a fraction of groups may violate this condition in which case the possible reasons can be examined. In other cases, such groups could possibly be remov ed from analysis. As discussed in section III-B, in theory t pooled , ANO V A are in any case not af fected by unequal variance if balanced design (equal sample size) is employed. Therefore experimental design should target balanced design as far as possible (in multimedia quality estimation, balanced design are common). Nev ertheless, practically it may be more insightful to analyze the possible reasons and consequences of unequal variance rather than merely applying the statistical tests. As discussed, Case 2 is valid from a theoretical perspec- tiv e but is of less interest in practice. In other words, the implications of comparing k samples whose corresponding populations hav e different variances but with equal means i.e. T ABLE I: Description of distribution types and their charac- teristics. T ype Parameters Shape Kurtosis Beta a = 0 . 5 , b = 0 . 5 symmetric, bimodal (two peaks) 1.5 Exponential λ = 0 . 5 decaying curve, non-symmetric 9 Normal µ = 0 , σ = 1 bell-shaped, symmetric, unimodal (one peak) 3 Uniform a = 0 , b = 1 ﬂat (no peaks), symmetric 1.8 µ 1 = µ 2 = ... = µ k , should also be noted. In this context, it is useful to point out that MOS is sometimes not the most accurate measure of multimedia quality , and other measures may be required to supplement it. For instance, the authors in [22] proposed the use of SOS (standard deviation of opinion scores) while Ref. [23] suggested using PDU (percentage dissatisﬁed users) in addition to MOS. Note that measures such as SOS, PDU can be different even if corresponding population MOS are equal. Such cases will arise if groups (samples) from different populations (with same population means) are compared, and may not lead to meaningful analysis of perceptual quality and/or user satisfaction levels. V . E X P E R I M E N TA L R E S U LT S A N D D I S C U S S I O N In the ﬁrst set of experiments, we in vestigate the effect of type of distribution that the sample follows. W e considered four different types of distributions (from which random numbers were generated to simulate sample observations), and these are summarized in T able I. Note that the parameters for these distributions were chosen in order to result in diverse shapes (in terms of symmetry , number of peaks etc.). The kurtosis v alues reported in T able I reﬂect this. As an example, we use ANO V A, and study the sampling distribution of F when the samples follow the distributions mentioned in T able I. W e consider 5 groups ( k = 5 ), equal number of observations in each group ( n i = n = 25 ), and ensured that the groups have similar variances. Thus, we represent the sample for exponential distribution as S exp = [ d 1 exp d 2 exp d 3 exp d 4 exp d 5 exp ] . Here d 1 exp to d 5 exp are 25- dimensional column vectors representing the groups. Similarly , we can deﬁne the samples for other distributions i.e. S beta , S normal and S unif orm . Since our goal was to study the sampling distribution of F in ANO V A, S exp , S beta , S normal and S unif orm were generated randomly in each iteration, making sure the that observations follo wed the respective distributions. The sam- pling distributions of F for each case are shown in Figure 3. The number of iterations N iter = 10 5 . W e hav e also plotted (represented by continuous line) the theoretical F distribution with the corresponding degrees of freedom i.e. F ( k − 1 , M − k ) = F (4 , 120) for comparison. W e can make the following two observations from this ﬁgure: • The sampling distribution of F follo ws the theoretical F -distrib ution curve irrespective of the type of sample distribution. Thus, sample normality is not a prerequisite for F to be distributed according to F -distribution. 8 0 1 2 3 4 5 6 7 8 9 0 2000 4000 6000 8000 10000 12000 14000 Samp ling distribu tion of F for S bet a (a) Beta 0 1 2 3 4 5 6 7 8 9 0 2000 4000 6000 8000 10000 12000 14000 Samp ling distribu tion of F for S exp (b) Exponential 0 1 2 3 4 5 6 7 8 9 0 2000 4000 6000 8000 10000 12000 14000 Samp ling distribu tion of F for S norm al (c) Normal 0 1 2 3 4 5 6 7 8 9 0 2000 4000 6000 8000 10000 12000 14000 Samp ling distribu tion of F for S unifo rm (d) Uniform Fig. 3: Sampling distribution of F v alues when the samples follow the indicated distributions. In each plot, the continuous curve indicates the theoretical F -distribution with 4 and 120 degrees of freedom. Figure best viewed in color . • Despite a small sample size ( n = 25 ), the sampling distribution of F approximates well the theoretical curve. Hence, as argued, in practice ANO V A (and other para- metric tests) can be applied to approximate the theoretical distribution. Obviously , the approximations will improve with increasing sample size. W e can carry out similar analysis regarding the sampling distribution of the test statistic on real data. Howe ver , in practice we typically have only one sample since the subjecti ve or objecti ve experiment is not repeated for obvious reasons. Therefore, to generate the sampling distributions in such scenario, we employ the idea of resampling. Speciﬁcally , giv en two or more samples which are to be compared, we can create randomized versions of these under the assumption that the giv en samples are similar (i.e. assuming the null hypothesis to be true). T o demonstrate this, we use ra w opinion scores from the dataset described in [5] where a comparison of upscalers was performed at varying compression rates. Since we want to study the sampling distribution of F in ANO V A, we ﬁrst selected three groups from the said data. These groups represent quality scores of three conditions ev aluated by 26 observers. Thus, the group size was 26 ( n i = n = 26 ). Other descriptiv e properties of the selected groups are summarized T ABLE II: Description of groups taken from [5]. group 1 group 2 group 3 Mean (MOS) 5.5769 7.3846 7.3077 V ariance 3.1338 3.2862 2.4615 Kurtosis 1.7971 6.9602 6.4978 Shape unimodal, non-symmetric bi-modal, non-symmetric unimodal, non-symmetric in T able II from which we note that none of the groups are normally distributed as indicated by very high or very low kurtosis values and their shapes. In addition, the group variances are similar . First, we applied ANO V A to compare the resampled ver- sions of the three groups (we employed 10 5 randomizations under the null hypothesis) and, the resulting sampling dis- tribution of F values is shown in Figure 4a. As expected, it approximates well the theoretical F distrib ution. T o giv e another example, we sho w the sampling distribution of t pooled when comparing group 1 and group 2 using the pooled t -test in Figure 4b. In this case also, the experimental distribution reasonably follo ws the theoretical t -distribution. 9 0 2 4 6 8 10 12 14 16 0 0.5 1 1.5 2 2.5 3 × 10 4 Samp ling distribu tion of F for three groups (a) -6 -4 -2 0 2 4 6 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Samp ling distribu tion of t (group1,gr oup2 ) (b) Fig. 4: Sampling distrib ution of F and t pooled values for the groups of data taken from [5]. The groups are summarized in T able II. In each plot, the continuous curve indicates the corresponding theoretical distribution. Figure best vie wed in color . V I . P R AC T I C A L R E C O M M E N D A T I O N S Based on the theoretical and experimental analysis in previ- ous sections, it is clear that the application of parametric tests should focus on the consequences of the assumptions in these tests. The practical recommendations tow ards using the tests are highlighted in the right ﬂow diagram in Figure 1, and are summarized in the follo wing. Applying normality checks on giv en data is neither required nor recommended as the CL T provides information about the shape and parameters of the sampling distribution of mean. Instead the more important consideration is whether mean (MOS) adequately represents the desired information from the sample(s). For instance, mean is a useful measure of central tendency in case of many symmetric distributions (not necessarily normal). Moreover , mean is still a practically useful statistic ev en if there are few outliers (ske wness) in the data. In all such cases, parametric tests are practically meaningful for statistical analysis. Homogeneity of v ariance should be exploited to obtain further insights into the data, and therefore not be viewed as a bottleneck for the purpose of statistical testing. T o that end, the empirical rule (refer to section IV -B) should be applied to detect the presence of groups/samples that may hav e very different variances as compared to the remaining ones. If such groups exist, then the corresponding conditions should be re visited to ﬁnd possible reasons for unequal variance. Consequently , if unequal variance condition is practically rea- sonable (or such groups can be removed), t pooled or ANO V A can be used. A balanced experimental design (equal sample size) would therefore be preferable in such cases (recall from section III-B both the tests are not af fected by unequal v ariance if group/sample sizes are same). The use of nonparametric tests is recommended if mean is not a suitable summary statistic of the data to be analyzed. Note that nonparametric tests should not be used merely because the giv en data is nonnormal . Rather they should be used to generate the sampling distrib ution of the desired test statistic. In summary , analysis of data pertaining to multimedia quality using mean (av erage) as a test statistic should focus on experimental design (this includes the selection of chal- lenging content recruiting adequate number of human subjects with possible emphasis on balanced design, conditions to be ev aluated, and the ﬁnal goal of analysis) rather than empha- sizing distributional assumptions, equal variance condition or resorting to multiple hypothesis tests. Howe ver , if mean is not a suitable test statistic, then nonparametric tests can be used by lev eraging the po wer of computers to construct empirical sampling distribution of the desired test statistic. V I I . C O N C L U D I N G R E M A R K S Parametric tests pro vide a theoretical frame work for drawing statistical inferences from the data and thus help in formulating well grounded recommendations. Howe ver , the application of these tests and interpretation of the results require some care in the light of the assumptions required in these tests. T o that end, we re visited the theoretical formulations and clariﬁed the role of the assumption of normality and homogeneity of variance. By analyzing the sampling distribution of the test statistics, we argued that the more appropriate question to be asked before deploying parametric tests is whether the test statistic follows the corresponding distribution or not (instead of the data following any speciﬁc distribution). W e also emphasized that the said assumptions should not be viewed as constraints on the data. Instead it is more important to focus on their practical implications. The presented analysis is particularly relev ant in the context of multimedia quality assessment because the said issues have not been emphasized enough in the corresponding literature. W e also made practical recommendations in order to av oid the theoretical issues related to multiple hypothesis testing. Even though the targeted application was multimedia quality estimation, the theoretical arguments and the recommendations are expected to be useful in sev eral other areas (such as medical data analysis, information retrie val, natural language processing etc.) where parametric tests are widely used. In or- 10 der to provide a tool for practical use, a software implementing the said recommendations is also made publicly av ailable § . R E F E R E N C E S [1] P . Coverdale, S. Moller, A. Raake, and A. T akahashi, “Multimedia quality assessment standards in itu-t sg12, ” IEEE Signal Pr ocessing Magazine , vol. 28, no. 6, pp. 91–97, Nov 2011. [2] ITU-R Recommendation BS.1534-3, “Method for the subjective assess- ment of intermediate quality lev els of coding systems, ” International T elecommunication Union, Gene va, Switzerland, T ech. Rep., Oct. 2015. [3] ITU-T Tutorial, “Objective perceptual assessment of video quality: Full reference tele vision, ” International T elecommunication Union, Geneva, Switzerland, T ech. Rep., May 2005. [4] T . K. T an, R. W eerakkody , M. Mrak, N. Ramzan, V . Baroncini, J. R. Ohm, and G. J. Sullivan, “V ideo quality evaluation methodology and veriﬁcation testing of hevc compression performance, ” IEEE Tr ansac- tions on Circuits and Systems for V ideo T echnology , vol. 26, no. 1, pp. 76–90, Jan 2016. [5] Y . Pitrey , M. Barkowsk y , P . Le Callet, and R. Pepion, “Subjective quality ev aluation of h.264 high-deﬁnition video coding versus spatial up-scaling and interlacing, ” ACM EuroITV Conference, W orkshop on Quality of Experience for Multimedia Content Sharing (QoEMCS) , 2010. [6] M. Narwaria, M. P . Da Silva, P . Le Callet, and R. Pepion, “T one mapping-based high-dynamic-range image compression: study of opti- mization criterion and perceptual quality , ” Optical Engineering , vol. 52, no. 10, pp. 102 008–102 008, 2013. [7] B. Belmudez, Audiovisual Quality Assessment and Prediction for V ideotelephony , ser . T -Labs Series in T elecommunication Services. Springer International Publishing, 2016. [8] M. Melo, M. Bessa, L. Barbosa, K. Debattista, and A. Chalmers, “Screen reﬂections impact on hdr video tone mapping for mobile devices: an ev aluation study , ” EURASIP Journal on Image and V ideo Pr ocessing , vol. 2015, no. 1, p. 44, 2015. [9] J. M. Moreno-Rold ´ an, M. Luque-Nieto, J. Poncela, V . D ´ ıaz-del-R ´ ıo, and P . Otero, “Subjective quality assessment of underwater video for scientiﬁc applications, ” Sensors , vol. 15, no. 12, pp. 31 723–31 737, 2015. [Online]. A vailable: http://dx.doi.org/10.3390/s151229882 [10] F . M. Moss, K. W ang, F . Zhang, R. Baddeley , and D. R. Bull, “On the optimal presentation duration for subjective video quality assessment, ” IEEE T ransactions on Circuits and Systems for V ideo T echnology , vol. 26, no. 11, pp. 1977–1987, Nov 2016. [11] ITU-T Recommendation P .1401, “Methods, metrics and procedures for statistical evaluation, qualiﬁcation and comparison of objectiv e quality prediction models, ” International T elecommunication Union, Geneva, Switzerland, T ech. Rep., Jul. 2012. [12] G. Pﬂug, “On kersting’ s proof of the central limit theorem, ” Statistics & Probability Letters , vol. 1, no. 6, pp. 323 – 326, 1983. [13] G. Roussas, An Introduction to Pr obability and Statistical Inference , second edition ed. Academic Press, 2015. [14] ITU-R Recommendation BT .500-12, “Methodology for the subjective assessment of the quality of tele vision pictures. ” Genev a, Switzerland: International T elecommunication Union, 2009. [15] M. H. Pinson, L. Janowski, R. Pepion, Q. Huynh-Thu, C. Schmidmer, P . Corriv eau, A. Y ounkin, P . L. Callet, M. Barkowsky , and W . Ingram, “The inﬂuence of subjects and environment on audiovisual subjective tests: An international study , ” IEEE J ournal of Selected T opics in Signal Pr ocessing , vol. 6, no. 6, pp. 640–651, Oct 2012. [16] E. Schmider, M. Ziegler , E. Danay , L. Beyer , and M. B ¨ uhner , “Is it really robust? Rein vestigating the robustness of ANO V A against violations of the normal distrib ution assumption.” Methodology: European Journal of Resear ch Methods for the Behavioral and Social Sciences , vol. 6, no. 4, p. 147, 2010. [17] W . K. Lim and A. W . Lim, “ A comparison of usual t-test statistic and modiﬁed t-test statistics on skewed distribution functions, ” Journal of Modern Applied Statistical Methods , vol. 15, no. 2, pp. 67–89, 2016. [18] T . Lumley , P . Diehr, S. Emerson, and L. Chen, “The importance of the normality assumption in large public health data sets, ” Annual Review of Public Health , vol. 23, pp. 151–169, 2002. [19] B. L. W elch, “The signiﬁcance of the dif ference between two means when the population variances are unequal, ” Biometrika , vol. 29, no. 3-4, p. 350, 1938. § https://sites.google.com/site/narwariam/home/research [20] “Final report from the video quality experts group on the validation of objectiv e quality metrics for video quality assessment. ” V ideo Quality Experts Group (VQEG), March 2003. [21] S. S. Sawilo wsky , “Fermat, schubert, einstein, and behrens-ﬁsher: The probable difference between two means when σ 2 1 6 = σ 2 2 , ” Journal of Modern Applied Statistical Methods , vol. 1, no. 2, pp. 461–472, 2002. [22] T . Hofeld, R. Schatz, and S. Egger , “Sos: The mos is not enough!” in 2011 Third International W orkshop on Quality of Multimedia Experi- ence , Sept 2011, pp. 131–136. [23] D. C. Mocanu, J. Pokhrel, J. P . Garella, J. Seppnen, E. Liotou, and M. Narwaria, “No-reference video quality measurement: added value of machine learning, ” Journal of Electr onic Imaging , vol. 24, no. 6, p. 061208, 2015. [Online]. A vailable: http://dx.doi.org/10.1117/1.JEI.24.6. 061208

Data Analysis in Multimedia Quality Assessment: Revisiting the Statistical Tests

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment