From Patches to Pictures (PaQ-2-PiQ): Mapping the Perceptual Space of Picture Quality
Blind or no-reference (NR) perceptual picture quality prediction is a difficult, unsolved problem of great consequence to the social and streaming media industries that impacts billions of viewers daily. Unfortunately, popular NR prediction models pe…
Authors: Zhenqiang Ying, Haoran Niu, Praful Gupta
Fr om Patches to Pictur es (P aQ-2-PiQ): Mapping the P er ceptual Space of Pictur e Quality Zhenqiang Y ing 1 * , Haoran Niu 1 * , Praful Gupta 1 , Dhruv Mahajan 2 , Deepti Ghadiyaram 2 † , Alan Bovik 1 † 1 Uni versity of T exas at Austin, 2 Facebook AI { zqying, haoranniu, praful gupta } @utexas.edu, { dhruvm, deeptigp } @fb.com, bovik@ece.utexas.edu Abstract Blind or no-r efer ence (NR) perceptual pictur e quality pr ediction is a dif ficult, unsolved problem of gr eat conse- quence to the social and str eaming media industries that im- pacts billions of viewers daily . Unfortunately , popular NR pr ediction models perform poorly on real-world distorted pictur es. T o advance pr ogr ess on this pr oblem, we intr o- duce the lar gest (by far) subjective pictur e quality database , containing about 40000 r eal-world distorted pictures and 120000 patches, on which we collected about 4 M human judgments of pictur e quality . Using these picture and patch quality labels, we built deep re gion-based ar chitectur es that learn to pr oduce state-of-the-art global pictur e quality pr e- dictions as well as useful local picture quality maps. Our innovations include pictur e quality pr ediction ar chitectur es that pr oduce global-to-local infer ences as well as local-to- global infer ences (via feedbac k). 1. Introduction Digital pictures, often of questionable quality , have be- come ubiquitous. Sev eral hundred billion photos are up- loaded and shared annually on social media sites like Face- book, Instagram, and T umblr . Streaming services like Net- flix, Amazon Prime V ideo, and Y ouTube account for 60% of all do wnstream internet traf fic [ 1 ]. Being able to under- stand and predict the perceptual quality of digital pictures, giv en resource constraints and increasing display sizes, is a high-stakes problem. It is a common misconception that if two pictures are impaired by the same amount of a distortion (e.g., blur), they will hav e similar perceiv ed qualities. Howe ver , this is far from true because of the way the vision system pro- cesses picture impairments. For example, Figs. 1 (a) and (b) hav e identical amounts of JPEG compression applied, but ∗† Equal contribution ( a ) ( b ) ( c ) Fig. 1: Challenges in distortion perception: Quality of a (distorted) im- age as perceiv ed by human observers is per ceptual quality . Distortion perception is highly content-dependent. Pictures (a) and (b) were JPEG compressed using identical encode parameters, but present very dif ferent degrees of perceptual distortion. The spatially uniform noise in (c) varies in visibility over the picture content, because of contrast masking [ 2 ]. Fig. 1 (a) appears relati vely unimpaired perceptually , while Fig. 1 (b) is unacceptable. On the other hand, Fig. 1 (c) has had spatially uniform white noise applied to it, but its per - ceiv ed distortion severity varies across the picture. The complex interplay between picture content and distortions (largely determined by masking phenomena [ 2 ]), and the way distortion artifacts are visually processed, play an im- portant role in how visible or annoying visual distortions may present themselves. Moreov er , perceived quality cor- relates poorly with simple quantities like resolution and bit rate [ 3 ]. Generally , predicting perceptual picture quality is a hard, long-standing research problem [ 4 , 2 , 3 , 5 , 6 ], despite its deceptive simplicity (we sense distortion easily with lit- tle, if any , thought). It is important to distinguish between the concepts of pic- tur e quality [ 2 ] and pictur e aesthetics [ 7 ]. Picture quality is specific to perceptual distortion, while aesthetics also re- lates to aspects lik e subject placement, mood, artistic value, and so on. For instance, Fig. 2 (a) is noticeably blurred and of lo wer perceptual quality than Fig. 2 (b), which is less dis- torted. Y et, Fig. 2 (a) is more aesthetically pleasing than the unsettling Fig. 2 (b). While distortion can detract from aesthetics, it can also contribute to it, as when intentionally 1 ( a ) ( b ) Fig. 2: Aesthetics vs. perceptual quali ty (a) is blurrier than (b), but likely more aesthetically pleasing to most viewers. adding film grain [ 8 ] or blur (bokeh) [ 9 ] to achiev e photo- graphic effects. While both concepts are important, picture quality prediction is a critical, high-impact problem affect- ing sev eral high-volume industries, and is the focus of this work. Robust picture quality predictors can significantly improv e the visual experiences of social media, streaming TV and home cinema, video surveillance, medical visual- ization, scientific imaging, and more. In many such applications, it is greatly desired to be able to assess picture quality at the point of ingestion, to better guide decisions regarding retention, inspection, culling, and all further processing and display steps. Unfor- tunately , measuring picture quality without a pristine r ef- er ence picture is very hard. This is the case at the out- put of any camera, and at the point of content ingestion by any social media platform that accepts user-generated content (UGC). No-refer ence (NR) or blind picture quality prediction is largely unsolved, though popular models exist [ 10 , 11 , 12 , 13 , 14 , 15 , 16 ]. While these are often predicated on solid principles of visual neuroscience, they are also sim- ple and computationally shallow , and fall short when tested on recent databases containing difficult, complex mixtures of real-world picture distortions [ 17 , 18 ]. Solving this prob- lem could af fect the w ay billions of pictures uploaded daily are culled, processed, compressed, and displayed. T o wards advancing progress on this high-impact un- solved problem, we make se veral new contrib utions. • W e built the largest picture quality database in exis- tence . W e sampled hundreds of thousands of open source digital pictures to match the feature distributions of the largest use-case: pictures shared on social media. The final collection includes about 40 , 000 real-world, unpro- cessed (by us) pictures of di v erse sizes, contents, and dis- tortions, and about 120 , 000 cropped image patches of various scales and aspect ratios (Sec. 3.1 , 3.2 ). • W e conducted the largest subjective picture quality study to date. W e used Amazon Mechanical T urk to col- lect about 4 M human perceptual quality judgments from almost 8 , 000 subjects on the collected content, about four times more than any prior image quality study (Sec. 3.3 ). • W e collected both picture and patch quality labels to r elate local and global picture quality . The new database includes about 1 M human picture quality judg- ments and 3 M human quality labels on patches drawn fr om the same pictures . Local picture quality is deeply related to global quality , although this relationship is not well understood [ 19 ], [ 20 ]. This data will help us to learn these relationships and to better model global pic- ture quality . • W e created a series of state-of-the-art deep blind pic- ture quality predictors , that builds on existing deep neu- ral network architectures. Using a modified ResNet [ 21 ] as a baseline, we (a) use patch and picture quality labels to train a region proposal network [ 22 ], [ 23 ] to predict both global picture quality and local patch quality . This model is able to produce better global picture quality predictions by learning relationships between global and local picture quality (Sec. 4.2 ). W e then further modify this model to (b) predict spatial maps of picture quality , useful for local- izing picture distortions (Sec. 4.3 ). Finally , we (c) inno- vate a local-to-global feedback architecture that produces further improved whole picture quality predictions using local patch predictions (Sec. 4.4 ). This series of models obtains state-of-the art picture quality performance on the new database, and transfer well – without finetuning – on smaller “in-the-wild databases such as LIVE Challenge (CLIVE) [ 17 ] and K onIQ-10K [ 18 ] (Sec. 4.5 ). 2. Background Image Quality Datasets: Most picture quality models ha ve been designed and ev aluated on three “legacy” databases: LIVE IQA [ 24 ], TID-2008 [ 25 ], and TID-2013 [ 26 ]. These datasets contain small numbers of unique, pristine images ( ∼ 30 ) synthetically distorted by div erse types and amounts of single distortions (JPEG, Gaussian blur, etc.). They con- tain limited content and distortion di versity , and do not cap- ture complex mixtures of distortions that often occur in real-world images. Recently , “in-the-wild” datasets such as CLIVE [ 17 ] and KonIQ- 10 K [ 18 ], have been introduced to attempt to address these shortcomings (T able 1 ). Full-Reference models: Many full-r efer ence (FR) per- ceptual picture quality predictors, which make compar- isons against high-quality refer ence pictures, are av ail- able [ 5 , 6 ], [ 27 , 28 , 29 , 30 , 31 , 32 , 33 ]. Although some FR algorithms (e.g. SSIM [ 5 ], [ 34 ], VIF [ 6 ], [ 35 , 36 ]) have achiev ed remarkable commercial success (e.g. for monitor- ing streaming content), they are limited by their requirement of pristine reference pictures. Current NR models arent general enough: No-r efer ence or blind algorithms predict picture content without the ben- efit of a reference signal. Popular blind picture quality algo- rithms usually measure distortion-induced de viations from perceptually relev ant, highly regular bandpass models of picture statistics [ 2 ], [ 37 , 38 , 39 , 40 ]. Examples include BRISQUE [ 10 ], NIQE [ 11 ], CORNIA [ 13 ], FRIQ UEE [ 12 ], which use “handcrafted” statistical features to driv e T able 1: Summary of popular IQA datasets. In the leg acy datasets, pictures were synthetically distorted with different types of single distortions. “In-the-wild” databases contain pictures impaired by complex mixtures of highly di verse distortions, each as unique as the pictures they af flict. Database # Unique contents # Distortions # Picture contents # Patch contents Distortion type Subjectiv e study framework # Annotators # Annotations LIVE IQA (2003) [ 24 ] 29 5 780 0 single, synthetic in-lab TID-2008 [ 25 ] 25 17 1700 0 single, synthetic in-lab TID-2013 [ 25 ] 25 24 3000 0 single, synthetic in-lab CLIVE (2016) [ 17 ] 1200 - 1200 0 in-the-wild crowdsourced 8000 350 K KonIQ (2018) [ 18 ] 10 K - 10 K 0 in-the-wild crowdsourced 1400 1 . 2 M Proposed database 39 , 810 - 39 , 810 119 , 430 in-the-wild crowdsourced 7865 3 , 931 , 710 Fig. 3: Exemplar pictures from the new database , each resized to fit. Actual pictures are of highly div erse sizes and shapes. shallow learners (SVM, etc.). These models produce ac- curate quality predictions on legac y datasets having single, synthetic distortions [ 24 , 25 , 26 , 41 ], but struggle on recent in-the-wild [ 17 , 18 ] databases. Sev eral deep NR models [ 42 , 43 , 44 , 45 , 46 ] have also been created that yield state-of-the-art performance on legac y synthetic distortion databases [ 24 , 25 , 26 , 41 ], e.g., by pretraining deep nets [ 47 , 48 , 49 ] on ImageNet [ 50 ], then fine tuning, or by training on proxy labels generated by an FR model [ 45 ]. Howe ver , most deep models also struggle on CLIVE [ 17 ], because it is too difficult, yet too small to sufficiently span the perceptual space of picture qual- ity to allow very deep models to map it. The authors of [ 51 ], the code of which is not made av ailable, reported high results, but we have been unable to reproduce their num- bers, even with more efficient networks. The authors of [ 52 ] use a pre-trained ResNet-101 and report high performance on [ 17 , 18 ], but later disclosed [ 53 ] that they are unable to reproduce their own results in [ 52 ]. 3. Large-Scale Dataset and Human Study Next we explain the details of the ne w picture quality dataset we constructed, and the crowd-sourced subjective quality study we conducted on it. The database has about 40 , 000 pictures and 120 , 000 patches, on which we col- lected 4 M human judgments from nearly 8 , 000 unique sub- jects (after subject rejection). It is significantly larger than commonly used “legacy databases [ 24 , 25 , 26 , 41 ] and more recent “in-the-wild” crowd-sourced datasets [ 17 , 18 ]. 3.1. UGC-like pictur e sampling Data collection began by sampling about 40 K highly di- verse contents of div erse sizes and aspect ratios from hun- dreds of thousands of pictures dra wn from public databases, including A V A [ 7 ], V OC [ 54 ], EMO TIC [ 55 ], and CER TH Blur [ 56 ]. Because we were interested in the role of lo- cal quality perception as it relates to global quality , we also cropped three patches from each picture, yielding about 120 K patches. While internally debating the concept of “representativ e, we settled on a method of sampling a large image collection so that it would be substantially “UGC- like. W e did this because billions of pictures are uploaded, shared, displayed, and viewed on social media, far more than anywhere else. Fig. 4: Scatter plot of picture width versus picture height with marker size indi- cating the number of pictures for a giv en dimension in the new database. W e sampled picture contents using a mixed integer pro- gramming method [ 57 ] similar to [ 18 ], to match a specific set of UGC feature histograms. Our sampling strategy was different in se veral ways: firstly , unlike K onIQ [ 18 ], no pic- tures were down sampled, since this intervention can sub- stantially modify picture quality . Moreover , including pic- tures of di verse sizes better reflects actual practice. Second, instead of uniformly sampling feature values, we designed a picture collection whose feature histograms match those of 15 M randomly selected pictures from a social media web- site. This in turn resulted in a much more realistic and dif- ficult database to predict features on, as we will describe later . Lastly , we did not use a pre-trained IQA algorithm to aid the picture sampling, as that could introduce algorithmic bias into the data collection process. T o sample and match feature histograms, we computed the follo wing div erse, objectiv e features on both our picture collection and the 15 M UGC pictures: • absolute brightness L = R + G + B . • colorfulness using the popular model in [ 58 ]. • RMS brightness contrast [ 59 ]. • Spatial Information(SI) , the global standard de viation of Sobel gradients [ 60 ], a measure of complexity . • pixel count , a measure of picture size. • number of detected faces using [ 61 ]. In the end, we arrived at about 40 K pictures. Fig. 3 shows 16 randomly selected pictures and Fig. 4 highlights the di- verse sizes and aspect ratios of pictures in the new database. 3.2. Patch cr opping W e applied the following criteria when randomly crop- ping out patches: (a) aspect ratio: patches have the same aspect ratios as the pictures they were drawn from. (b) di- mension: the linear dimensions of the patches are 40% , 30% , and 20% of the picture dimensions. (c) location: ev- ery patch is entirely contained within the picture, but no patch ov erlaps the area of another patch cropped from the same image by more than 25% . Fig. 5 sho ws two e xemplar pictures, and three patches obtained from each. Fig. 5: Sample pictures and 3 randomly positioned crops ( 20% , 30% , 40% ). 3.3. Cro wdsourcing pipeline f or subjective study Subjectiv e picture quality ratings are true psychometric measurements on human subjects, requiring 10 - 20 times as much time for scrutiny (per photo) as for example, object la- belling [ 50 ]. W e used the Amazon Mechanical Turk (AMT) crowdsourcing system, well-documented for this purpose [ 17 , 18 , 62 , 63 ], to gather human picture quality labels. W e divided the study into two separate tasks: picture quality ev aluation and patch quality ev aluation. Most sub- jects ( 7141 out of 7865 workers) only participated in one of these, to avoid biases incurred by vie wing both, e ven on dif- ferent dates. Either way , the cro wdsource workflo w was the same, as depicted in Fig. 6 . Each worker was gi ven instruc- tions, follo wed by a training phase, where the y were sho wn sev eral contents to learn the rating task. They then viewed and quality-rated N contents to complete their human intel- ligent task (HIT), concluding with a survey regarding their Fig. 6: AMT task: W orkflow experienced by crowd-sourced workers when rating either pictures or patches. experience. At first, we set N = 60 , but as the study ac- celerated and we found the w orkers to be delivering consis- tent scores, we set N = 210 . W e found that the workers performed as well when viewing the increased number of pictures. 3.4. Processing subjecti ve scor es Subject rejection: W e took the recommended steps [ 17 , 63 ] to ensure the quality of the collected human data. • W e only accepted workers with acceptance rates > 75% . • Repeated images: 5 of the N contents were repeated ran- domly per session to determine whether the subjects were giving consistent ratings. • “Gold” images: 5 out of N contents were “gold ones sampled from a collection of 15 pictures and 76 patches that were separately rated in a controlled lab study by 18 reliable subjects. The “gold” images are not part of the new database. W e accepted or rejected each raters scores within a HIT based on two factors: the difference of the repeated con- tent scores compared with overall standard deviation, and whether more than 50% of their scores were identical. Since we desired to capture many ratings, workers could partici- pate in multiple HITs. Each content receiv ed at least 35 quality ratings, with some receiving as man y as 50 . The labels supplied by each subject were con verted into normalized Z scores [ 24 ], [ 17 ], averaged (by content), then scaled to [0, 100] yielding Mean Opinion Scores (MOS) . The total number of human subjecti v e labels collected after subject rejection was 3 , 931 , 710 ( 950 , 574 on images, and 2 , 981 , 136 on patches). Inter -subject consistency: A standard way to test the con- sistency of subjective data [ 24 ], [ 17 ], is to randomly divide subjects into two disjoint equal sets, compute two MOS on each picture (one from each group), then compute the Pear - son linear correlation (LCC) between the MOS values of the two groups. When repeated ov er 25 random splits, the av erage LCC between the two groups MOS was 0 . 48 , in- dicating the difficulty of the quality prediction problem on this realistic picture dataset. Fig. 12 (left) sho ws a scatter plot of the two halves of human labels for one split, sho wing a linear relationship and fairly broad spread. W e applied the same process to the patch scores, obtaining a higher LCC of 0 . 65 . This is understandable: smaller patches contain less spatial di versity; hence they recei v e more consistent scores. W e also found that nearly all the non-rejected subjects had a positive Spearman rank ordered correlation (SRCC) with the golden pictures, validating the data collection process. Fig. 7: Scatter plots descriptive of the new subjective quality database . Left: Inter-subject scatter plot of a random 50% divisions of the human labels of all 40 K + pictures into disjoint subject sets. Right: Scatter plot of picture MOS vs MOS of largest patch ( 40% of linear dimension) cropped from each same picture. Relationships between picture and patch quality: Fig. 12 (right) is a scatter plot of the entire database of picture MOS against the MOS of the largest patches cropped from them. The linear correlation coefficient (LCC) between them is 0 . 43 , which is strong, given that each patch rep- resents only 16% of the picture area. The scatter plots of the picture MOS against that of the smaller ( 30% and 20% ) patches are quite similar, with somewhat reduced LCC of 0 . 36 and 0 . 28 , respecti vely (supplementary material). An outcome of creating highly realistic “in the wild data is that it is much more difficult to train successful models on. Most pictures uploaded to social media are of reason- ably good quality , largely owing to improv ed mobile cam- eras. Hence, the distribution of MOS in the ne w database is narrower and peakier as compared to those of the two previous “in the wild picture quality databases [ 17 ], [ 18 ]. This is important, since it is desirable to be able to predict small changes in MOS, which can be significant regarding, for example, compression parameter selection [ 64 ]. As we show in Sec. 4 , the new database is very challenging, e ven for deep models. Fig. 8: MOS (Z-score) histograms of thr ee “in-the-wild databases . Left: CLIVE [ 17 ]. Middle: KoniIQ- 10 K [ 18 ]. Right: The new database introduced here. 4. Learning Blind Pictur e Quality Pr edictors W ith the av ailability of the ne w dataset comprising pic- tures and patches associated with human labels (Sec. 3 ), we created a series of deep quality prediction models that ex- ploit its unique characteristics. W e conducted four picture quality learning experiments, ev olving from a simple net- work into models of increasing sophistication and percep- tual relev ance which we describe next. 4.1. A baseline picture-only model T o start with, we created a simple model that only pro- cesses pictures and the associated human quality labels. W e will refer to this hereafter as the Baseline Model. The basic network that we used is the well-documented pre-trained ResNet- 18 [ 21 ], which we modified (described next) and fine-tuned to conduct the quality prediction task. Input image pre-processing: Because picture quality pre- diction (whether by human or machine) is a psychometric prediction, it is crucial to not modify the pictures being fed into the network. While most visual recognition learners augment input images by cropping, resizing, flipping, etc., doing the same when training a perceptual quality predictor would be a psychometric error . Such input pre-processing would result in perceptual quality scores being associated with different pictures than the y were recorded on. The new dataset contains thousands of unique combina- tions of picture sizes and aspect ratios (see Fig. 4 ). While this is a core strength of the dataset and reflects its realism, it also poses additional challenges when training deep net- works. W e attempted several ways of training the ResNet on raw multi-sized pictures, but the training and validation losses were not stable, because of the fixed sized pooling and fully connected layers. In order to tackle this aspect, we white padded each train- ing picture to size 640 × 640 , centering the content in each instance. Pictures having one or both dimensions larger than 640 were moved to the test set. This approach has the following advantages: (a) it allows supplying constant- sized pictures to the network, causing it to stably conv er ge well, (b) it allows large batch sizes which improves train- ing, (c) it agrees with the experiences of the picture raters, since AMT renders white borders around pictures that do not occupy the full webpage’ s width. T raining setup: W e divided the picture dataset (and asso- ciated patches and scores) into training, v alidation and test- ing sets. Of the collected 39 , 810 pictures (and 119 , 430 patches), we used about 75% for training ( 30 K pictures, along with their 90 K patches), 19% for validation ( 7 . 7 K pictures, 23 . 1 K patches), and the remaining for testing ( 1 . 8 K pictures, 5 . 4 K patches). When testing on the valida- tion set, the pictures fed to the trained networks were also white bordered to size 640 × 640 . As mentioned earlier , the test set is entirely composed of pictures having at least one linear dimension exceeding 640 . Being able to perform well on larger pictures of diverse aspect ratios was deemed as an additional challenge to the models. Implementation Details: W e used the PyT orch implemen- tation of ResNet- 18 [ 65 ] pre-trained on ImageNet and re- tained only the CNN backbone during fine-tuning. T o this, we added two pooling layers (adapti ve a v erage pooling and adaptiv e max pooling), followed by two fully-connected ( FC ) layers, such that the final FC layer outputs a sin- gle score. W e used a batch size of 120 and employed the MSE loss when regressing the single output quality score. W e employed the Adam optimizer with β 1 = 0 . 9 and T able 2: Picture quality predictions: Performance of picture quality mod- els on the full-size validation and test pictures in the new database. A higher value indicates superior performance. NIQE is not trained. V alidation Set T esting Set Model SRCC LCC SRCC LCC NIQE [ 11 ] 0.094 0.131 0.211 0.288 BRISQUE [ 10 ] 0.303 0.341 0.288 0.373 CNNIQA [ 68 ] 0.259 0.242 0.266 0.223 NIMA [ 46 ] 0.521 0.609 0.583 0.639 Baseline Model (Sec. 4.1 ) 0.525 0.599 0.571 0.623 RoIPool Model (Sec. 4.2 ) 0.541 0.618 0.576 0.655 Feedback Model (Sec. 4.4 ) 0.562 0.649 0.601 0.685 β 2 = 0 . 99 , a weight decay of 0 . 01 , and do a full fine- tuning for 10 epochs. W e followed a discriminati ve learn- ing approach [ 66 ], using a lower learning rate of 3 e − 4 , but a higher learning rate of 3 e − 3 for the head layers. These set- tings apply to all the models we describe in the following. Evaluation setup: Although the baseline model was trained on whole pictures, we tested it on both pictures and patches. For comparison with popular shallo w methods, we also trained and tested BRISQUE [ 10 ] and the “completely blind NIQE [ 11 ], which does not in v olve any training. W e reimplemented two deep picture quality methods - NIMA [ 46 ] which uses a Mobilenet-v2 [ 67 ] (except we replaced the output layer to regress a single quality score), and CN- NIQA [ 68 ], following the details provided by the authors. As is the common practice in the field of picture quality as- sessment, we report two metrics: (a) Spearman Rank Cor- relation Coef ficient ( SRCC ) and (b) Linear Correlation Co- efficient ( LCC ). Results: From T able 5 , the first thing to notice is the lev el of performance attained by popular shallow mod- els (NIQE [ 11 ] and BRISQUE [ 10 ]), which ha v e the same feature sets. The unsupervised NIQE algorithm per- formed poorly , while BRISQUE did better , yet the re- ported correlations are far below desired levels. Despite being CNN-based, CNNIQA [ 68 ] performed worse than BRISQUE [ 10 ]. Our Baseline Model outperformed most methods and competed very well with NIMA [ 46 ]. The other entries in the table (the R OIPool and Feedback Mod- els) are described later . T able 6 sho ws the performances of the same trained, un- modified models on the associated picture patches of three reduced sizes ( 40% , 30% and 20% of linear image dimen- sions). The Baseline Model maintained or slightly im- prov ed performance across patch sizes, while NIQE con- tinued to lag, despite the greater subject agreement on reduced-size patches (Sec. 3.4 ). The performance of NIMA suffered as the patch sizes decreased. Conv ersely , BRISQUE and CNNIQA improved as the patch sizes de- creased, although they were trained on whole pictures. 4.2. RoIPool : a picture + patches model Next, we developed a new type of picture quality model that lev erages both picture and patch quality information. Our “RoIPool Model” is designed in the same spirit as Fast/F aster R-CNN [ 22 , 23 ], which w as originally designed for object detection. As in Fast-RCNN, our model has an RoIP ool layer which allows the flexibility to aggregate at both patch and picture-sized scales. Howe ver , it differs from Fast-RCNN [ 22 ] in three important ways. First, in- stead of regressing for detecting bounding boxes, we pre- dict full-picture and patch quality . Second, Fast-RCNN performs multi-task learning with two separate heads, one for image classification and another for detection. Our model instead shares a single head between patches and im- ages. This was done to allow sharing of the “quality-aware weights between pictures and patches. Third, while both heads of Fast-RCNN operate solely on features from R OI- pooled region proposals, our model pools over the entire picture to conduct global picture quality prediction. Implementation details: As in Sec. 4.1 , we added an R OIPool layer followed by two fully-connected layers to the pre-trained CNN backbone of ResNet-18. The output size of the RoIPool unit was fixed at 2 × 2 . All of the hyper - parameters are the same as detailed in Sec. 4.1 . T rain and test setup: Recall that we sampled 3 patches per image and obtained picture and patch subjectiv e scores (Sec. 3 ). During training, the model receives the following input: (a) image, (b) location coordinates (left, top, right, bottom) of all 3 patches and, (c) ground truth quality scores of the image and patches. At test time, the RoIPool Model can process both pictures and patches of any size. Thus, it offers the advantage of predicting the qualities of patches of any number and specified locations, in parallel with the picture predictions. Results: As shown in T able 5 , the RoIPool Model yields better results than the Baseline Model and NIMA on whole pictures on both validation and test datasets. When the same trained RoIPool Model was ev aluated on patches, the per- formance improvement was more significant. Unlike the Baseline Model, the performance of the ROIPool model in- creased as the patch sizes were reduced. This suggests that: (i) the RoIPool Model is more scalable than the Baseline Model, hence better able to predict the qualities of pictures of varying sizes, (ii) accurate patch predictions can help guide global picture prediction, as we show in Sec. 4.4 , (iii) this nov el picture quality prediction architecture allo ws computing local quality maps, which we explore ne xt. 4.3. Predicting per ceptual quality maps Next, we used the R OIPool model to produce patch-wise quality maps on each image, since it is flexible enough to make predictions on any specified number of patches. This unique picture quality map predictor is the first deep T able 3: Patch quality predictions: Results on (a) the largest patches ( 40% of linear dimensions), (b) middle-size patches ( 30% of linear dimensions) and (c) smallest patches ( 20% of linear dimensions) in the validation and test sets. Same protocol as used in T able 5 . (a) (b) (c) V alidation T est V alidation T est V alidation T est Model SRCC LCC SRCC LCC SRCC LCC SRCC LCC SRCC LCC SRCC LCC NIQE [ 11 ] 0.109 0.106 0.251 0.271 0.029 0.011 0.217 0.109 0.052 0.027 0.154 0.031 BRISQUE [ 10 ] 0.384 0.467 0.433 0.498 0.442 0.503 0.524 0.556 0.495 0.494 0.532 0.526 CNNIQA [ 68 ] 0.438 0.400 0.445 0.373 0.522 0.449 0.562 0.440 0.580 0.481 0.592 0.475 NIMA [ 46 ] 0.587 0.637 0.688 0.691 0.547 0.560 0.681 0.670 0.395 0.411 0.526 0.524 Baseline Model (Sec. 4.1 ) 0.561 0.617 0.662 0.701 0.577 0.603 0.685 0.704 0.563 0.541 0.633 0.630 RoIPool Model (Sec. 4.2 ) 0.641 0.731 0.724 0.782 0.686 0.752 0.759 0.808 0.733 0.760 0.769 0.792 Feedback Model (Sec. 4.4 ) 0.658 0.744 0.726 0.783 0.698 0.762 0.770 0.819 0.756 0.783 0.786 0.808 Image CNN Head Image score (a) Image CNN ImagePatchRoIPool Head Image & Patch scores (b) Image Features Image CNN ImagePatchRoIPool Head0 Image & Patch scores + ImageRoIPool ImageAvgMaxPool Head1 Image score (c) Fig. 9: Illustrating the dif ferent deep quality prediction models we studied. (a) Baseline Model: ResNet- 18 with a modified head trained on pictures (Sec. 4.1 ). (b) RoIPool Model: trained on both picture and patch qualities (Sec. 4.2 ). (c) Feedback Model: where the local quality predictions are fed back to improve global quality predictions (Sec. 4.4 ). model that is learned from true human-generated picture and patch labels, rather than from proxy labels delivered by an algorithm, as in [ 45 ]. W e generated picture qual- ity maps in the follo wing manner: (a) we partitioned each picture into a grid of 32 × 32 non-overlapping blocks, thus preserving aspect ratio (this step can be easily ex- tended to process denser , ov erlapping, or smaller blocks) (b) Each block’ s boundary coordinates (left, top, right, bottom ) were provided as input to the RoIPool to guide learning of patch quality scores (c) For visualiza- tion, we applied bi-linear interpolation to the block predic- tions, and represented the results as magma color maps. W e then α -blended the quality maps with the original pic- tures ( α = 0 . 8 ). From Fig. 10 , we may observe that the R OIPool Model is able to accurately distinguish regions that are blurred, washed-out, or poorly e xposed, from high- quality regions. Such spatially localized quality maps have Fig. 10: Spatial quality maps generated using the RoIPool Model (Sec. 4.2 ). Left: Original Images. Right: Quality maps blended with the originals using magma color . great potential to support applications like image compres- sion, image retargeting, and so on. 4.4. A local-to-global feedback model As noted in Sec. 4.3 , local patch quality has a signifi- cant influence on global picture quality . Giv en this, how do we effecti vely leverage local quality predictions to further improv e global picture quality? T o address this question, we developed a nov el architecture referred to as the Feed- back Model (Fig. 9 (c)). In this frame work, the pre-trained backbone has two branches: (i) an RoIPool layer followed by an FC-layer for local patch and image quality prediction ( Head0 ) and (ii) a global image poo ling layer . The predic- tions from Head0 are concatenated with the pooled image Pr e d i c t e d = 5 6 . 9 , G r o u n d - t r u t h M O S = 1 7 . 9 Pr e d i c t e d = 6 8 . 1 , G r o u n d - t r u t h M O S = 8 2 . 1 ( a ) ( b ) Fig. 11: Failure cases: Examples where the Feedback Model’ s predictions differed the most from the ground truth predictions. features from the second branch and fed to a new FC layer ( Head1 ), which makes whole-picture predictions. From T ables 5 and 6 , we observe that the performance of the Feedback Model on both pictures and patches is im- prov ed ev en further by the unique local-to-global feedback architecture. This model consistently outperformed all shal- low and deep quality models. The largest improv ement is made on the whole-picture predictions, which w as the main goal. The improvement afforded by the Feedback Model is understandable from a perceptual perspecti ve, since, while quality perception by a human is a low-le vel task in v olv- ing low-le vel processes, it also in volv es a viewer casting their foveal gaze at discrete localized patches of the picture being viewed. The overall picture quality is likely an inte- grated combination of quality information gathered around each fixation point, similar to the Feedback Model. Failur e cases: While our model attains good performance on the new database, it does make errors in prediction. Fig 11 (a) sho ws a picture that was considered of a very poor quality by the human raters (MOS= 18 ), while the Feedback model predicted an overrated score of 57 , which is moder- ate. This may hav e been because the subjects were less for - giving of the blurred moving object, which may ha ve drawn their attention. Conv ersely , Fig 11 (b) is a picture that was underrated by our model, receiving a predicted score of 68 against the subject rating of 82 . It may hav e been that the subjects discounted the haze in the background in fav or of the clearly visible w aterplane. These cases further reinforce the difficulty of perceptual picture quality prediction and highlight the strength of our new dataset. 4.5. Cross-database comparisons Finally , we evaluated the Baseline (Sec. 4.1 ), RoIPool (Sec. 4.2 ), Feedback (Sec. 4.4 ) , and other baselines – all trained on the proposed dataset – on two other smaller “in-the-wild databases CLIVE [ 17 ] and K onIQ-10k [ 18 ] without any fine-tuning. From T able 7 , we may observe that all our three models, trained on the proposed dataset, trans- fer well to other databases. The Baseline, RoIPool, and Feedback Models all outperformed the shallow and other deep models [ 46 , 68 ] on both datasets. This is a powerful re- sult that highlights the representati veness of our ne w dataset T able 4: Cross-database comparisons: Results when models trained on the new database are applied on CLIVE [ 17 ] and K onIQ [ 18 ] without fine- tuning. V alidation Set CLIVE [ 17 ] K onIQ [ 18 ] Model SRCC LCC SRCC LCC NIQE [ 11 ] 0.503 0.528 0.534 0.509 BRISQUE [ 10 ] 0.660 0.621 0.641 0.596 CNNIQA [ 68 ] 0.559 0.459 0.596 0.403 NIMA [ 46 ] 0.712 0.705 0.666 0.721 Baseline Model (Sec. 4.1 ) 0.740 0.725 0.753 0.764 RoIPool Model (Sec. 4.2 ) 0.762 0.775 0.776 0.794 Feedback Model (Sec. 4.4 ) 0.784 0.754 0.788 0.808 and the efficac y of our models. The best reported numbers on both databases [ 69 ] uses a Siamese ResNet-34 backbone by training and testing on the same datasets (along with 5 other datasets). While this model reportedly attains 0 . 851 SRCC on CLIVE and 0 . 894 on KonIQ- 10 K, we achie ved the above results by directly applying pre-trained models, thereby not allowing them to adapt to the distortions of the test data. When we also trained and tested on these datasets, our picture-based Baseline Model also performed at a simi- lar le vel, obtaining an SRCC of 0 . 844 on CLIVE and 0 . 890 on K onIQ- 10 K. 5. Concluding Remarks Problems in volving perceptual picture quality prediction are long-standing and fundamental to perception, optics, image processing, and computational vision. Once viewed as a basic vision science modelling problem to improve on weak Mean Squared Error (MSE) based ways of as- sessing television systems and cameras, the picture quality problem has ev olved into one that demands the large-scale tools of data science and computational vision. T ow ards this end we hav e created a database that is not only sub- stantially lar ger and harder than previous ones, but contains data that enables global-to-local and local-to-global quality inferences. W e also developed a model that produces lo- cal quality inferences, uses them to compute picture quality maps, and global image quality . W e believe that the pro- posed new dataset and models have the potential to enable quality-based monitoring, ingestion, and control of billions of social-media pictures and videos. Finally , examples in Fig. 11 of competing local vs. global quality percepts highlight the fundamental difficulties of the problem of no-reference perceptual picture quality assessment: its subjectiv e nature, the complicated interactions between content and myriad possible combinations of distortions, and the effects of perceptual phenomena like masking. More complex architectures might mitigate some of these issues. Ad- ditionally , mid-lev el semantic side-information about objects in a picture (e.g., faces, animals, babies) or scenes (e.g., outdoor vs. indoor) may also help capture the role of higher-le vel processes in picture quality assessment. References [1] Sandvine . The Global Internet Phenom- ena Report September 2019. [Online] A vail- able: https://www.sandvine.com/ global- internet- phenomena- report- 2019 . 1 [2] A. C. Bovik. Automatic prediction of perceptual image and video quality . Proceedings of the IEEE , v ol. 101, no. 9, pp. 2008-2024, Sep. 2013. 1 , 2 [3] Z. W ang and A. C. Bovik. Mean squared error: Love it or leav e it? A new look at signal fidelity measures. IEEE Signal Pr ocess. Mag . , v ol. 26, no. 1, pp. 98-117, Jan 2009. 1 [4] J. Mannos and D. Sakrison. The effects of a visual fidelity criterion of the encoding of images. IEEE T rans. Inf. Theor . , vol. 20, no. 4, pp. 525–536, July . 1974. 1 [5] Z. W ang, A. C. Bovik, H. R. Sheikh, and E. P . Simoncelli. Image quality assessment: From error visibility to structural similarity . IEEE Tr ansactions on Image Pr ocessing , vol. 13, no. 4, pp. 600-612, April 2004. 1 , 2 [6] H. R. Sheikh and A. C. Bo vik. Image information and visual quality . IEEE T ransactions on Image Pr ocessing , vol. 15, no. 2, pp. 430-444, Feb 2006. 1 , 2 [7] N. Murray, L. Marchesotti, and F . Perronnin. A V A: A large- scale database for aesthetic visual analysis. In IEEE Int’l Conf. on Comput. V ision and P attern Recogn. (CVPR) , June 2012. 1 , 3 [8] A. Norkin and N. Birkbeck. Film grain synthesis for A V1 video codec. In Data Compr ession Conf . (DCC) , Mar . 2018. 2 [9] Y . Y ang, H. Bian, Y . Peng, X. Shen, and H. Song. Simulating bokeh effect with kinect. In P acific Rim Conf. Multimedia , Sept. 2018. 2 [10] A. Mittal, A. K. Moorthy , and A. C. Bovik. No-reference image quality assessment in the spatial domain. IEEE T rans- actions on Ima ge Pr ocessing , vol. 21, no. 12, pp. 4695–4708, 2012. 2 , 6 , 7 , 8 [11] A. Mittal, R. Soundararajan, and A. C. Bovik. Making a “Completely blind image quality analyzer . IEEE Signal Pro- cessing Letters , v ol. 20, pp. 209-212, 2013. 2 , 6 , 7 , 8 [12] D. Ghadiyaram and A. C. Bo vik. Perceptual quality predic- tion on authentically distorted images using a bag of features approach. Journal of V ision , vol. 17, no. 1, art. 32, pp. 1-25, January 2017. 2 [13] P . Y e, J. Kumar , L. Kang, and D. Doermann. Unsupervised feature learning framework for no-reference image quality assessment. In IEEE Int’l Conf. on Comput. V ision and P at- tern Recogn. (CVPR) , pages 1098–1105, June 2012. 2 [14] J. Xu, P . Y e, Q. Li, H. Du, Y . Liu, and D. Doermann. Blind image quality assessment based on high order statistics ag- gregation. IEEE T ransactions on Image Pr ocessing , vol. 25, no. 9, pp. 4444-4457, Sep. 2016. 2 [15] K. Gu, G. Zhai, X. Y ang, and W . Zhang. Using Free Energy Principle For Blind Image Quality Assessment. IEEE T rans- actions on Multimedia , vol. 17, no. 1, pp. 50-63, Jan 2015. 2 [16] W . Xue, L. Zhang, and X. Mou. Learning without human scores for blind image quality assessment. In IEEE Int’l Conf. on Comput. V ision and P attern Reco gn. (CVPR) , pages 995–1002, June 2013. 2 [17] D. Ghadiyaram and A. C. Bovik. Massiv e online crowd- sourced study of subjective and objective picture quality . IEEE T ransactions on Image Processing , vol. 25, no. 1, pp. 372-387, Jan 2016. 2 , 3 , 4 , 5 , 8 , 12 [18] H. Lin, V . Hosu, and D. Saupe. Koniq-10K: T owards an eco- logically valid and large-scale IQA database. arXiv preprint arXiv:1803.08489 , March 2018. 2 , 3 , 4 , 5 , 8 , 12 [19] A. K. Moorthy and A. C. Bovik. V isual importance pooling for image quality assessment. IEEE J. of Selected T opics in Signal Pr ocess. , v ol. 3, no. 2, pp. 193-201, April 2009. 2 [20] J. Park, S. Lee, and A.C. Bovik. VQpooling: Video qual- ity pooling adaptiv e to perceptual distortion severity . IEEE T ransactions on Image Processing , vol. 22, no. 2, pp. 610- 620, Feb . 2013. 2 [21] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In IEEE Conf. Comput. V ision and P attern Recogn. , pages 770–778, 2016. 2 , 5 [22] R. Girshick. Fast R-CNN. In IEEE Int’l Conf. on Comput. V ision (ICCV) , page 10401049, 2015. 2 , 6 [23] R. Girshick. Faster R-CNN: T ow ards real-time object de- tection with region proposal networks. In Adv . Neural Info Pr ocess Syst (NIPS) , 2015. 2 , 6 [24] H. R. Sheikh, M. F . Sabir, and A. C. Bovik. A statistical ev aluation of recent full reference image quality assessment algorithms. IEEE T ransactions on Image Processing , vol. 15, no. 11, pp. 3440-3451, Nov 2006. 2 , 3 , 4 [25] N. Ponomarenko, V . Lukin, A. Zelensky , K. Egiazarian, M. Carli, and F . Battisti. TID2008-a database for ev aluation of full-reference visual quality assessment metrics. Advances of Modern Radioelectr onics , vol. 10, no. 4, pp. 30–45, 2009. 2 , 3 [26] N. Ponomarenko, O. Ieremeiev , V . Lukin, K. Egiazarian, L. Jin, J. Astola, B. V ozel, K. Chehdi, M. Carli, F . Battisti, and C. . J. Kuo. Color image database TID2013: Peculiari- ties and preliminary results. In European W orkshop on V isual Information Pr ocessing , volume vol. 30, pp. 106-111, pages 106–111, June 2013. 2 , 3 [27] Z. W ang, E. P . Simoncelli, and A. C. Bovik. Multiscale struc- tural similarity for image quality assessment. In Asilomar Conf. Signals, Systems Comput. , Pacific Grove, CA, Nov 2003. 2 [28] E. C. Larson and D. M. Chandler . Most apparent dis- tortion: Full-reference image quality assessment and the role of strategy . J. Electr on. Imag. , vol. 19, no. 4, pp. 011006:1011006:21, Jan.Mar . 2010. 2 [29] L. Zhang, L. Zhang, X. Mou, and D. Zhang. FSIM: A feature similarity index for image quality assessment. IEEE T rans- actions on Image Pr ocessing , v ol. 20, no. 8, pp. 2378-2386, Aug 2011. 2 [30] D. M. Chandler and S. S. Hemami. VSNR: A wa velet-based visual signal-to-noise ratio for natural images. IEEE T rans- actions on Image Pr ocessing , v ol. 16, no. 9, pp. 2284-2298, Sep. 2007. 2 [31] W . Xue, L. Zhang, X. Mou, and A. C. Bovi k. Gradient mag- nitude similarity de viation: A highly ef ficient perceptual im- age quality index. IEEE T ransactions on Image Pr ocessing , vol. 23, no. 2, pp. 684-695, Feb 2014. 2 [32] A Haar wav elet-based perceptual similarity index for image quality assessment. Signal Pr ocess.: Image Comm. , v ol. 61, pp. 33-43, 2018. 2 [33] L. Zhang, Y . Shen, and H. Li. VSI: A visual saliency-induced index for perceptual image quality assessment. IEEE T rans- actions on Ima ge Pr ocessing , vol. 23, no. 10, pp. 4270-4281, Oct 2014. 2 [34] [Online] https://www.emmys. com/news/press- releases/ honorees- announced- 67th- engineering- emmy- awards . 2 [35] Z. Li, C. Bampis, J. Novak, A. Aaron, K. Swanson, A. Moorthy , and J. D. Cock. VMAF: The Journey Continues, The Netflix T ech Blog . [Online] A vailable: https://medium.com/netflix- techblog/ vmaf- the- journey- continues- 44b51ee9ed12 . 2 [36] M. Manohara, A. Moorthy , J. D. Cock, I. Katsavouni- dis, and A. Aaron. Optimized shot-based encodes: Now streaming!, The Netflix T ech Blog . [Online] A vail- able: https://medium.com/netflix- techblog/ optimized- shot- based- encodes- now- streaming- 4b9464204830 . 2 [37] D. J. Field. Relations between the statistics of natural images and the response properties of cortical cells. J. Opt. Soc. Am. A , vol. 4, no. 12, pp. 2379-2394, Dec 1987. 2 [38] D. L. Ruderman. The statistics of natural images. Network: computation in neural systems , vol. 5, no. 4, pp. 517–548, 1994. 2 [39] E. P . Simoncelli and B. A. Olshausen. Natural image statis- tics and neural representation. Annual r eview of neur o- science , vol. 24, no. 1, pp. 1193–1216, 2001. 2 [40] A.C. Bovik, M. Clark, and W .S. Geisler . Multichannel tex- ture analysis using localized spatial filters. IEEE T rans P at- tern Anal. Machine Intell , v ol. 12, no. 1, pp. 5573, 1990. 2 [41] E. C. Larson and D. M. Chandler . Categorical image quality (CSIQ) database, 2010. [Online] A vailable: http://vision.eng.shizuoka.ac.jp/mod/ page/view.php?id=23 . 3 [42] D. Ghadiyaram and A. C. Bovik. Blind image quality as- sessment on real distorted images using deep belief nets. In IEEE Global Conference on Signal and Information pr ocess- ing , volume pp. 946–950, pages 946–950. 3 [43] J. Kim, H. Zeng, D. Ghadiyaram, S. Lee, L. Zhang, and A. C. Bovik. Deep con volutional neural models for picture-quality prediction: Challenges and solutions to data-driv en image quality assessment. IEEE Signal Pr ocess. Mag. , v ol. 34, no. 6, pp. 130-141, Nov 2017. 3 [44] S. Bosse, D. Maniry, T . Wie gand, and W . Samek. A deep neural network for image quality assessment. In 2016 IEEE Int’l Conf. Image Pr ocess. (ICIP) , pages 3773–3777, Sep. 2016. 3 [45] J. Kim and S. Lee. Fully deep blind image quality predictor . IEEE J. of Selected T opics in Signal Pr ocess. , vol. 11, no. 1, pp. 206-220, Feb 2017. 3 , 7 [46] H. T alebi and P . Milanfar. NIMA: Neural image assessment. IEEE T ransactions on Image Processing , vol. 27, no. 8, pp. 3998-4011, Aug 2018. 3 , 6 , 7 , 8 , 12 , 13 [47] K. Simonyan and A. Zisserman. V ery deep con volutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 , Sept. 2014. 3 [48] X. Liu, J. van de W eijer , and A. D. Bagdanov . RankIQA: Learning from rankings for no-reference image quality as- sessment. In IEEE Int’l Conf. on Comput. V ision (ICCV) , page 10401049, 2017. 3 [49] K. Ma, W . Liu, K. Zhang, Z. Duanmu, Z. W ang, and W . Zuo. End-to-end blind image quality assessment using deep neural networks. IEEE T ransactions on Image Pr ocessing , vol. 27, no. 3, pp. 1202-1213, March 2018. 3 [50] J. Deng, W . Dong, R. Socher, L. Li, Kai Li, and Li Fei- Fei. ImageNet: A large-scale hierarchical image database. In IEEE Conf. Comput. V ision and P attern Recogn. , pages 248–255, June 2009. 3 , 4 [51] S. Bianco, L. Celona, P . Napoletano, and R. Schettini. On the use of deep learning for blind image quality assessment. Signal, Image and V ideo Pr ocessing , vol. 12, no. 2, pp. 355– 362, Feb 2018. 3 [52] D. V arga, D. Saupe, and T . Szir ´ anyi. DeepRN: A content preserving deep architecture for blind image quality assess- ment. IEEE Int’l Conf. on Multimedia and Expo (ICME) , pages 1–6, 2018. 3 [53] D. Saupe. http://www.inf.uni- konstanz.de/ ˜ saupe. 3 [54] M. Everingham, L. V an Gool, C. K. I. W illiams, J. W inn, and A. Zisserman. The pascal visual object classes (VOC) challenge. Int’l J. of Comput. V ision , pp. 303-338, June 2010. 3 [55] R. Kosti, J. M. Alvarez, A. Recasens, and A. Lapedriza. EMO TIC: Emotions in conte xt dataset. In IEEE Conf. Com- put. V ision and P attern Recogn. W orkshops (CVPRW) , July 2017. 3 [56] E. Mavridaki and V . Mezaris. No-reference blur assessment in natural images using fourier transform and spatial pyra- mids. In IEEE Int’l Conf. Image Pr ocess. (ICIP) , Oct 2014. 3 [57] V . V onikakis, R. Subramanian, J. Arnfred, and S. W inkler. A probabilistic approach to people-centric photo selection and sequencing. IEEE T ransactions on Multimedia , v ol. 19, no. 11, pp. 2609-2624, Nov 2017. 3 [58] D. Hasler and S. E. Suesstrunk. Measuring colorfulness in natural images. In SPIE Conf. on Human V ision and Elec- tr onic Imaging VIII , 2003. 4 [59] Eli Peli. Contrast in complex images. J . Opt. Soc. Am. A , vol. 7, no. 10, pp. 2032–2040, Oct 1990. 4 [60] H. Y u and S. Winkler . Image complexity and spatial informa- tion. In Int’l W orkshop on Quality of Multimedia Experience (QoMEX) , pages 12–17. IEEE, 2013. 4 [61] Face detection using haar cascades. OpenCV - Python T utorials , [Online] A v ailable: https: //opencv- python- tutroals.readthedocs.io/ en/latest/py_tutorials/py_objdetect/py_ face_detection/py_face_detection.html . 4 [62] M. J. C. Crump, J. V . McDonnell, and T . M. Gureckis. Eval- uating amazon’ s mechanical turk as a tool for experimental behavioral research. PLOS ONE , vol. 8, pp. 1-18, March 2013. 4 [63] Z. Sinno and A. C. Bovik. Large-scale study of perceptual video quality . IEEE T ransactions on Image Pr ocessing , vol. 28, no. 2, pp. 612-627, Feb 2019. 4 [64] X. Y u, C. G. Bampis, P . Gupta, and A. C. Bovik. Predict- ing the quality of images compressed after distortion in two steps. IEEE T ransactions on Image Pr ocessing , v ol. 28, no. 12, pp. 5757-5770, Dec 2019. 5 [65] T orchvision.models. Pytorch. [Online] A v ail- able: https://pytorch.org/docs/stable/ torchvision/models.html . 5 [66] J. Ho ward and S. Ruder . Uni versal language model fine-tuning for text classification. arXiv preprint arXiv:1801.06146 , 2018. 6 [67] M. Sandler , A. Ho ward, M. Zhu, A. Zhmoginov , and L. Chen. Mobilenetv2: In verted residuals and linear bot- tlenecks. In IEEE Int’l Conf. on Comput. V ision and P attern Recogn. (CVPR) , pages 4510–4520, June 2018. 6 , 12 [68] L. Kang, P . Y e, Y . Li, and D. Doermann. Conv olutional neural networks for no-reference image quality assessment. In IEEE Int’l Conf. on Comput. V ision and P attern Recogn. (CVPR) , pages 1733–1740, June 2014. 6 , 7 , 8 , 12 , 13 [69] W . Zhang, K. Ma, G. Zhai, and X. Y ang. Learning to blindly assess image quality in the laboratory and wild. arXiv pr eprint arXiv:1907.00516 , 2019. 8 Supplementary Material – Fr om P atches to Pictur es (PaQ-2-PiQ): Mapping the P erceptual Space of Pictur e Quality A. Perf ormance Summary The performance of NIMA [ 46 ] reported in the paper used a default MobileNet [ 67 ] backbone. For a fair comparison against the proposed family of models which used ResNet- 18 backbone, we reported the performance of NIMA (ResNet- 18 ) on images (T able 5 ) and patches (T able 6 ) of the new datatbase, and also cross-database performance on CLIVE [ 17 ] and K onIQ- 10 K [ 18 ] (T able 7 ). Giv en that the proposed models either compete well or outperform other models in all categories further demonstrates their quality prediction strength across multiple databases containing div erse image distortions. T able 5: Picture quality predictions: Performance of picture quality models on the full-size validation and test pictures in the ne w database. A higher v alue indicates superior performance. NIQE is not trained. V alidation Set T esting Set Model SRCC LCC SRCC LCC NIMA(MobileNet v2) [ 46 ] 0.521 0.609 0.583 0.639 NIMA(ResNet 18) [ 46 ] 0.503 0.577 0.580 0.611 Baseline Model (Sec. 4 . 1 ) 0.525 0.599 0.571 0.623 RoIPool Model (Sec. 4 . 2 ) 0.541 0.618 0.576 0.655 Feedback Model (Sec. 4 . 4 ) 0.562 0.649 0.601 0.685 T able 6: Patch quality predictions: Results on (a) the largest patches ( 40% of linear dimensions), (b) middle-size patches ( 30% of linear dimensions) and (c) smallest patches ( 20% of linear dimensions) in the validation and test sets. Same protocol as used in T able 5 . (a) (b) (c) V alidation T est V alidation T est V alidation T est Model SRCC LCC SRCC LCC SRCC LCC SRCC LCC SRCC LCC SRCC LCC NIMA(MobileNet v2) [ 46 ] 0.587 0.637 0.688 0.691 0.547 0.560 0.681 0.670 0.395 0.411 0.526 0.524 NIMA(ResNet 18) [ 46 ] 0.578 0.600 0.676 0.696 0.516 0.505 0.672 0.657 0.324 0.316 0.504 0.483 Baseline Model (Sec. 4 . 1 ) 0.561 0.617 0.662 0.701 0.577 0.603 0.685 0.704 0.563 0.541 0.633 0.630 RoIPool Model (Sec. 4 . 2 ) 0.641 0.731 0.724 0.782 0.686 0.752 0.759 0.808 0.733 0.760 0.769 0.792 Feedback Model (Sec. 4 . 4 ) 0.658 0.744 0.726 0.783 0.698 0.762 0.770 0.819 0.756 0.783 0.786 0.808 T able 7: Cross-database comparisons: Results when models trained on the new database are applied on CLIVE [ 17 ] and K onIQ [ 18 ] without fine-tuning. V alidation Set CLIVE [ 17 ] KonIQ [ 18 ] Model SRCC LCC SRCC LCC NIMA(MobileNet v2) [ 46 ] 0.712 0.705 0.666 0.721 NIMA(ResNet 18) [ 46 ] 0.707 0.645 0.707 0.679 Baseline Model (Sec. 4 . 1 ) 0.740 0.725 0.753 0.764 RoIPool Model (Sec. 4 . 2 ) 0.762 0.775 0.776 0.794 Feedback Model (Sec. 4 . 4 ) 0.784 0.754 0.788 0.808 B. Inf ormation on Model Parameters T able 8 summarizes the number of learnable parameters used by each of the compared models. • CNNIQA ’ s [ 68 ] poor performance can be attributed to its shallow CNN-based architecture with less than 1 M parameters indicating its inability to model the complex problem. • It is interesting to note that NIMA (MobileNet-v2) performed consistently at par with NIMA (ResNet- 18 ) ev en though it used only 20% of the total parameters. • Although RoIPool Model used the same number of parameters as the Baseline Model, it achiev ed significantly better performance suggesting the importance of accurate local quality predictions for global quality . T able 8: Number of model parameters. Model Backbone params Head params T otal params CNNIQA [ 68 ] - - 724.90 K NIMA (MobileNet v2) [ 46 ] 2.22 M 10.11 K 2.23 M NIMA (ResNet- 18 ) [ 46 ] 11.17 M 10.11 K 11.18 M Baseline (Sec. 4 . 1 ) 11.17 M 537.99 K 11.70 M RoIPool Model (Sec. 4 . 2 ) 11.17 M 537.99 K 11.70 M Feedback Model (Sec. 4 . 4 ) 11.17 M 1.07 M 12.24 M C. Picture MOS vs P atch MOS scatter plots Fig. 12: Scatter plots of picture MOS vs patch MOS . Left: Scatter plot of picture MOS vs MOS of second largest patch ( 30% of linear dimension) cropped from each same picture. Right: Scatter plot of picture MOS vs MOS of smallest patch ( 20% of linear dimension) cropped from each same picture. D. Amazon Mechanical T urk Interface W e allowed the workers on Amazon Mechanical T urk (AMT) to previe w the “Instructions” page (as shown in Fig 13 ) before they accept to participate in the study . Once accepted, they were tasked with rating the quality of images on a Likert scale marked with “Bad”, “Poor”, “Fair”, “Good” and “Excellent” as demonstrated in Fig. 14 and 15 . A similar user interface was used for patch quality rating task. Fig. 13: AMT task: The “Instructions” page shown to workers at the beginning of each HIT . Fig. 14: AMT task: Training session interface of AMT task experienced by cro wd-sourced work ers when rating pictures. Fig. 15: AMT task: T esting session interface of AMT task experienced by cro wd-sourced work ers when rating pictures.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment