A Comparative Study of Computational Aesthetics
Objective metrics model image quality by quantifying image degradations or estimating perceived image quality. However, image quality metrics do not model what makes an image more appealing or beautiful. In order to quantify the aesthetics of an imag…
Authors: Dogancan Temel, Ghassan AlRegib
Citation D. T emel and G. AlRegib, ”A comparativ e study of computational aesthetics, ” 2014 IEEE International Conference on Image Processing (ICIP), Paris, 2014, pp. 590-594. DOI https://doi.org/10.1109/ICIP.2014.7025118 Review Date added to IEEE Xplore: 29 January 2015 Code/Poster https://ghassanalregib.com/publications/ Bib @INPR OCEEDINGS { T emel2014 ICIP , author= { D. T emel and G. AlRegib } , booktitle= { 2014 IEEE International Conference on Image Processing (ICIP) } , title= { A comparati ve study of computational aesthetics } , year= { 2014 } , pages= { 590-594 } , doi= { 10.1109/ICIP .2014.7025118 } , ISSN= { 1522-4880 } , month= { Oct } , } Copyright c 2014 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collectiv e works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this w ork in other works. Contact alregib@gatech.edu https://ghassanalregib.com/ dcantemel@gmail.com http://cantemel.com/ A COMP ARA TIVE STUD Y OF COMPUT A TIONAL AESTHETICS Dogancan T emel and Ghassan AlRe gib Center for Signal and Information Processing (CSIP) School of Electrical and Computer Engineering Georgia Institute of T echnology , Atlanta, GA, 30332-0250 USA { cantemel,alregib } @gatech.edu ABSTRA CT Objectiv e metrics model image quality by quantifying im- age degradations or estimating perceived image quality . How- ev er , image quality metrics do not model what makes an im- age more appealing or beautiful. In order to quantify the aes- thetics of an image, we need to take it one step further and model the perception of aesthetics. In this paper , we e xam- ine computational aesthetics models that use hand-crafted, generic and hybrid descriptors. W e show that generic de- scriptors can perform as well as state of the art hand-crafted aesthetics models that use global features. Howe ver , neither generic nor hand-crafted features is sufficient to model aes- thetics when we only use global features without considering spatial composition or distrib ution. W e also follo w a visual dictionary approach similar to state of the art methods and show that it performs poorly without the spatial pyramid step. Index T erms — Computational Aesthetics, Image Qual- ity , Photography , Color 1. INTR ODUCTION Everyday , we are exposed to various images and video thanks to Facebook, Flickr, Y outube, Instagram and others. The content in these websites or applications are provided by the users. While uploading the multimedia content, files need to satisfy basic constraints such as format, size and resolution. Howe ver , these social media platforms do not assess the qual- ity of multimedia content. Subjective quality e v aluation is by far the best way to asses the quality of multimedia. Howe ver , it requires extensi ve amount of time and labor . Therefore, ob- jectiv e quality metrics are used to estimate subjecti ve quality . Most of the quality assessment methods estimate the qual- ity by calculating the degradations in the images and videos. Fidelity-based metrics calculate the accuracy of the process ed content with respect to the original content whereas structural metrics model the percei ved quality of visual data by con- sidering the Human V isual System (HVS). Howe ver , fidelity and structure-based metrics are not sufficient to estimate the quality of experience for the end user . W e claim in our work that we also need to consider the aesthetics within images and videos. Hence, in our work, we dev elop image quality mea- sures that incorporate aesthetics as well as other structure- based and statistical models. The authors in [1] proposed a computational approach to study aesthetics in photographic images. They studied aes- thetics as a machine learning problem by e xtracting lo w-lev el features based on rules of thumbs in photography , common intuition and rating patterns. In [2], authors designed a color- fulness index to asses the quality rather than fidelity . Color- fulness index quantifies the colorfulness in natural images to estimate percei ved quality by using the color distrib ution in the CIELab color space. In addition to the colorfulness fea- ture, brightness, contrast, saturation and saliency were also used to asses the beauty rating of videos in [3]. In [4], the au- thors used composition, content and sky-illumination as high lev el attributes to predict aesthetics and interestingness. Instead of focusing on the entire image, the authors in [5] extracted subject regions using blind motion deblurring [6] to detect areas that draw the most attention of HVS. The au- thors in [7] used hue and scene composition features as global features whereas dark channel, face region and complexity as regional features. Sharpness, colorfulness, luminance, color harmony and blockiness were used in [8] as low-lev el features to model visual aesthetic appeal. Composition-specific fea- tures such as relati ve foreground position and visual weight ratio were used in [9] to asses photo quality and perform semi- automatic enhancement based on visual aesthetics. Instead of hand-crafting features that highly correlate with photographic practices and techniques, the authors in [10] used generic im- age descriptors such as GIST [11] and BO V ([12], [13]) to asses aesthetics of images. It is not straightforward to describe aesthetics. Therefore, we need to focus on the judgement of subjects. The authors in [5] generated a video database by selecting videos from Y ouT ube. The database contains 4000 high quality profes- sional movie clips and 4000 low quality amateurish clips. The quality of the videos was assessed per frame basis and the av- erage was computed to asses the quality of videos. The au- thors also used MSN Li ve Search to search for images and volunteers ranked the images on a scale between 1 and 5. A photography database with peer-rated aesthetics scores rang- ing from 1 to 7 was provided in [14]. Similarly , [15] pro- vided a large photo database with peer ratings based on qual- ity ranging from 1 to 10. Authors in [16] randomly collected large samples from [17], [15] , [14] and [18] that were an- notated with aesthetics, quality , liking and emotion scores. Around 1 million images crawled by Flickr with textual tags, aesthetics annotations, and EXIF meta-data were provided in [19]. A large set of standardized, emotionally-e vocati ve color photographs were provided with a wide range of semantic categories in [20]. Authors in [21] generated a large scale database with score distributions, semantic and style labels and rich annotations including aesthetics. In this work, we focus on the binary classification of the images based on aesthetics quality . W e examine the descrip- tors used in the aesthetics image quality literature as well as other commonly used image descriptors. In section 2.1, we introduce state of the art computational aesthetics descriptors. Geometric descriptors are discussed in section 2.2 and color descriptors are explained in 2.3. W e examine hybrid descrip- tors in section 2.4 and visual dictionary approach in section 2.5. W e compare the descriptors in section 2.6 and conclude our discussion in section 3. 2. AESTHETICS DESCRIPTORS W e can measure the success of the computational models by how good they can estimate the a verage subjectiv e scores. From the image sets referred in section 1, we use the CUHK database collected by the authors in [22]. The images in the database are obtained from the photo contest website DPChal- lenge [15] along with the user ratings. From the obtained 60,000 images, the top 10% images are selected as good and the bottom 10% images are selected as bad images. Half of the image set is randomly selected to be used for training with labels and the other half is used for the classification tests. In the follo wing sections, we briefly introduce the descriptors and examine their classification performances. 2.1. State of the art descriptors Ke at al. [22] designed aesthetics features using spatial dis- tribution of edges, color distribution, hue count, blur , contrast and brightness. Datta et al. [1] used exposure of light, col- orfulness, saturation, hue, rule of thirds, familiarity measure, wa velet based texture feature, size and aspect ratio, region composition, depth of field and shape conv exity to asses im- age aesthetics. T ong et al. [23] implemented a black-box approach by generating a set of lo w-lev el features and fus- ing these features using learning algorithms. In addition to extracting global features, Luo and T ang [5] extracted sub- ject regions to obtain local features of foreground and back- ground. They calculated clarity contrast, lighting, simplicity , composition geometry and color harmony to model the aes- thetics. Marchesotti et al. [10] used generic features instead of hand-crafted features to perform aesthetics-based classifi- Fig. 1 . Classification accuracy of state of the art methods cation. SIFT and color features were e xtracted and a global model was generated using Gaussian Mixture Model (GMM). Bag of W ords (BoW) and Fisher V ector (FV) were used to obtain the statistics of feature distributions to train a Support V ector Machine (SVM) classifier . W e summarize the classification accuracy of state of the art methods in T able 1. Hand-crafted global features can reach to a classification accuracy up to 76% with Ke et al. [22]. Local features proposed by Luo and T ang [5] leads to an accurac y up to 93% . Generic features used by Marchesotti et al. [10] result in an accuracy between 81 . 4% and 89 . 9% . In the rest of the simulations, we use L1 soft-margin SVM classifier . W e experimented other classifiers and observed that classification accurac y did not change significantly when we had a large set of training and test images. 2.2. Geometric Descriptors GIST is a holistic representation of an image to model the shape of the scene [11]. Scene representation is classified with respect to naturalness, openness, roughness, expansion and ruggerdness. GIST uses the local and global energy spec- trum to quantify the introduced metrics in the scene repre- sentation. W e use three different configurations of the GIST features by varying the number of orientations per scale and number of blocks. GIST(16) and GIST(32) correspond to the configuration where we use an unlocalized energy spectra by setting the number of block to 1 . GIST(512) corresponds to the case where the number of blocks is set to 4 . Orientations per scales are all set to 8 for GIST(32) and GIST(512) where orientation scales are set to 4 for GIST(16). SIFT divides the image into 4x4 grids of cells and calculates histogram of image gradient directions as explained in [24]. GMM approx- imates the distributions by weighted sum of Gaussian models. In addition to being used as a feature, it is also used to gen- erate the visual dictionary in section 2.5. Maximally stable extremal regions (MSER) thresholds the image in the inten- sity channel. Threshold is swept from black to white to de- tect the connected areas that are unchanged o ver a large set of thresholds [25]. Difference of Gaussians (DOG) con volves the original image with Gaussian kernels and subtracts the blurred images from each other . Difference of blurred images contains band-pass details that are used as image descriptors. W e also detect corners and blobs to represent images using Hessian, Harris and Laplace operators. Implementation of Fig. 2 . Classification accuracy of Geometric Descriptors most of the geometric descriptors are provided with VLFeat package which is an open and portable library of computer vision algorithms [26]. Classification results for geometric descriptors are giv en in Figure 2. In here, we calculate the geometric descriptors by computing the average of each feature dimension ov er the whole image. W e obtain the most accurate classification with 67 . 7% in DOG, 67 . 2% in SIFT and 67 . 1% in HessianLaplace. Generic geometric descriptors inherently contain information related to the spatial complexity and composition. Some of the geometric descriptors perform better than the others b ut none of them is suf ficient for aesthetics classification since they do not focus on the basic dimensions of the aesthetics. 2.3. Color Descriptors Color descriptors are designed according to four main con- straints: photometric robustness, geometric robustness, pho- tometric stability and generality [27]. In order to satisfy these constraints, color descriptors should be in variant to shado w , shading, light source configuration, vie w point, orientation and image quality . Ho wev er , it is not possible to design de- scriptors that can satisfy all the constrains. Since image aes- thetics is also influenced by these factors, we can use color descriptors as aesthetics metrics. Color naming is introduced in [28] as an image descrip- tor , which calculates the color distributions similar to the bag of words approach. Relati ve locations of pixels do not effect the distribution since this approach only focuses on the color distribution of the pixels in the region of interest. It is origi- nally used to assign linguistic color labels to image pixels and the main objectiv e is to predict the color category that humans would percei ve gi ven a color measurement. In practice, color naming descriptor is a 11-D vector where each dimension cor- responds to the distribution of main colors that can be sorted as follo ws: black, blue, bro wn, grey , green, orange, pink, pur- ple, red, white and yellow . W e also use the color naming method described in [29]. Authors use a fuzzy k -means algo- rithm to obtain color labels from Munsell book of color . Color descriptor consists of member functions which map color el- ements to [0 , 1] interval and the abbre viation JOSA is used to represent this descriptor in the results . In addition to the distribution of colors, we can also con- sider the relativ e locations of the color pixels. W e use discrim- inativ e color descriptor introduced in [27] to cluster the color pixels in compact representations. Color pixels are clustered by maximizing the discriminativ e power using an information theocratical approach. W e also use the color descriptors in the opponent color space as described in [30]. Color descriptors are designed in a similar way to SIFT by calculating the his- togram of gradients in hue. The hue color descriptor performs poorly when the saturation is low . In case of low saturation, we can use opponent color angle [30]. W e use the def ault v ersion of the color descriptors de- fined in [31]. Color descriptor matrix is composed of three rows where each ro w corresponds to the metric calculated ov er three constant regions. In addition, we take the aver - age of the metric o ver three re gions and use it as an additional descriptor which is sho wn with a suf fix (1 x 11) . In case of discriminativ e color , we also calculate the color descriptors for a color dictionary size of 25 and 50 . Fig. 3 . Classification accuracy of Color Descriptors Color naming (1x11) results in the highest classification accuracy with 73 . 6% followed by JOSA(1x11) with 72 . 1% . When we compare with the generic descriptors, color descrip- tors perform better . In the literature, variations of color distri- bution and harmony are commonly used as hand-crafted fea- tures to model aesthetics as it is mentioned in section 1. 2.4. Hybrid Descriptors Color , geometric and aesthetics descriptors analyze the im- ages in dif ferent aspects and the y can be complementary to each other for classification. Howe ver , they can also contra- dict with each other in terms of classification decisions. W e combine some of the descriptors defined in the previous parts to obtain a hybrid descriptor . Classification accuracies of the hybrid descriptors are sho wn in Figure 4. Fig. 4 . Classification accuracy of Hybrid Descriptors W e use Color naming (1x11) as the color descriptor which is abbreviated as CN and SIFT is used as the geometric de- F e a t u r e E x t r a c t i o n U n i v e r s a l F e a t u r e M o d e l D e s c r i p t o r E x t r a c t i o n C l a s s I I m a g e s C l a s s I I I m a g e s C l a s s I F e a t u r e s C l a s s I I F e a t u r e s M o d e l P a r a m e t e r s C l a s s i f i e r T r a i n i n g C l a s s I D e s c r i p t o r s C l a s s I I D e s c r i p t o r s C l a s s i f i e r M o d e l Fig. 5 . Training pipeline scriptor . When CN and SIFT are combined together, classifi- cation accuracy gets lo wer than the accuracy of CN but higher than the accuracy of SIFT with 70 . 1% . Optimal scale (OS) is a feature introduced to quantify the amount of low frequenc y content in the image. Classification ratio increases by 1% when OS is added to the descriptor . W e add e xposure of light, hue, rule of thirds, wa velet-based texture and size and aspect ratio features introduced by Datta et al. [1] to our descrip- tor . Classification accuracy increases up to 75 . 7% . Finally , we add brightness, saturation, av erage color and colorfulness features proposed by Susstrunk et al. [2] and [3]. Our hybrid descriptor has a classification accuracy of 75 . 9% . Instead of training with the whole set, we train with the first 100 good and bad images to experiment the effect of training set size. Classification accuracy drops from 75 . 9% to 70 . 2% . 2.5. V isual Dictionary Appr oach In the previous sections, we extract features and combine them in a v ector to form a descriptor . Howe ver , we do not use the distribution of the features. In contrast, in this section, we generate a visual dictionary to consider the distributional statistics of the features. The training pipeline including visual dictionary generation is shown in Figure 5. W e extract features from two different classes of images and these features are fed to a univ ersal feature model to ob- tain a visual dictionary . T raining features and univ ersal model parameters are fed to a descriptor e xtraction module to obtain class I and class II descriptors. W e train the classifier with the labeled descriptors to obtain a classifier model. W e use SIFT , DOG, Color naming (1x11) and JOSA because they hav e the highest classification accuracy based on our experiments from previous sections. In addition to the default SIFT , we perform singular v alue decomposition to keep the first N eigenv alues and remov e the rest to reconstruct the feature vector . W e ex- perimented with dif ferent N values and 30 produced the high- est classification accuracy . Gaussian Mixture model (GMM) is used as the universal feature model and Fisher V ector is preferred as the descriptor . In our simulations, we v ary the number of Gaussians in the GMM and the training set size. For each feature, we perform simulations at least for 14 dif- ferent configurations and up to 22 . Configurations that lead to highest classification accuracy are gi ven in T able 1. 2.6. Descriptor Comparison W e observe that state of the art descriptors using local infor- mation leads to classification ratios up to 90% . Luo and T ong [5] used hand-crafted features and Marchesotti et al. [10] used generic features to obtain high classification accurac y . As it was claimed in [10], generic features can perform as good T able 1 . Classification accuracy for the dictionary approach Featur es Number of T raining Classification T ypes Gaussians Set Size Accuracy (%) SIFT(SVD) 200 100 75.5 SIFT 200 100 74.0 CN 5 6000 72.6 DoG 2 6000 69.7 JOSA 5 6000 67.6 as hand-crafted features. Howe ver , e xamining local features in addition to global features ha ve a more significant effect on the classification results than feature selection. The max- imum classification ratio we obtain using geometric descrip- tors is 67 . 7% and it is 73 . 6% using color descriptors. Hybrid descriptors lead to 75 . 9% and the dictionary approach leads to 75 . 5% at most. Basic generic features such as color and geometric can perform as well as the state of the art global computational approaches. Howe ver , they do not perform as well as the ones that take spatial characteristics into ac- count by using subject region e xtraction or spatial pyramid. The main disadvantage of the regional methods comes from the time and memory complexity . Subject region extraction used in [5] is an exhausti ve approach that requires significant amount of computational time and the spatial pyramid origi- nally introduced in [32] requires significant amount of mem- ory (Approximately 250GB of memory is required to store the extracted features of 1 dataset out of 4 in the CUHK dataset ). 3. CONCLUSION In this paper , we compare generic, hand-crafted and aesthet- ics descriptors to examine the classification performance of image aesthetics. W e hav e shown that basic generic descrip- tors can perform as well as the state of the art hand-crafted global descriptors. Howe ver , both generic and hand-crafted features are limited in terms of classification when we do not consider the spatial distribution of features. Spatial pyramid and subject region extraction are the main factors that lead to high classification accuracies in the literature. But they re- quire significant amount of computational time and memory . In our future work, we will examine the correlation between the extracted features and o verall image aesthetics. Instead of directly feeding the features to classifiers, we will focus on the indi vidual relationships between the features and aes- thetics to model a no-reference image aesthetics metric based on deep learning. W e plan to use A V A dataset introduced in [21] to ev aluate the aesthetics metrics because of the image variety , aesthetics scores and rich annotations. 4. REFERENCES [1] R. Datta and J. Z. W ang, “Studying Aesthetics in Pho- tographic Images, ” 2006. [2] D. Hasler and S. Sabine, “Measuring colourfulness in natural images, ” 2003. [3] G. Y ildirim, A. Shaji, and S. S ¨ usstrunk, “Estimating beauty ratings of videos using supervoxels, ” Pr oceed- ings of the 21st ACM international conference on Mul- timedia - MM ’13 , pp. 385–388, 2013. [4] S. Dhar, T . L. Berg, and S. Brook, “High Le vel De- scribable Attributes for Predicting Aesthetics and Inter- estingness, ” 2011. [5] Y . Luo and X. T ang, “Photo and V ideo Quality Evalua- tion : Focusing on the subject, ” pp. 386–399, 2008. [6] A. Levin, “Blind Motion Deblurring Using Image Statistics, ” 2006. [7] X. T ang, W . Luo, and X. W ang, “Content-Based Photo Quality Assessment, ” IEEE T ransactions on Multime- dia , vol. 15, no. 8, pp. 1930–1943, Dec. 2013. [8] A. K. Moorthy , P . Obrador , and N. Oliv er , “T o wards Computational Models of V isual Aesthetic Appeal of Consumer V ideos, ” 2010. [9] S. Bhattacharya, R. Sukthankar, and M. Shah, “A framew ork for photo-quality assessment and enhance- ment based on visual aesthetics, ” Pr oceedings of the in- ternational confer ence on Multimedia - MM ’10 , p. 271, 2010. [10] L. Marchesotti, F . Perronnin, D. Larlus, and G. Csurka, “ Assessing the aesthetic quality of photographs using generic image descriptors, ” pp. 1784–1791, 2011. [11] A. Oliv a and A. T orralba, “Modeling the Shape of the Scene : A Holistic Representation of the Spatial Env e- lope , ” vol. 42, no. 3, pp. 145–175, 2001. [12] G. Csurka, C. R. Dance, L. Fan, J. W illamowski, C. Bray , and D. Maupertuis, “V isual Categorization with Bags of Ke ypoints, ” ECCV SLCV W orkshop , 2004. [13] J. Si vic and A. Zisserman, “V ideo Google: a text re- triev al approach to object matching in videos, ” Pr oceed- ings Ninth IEEE International Conference on Computer V ision , , no. Iccv , pp. 1470–1477 vol.2, 2003. [14] Photo.net, “http://photo.net, ” . [15] DPChallenge, “http://www .dpchallenge.com, ” . [16] R. Datta, J. Li, and J.Z. W ang, “ Algorithmic inferencing of aesthetics and emotion in natural images: An exposi- tion, ” in Imag e Pr ocessing, 2008. ICIP 2008. 15th IEEE International Confer ence on , 2008, pp. 105–108. [17] Alipr , “http://alipr .com, ” . [18] T erragalleria, “http://www .terragalleria.com, ” . [19] H. Mller, P . Clough, Th. Deselaers, and B. Caputo, Im- ageCLEF: Experimental evaluation in visual informa- tion r etrieval series. The information r etrieval series , Springer , 2010. [20] Univ ersity of Florida, “International affecti ve picture system, ” . [21] N. Murray , D. Barcelona, L. Marchesotti, and F . Per- ronnin, “A V A : A Lar ge-Scale Database for Aesthetic V isual Analysis, ” 2012. [22] Y . Ke, X. T ang, and F . Jing, “The Design of High-Lev el Features for Photo Quality Assessment, ” IEEE Com- puter Society Conference on Computer V ision and P at- tern Recognition (CVPR’06) , v ol. 1, pp. 419–426. [23] H. T ong, M. Li, H. Zhang, J. He, and C. Zhang, “Clas- sification of digital photos taken by photographers or home users, ” Pr oceedings of P acific Rim Confer ence on Multimedia , 2004. [24] D. G. Lo we, “Distinctiv e image features from scale- in v ariant keypoints, ” Int. J. Comput. V ision , vol. 60, no. 2, pp. 91–110, Nov . 2004. [25] J. Matas, O. Chum, M. Urban, and T . P ajdla, “Ro- bust wide baseline stereo from maximally stable ex- tremal regions, ” in Pr oc. BMVC , 2002, pp. 36.1–36.10, doi:10.5244/C.16.36. [26] A. V edaldi and B. Fulkerson, “VLFeat: An open and portable library of computer vision algorithms, ” 2008. [27] R. Khan, J. V an de W eijer , F . S. Khan, D. Muselet, C. Ducottet, and C. Barat, “Discriminati ve color de- scriptors, ” 2013. [28] J. V an de W eijer , C. Schmid, J. V erbeek, and D. Lar- lus, “Learning color names for real-world applications., ” IEEE transactions on image pr ocessing : a publication of the IEEE Signal Pr ocessing Society , vol. 18, no. 7, pp. 1512–23, July 2009. [29] R. Benav ente, M. V anrell, and R. Baldrich, “Parametric fuzzy sets for automatic color naming., ” Journal of the Optical Society of America. A, Optics, image science, and vision , vol. 25, no. 10, pp. 2582–93, Oct. 2008. [30] J. V an De W eijer and C. Schmid, “Coloring local feature extraction, ” in In ECCV , 2006 . [31] J. V an de W eijer , “http://cat.uab .es/ ∼ joost/software.html, ” . [32] S. Lazebnik, C. Schmid, and J. Ponce, “Beyond bags of features: Spatial pyramid matching for recognizing nat- ural scene categories, ” in IEEE Computer Society Con- fer ence on Computer V ision and P attern Recognition , W ashington, DC, USA, 2006, CVPR ’06, pp. 2169– 2178, IEEE Computer Society .
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment