HCU400: An Annotated Dataset for Exploring Aural Phenomenology Through Causal Uncertainty

The way we perceive a sound depends on many aspects-- its ecological frequency, acoustic features, typicality, and most notably, its identified source. In this paper, we present the HCU400: a dataset of 402 sounds ranging from easily identifiable eve…

Authors: Ishwarya Ananthabhotla, David B. Ramsay, Joseph A. Paradiso

HCU400: An Annotated Dataset for Exploring Aural Phenomenology Through   Causal Uncertainty
HCU400: AN ANNO T A TED D A T ASET FOR EXPLORING A URAL PHENOMENOLOGY THR OUGH CA USAL UNCER T AINTY Ishwarya Ananthabhotla ∗ , David B. Ramsay ∗ , and J oseph A. P aradiso MIT Media Laboratory , Cambridge, MA. ABSTRA CT The way we perceive a sound depends on many aspects– its ecological frequency , acoustic features, typicality , and most notably , its identified source. In this paper , we present the HCU400: a dataset of 402 sounds ranging from easily iden- tifiable ev eryday sounds to intentionally obscured artificial ones. It aims to lower the barrier for the study of aural phe- nomenology as the largest av ailable audio dataset to include an analysis of causal attribution. Each sample has been anno- tated with crowd-sourced descriptions, as well as familiarity , imageability , arousal, and v alence ratings. W e e xtend e xisting calculations of causal uncertainty , automating and generaliz- ing them with w ord embeddings. Upon analysis we find that individuals will provide less polarized emotion ratings as a sound’ s source becomes increasingly ambiguous; indi vidual ratings of familiarity and imageability , on the other hand, di- ver ge as uncertainty increases despite a clear negati ve trend on av erage. Index T erms — auditory perception, causal uncertainty , affect, audio embeddings 1. MO TIV A TION Despite a substantial body of literature, human auditory pro- cessing remains poorly understood. In 1993, Ga ver intro- duced an ecological model of auditory perception based on the ph ysics of an object in combination with the class of its sound-producing interaction [1]. He suggests that ev- eryday listening focuses on sound sources, while musical listening focuses on acoustic properties of a sound, and that the difference is e xperiential. Current research has corrob- orated this distinction– studies sho w that listeners primarily group sounds by category of sound-source, sometimes group sounds by location/context, and only in certain conditions fa- vor groupings by acoustic properties [2, 3]. Recent work with open-ended sound labeling demonstrates that limited cate- gorization tasks may encourage more detailed descriptions along valence/arousal axes (i.e. for animal sounds) or using acoustic properties (i.e. for mechanical sounds) if sound- ∗ Equal contrib ution. The authors would like to thank the AI Grant for their financial support of this work. source distinctions are too limited for the categorization task [4]. It has been suggested that non-verbal sounds from a liv- ing source are processed differently in the brain than other physical events [5]. Symbolic information tends to underly our characterization of sounds from humans and animals (i.e. yawning, clapping), while acoustic information is relied on for other en vironmental sounds [6, 7, 8]. Furthermore, in [9] Dubois et al. demonstrated that, for complex scenes, the per- ception of pleasant/unpleasantness was attributed to audible evidence of human activity instead of measurable acoustic features. It is clear from the abov e research that any examination of sound phenomenology must start with a thorough charac- terization of a sound’ s interpreted cause. In many cases how- ev er , a sound’ s cause can be ambiguous. In [10] Ballas intro- duced a measure of causal uncertainty ( H cu ) based on a lar ge set of elicited noun/v erb descriptions for 41 ev eryday sounds: H cui = P n j p ij log 2 p ij . (For sound i , p ij is the proportion of labels for that sound that fall into category j as decided by experts revie wing the descriptions). He shows a complicated relationship between H cu and the typicality of the sound, its familiarity , the av erage cogniti ve delay before an individual is able to produce a label, and the ecological frequenc y of the sound in his subjects’ en vironment. H cu was further explored in [11] using 96 kitchen sounds. Lemaitre et al. demonstrated that H cu alters ho w we classify sounds: with low causal un- certainty , subjects cluster kitchen sounds by their source; oth- erwise they fall back to acoustic features. In this paper , we introduce the HCU400 dataset– the largest dataset av ailable for studying everyday sound phe- nomenology . In this dataset, we include 402 sounds that were chosen to (1) capture common en vironmental sounds from ev eryday life, and (2) to fully sample the range of causal uncertainty . While many of the sounds in our dataset are unambiguous, over 100 of the sounds are modified to inten- tionally obscure their source– allowing explicit control of source-dependent effects. As part of the dataset, we include high-level emotional features corresponding to each sound’ s valence and arousal, in line with previous work on affecti ve sound measurement [12]. W e also account for features that provide other insights into the mental processing of sound– familiarity and image- ability [13, 14]. W e explore the basic relationships between all of these features. Finally , we introduce word embeddings as a clustering technique to extend the original H cu , and apply it to the free response labels we gathered for each sound in the dataset. Deep learning has provided a ne w tool to represent v ast amounts of semantic data in a highly compressed form; these techniques will likely make it possible to model and gener - alize source-dependent auditory processing phenomena. The HCU400 represents a first step in that direction. 2. D A T ASET O VER VIEW The HCU400 dataset consists of 402 sound samples and 3 groups of features: sound sample annotations and associated metadata, audio features, and semantic features. It is av ailable at github.com/mitmedialab/HCU400 . 2.1. Sourcing the Sounds All sounds in the dataset are sourced from the Freesound archiv e ( https://freesound.org ). W e built tools to rapidly explore the archive and re-label sound samples, searching for likely candidates based on tags and descrip- tions, and finally filtering by star and user ratings. Each candidate sound was split into 5 second increments (and shorter sounds were extended to 5 seconds) during audition. A major goal in our curation was to find audio samples that spanned the space from “common and easy to identify” to “common but dif ficult to identify” and finally to “uncom- mon and difficult to identify”. W e explicitly sought an ev en distribution of sounds in each broad category (approximately 130 sounds) using rudimentary blind self-tests. In sourcing sounds for the first two categories, we attempted to select samples that form common scenes one might encounter , such as kitchen, r estaurant, bar , home, office, factory , airport, str eet, cabin, jungle, river , beac h, construction site, warzone, ship, farm, and human vocalization . W e av oided any samples with explicit speech. T o source unfamiliar/ambiguous sounds, we include a handful of digitally synthesized samples in addition to ar- tificially manipulated everyday sounds. Our manipulation pipeline applies a series of random effects and transforms to our existing samples from the former categories, from which we curated a subset of sufficiently unrecognizable re- sults. Effects include reverberation, time reversal, echo, time stretch/shrink, pitch modulation, and amplitude modulation. 2.2. Annotated F eatures W e began by designing an Amazon Mechanical Turk (AMT) experiment in which participants were presented with a sound chosen at random. Upon listening as many times as they de- sired, the y then provided a free-text description alongside l ik- ert ratings of its familiarity , imageability , arousal, and valence (as depicted by the commonly used self-assessment manikins [12]). The interface additionally captured metadata such as the time taken by each participant to complete their responses, the number of times a given sound was played, and the num- ber of words used in the free-text response. Roughly 12000 data points were collected through the experiment, resulting in approximately 30 ev aluations per sound after discarding outliers (individual workers whose ov erall rankings de viate strongly from the global mean/standard deviation). A refer- ence screenshot of the interf ace and its included questions can be found at github.com/mitmedialab/HCU400 . 2.3. A udio F eatures Low level features were extracted using the Google VG- Gish audio classification network, which provides a 128- dimensional embedded representation of audio segments from a network trained to classify 600 types of sound ev ents from Y ouTube [15]. This is a standard feature extraction tool, and used in prominent datasets. A comprehensi ve set of standard features extracted using the OpenSMILE toolkit [ ? ] is also included. 2.4. Semantic F eatures A novel contribution of this work is the automation and exten- sion of H cu using word embeddings and knowledge graphs. T raditionally , these are used to geometrically capture seman- tic word relationships; here, we leverage the “clustering ra- dius” of the set of label embeddings as a metric for each sound’ s H cu . W e employed three major approaches to embed each la- bel: (1) averaging all constituent w ords that are nouns, verbs, adjectiv es, and adverbs– a common/successful av erage en- coding technique [16]– (2) choosing only the first or last noun and verb, and (3) choosing a single ’head word’ for each em- bedding based on a greedy search across a heavily stemmed version of all of the labels (using the aggressiv e Lancester Stemmer [17]). In cases where words are out-of-corpus, we auto-correct their spelling, and/or replace them with a syn- onym from W ordNet where av ailable [18]. Labels that fail to cluster are represented by the word with the smallest distance to an e xisting cluster for that sound (using W ordNet path- length). This greedy search technique is used to automati- cally generate the group of labels used in the H cu calculation. Both W ord2V ec [19] and Conceptnet Numberbatch [20] were tested to embed individual w ords. After embedding each label, we derived a ’cluster radius’ score for the set of labels, using the mean and standard de via- tion of the distance of each label from the centroid as a base- line method. W e also explore (k=3) nearest neighbor intra- cluster distances to reduce the impact of outliers and increase tolerance of oblong shapes. Finally , we calculate the sum Fig. 1 . A verage ConceptNet embedding where the radius represents our H cu metric; red b ubbles and the ‘ mod’ suf fix are used to indicate sounds that hav e been intentionally modified. of weighted distance from each label subgroup to the largest ’head word’ cluster– a technique which emphasizes sounds with a single dominant label. W e also include a location-based embedding to capture in- formation pertaining to the likelihood of concept co-location in a physical en vironment. In order to generate a co-location embedding, we implement a shallo w-depth crawler that oper - ates on ConceptNet’ s location relationships (’Located-Near’, ’Located-At’, etc) to create a weighted intersection matrix of the set of unique nouns across all our labels as a pseudo- embedding. Again, we derive the centroid location and mean deviation from the centroid of the labels (represented by the first unique noun) for a giv en sound sample. Giv en the number of techniques, we compare and include only the most representati ve pipelines in our dataset. All clus- tering approaches give a similar overall monotonic trend, b ut with v ariations in their deri vati ve and noise. Analysis of clus- ter labels in conjunction with scores suggests that a distance- from-primary-cluster definition is most fitting. Most embed- ding types are similar , but we prefer ConceptNet embeddings ov er others because it is explicitly designed to capture mean- ingful semantic relationships. Our clustering results from a Processed ConceptNet em- bedding are plotted in Figure 1. Intentionally modified sounds are plotted in red, and we see most sounds with div ergent labeling fall into this category . Sounds that ha ve not been modified are in other colors– here we see examples of com- pletely unambiguous sounds, like human vocalizations, ani- mal sounds, sirens, and instruments. 3. B ASELINE ANAL YSIS AND DISCUSSION First, we find that the likert annotations are reliable amongst online workers, using a split ranking e valuation adapted from [21]. Each of the groups consisted of 50 % of the workers, and Fig. 2 . Split ranking correlation plots and Spearman rank co- efficient v alues for the four likert annotated features. the mean ranking was computed after averaging N=5 splits. The resulting spearman rank coefficient value for each of the crowd-sourced features is giv en in Figure 2. This provides the basis for se veral intuiti ve trends in our data, as shown by Fig- ure 4 – we find a near linear correlation between imageability and familiarity , and a significant correlation between arousal and valence. W e also find a strong correlation between im- ageability , familiarity , time-based individual measures of un- certainty (such as such “time to first letter” or “num of times played”), and the label-based, aggreg ate measures of uncer- tainty (the cluster radii and H cu ). W e ne xt see strong evidence of the v alue of word em- beddings as a measure of causal uncertainty – the automated technique aligns well with the split of modified/ non-modified sounds (see Fig. 1) and a qualitativ e re view of the data labels. Our measure also goes one step beyond H cu , as the cluster Fig. 3 . Feature distributions grouped by extremes in the “Processed CNET” cluster metric; red points represent data at ≤ 15th percentile (the most labeling agreement and least ambiguous); blue dots are ≥ 85th percentile (high H cu ). Fig. 4 . Correlation Matrix displaying the absolute v alue of the Pearson correlation coefficient between the mean values of annotated features, metadata, and four representativ e word embedding based clustering techniques. centroid assigns representativ e content to the group of labels. Initial clustering of sounds by their embedded centroids re- veals a relationship between clusters and emotion rankings when the source is unambiguous, which could be generalized to predict non-annotated sounds (i.e., sirens, horns, and traf- fic all cluster together and ha ve very close positi ve arousal and ne gative valence rankings; similar kinds of trends hold for clusters of musical instruments and nature sounds). Furthermore, we use this data to explore the causal rela- tionship between av erage source uncertainty and indi vidual assessment beha vior . In Figure 3, we plot the distributions of pairs of features as a function of data points within the 15th (red) and greater than 85th (blue) percentile of a single cluster metric (“Processed CNET”). It confirms a strong re- lationship between the extremes of the metric and individual deliberation (bottom right), as reported by [10]. W e further find that more ambiguous sounds have less extreme emotion ratings (top right); the data suggest this is not because of dis- agreement in causal attribution, but because individuals are less impacted when the source is less clear (bottom left). This trend is not true of imageability and familiarity , howe ver; as sounds become more ambiguous, individuals are more likely to di verge in their responses (top center). Regardless, we find a strong downward trend in av erage familiarity/imageability scores as the source becomes more uncertain (top left). 4. CONCLUSION It is kno wn that aural phenomenology rests on a complex in- teraction between a presumed sound source, the certainty of that source, the sound’ s acoustic features, its ecological fre- quency , and its f amiliarity . W e hav e introduced the HCU400– a dataset of ev eryday and intentionally obscured sounds that reliably captures a full set of af fectiv e features, self-reported cognitiv e features, timing, and free-text labels. W e present a new technique to quantify H cu using the distances between word embeddings of free text labels. Our analysis demon- strates (1) the efficacy of a quantified approach to H cu using word embeddings; (2) the quality of our crowd-sourced lik- ert ratings; and (3) the complex relationships between global uncertainty and individual rating beha vior, which of fers nov el insight into our understanding of auditory perception. 5. REFERENCES [1] W illiam W Gav er, “What in the world do we hear?: An ecological approach to auditory e vent perception, ” Ecological psycholo gy , vol. 5, no. 1, pp. 1–29, 1993. [2] Michael M Marcell, Diane Borella, Michael Greene, Elizabeth K err , and Summer Rogers, “Confrontation naming of en vironmental sounds, ” Journal of clinical and e xperimental neur opsychology , v ol. 22, no. 6, pp. 830–864, 2000. [3] Brian Gygi and V aleriy Shafiro, “General functions and specific applications of en vironmental sound research., ” F rontier s in bioscience: a journal and virtual library , vol. 12, pp. 3152–3166, 2007. [4] Oliv er Bones, T rev or J Cox, and William J Da vies, “Dis- tinct categorization strategies for different types of en vi- ronmental sounds, ” 2018. [5] James W Lewis, Frederic L W ightman, Julie A Bre- fczynski, Raymond E Phinney , Jeffrey R Binder , and Edgar A DeY oe, “Human brain regions in volved in rec- ognizing environmental sounds, ” Cer ebral cortex , vol. 14, no. 9, pp. 1008–1021, 2004. [6] Bruno L Giordano, John McDonnell, and Stephen McAdams, “Hearing li ving symbols and nonliving icons: category specificities in the cognitiv e processing of en vironmental sounds, ” Brain and co gnition , vol. 73, no. 1, pp. 7–19, 2010. [7] Salvatore M Aglioti and Mariella Pazzaglia, “Rep- resenting actions through their sound, ” Experimental brain r esear ch , v ol. 206, no. 2, pp. 141–151, 2010. [8] L Pizzamiglio, T Aprile, G Spitoni, S Pitzalis, E Bates, S D’amico, and F Di Russo, “Separate neural systems for processing action-or non-action-related sounds, ” Neur oimage , v ol. 24, no. 3, pp. 852–861, 2005. [9] Dani ` ele Dubois, Catherine Guastavino, and Manon Raimbault, “ A cognitive approach to urban sound- scapes: Using verbal data to access ev eryday life au- ditory categories, ” Acta acustica united with acustica , vol. 92, no. 6, pp. 865–874, 2006. [10] James A Ballas, “Common factors in the identification of an assortment of brief ev eryday sounds, ” Journal of experimental psychology: human per ception and per- formance , vol. 19, no. 2, pp. 250, 1993. [11] Guillaume Lemaitre, Olivier Houix, Nicolas Misdariis, and Patrick Susini, “Listener expertise and sound iden- tification influence the categorization of environmental sounds., ” Journal of Experimental Psycholo gy: Applied , vol. 16, no. 1, pp. 16, 2010. [12] Margaret M Bradley and Peter J Lang, “The interna- tional affecti ve digitized sounds (IADS-2): Af fective ratings of sounds and instruction manual, ” University of Florida, Gainesville, FL, T ech. Rep. B-3 , 2007. [13] Annett Schirmer, Y ong Hao Soh, T revor B Penney , and Lonce W yse, “Perceptual and conceptual priming of en- vironmental sounds, ” Journal of cognitive neur oscience , vol. 23, no. 11, pp. 3241–3253, 2011. [14] Bradley R Buchsbaum, Rosanna K Olsen, Paul K och, and Karen Faith Berman, “Human dorsal and ventral au- ditory streams subserv e rehearsal-based and echoic pro- cesses during verbal working memory , ” Neur on , v ol. 48, no. 4, pp. 687–697, 2005. [15] Shawn Hershey , Sourish Chaudhuri, Daniel P . W . El- lis, Jort F . Gemmeke, Aren Jansen, Channing Moore, Manoj Plakal, Devin Platt, Rif A. Saurous, Bryan Sey- bold, Malcolm Slane y , Ron W eiss, and Ke vin W ilson, “CNN architectures for large-scale audio classification, ” in International Confer ence on Acoustics, Speech and Signal Pr ocessing (ICASSP) . 2017. [16] Lifu Huang, Heng Ji, et al., “Learning phrase embed- dings from paraphrases with GR Us, ” in Proceedings of the F irst W orkshop on Curation and Applications of P ar- allel and Comparable Corpor a , 2017, pp. 16–23. [17] D P aice Chris et al., “ Another stemmer , ” in A CM SIGIR F orum , 1990, vol. 24, pp. 56–61. [18] George A Miller , “W ordNet: a lexical database for en- glish, ” Communications of the A CM , v ol. 38, no. 11, pp. 39–41, 1995. [19] T omas Mikolov , Ilya Sutskev er , Kai Chen, Greg S Cor - rado, and Jeff Dean, “Distributed representations of words and phrases and their compositionality , ” in Ad- vances in neural information pr ocessing systems , 2013, pp. 3111–3119. [20] Robyn Speer , Joshua Chin, and Catherine Hav asi, “Con- ceptnet 5.5: An open multilingual graph of general knowledge., ” in AAAI , 2017, pp. 4444–4451. [21] W ilma A Bainbridge, Phillip Isola, and Aude Oliv a, “The intrinsic memorability of face photographs., ” J our- nal of Experimental Psychology: General , vol. 142, no. 4, pp. 1323, 2013.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment