The Intrinsic Memorability of Everyday Sounds

A udio Engineering Society Conf erence P aper Presented at the Conf erence on Immersiv e and Interactive A udio 2019 March 27 – 29, Y ork, UK This paper was peer -re viewed as a complete manuscript for pr esentation at this confer ence. This paper is available in the AES E-Library (http://www .aes.or g/e-lib) all rights r eserved. Repr oduction of this paper , or any portion thereof , is not permitted without dir ect permission fr om the Journal of the Audio Engineering Society . The Intrinsic Memorability of Everyday Sounds David B. Ramsay ∗ , Ishwarya Ananthabhotla ∗ , and Joseph A. Paradiso Responsive En vir onments, MIT Media Lab Correspondence should be addressed to Ishwarya Ananthabhotla ( Ishwarya@mit.edu ) ABSTRA CT Our aural experience plays an integral role in the perception and memory of the e vents in our li ves. Some of the sounds we encounter throughout the day stay lodged in our minds more easily than others; these, in turn, may serve as po werful triggers of our memories. In this paper , we measure the memorability of ev eryday sounds across 20,000 cro wd-sourced aural memory games, and assess the de gree to which a sound’ s memorability is constant across subjects. W e then use this data to analyze the relationship between memorability and acoustic features like harmonicity , spectral skew , and models of cognitive salience; we also assess the relationship between memorability and high-lev el features with a dependence on the sound source itself, such as its familiarity , valence, arousal, source type, causal certainty , and verbalizability . W e ﬁnd that (1) our cro wd-sourced measures of memorability and confusability are reliable and robust across participants; (2) that the authors’ measure of collectiv e causal uncertainty detailed in our previous w ork, coupled with measures of visualizability and valence, are the strongest individual predictors of memorability; (3) that acoustic and salience features play a heightened role in determining "confusability" (the f alse positi ve selection rate associated with a sound) relativ e to memorability , and that (4), within the frame work of our assessment, memorability is an intrinsic property of the sounds from the dataset, shown to be independent of surrounding context. W e suggest that modeling these cogniti ve processes opens the door for human-inspired compression of sound en vironments, automatic curation of lar ge-scale environmental recording datasets, and real-time modiﬁcation of aural ev ents to alter their likelihood of memorability . 1 Introduction For a sound to enter our memory , it is ﬁrst uncon- sciously processed by a change-sensiti ve, gestalt neural mechanism before passing through a conscious ﬁltering process [ 1 , 2 , 3 ]. W e then encode this auditory infor - mation via a complex and v ariable process; frequently we abstract our e xperiences into words, though we also utilize phonological-articulatory , visual/visuospatial, ∗ equal contrib ution. The authors would like to thank the AI Grant for their ﬁnancial support of this work. semantic, and echoic memory [ 4 , 5 ]. Dif ferent types of memory may also dri ve more visceral forms of rec- ollection and experience; non-semantic memory , for example, may underpin po werful recollection and nos- talgia experiences similar to those reported with music [6]. In this work, we map out the features of e veryday sounds that driv e their memorability using an auditory memory game. As a recall experiment, we hypothe- size that it can pro vide useful insights into cognitiv e Ramsa y , Ananthabhotla, and Par adiso Intrinsic Memorability of Sound models for auditory capture and curation. Additionally , we design the task such that it is beyond the capac- ity of our working and echoic memory and engages long-term memory cogniti ve processes [ 7 , 8 ]. W ith this work we hope to illuminate the role of top-down features – imageability , emotionality , causal certainty , and familiarity – in auditory memory . Using state-of- the-art cogniti ve saliency models, we also e xplore the relativ e importance of lo w-level acoustic descriptors against high-le vel conceptual ones for memory forma- tion. T o our knowledge, this is the ﬁrst general treat- ment of auditory memorability that combines lo w-level auditory salience models with multi-domain, top-down cognitiv e gestalt features. This work enables more ac- curate models of auditory memory , and represents a step tow ard cognitively-inspired compression of e very- day sound en vironments, automatic curation of lar ge- scale en vironmental recording datasets, and real-time modiﬁcation of aural ev ents to alter their likelihood of memorability . 2 What Inﬂuences Memorability? Many f actors inﬂuence the cognitiv e processes under- lying human aural processing and storage. Research shows a complicated interdependence between atten- tion, acoustic feature salience, source concept salience, emotion, and memory; furthermore, v erbal, pictorial, and phonological-articulatory mnemonics can have a signiﬁcant impact on sound recall tasks. Neuroscience research supports the idea that gestalt auditory pre-processing is follo wed by attentiv e ﬁlter- ing prior to conscious perception [ 1 , 2 ]. These gestalt representations incorporate both ’bottom-up’ and ’top- down’ processes – sounds that are contextually no vel only based on their acoustic features, as well as sounds that are only conceptually nov el, lead to distinct and measurable variations in unconscious e vent-related po- tentials (ERPs) [ 9 ]. These data motiv ate the need to in- corporate high-lev el conceptual features and low-le vel acoustic features relati ve to a sound context for even simple models of auditory processing, attention, and memory . The stored gestalt representation of the current sound context – necessary to e xplain change-dri ven ERPs – can be thought of as the ﬁrst stages of auditory memory [ 3 ]. This immediate store, known as ’echoic memory’, starts decaying exponentially by 100ms after a sound onset [ 10 ]. Measurements of ERPs suggest immediate storage of rhythmic stimuli on the order of 100 ms with a resolution as lo w as 5 ms [ 11 ]; other studies hav e shown this immediate store is complimented with an additional echoic mechanism that lasts several seconds [ 12 , 13 ]. On these time-scales, our auditory system compresses its perceptual representation of textures based on time-av eraged statistics [14]. These principles have been used to design ’bottom- up’ cognitiv e saliency models [ 15 , 16 ]. While other time-av eraged low-le vel features hav e been used to quantify sound similarity [ 17 ], saliency models are now common in practical applications [ 18 , 19 ]. Al- though the abo ve work does not include higher -level gestalt processing, a few researchers ha ve successfully combined low-le vel salienc y modeling with a focused, task-speciﬁc top-do wn cognitiv e model [ 20 , 21 , 22 ]. These models aren’t designed to generalize outside of their domain, howe ver . In general, high-le vel ’ top-down’ features hav e been an area of intense study that begins in the 1950’ s, when Colin Cherry demonstrated that his subjects noticed their name – and no other v erbal content – spoken by a secondary speaker in a shado wing test [ 23 ]. Besides the ongoing work in verbal auditory processing, re- search into non-verbal stimuli and auditory memory also pro vides us with insight into the role of conceptual abstractions in modulating attention and memory in a more general sense. One such abstraction – emotionality – is kno wn to ha ve a powerful ef fect on cogniti ve processing and mem- ory formation [ 24 ]. For music, recall has been sho wn to improve with positiv e valence, high arousal sound ev ents [ 6 ], though recent research has called the signif- icance of arousal into question [ 25 ]. In noise pollution research, the high-le vel perception of human activity is considered ’pleasant’ (more positi vely valenced) re- gardless of low-le vel acoustic features [ 26 ]. In general, the emotional impact of a sound is correlated with the clarity of its perceiv ed source [ 27 ], though sounds can hav e emotional impact ev en without a direct mapping to an explicit abstract idea [28]. For recognition and recall memory exercises, verbal- izing a sound or naming a sound (both of which may engage phonological-articulatory motor memory) is the most common and successful strategy [ 29 ]. This se- mantic abstraction has overshado wing effects, though; verbal descriptions can distort recollection of the sound AES Conf erence on Immersive and Interactiv e Audio , Y or k, UK, 2019 March 27 – 29 P age 2 of 10 Ramsa y , Ananthabhotla, and Par adiso Intrinsic Memorability of Sound itself, degrading recognition performance without al- tering conﬁdence [ 30 ]. Some researchers speciﬁcally isolate and study echoic memory , separating it from naming (a process that doesn’t in volve outward ver - balization) using homophonic sound sources to en- sure subjects are not relying internally on a naming mechanism [ 31 ]. In ev eryday life, though, we natu- rally rely on a complex mixture of echoic (perceptual), phonological-articulatory (motor), verbal (semantic), and visual memory [4, 5]. In our previous work [ 27 ] we curated a large dataset of e veryday sounds to include high-le vel features that may inﬂuence a sound’ s memorability; most notably , its causal certainty (the degree to which a sound im- plies a clear , unambiguous source, denoted as H cu ), the implied source itself as determined by cro wd-sourced workers, and its acoustic features. W e also collected ratings for the valence and arousal of each sound, its familiarity , and how easily it conjures a mental im- age (features that ha ve strong correlations with H cu ). Combined, these data gi ve us insight into a sample’ s emotionality , as well as the ease by which it can be stored in semantic or visual memory . Embeddings based on location relationships of sound sources (’at- location’, ’located-near’) gi ve us additional insight into conceptual distinctness of a sound compared to the con- textualizing sounds of a soundscape, and can serv e as a ﬁrst-order proxy for ecological exposure. In this paper , we explore the relationship between both low-le vel acoustic features and high-le vel conceptual features with the memorability of sound, in a context that engages long-term memory processes. W e hope that a thorough analysis of the major low- and high- lev el features from the literature might lay the foun- dation for a practical, generalized model of auditory memory . W e set out to test a few important hypotheses, namely: 1. The cognitiv e processing of sound is similar enough across people that trends in recall across sound samples will be measurable and robust across users. 2. Higher-le vel gestalt features will be most predicti ve of successful recall performance. W e see from the lit- erature that naming and emotionality hav e very strong ef fects in similar tasks– we e xpect sounds with lo w H cu (easy to name their source) and strong v alence/high arousal to be the most memorable. High H cu (uncer- tain) sounds elicit weaker emotions, reinforcing this effect. 3. Low-le vel acoustic feature information will marginally predict memory performance. Gestalt fea- tures are not easily mapped to lo w lev el feature space (and we expect gestalt features to dominate); ho wev er , the literature suggests a measurable, second-order con- tribution from lo w-lev el perceptual saliency modeling. 4. The likelihood of a sound eliciting a false memory will be best predicted by its conceptual familiarity as well as by low-le vel acoustic features. 5. The context a sound is presented against will have a marginal b ut measurable impact on whether it is re- called. In other words, we expect emotional and unam- biguous sounds to be the most memorable regardless of presentation, but when a sound stands out against the immediately preceding sounds, we h ypothesize that it will be slightly more memorable. 3 Samples and Feature Generation Audio samples for this test were taken from the HCU400 dataset [ 27 ]. Standard lo w-level acoustic features were extracted from each sample based on prior precedent [ 17 ]. W e used default conﬁgurations from three audio analysis tools: Librosa [ 32 ], pyAu- dioAnalysis [ 33 ], and Audio Commons [ 34 ], which include basic features (i.e. spectral spread) as well as more advanced timbral modeling. W e supplement these features with additional summary statistics like high/mid/bass energy rat ios, percussive vs. harmonic energy , and pitch contour diversity . Over the last decade there hav e been advances in cog- nitiv e models that can determine the acoustic salience of sound, inspired by the neuroscience of perception [ 16 , 19 ]. Here we follow the procedure proposed by [ 15 ], applying separate temporal, frequency , and in- tensity kernels to an input magnitude spectrogram to produce three time-frequenc y salience maps. Figure 1 shows a comparison of temporal salience between two sound samples in the HCU400 dataset with highly contrasting auditory properties. From these maps, we compute a series of summary statistics to be used as features. High-lev el, top-down features were taken from our previous work in [ 27 ] and include causal uncertainty ( H cu ), the cluster diameter of embedding v ectors gen- erated from user -provided labels (quantifying source agreement or source location), familiarity , imageability , valence, and arousal. AES Conf erence on Immersive and Interactiv e Audio , Y or k, UK, 2019 March 27 – 29 P age 3 of 10 Ramsa y , Ananthabhotla, and Par adiso Intrinsic Memorability of Sound V omit Marketplace Fig. 1: A table demonstrating the auditory salience model based on [ 15 ] applied to two contrasting audio samples in the HCU400 dataset. The resulting salience scores (bottom) are summarized and used as features in predicting memorability . 4 Measuring Memorability In order to quantify memorability , we dre w inspira- tion from work in [ 35 ], which used an online memory game to determine the features that make images mem- orable. W e designed an analogous interface for the audio samples in the HCU400 dataset; this interface can be found at http://keyword.media.mit. edu/memory . The game opens with a short auditory phase alignment-based assessment [ 36 ] to ensure that participants are wearing headphones, follo wed by a surve y that captures data about where they spend their time (urban vs. rural areas, the workplace vs home, etc). Participants are then presented with a series of 5 second sound clips from the HCU400 dataset, and are asked to click when they encounter a sound that they’ ve heard pre viously in the task. At the end of each round consisting of roughly 70 sound clips, the participant is provided with a score. Screenshots of the interface at each stage are shown in Figure 2. By design, each round of the game consisted of 1-2 pairs of tar get sounds and 20 pairs of vigilance sounds . T ar get sounds were deﬁned as samples from the dataset that were separated by exactly 60 samples– the sounds for which memorability was being assessed in a gi ven round. The vigilance sounds , pairs of sounds that were separated in the stream by 2 to 3 others, were used to ensure reliable engagement throughout the task follow- ing the method in [ 35 ]. Roughly 20,000 samples were crowd-sourced on Amazon Mechanical T urk such that a single task consisted of a single round in the game. In- dividua l workers were limited to no more than 8 rounds to ensure that target samples were not repeated. Rounds that failed to meet a minimum vigilance score (>60%) or exceeded a maximum false positiv e rate (>40%) were discarded. 5 Summary of Participant Data W e recruited 4488 participants, consisting of a small ( < 50) number of v olunteers from the univ ersity com- munity and the rest from Amazon Mechanical T urk. AES Conf erence on Immersive and Interactiv e Audio , Y or k, UK, 2019 March 27 – 29 P age 4 of 10 Ramsa y , Ananthabhotla, and Par adiso Intrinsic Memorability of Sound Fig. 2: Screenshots of the auditory memory game in- terface presented to participants as a part of our study . The game can be found at http: //keyword.media.mit.edu/memory. Our survey data sho ws that our participants report a 51/37/12% split between urban, suburban, and rural communities. W e see weak trends in the av erage time per location reported for each community type– ur- banites self-report spending less time at home, in the kitchen, in cars, and watching media on average. Rural participants report spending more time in churches and in nature. Using KNN clustering and silhouette analy- sis, we ﬁnd four latent clusters – students (590 users), ofﬁce workers (1250 users), home-makers (1640 users), and none of these (1010 users). Split-rank comparisons between groups did not re veal meaningful dif ferences in results across user groups; we speculate any dif fer- ences due to ecological exposure of sounds between en vironments is not consistent or inﬂuential enough at this group lev el to alter performance. Fig. 3: T op: A histogram of the raw scores for each sound – they were successfully remembered and identiﬁed about 55% of the time on a ver- age, with a large standard de viation; Bottom: A histogram of "confusability" scores for each sound, with an av erage score of about 25%. 6 Summary of Memory Data The raw memorability score M for each sound is sim- ply computed as the number of times it w as correctly identiﬁed as the target divided by the number of its appearances. Howe ver , this does not account for the likelihood that the sound will be falsely remembered (i.e. clicked on without a prior presentation). W e addi- tionally compute a "confusability" score C 10 for each sound sample, deﬁned as the false positi ve rate for sounds when the y fall close to the second tar get presen- tation (i.e. in the last ten positions of the game). W e can AES Conf erence on Immersive and Interactiv e Audio , Y or k, UK, 2019 March 27 – 29 P age 5 of 10 Ramsa y , Ananthabhotla, and Par adiso Intrinsic Memorability of Sound thus deri ve a "normalized memory score" represented by M − C 10 . In attempting to understand auditory mem- ory , we consider both what makes a sound memorable and what makes a sound easily mis-attributed to other sounds, whether those sounds are encountered in our game or represent the broader set of sounds that one encounters on a habitual basis. W e therefore model both normalized memorability and confusability in this work. W e conﬁrm the reliability of both the normalized mem- ory scores and the confusability scores across partici- pants by performing a split ranking analysis similar to [ 35 ] with 5 splits, shown in Figure 4 with their respec- tiv e Spearman correlation coef ﬁcients. This conﬁrms that memorability and confusability are consistent, user - independent properties. In T able 1, we sho w a short list of the most and least memorable and confusable sounds in our dataset as a function of the normalized memorability score and confusability score. Fig. 4: The results of the split-ranking analysis for the normalized memorability score and confusabil- ity score, using 5 splits; The Spearman coefﬁ- cient correlations demonstrate the reliability of these scores across study participants, enabling us to model both metrics in the later parts of the work. Most Memorable Least Memorable man_scr eaming.wav morphed_ﬁr ecracker_fx.wav woman_scr eaming.wav truck_(idling).wav ﬂute.wav morphed_turke y2_fx.wav woman_crying.wav morphed_airplane_fx.wav opera.wav morphed_metal_gate_fx.wav yawn.wav morphed_shovel_fx.wav Most Confusable Least Confusable garag e_opener .wav clock.wav lawn_mower .wav morphed_335538_fx.wav washing_machine.wav phone_ring.wav rain.wav woman_crying.wav morphed_tank_fx.wav woman_scr eaming.wav morphed_printing_pr ess_fx.wav vomit.wav T able 1: A list of the most and least memorable and confusable sounds from the HCU400 dataset. 7 Feature T rends in Memorability and Confusability W e consider two objectives – (1) to determine the rela- tionship between individual features and our measured memorability and confusability scores, and (2) to de- termine the relativ e importance of these features in pre- dicting memorability and confusability . T o address the former , we provide the resulting R 2 value after apply- ing a transform learned using support vector re gression (SVR) for each individual feature. F or the latter, we use a sampled Shapely value regression technique in the context of SVR– that is, we ﬁrst take N random features ( N between 1 and 10), perform an SVR to predict mem- orability or confusability scores for our 402 sounds and the calculated R 2 of the ﬁt. W e then measure the change in R 2 as we append ev ery remaining feature to the model, each individually . The largest average changes ov er 10k models are reported in table 2. This technique is robust to complex underlying nonlinear relationships from feature space to predicted metric as well as feature collinearity . W e ﬁnd that the strongest predictors of both memorability and confusability are the measures of imageability (how easy the sound is to visualize) and its causal uncertainty . Memorability is dominated by high le vel, gestalt features, with only one lower lev el feature (‘pitch div ersity’) in the ten most important features. Low le vel features, including those deriv ed from the auditory salience models, play a more signiﬁcant role in determining confusability . The absolute R 2 values indicate that no indi vidual fea- ture is a signiﬁcant predictor of memorability by it- self. This implies a complex causal interplay in feature space, which we explore further in the set of plots AES Conf erence on Immersive and Interactiv e Audio , Y or k, UK, 2019 March 27 – 29 P age 6 of 10 Ramsa y , Ananthabhotla, and Par adiso Intrinsic Memorability of Sound presented by Figure 5. In each plot, we show a dis- tribution of feature v alues for the 15% of sounds that are most memorable or least confusable (blue) con- trasted against the least memorable or most confusable sounds (red). W e ﬁrst consider the effect of H cu and valence on memory– lo w memorability and high con- fusability sounds exhibit a similar trend of high causal uncertainty and neutral v alence (Column 1). In Col- umn 2, we consider imageability and familiarity ratings, shown to be strongly collinear in [ 27 ]. Here, their re- lationship to memorability and confusability div erge; while both are positiv ely correlated with memorability , neutral ratings are the stronger predictor of confusabil- ity . This suggests that we are most likely to confuse sounds if they are loosely familiar but neither strictly nov el nor immediately recognizable. Finally , Column 3 rev eals a discernible decision boundary in low-le vel feature space for confusability which doesn’t exist in its memorability counterpart. The relati ve importance of lo w-level salience features, here represented by spec- tral spread, aligns with intuition– we hypothesize that, in the absence of strong causal uncertainty or affect feature values, our perception of sounds is driven by their spectral properties. 8 P er-game Modeling of Memorability The aural context in which a sound is presented, which includes ecological exposure as well as the immediate preceding sounds in our audition task, may inﬂuence the memory formation process. The literature supports the notion that, giv en a context, unexpected sounds are more likely to grab our attention and engage memory [ 9 ]. T o understand this ef fect in our test, we ran two studies based on a 5 sound context (approximating the limits of semantic working memory) and a 1 sound context (approximating the limits of echoic memory). T able 3 shows the results of a model trained to predict whether the tar get in each game will be successfully recalled. This model was trained with the most mem- orable and least memorable sounds only (15th/85th percentiles) with a 5-fold cross-validation process, and results are reported on a 15% hold-out test set. T o begin, a baseline model is trained using the absolute, immutable features of the target sound. Because there are a limited number of sounds in our dataset relati ve to the number of games, the feature space is redundant and sparse, and we e xpect the accurac y of this model to con ver ge to the average e xpected value ov er our set of sounds. W e then introduce contextual features– the r el- ative differ ence (z-score) of target sound features with those of the varying sounds that precede its ﬁrst pre- sentation in each game– to see if our model impro ves. Evidence of over -ﬁtting on the train and validation set when all of the contextual features are included (decreasing test set accurac y) motiv ates a second test, limited to a smaller set of the 50 most meaningful fea- tures (from our SVR analysis; 25 high-lev el and 25 low-le vel). In both cases, howe ver , model performance does not improv e as we would expect if the context contained additional useful information. W e also run a classiﬁer that only uses conte xtual fea- tures, to ensure informative context has not been ob- scured, as useful information in the conte xt could be subsumed by the absolute features in our ﬁrst test. W e run a noise baseline in which contextual features are calculated using a random context, which are still infor - mativ e as the z-score depends lar gely on the absolute features of the target sound. W e then run a model with the proper context to measure the difference in per- formance. There is no improvement when the proper context is re-introduced. This leads us to a meaningful insight, contrary to our hypothesis – context does not exert a measurable inﬂu- ence on our results. While context likely does matter in real-world settings, we suspect that our memory game framew ork indirectly primes participants to ex- pect otherwise surprising sounds. This conﬁrms that our data is the consequence of truly intrinsic properties of the sounds themselves, independent of immediate context and participant ecological exposure (as was demonstrated in the split-rank analysis). 9 Implications and Conclusion In this work, we quantify the inherent likelihood that a sound will be remembered or incorrectly confused and conﬁrm that is consistent across user groups. In line with our hypotheses, we sho w that the most im- portant features that contribute to a sound being re- membered are gestalt– namely those sounds with clear sound sources (high H cu ), that are easy to visualize, familiar , and emotional. W e also show that lo w H cu sounds that are not familiar or easy to visualize are most likely to be mis-attributed, and lo w lev el features play a more important role in predicting this beha vior . These relationships are not inﬂuenced by conte xt, and are intrinsic properties of the sounds themselves. AES Conf erence on Immersive and Interactiv e Audio , Y or k, UK, 2019 March 27 – 29 P age 7 of 10 Ramsa y , Ananthabhotla, and Par adiso Intrinsic Memorability of Sound T op Predictors for Memorability and Confusability Memorability Confusability Feature R 2 Shapely ∆ R 2 Feature R 2 Shapely ∆ R 2 Imageability 0.201 0.126 Imageability 0.065 0.078 Hcu 0.224 0.125 Hcu 0.073 0.078 Familiarity 0.176 0.123 A vg Spectral Spread 0.087 0.078 V alence 0.178 0.120 Peak Spectral Spread 0.037 0.076 Location Embedding Density 0.147 0.117 Peak Energy , Frequency Salience Map 0.059 0.076 Familiarity std 0.103 0.117 Location Embedding Density 0.100 0.076 Pitch Div ersity 0.084 0.113 Frequency Ske w, Frequenc y Salience Map 0.059 0.076 Imageability std 0.086 0.113 Arousal 0.039 0.076 Arousal 0.072 0.112 Peak Energy , Intensity Salience Map 0.044 0.075 Arousal std 0.056 0.111 Familiarity 0.045 0.075 A vg Spectral Spr ead 0.099 0.107 V alence 0.100 0.075 T imbral Sharpness 0.094 0.091 T imbral Roughness 0.094 0.047 Max Ener gy 0.091 0.100 A vg Flux, Sub-band 1 0.092 0.064 T reble Ener gy Ratio 0.090 0.020 Flux Entropy , Sub-band 1 0.091 0.061 T able 2: The top performing features from the Shapely re gression analysis for both memorability and confusability (gestalt features are bolded); sho wn are the features ordered by their respectiv e contributions to the R 2 value, with additional features with top performing individual R 2 values appended in italics. The ﬁrst column indicates the individual predictiv e power of each feature; the second indicates its relative importance in the context of the full feature set. Fig. 5: Scatter plots showing the changes in distrib ution of select features based on extremes in memorability (top row) and confusability (bottom ro w); blue indicates sounds that are most (85th percentile) memorable or least (15th percentile) confusable; red indicates sounds that are least memorable or most confusable. T o our kno wledge, this is the ﬁrst body of work that combines top-down theories from psychology and cog- nition with bottom-up auditory salience framew orks to model the memorability of ev eryday sounds. W e posit that the demonstration of memorability as an intrinsic, user and conte xt-independent property of sounds, along with the insights mentioned above, ha ve signiﬁcant im- plications for audio technology research – for example, knowing that gestalt features are the primary driv ers of memorability might allo w us to selectively choose audio samples in a stream or sound en vironment to be recorded and stored, as a way of mimicking human memory to perform compression at a le vel of abstrac- tion higher than the sample le vel. An understanding of the most signiﬁcant predictors of memorability and confusability might also allo w us to artiﬁcially manipu- late our sonic en vironments to make certain streams of audio more or less memorable, perhaps as a memory aid or a mechanism to eliminate distractions vying for our attention. Looking ahead, we aim to enable man y AES Conf erence on Immersive and Interactiv e Audio , Y or k, UK, 2019 March 27 – 29 P age 8 of 10 Ramsa y , Ananthabhotla, and Par adiso Intrinsic Memorability of Sound Memorability Per-Game Models Featur es Accuracy (%) Absolute + All 5-Sound Context Feats (working semantic) 68.0 Absolute + T op 50 5-Sound Context Feats 69.1 Absolute F eature Only Baseline (~expected value) 70.3 Contextual Only , 5-Sound Context (working semantic) 62.5 5 Sound Context, Noise Baseline 64.1 Absolute + All 1-Sound Context Feats (echoic) 68.0 Absolute + T op 50 1-Sound Context Feats 69.5 Absolute F eature Only Baseline (~expected value) 70.3 Contextual Only , 1-Sound Context (echoic) 60.0 1 Sound Context, Noise Baseline 61.3 T able 3: The inﬂuence of contextual sounds before the ﬁrst presentation of the target on our ability to predict recall across games. of these applications by translating the principles from this work to an online, real-time model. References [1] Snyder , J. S. and Elhilali, M., “Recent adv ances in exploring the neural underpinnings of audi- tory scene perception, ” Annals of the Ne w Y ork Academy of Sciences , 1396(1), pp. 39–55, 2017. [2] W inkler , I., Denham, S. L., and Nelken, I., “Mod- eling the auditory scene: predictiv e regularity rep- resentations and perceptual objects, ” T r ends in cognitive sciences , 13(12), pp. 532–540, 2009. [3] Inui, K., Urakaw a, T ., Y amashiro, K., Otsuru, N., Nishihara, M., T akeshima, Y ., Keceli, S., and Kakigi, R., “Non-linear laws of echoic memory and auditory change detection in humans, ” BMC neur oscience , 11(1), p. 80, 2010. [4] Buchsbaum, B. R., Olsen, R. K., K och, P ., and Berman, K. F ., “Human dorsal and ventral audi- tory streams subserve rehearsal-based and echoic processes during v erbal working memory , ” Neu- r on , 48(4), pp. 687–697, 2005. [5] V aidya, C. J., Zhao, M., Desmond, J. E., and Gabrieli, J. D., “Evidence for cortical encoding speciﬁcity in episodic memory: memory-induced re-activ ation of picture processing areas, ” Neu- r opsychologia , 40(12), pp. 2136–2143, 2002. [6] Jäncke, L., “Music, memory and emotion, ” Jour - nal of biology , 7(6), p. 21, 2008. [7] Ma, W . J., Husain, M., and Bays, P . M., “Chang- ing concepts of working memory , ” Nature neur o- science , 17(3), p. 347, 2014. [8] Loaiza, V . M., Duperreault, K. A., Rhodes, M. G., and McCabe, D. P ., “Long-term semantic rep- resentations moderate the effect of attentional refreshing on episodic memory , ” Psychonomic bulletin & r eview , 22(1), pp. 274–280, 2015. [9] Schirmer , A., Soh, Y . H., Penney , T . B., and W yse, L., “Perceptual and conceptual priming of en vi- ronmental sounds, ” Journal of cognitive neur o- science , 23(11), pp. 3241–3253, 2011. [10] Lu, Z., W illiamson, S., and Kaufman, L., “Behav- ioral lifetime of human auditory sensory memory predicted by physiological measures, ” Science , 258(5088), pp. 1668–1670, 1992. [11] Nishihara, M., Inui, K., Morita, T ., Kodaira, M., Mochizuki, H., Otsuru, N., Motomura, E., Ushida, T ., and Kakigi, R., “Echoic memory: in vestiga- tion of its temporal resolution by auditory of fset cortical responses, ” PloS one , 9(8), p. e106553, 2014. [12] Cow an, N., “On short and long auditory stores. ” Psychological b ulletin , 96(2), p. 341, 1984. [13] Alain, C., W oods, D. L., and Knight, R. T ., “ A distributed cortical netw ork for auditory sensory memory in humans, ” Brain r esearc h , 812(1-2), pp. 23–37, 1998. [14] McDermott, J. H., Schemitsch, M., and Simon- celli, E. P ., “Summary statistics in auditory per - ception, ” Natur e neur oscience , 16(4), p. 493, 2013. [15] Kayser , C., Petkov , C. I., Lippert, M., and Logo- thetis, N. K., “Mechanisms for allocating auditory attention: an auditory saliency map, ” Curr ent Bi- ology , 15(21), pp. 1943–1947, 2005. [16] Delmotte, V . D., Computational auditory saliency , Ph.D. thesis, Georgia Institute of T echnology , 2012. [17] Richard, G., Sundaram, S., and Narayanan, S., “ An overvie w on perceptually moti vated audio indexing and classiﬁcation, ” Proceedings of the IEEE , 101(9), pp. 1939–1954, 2013. [18] Schauerte, B., Kühn, B., Kroschel, K., and Stiefel- hagen, R., “Multimodal saliency-based attention for object-based scene analysis, ” in Intelligent AES Conf erence on Immersive and Interactiv e Audio , Y or k, UK, 2019 March 27 – 29 P age 9 of 10 Ramsa y , Ananthabhotla, and Par adiso Intrinsic Memorability of Sound Robots and Systems (IROS), 2011 IEEE/RSJ In- ternational Confer ence on , pp. 1173–1179, IEEE, 2011. [19] Kalinli, O., Sundaram, S., and Narayanan, S., “Saliency-dri ven unstructured acoustic scene clas- siﬁcation using latent perceptual indexing, ” in Multimedia Signal Pr ocessing, 2009. MMSP’09. IEEE International W orkshop on , pp. 1–6, IEEE, 2009. [20] Kalinli, O. and Narayanan, S. S., “Combining task-dependent information with auditory atten- tion cues for prominence detection in speech, ” in Ninth Annual Confer ence of the International Speech Communication Association , 2008. [21] Kalinli, O. and Narayanan, S., “Prominence de- tection using auditory attention cues and task- dependent high le vel information, ” IEEE T ransac- tions on audio, Speech, and language pr ocessing , 17(5), pp. 1009–1024, 2009. [22] Marchegiani, M. L., “T op-Do wn Attention Mod- elling in a Cocktail Party Scenario, ” 2012. [23] Cherry , E. C., “Some experiments on the recog- nition of speech, with one and with two ears, ” The J ournal of the acoustical society of America , 25(5), pp. 975–979, 1953. [24] LeDoux, J. E., “Emotion, memory and the brain, ” Scientiﬁc American , 270(6), pp. 50–57, 1994. [25] Eschrich, S., Münte, T . F ., and Altenmüller , E. O., “Unforgettable ﬁlm music: the role of emotion in episodic long-term memory for music, ” BMC neur oscience , 9(1), p. 48, 2008. [26] Dubois, D., Guasta vino, C., and Raimbault, M., “ A cognitive approach to urban soundscapes: Us- ing v erbal data to access ev eryday life auditory categories, ” Acta acustica united with acustica , 92(6), pp. 865–874, 2006. [27] Ananthabhotla, I., Ramsay , D., and Paradiso, J., “HCU400: An Annotated Dataset for Explor - ing Aural Phenomenology through Causal Un- certainty , ” International Confer ence on Acous- tics, Speech, and Signal Pr ocessing , 2018, under revie w; 1811.06439 . [28] Quirin, M., Kazén, M., and Kuhl, J., “When non- sense sounds happy or helpless: the implicit posi- tiv e and negati ve af fect test (IP AN A T). ” Journal of personality and social psychology , 97(3), p. 500, 2009. [29] Bartlett, J. C., “Remembering environmental sounds: The role of verbalization at input, ” Mem- ory & Cognition , 5(4), pp. 404–414, 1977. [30] Mitchell, H. F . and MacDonald, R. A., “Re- membering, Recognizing and Describing Singers’ Sound Identities, ” Journal of New Music Re- sear ch , 40(1), pp. 75–80, 2011. [31] Conrad, R., “The de velopmental role of v ocaliz- ing in short-term memory , ” Journal of Memory and Language , 11(4), p. 521, 1972. [32] McFee, B., Raf fel, C., Liang, D., Ellis, D. P ., McV icar, M., Battenberg, E., and Nieto, O., “li- brosa: Audio and music signal analysis in python, ” in Pr oceedings of the 14th python in science con- fer ence , pp. 18–25, 2015. [33] Giannakopoulos, T ., “pyAudioAnalysis: An Open-Source Python Library for Audio Signal Analysis, ” PloS one , 10(12), 2015. [34] Font, F ., Brookes, T ., Fazekas, G., Guerber , M., La Burthe, A., Plans, D., Plumbley , M. D., Shaashua, M., W ang, W ., and Serra, X., “ Au- dio Commons: bringing Creati ve Commons audio content to the creativ e industries, ” in Audio Engi- neering Society Conference: 61st International Confer ence: Audio for Games , Audio Engineer- ing Society , 2016. [35] Bainbridge, W . A., Isola, P ., and Oliv a, A., “The intrinsic memorability of face photographs. ” Jour - nal of Experimental Psychology: Gener al , 142(4), p. 1323, 2013. [36] W oods, K. J., Siegel, M. H., T raer, J., and McDer - mott, J. H., “Headphone screening to facilitate web-based auditory experiments, ” Attention, P er- ception, & Psychophysics , 79(7), pp. 2064–2072, 2017. AES Conf erence on Immersive and Interactiv e Audio , Y or k, UK, 2019 March 27 – 29 P age 10 of 10

The Intrinsic Memorability of Everyday Sounds

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment