A Controlled Set-Up Experiment to Establish Personalized Baselines for Real-Life Emotion Recognition

1 The main part o f this work to ok place in Q2 -Q3, 2015. Co rresponding affiliations refer to that period. A Controlled Set-Up Experimen t to Establi sh Personalized Baselines for Rea l-Life Emotion Recognition Varvara Kol lia, Noured dine Tayebi { Varvara.Kolli a, Noureddin e.Tayebi } @intel.com 1 Abstract We design, conduct and present the results of a highly personaliz ed baseline e motion recognition experiment, which ai ms to set reliable gr ound- truth estimates f or the subject’s e motional stat e for real - life prediction under si milar conditi ons using a small n umber of physiological sensors. We also pr opose an adaptive sti muli-selection mechanism that would use the user’s feedbac k as guide for futur e stimul i selection in the c ontrolled-setup exp eriment and g enerate opti mal ground-truth personaliz ed sessions systematically. Initi al results are ver y promising (8 5% accuracy) and variabl e impo rtance analysis sh ows that only a few featur es, which are easy- to -implement in port able devices, woul d suffice to predi ct the subject’s emotional s tate. Keywords: emotion r ecogn ition, baseline, rando m forest, classificati on, stimuli s election, personali zed selection system , adaptive stimuli gen eration, rankin g, user’s feedb ack, physiologi cal sensors 1. Introduc tion R ecent advances in s ensor technol ogy open new p ossibilities in all asp ects of hu man- ma chine interaction. One of the new areas of inter est is the understanding of hu man cogn itive and emotional behavior through bi o-signals. Miniatu rized physiol ogical sensors all ow for seamle ss integration in portable devices and l end thems elves to providing c ontinuous insigh ts into the us er’s emotional sta te. The interpretati on of these sign als into distinct e motions is possibl e through patt ern matching, via a machine learning algorithm. Typical physiological sensors used f or emotion recogn ition are: electrocardiogra m (ECG), electromyogra m (EMG), body-temperatur e (T), gal vanic skin response (GSR) , photoplethysmogra m (PPG) and even brai n waves by electr oencephalogra m (EEG) sens ors. It has been demonstrated that there is clear co rrelation betw een the em otional states of the subject and these sensor-data. However, the application of train ed models int o real-life has always b een challenging , with the main difficu lties stemming pri marily from the fact that there is no unique definition of emotions, as well as the subje ctive nature of the proble m and the variet y of other-factors that need to be taken into consideration, when applying controlled-se tup mod els to real-life. In this work, we tackl e the em otion recogniti on problem at a p ersonalized le vel, to take into acc ount its subjectivity. In additi on, we trea t emotions largel y as states, rather than instan ces, to be abl e to characterize mor e reliably th e sensor data and the emoti onal states themsel ves are defined to be cl ear, distinct and easy to stimula te, based on the availab le material. The con trolled setup-experi ment itself i s highly personalized. The final goal of this would be t o employ the per sonalized m odels at similar s tatic real-life conditions and improve the m sub sequently, b ased on the subj ect’s feedback. 2. Problem Definitio n The main target of this stud y is to set guidelines for establish ing the ground-truth for personaliz ed models in emoti on recognition pr oblems using biosig nals. In particular, we are trying to answ er the question of telling one's emotional state from biosig nals, by breakin g down it down to the following sub- problems: i) Gr ound-truth problem: training a model on predicting the emoti onal state of the subject under constant e xternal conditions in a c ontrolled set-u p environment wh ere the provoked emotion is known a-priori, to set the baseline and ii) Real-life pre diction: repeatin g the experiment under similar conditions with real-li fe data from the subje ct to pred ict their emotional state in a sum marizing fashion, and compare to (any) input from th e subject, if available, usin g the trained bas elin e model of the first stage. Due to the sub jective nature of the emotion re cognition problem, we opt for a p ersonalized model, where separa te baselin e experiments, with different stimuli, will b e used to set th e baseline for each subject at a pers onalized lev el. More specifically, the goal of the experiment is t o establi sh the ground-truth for six emotions at a personalized lev el. The user ran kings are used as guid elines, as the experi ment p rogresses, for futur e stimuli (clip) selecti on, as well as f or data processing . The main criteri on for clip-selection is to provoke strong intensity clear emoti ons that can for m a baseline. A secondary goal is to ra nk the effectiveness of different bio-sign als in emotion clas sification. Thr oughout this study, we will be u sing easy- to -derive features in ti me domain, id eal to be imple mented in w earable devices. Also, the experiment is conducted with a m ore general goal in mind, to es tablish the baseline f or the subject s o that we are ab le to predict their e motions (in general categories) in real-life situations in th e future, un der similar (static ) conditions, e.g. while watchin g movies, doing office-w ork, during a soci al/professional visi t or whil e being engaged in acti vities that w ould not require significant move ment/exercis e. If this study were to be used as emotion-baselin e for other asp ects of the su bject’s life, a syst ematic way of evaluating th e effect of move ment, as well as c ontext information (e g environmental fac tors) would need to be in place. The baseline model can be e xtended to mor e complex envir onments and settings, once the additional factors we need to consider are evaluated s ystematically through the u se of additional sensors, and other correspondin g sources of relevan t information, such as oth er mobile/portable devices. 2.a Sensors A schematic of the pr oblem can b e seen in Fig. 1. As i nput signals, we will use the Heart- Rate (HR) signal and its derivatives , namely heart ra te variabilit y (HRV), heart rate p ercent (HR P) and breathing rate (BR), as provided by electrocardiogra m (ECG) and /or photoplethysmogram (PPG) sensors, as well as Galvanic Skin Resp onse (GSR) and Skin Temperature (SKT). Brain waves and musc ular movements, as recorded by elec troencephalogram (EEG) and el ectromyogra m (EMG) sensors were also consid ered, but will not be part of this study, to maintain the recordin g system's ease- of - use and portability. Once thes e signals are record ed, they will be pr ocessed with machine learnin g methods t o provide the esti mate for the subject's em otional state. Fig. 1. System f or Personalized Emotion Recogniti on. Two collection de vices were used f or this exp eriment: the Zeph yr bioharness was used to report HR (and its derivatives) fro m ECG se nsor, and an in-house de vice was used to pr ovide GSR (fro m the fingers) an d SKT (from the wris t). EC G and PPG sensors for HR rec ording were als o available from our recording device. The Zephyr bi oharness is plac ed across the chest and an interf ace is availa ble for signal collection, pre-pro cessing, feature extra ction, and sto rage. It is notew orthy, that as two separate de vices were used, special at tention was necessary for th eir time-alignment and int erpolation/extr apolation to handle missing valu es and correct for differ ent sampli ng rates. 2.b Targeted Emo tions We will try to identif y three posi tive and thre e negative distinct em otions. The six e motions we will b e reporting on this experime nt are the follo wing: sadness/anger, fear, disgust, aw e/reverence, contentment, joy/a musement. The definitions we use for the three negative e motions are as foll ows. Sadness is considered the pri mary emotion, anger ma y or may not be elicited as s econdary emoti on, or as a means to cover the pri mary feeling which is sadn ess. Sadness is th e emotion that provok es reaction of tears, n ervousness an d helplessness. It makes one feel unh appy, miserable, d esolate and in pain . Fear is the feelin g that one is in danger and that th ere is an imminent threat. This feeling ma y cause panic and state of high al ert. Disgust is the repu gnance caused by s omething hard to look ( or sense in general) , which is repu lsive and horrid. The three positiv e emotion s are awe/rever ence, contentment and joy/amuse ment. Awe/rev er ence i s the feeling ass ociated with admirati on at the sight of a wonder, and it includes th e element of surprise; it is used in the sense of admirati on for something e xceptional. In c ontrast, conte ntment ref ers to a state of relaxation and happiness, it is the state of cal m satisfaction. Joy/ amusement is used here in conjunction to laugh ter; it is the emoti on associated with somethin g that one find s entertaining. These categories were chosen in terms of distincti veness and rela tive ease of stimulation, in order f or the experiment to b e effective. By n o means, is this a complete defin ition of em otions. These emotions were not selected to cover the entire emoti onal spectru m of the individual, bu t rather as a clean subset that lends itself to id entification, with di stinct bounda ries among its members. T hat is the reasoning Machine Learning Algorithm HR/HRV /HRP BR GSR SKT EMO TION Per sonaliz ed behind grouping anger with sadness, sinc e it is usuall y hard to clearly separate t he two. Likewise, as joy is a very general category, we t arget specificall y joyful emoti ons related to amus ement (laughter ) only. We should also no te that a small set of targeted e motions should alwa ys be defin ed beforehand, in order to provide a starting point f or material sel ection and experiment definiti on. However, th e definition is not hard and it should be adjus table to so me degree t o the individua l; mostly towards th e direction of mergi ng (or exclud ing) categories, than towards the on e of greater r esoluti on, to allow for greater accurac y. Therefore, the e motions for the current study were n ot chosen to cover the entire range of emotion of (any) subject, but they were more meant to be used as broad categories, that do not have much overlap and can be s timulated cl early. 3. Experiment In this section, we describe the experi ment design, th e stimulus sel ection, rankin g and the protocol. We also touch on h ow the user’s feedbac k is currently inc orporated int o the experi mental design and h ow this process can be aut omated and optimized t o allow for more robust emotional baselin e generation. We note that ev en though this experi ment was design ed for and conducted wi th adults, it could be used to establish baselin es on children, in order to extrac t clear distinct e motions, by ad justing the stimul i selection and the e motion d efinition appropriatel y. The reasoning behi nd this is th at children usually experience clear e motions of str ong intensity of dura tion that is easily distinguish able. A direct real-life application where the trained baseline model could be test ed would be in scho ol, during hours of teaching and perf ormance evaluati ons. 3.a Stimul i Selection Publicly available video-clips are us ed for the m ost part, in this experiment. There are t wo main reas ons behind this choice: i) the availability of a large numbe r of online sel ections from di fferent categ ories and ii) the combination of visual and hearing elemen ts which would lead t o greater ef ficiency in evoking particular feelings. More specifically, the stimuli (clips) are carefully s elected from publicly available youtube clips. In ord er to be appealin g to a greater audience, clip s relativel y popular were chosen, that gathered positive f eedback . However, as the elemen t of surprise is t ypically impo rtant, the mos t popular clips were l eft out on purpos e, even though w e generally ai med at widely accepted material , in terms of its effici ency to stimulate the desired respons e, based on other users ’ comments. Clips are selected from the n ews, fr om the movies and fr om documentarie s/tv-series for the most part. Likewise, w e aim at selecting relative ly contemporary clip s mostly; to maximize their effectiveness, k eeping in mind that it is bett er if the sub ject has not been exposed to the m for a while. In the long run, all this manual work should be substituted by a smart aut omatic selection s ystem that would be based on an initial questionnair e and the user’s feedback as the experimen t progresses. In evoking the negati ve em otions, there are s ome generally accept ed historical e vents which are trag ic, that are bound to re sult in strong sadne ss for most people. Th ese are used as th e main p ool of sadness and anger stimuli . Obvi ously, factors such as the age of the subject and their origin are, among others and, to some ex tent, factors that sh ould be consider ed in further pers onalizing the material . With respect to fear, w ell-known scenes aiming at pr ovoking fear from the movies are selected, as well as publicly available user-comp ilations of relevant clips . T hese scenes should be in de pendent and complete, in th e sense, that the y should bring a pers on of neutral sta te to a pers on in the state of the targeted emotion, without an y other background/inf ormation nec essary. Finally, there ar e many clips based on personal experiences, as well as scenes fro m movies and TV-sh ows targeting disgust. It is noteworthy, that negative emoti ons, especiall y sadness/anger an d fear take time t o build-up, therefore very sh ort clips sh ould be avoided. Addition ally, very larg e clips, as the subject gets tired, may lose in intensity and should, theref ore, not be consider ed. We aim more at charac terizing biosignals when the subject is at a state of e.g. fear, than a t captu ring strong brief m oments of fear, as the second case would be much harder to iden tify consistentl y. Specifically, with r espect to the negati ve feelings, no personal mat erial should be sele cted, even thoug h personalization is p ossible; e.g . there are clips which would provoke extremel y negative feelings, e.g. certain news st ories, which would be selected based on previous inp ut by the user. On the other hand, ther e are s ome generally a ccepted incred ible talents and ac ts/performan ces or places also bound to provoke admirati on (awe) or oth er positive fe elings. High-ranked funny compilations in ter ms of ev eryday events and shor t movie-parts fr om various po pular sources are used to evoke joy, which is here used in terms of amus ement. Note that l ong sessions in terms of positive feelings should b e fine; it is mainly the negative f eelings that requ ire inter changing. Fin ally, calm, serene clips (e.g. from natu re), are used t o help the subject experience the c ontentment feelin g. Music can also be a very important eleme nt in these clips in choosin g among similar sti muli of the sam e category. In this manner, in design ing the experi ment we can take into acc ount some general guidelines in sel ecting our material that would be then fu rther personalized based on the subj ect. Special attention should be drawn t o the fact that the user’s input is ve ry impor tant for all thes e selections. For e xample, some one’s specific sensiti vities (or lac k of) and general beliefs on certain categories, could guid e the stimuli selection app ropriately. For example, if s omeone percei ves a certain category of clips ai med at p rovoking fear as fun ny, these clips sh ould be clearly av oided when targe ting fear, in the next se ssions. With respect to the full y personaliz ed session, som eone close to the subj ect should pick the stimuli, in order for the subje ct not to be exp osed to the material befor ehand and fully e xperience the e motions during the sessi on. The stimuli can be personal, by e.g. includ ing family videos not seen f or a while, or it can be merely personali zed, by s electing material of special signifi cance to the sub ject, which is n ot personal, (e.g. sc enes from their fav orite comedy ). The stimulus-pool sh ould be subject t o a few fundame ntal rules that would make it app licable to most people. In this work , emphasis is plac ed in finding effective youtube videoclips that would stimulate the targeted emotions, whereas no hard limits are imp osed on the t otal duration of each emo tion/session, as long as it falls within a reasonable window (60-70 min), to facilita te the clip sele ction. If we have more data for some em otions, that would let us take-out cli ps or sessions that rec eived lower rankings. On the other hand, if some s essions recei ve low scores, more data can be coll ected to replace th ose sessions / emotions as the experi ment progre sses. Ideally, the subject’s bas eline will only be con structed from (parts of) sessions that actually pro voked the desired emotions and any unreliabl e data will be taken out. The length of the clips i s subjectiv e, however, in gener al, the clips should neither be too long nor t oo short; 5-10min clips are usually ideal, except in th e case where th e clips are to o strong, in which case ~3min should suffic e to evoke the targ eted emoti on. In the cas e of very short cli ps, compilati ons should be used. We used 1 0-15min compilati ons of short clip s for the positive feelin gs mostly, as c ompilations that consist of short entertain ing or awe-inspiring cli ps would get one’s fo cus and stimulat e their curiosity on what c omes next. Longer clip s lose on the intensity of the pr ovoked emotion. In general, each session should contain 2-3 e motions, with smo oth transitions. It is also reasonable to expec t that most people w ould prefer to transi tion from nega tive to positive fe elings and cl ose on a good n ote. Experiencing the negati ve emotions is hard er, therefore ses sions with predomina ntly negative emotions should be alternating between e motions and possibly be shorter in total du ration. Full sessions of one positive emoti on (eg. joy) may not be problematic, as long as they keep the subj ect’s intere st. In this particular experiment, we found that the sub ject had a preferenc e for sh ort clips that es calate. Additionally, the most provocative clips got the best rankings and clips that inclu ded music had a great effect on the subje ct. Finally, the ful ly personalized session got c onsistently high r ankings. 3.b Stimul i-Ranking The user should giv e their feedback a t the end of each complete sessi on, to avoid loss of focus. The interviewing pro cess is kept simpl e and clear. The main feedback the us er provid es is stimuli-ranking; each clip is ranked on a scale of 1. 10; 1 being the lo west and 10 being the highest score, with respect to its effectiveness in p rovoking the targeted emotion. F or example, if one clip is ai med at provoking sadness/anger and the us er ranks it with 10, this translates to the clip being v ery efficient in ma king the subject feeling reall y sad or ang ry. Other reporting t echniqu es are also available and will be considered in the future, howe ver, direct ranking h elps keep the reporting simple and effect ive. In addition, the subject can provid e feedback on which emoti on(s) did the clip a ctually provoke, i n case they are different from the desired one. The user can additionally rep ort on the length of each clip and which parts in particular were most effecti ve (for the l onger ones). This f eedback is use d to select future clip s that would stimulat e the desired em otions on the sub ject more efficientl y. 3.c Experiment-Pro tocol In this section, we will briefly describ e the experim ent's protoc ol. A general pro tocol is establish ed, with a certain degree of personalization, as it will be explai ned shortly. The pro tocol consists of nine sessions in total, eight sessi ons ~60min each an d one session which is fu lly personaliz ed, with 10-15 clips /session on average. The order of stimulated emotions is inter chan ging in each sessi on. The order of emotions in each session is of importance; abrupt transitions were avoided as much as po ssible, (e.g. from j oy to sadness), to facilitat e the actual e motional experien ce of the subject. The las t session is p ersonalized, based on the subje ct's preferences; th e material its elf does not ha ve to be person al (though it c ould be), but it has to reflec t the subject's id eas/tastes on what they think is most li kely to provoke them the feeling in questi on. The number of em otions in each hour-long session should b e limited t o 3, otherwise the intensity of the stimulated e motion may be compromised. It is als o noteworthy that in this experiment 5min res ting-periods wer e used to separa te different e motions. This should be adjustable based on the user’s fe edback; h ere, it was f ound that 3min should suffice. Regarding the firs t eight sessions, they are based on cl ips drawn from the general material, and selec ted accordi ng to the user ’s feedb ack . Each one of the eig ht sessions contains 2-3 emotions and it la sts ~60- 70 min, on av erage. The ses sions start with ~ 5min of re st (calmness) and th e emotions of each session are also separated b y ~5mi n of rest. Therefore, each session contains ~15-25min of distinct em otions, in most cases. The durati on of the clips varies with the majority bein g ~3-12min; h owever we als o included compilations with very short clip s (~1min) and tot al duration ~ 40min. The compilations of short clips target positive f eelings, (joy and aw e/reverence); as the negative emoti ons need longer time (clips) to build up. A typical brea kdown of a session would contain 2-3 emotions, each of which would contain 2-4 clips; with each emotion be ing separated fro m the next one with a period of rest. The last session is fully personalized and it c onsists of ~10min of p ersonalized clip selections, as provided by a person cl ose to the subject. The orde r of emotions is as follows: sadness/anger  fear  disgust  awe/reverence  contentment  joy /amusement, separa ted by 3min r esting periods. This order was ch osen to allow f or as smooth transitions as possib le, and close on a good note, and it was f ound to be effe ctive. The personalized sessi on can be broken int o 2 sessions, if the number of emotions is found to pos e a problem. A breakdown of the exp eriment can be seen in Table I. In total, the experim ent took ~9.1 hrs , with 3.8 hrs of positive (aw e/reveren ce, joy/amuse ment, contentment ) and 3.2 hrs of n egative (fear, disg ust, sadnesss/anger) e motions, and the remaining being periods of calmness (r est- in - be tw een). We have a t otal of 1027 ins tances, using a wind ow o f w=32 samples, at a samplin g frequency of f s =1Hz ; after undersampling . Note that the HRV/B R signals we use, were alread y extracte d at a much higher frequency. In ter ms of duration, we collected 192 min (3.2 hrs) of negative emotions and 228 min (3. 8 hrs) of positive e motions. Th e remaining 2.1 hrs w ere resting-periods (neutral), wh ich will not be included in the clas sification, du e to the fact that they inclu ded transition periods between e motions, as well as some motion artifacts. W e should also not e that the highest-ranked da ta of the four th emotion (disgust) were n ot recorded, du e to hardware failure ; which is th e reason for th e relatively lo w number of instances of this emotion. This contr olled setup exp eriment is a long e xperiment; most controlled setup experiments a re ~ 1hr, and it was demanding on two levels: it required a lo t of carefully sele cted material and it was emotionall y draining, due t o the very strong material, as well as to the sub ject’s focus and effort. On th e other hand , it is a reliable wa y of setting the grou nd-truth with a li mited number of inputs f or real-life scenarios, and its shortcomings can be overc ome with a smart au tomated system. Table I: Su mmary of Experi ment La bel Symbol Frequency Duration (min) Rest 0 240 128.0 Fear 1 129 68.8 Sad/Anger 2 122 65.1 Awe/Rev 3 158 84.3 Disgust 4 109 58.1 Joy/Amus 5 149 79.5 Content 6 120 64.0 It is also notewor thy that it was found that the particu lar mood of the da y matter s and cannot be full y removed with sign al normaliz ation. Ideally, the sessi ons should be pr ovided in the beginning of the day, when the subject is ale rt, responsi ve and able to focu s. As the day pr ogresses, it is easier to lose focus, respond less to th e positive feelin gs and small d estructions can b e detrimental f or the experi ment. Further quantificati on on how the parti cular mood (as related to the ti me of the day) affe cts the results is beyond the sc ope of this stud y and would mak e the experi ment longer an d more demand ing. It is better to shorten th e exper iment by creating an effective adapti ve selecti on system and offering the sessions in the begi nning of the da y. 3.d Protocol Extension for Automated Personalized S election The main target of this stud y is to set guidelines for establish ing the ground-truth for personaliz ed models in emoti on recognition pr oblems. The user-ra nkings and feedback are used as guidelin es to select the materia l of future sessi ons. E.g., if the targeted em otion is fear and the subject thinks tha t supernatural clips ar e funny, then superna tural clips s hould be excluded fr om future sessi ons in the controlled set-up e xperiment to get an accurate bas eline for this em otion. This adap tive material selection is no vel and even thoug h, it was presentl y done manuall y, an automated system can be built t o adjust the material-sel ection and re-define the future sessions of the experiment in an automated fashion, by e.g. usin g cluste ring to group sti muli (clips) b ased on key-words and recomm end material for the specific user based on their so-far feedb ack. A pre-requisite for this automated s mart session/experim ent-generating syst em would b e the existence of a large pool of emotion-stimulating material. Th e selection mechanis m should make its recommendations fr om a large po ol of initial stimuli and be adap tive to generate the most effec tive subset, to establish th e baseline fo r the subject’s e motional resp onse. The initial stimuli would be created by a number of different pe ople of different a ges and backgr ounds. It should also be mentioned that the order of the sessions can change, depend ing on the particular mood of t he subject. Howev er, conducting the exp eriments at a par ticular tim e of day under regular conditions s hould help with setting unbiased baselines and ge neralization. Based on the same id ea of personalization by u tilizing the user’s feedback, o ur man ual process can be automated (and shor tened) via a s mart selection system. Once a larg e pool of stimulating material is formed based on certain ground rules, so me of which have been describ ed above, the firs t sessions can be formed by selec ting mat erial according t o th e subject’s respons es on an initial questionnaire. The questionnaire should provide general p ersonalized guidelines based on the user’s answers regardin g their general prefer ences an d ideas on what categ ories/subjects of the pre-selected materia l may caus e them the emoti ons in question. The following sessi ons should be bas ed on the us er’s feedback and (so - far) rankings. The proc ess sh ould be repeated until a maximum number of itera tions has been rea ched, or until convergenc e (i.e. good rankin gs) have been co llected. This is illustra ted in Fig . 2. In the system of Fig. 2, not only rankin gs of previous s essions are used as guideli nes for future choic es, but also the targeted emoti ons can be adjustable based on the user’s feedback, t he available material and the experiment’s progress. E ventually, the s mart stimuli selecti on would mak e the experiment sh ort (in terms of durati on) and accurat e, so that the baseli ne can be used t o predict the subject’ s response in similar settings in r eal-life. We should note tha t personaliz ation should not lead t o the devel opment of a full-recommendati on system, given that existing categ orizations can be us ed; e.g. in t erms of movie-genres, s o that the initial selection of stimuli d epends on the s election of c ertain broad categories, bas ed on a (tree-structured ) questionnaire and it should be adaptable with the cou rse of the experim ent. Also , pre-ranked clips and (e.g.) youtube re com mendations can be utilized in per sonalizing the material furt her. The adaptive model will learn fu rth er and be improved based on the user’s feedbac k in real -life, but the c ontrolled- setup experiment sh ould provide a good starting-poin t in establishi ng solid ground-truth baselin es. Fig. 2 Smart sti muli selecti on system to all ow for personalized groun d-truth detection. 4. Data Analysi s In this section, we will pres ent the data anal ysis methodology and dis cuss in detail the results of th e experiment. We also rank the featur es in terms of their effect in clas sification accu racy, both for the six- emotion, as well as for the binary setup. Prototype c ode in Matlab and R was used for this analysis. 4.a Pre-processing Signal pre-processing is essential t o formulate our pr oblem mathema tically. In our case, signal pre- processing consists mainly of time-align ment and normaliz ation. We refe rence all our measur ements to a global clock, t o allow for a single ref erence with respect to features and labeli ng. Additionally, interpolation and e xtrapolation ( to handle missing values) was nec essary, for uni form reporting. Our final features-repor t was generated with frequency 1Hz (after sub-sampli ng). In the pre-classifi cation stage, the data (fea tures with lab els) are rand omized after bein g generated from normalized signals. Signal normalizati on was necessar y to take ou t the bias of the specific da y/time when the sessi on wa s conducted. Note that some classifiers require that feat ures (not only sign als) are a lso normaliz ed, especially if feature n ormal ization is not included in their (default) implementati on. Before fea ture- extraction and lab eling, synchroniz ation was required f or all measured sign als. Wi th respect to the GSR data, the signal was filtered with a median filter of order 10 f or smoothing and for noise r emoval. Moreover, ECG coll ected data fro m the external de vice were f ound to be m ore accurate (from the PPG recorded ones) and were used in the analysis below. Finally, it was observed that the skin temperatur e was following a spe cific pattern in each ses sion, which led to a target-leak in the system, therefore that signal had to be re moved. 4.b Feature Extraction We extract th e features (based on the pr e-processed sig nals prior to undersa mpling) shown in Tabl e II. A window of 32 sec is app lied beforehan d, which, after s ome investigati on, was fou nd to be efficient for algorithm robustness an d generalization purposes. Thi s will be illustrat ed in the R esults section. Stimulus P ool Ground Rules Questionnair e Selection Algorithm T arg et ed Emotions Session User ’ s Feedback R epeat until conditions are met Table II: Features Signal Feature HRV Mean HRV St. Deviation BR Mean BR St. Deviation HRP Mean HRP St. Deviation BR Sum of Sq. GSR Sum of Sq. HRV Mean of Diff. HRV St. Dev. of Diff. GSR Mean GSR St. Deviation SKT Mean HRV Mean of Diff. Sq. HRV Std. of Diff. Sq. HR Mean HR St. Deviation The features consis t of the mean and standard de viation of the main collected si gnals; namely HR and its derivates HRV and H RP, as well as BR and GSR. The m ean of SKT, and the me an and the standard deviation of the squ ares of successive diff erences of HRV and the sum of squares (power) of BR and GSR are also extracted. In total, 17 time-domain features are extract ed, which are easy to implem ent in wearable devices. 4.c Results We use random f orests, as our main classifier an d present the resul ts for both the six-emo tion setup, as well as the binar y setup, which refers to positive v s negative e motions classifica tion. Random f orest was found to be the most effective classifier , among the ones we tried. We also pres ent a ranking of the relevant import ance of the features. 4.c.i Random Fores t Classif ication In this section, we present the classifi cation results with random f orests (RF), which is an ens emble classifier. The resul ts are pr esented for both the bin ary and the six-emotion setup in ter ms of the prediction error, which is the out- of -bag (OOB) error in the cas e of RF. Throughou t our analysis, we report the best observed O OB error, using RF. Howev er, it is not eworthy that we did not obser ve any si gnificant OOB variab ility (~3%) after multiple runs fr om different seeds. We start our anal ysis by modifying our initial setup, t o exclude SKT, as inp ut. The reasonin g behind this, is that even though we found that SKT ranked high on variable imp ortance, its behavior was similar regardless of the se ssion, (i ncreasing with ti me). This can not be meaningful, as each sessi on contains different emoti ons (in diffe rent order) and it can intr oduce bias if pat terns are n ot clear enough. Therefore, to pre vent a target-leak an d unrealisticall y optimistic results, we exclud ed SKT from our analysis. The effect is clear in Tabl e III, where the OOB is reported with and without SKT. Specifi cally, we get ~36% OOB err or excludin g the SKT information, whereas we g et only ~27% O OB if we include i t, for the 6-emotion pr oblem. An impr ovement of ~7% is als o observed f or the binary c ase. We note that the negative emotions refer to fear, sadn ess/anger and disgu st; where positive emotions refer t o awe/reverence, j oy/amusement and contentment. Table III: Effect of SKT on results fro m all sessions Label OOB (%) with SK T OOB(%) wo SKT Fear 32.6 45.7 Sadness/Anger 25.4 35.2 Awe/Reverence 23.4 31.6 Disgust 30.3 35.8 Joy/Amusement 20.8 28.9 Contentment 29.1 43.3 Average OOB 26.6 36.3 Negative Emotions 19.7 28.1 Positive Emotions 11.7 18.0 Average OOB 15.4 22.6 To improve the pred iction accuracy, we will repeat the analysis , taking into account only the hig hest- ranking parts; that is the parts of the sessions that were most eff ective in pro voking the targe ted emotions. In Table IV, we see the brea k-down of the hi ghest-ranking parts of the experiment. In particular, we ai med for a minimum rank of 7 on a 1.10 scale. We end - up with ~50min /emotion and a total of ~300min (hrs) of recorded dat a. It is notewort hy that all the clip s from th e last session, which was fully personali zed were inclu ded in the analysis , as these clip s received consi stently high rankings. Table IV: Experiment su mmary using only clips tha t received high rankings Label Symbol Frequency Prev. Frequ ency Rest/Excluded 0 465 240 Fear 1 105 129 Sad/Anger 2 84 122 Awe/Rev 3 124 158 Disgust 4 50 109 Joy/Amus 5 125 149 Content 6 74 120 We achieve ~7.6 % OOB im provement for th e 6-emotion setup and ~7.3% for the binary on e, by taking into account only the hig hest-ranked clips, as we can s ee in Table V and Table VI, respecti vely. The performance could b e further impr oved if we could tighten the ran king thresh old, i.e. by taking into account only clips rank ed 9 or 10. This is n ot feasible here, due to the fact that the material would be severely limited; however it e mphasizes the impor tance of appropria te personalized stimuli selection. From the confusi on matrix of these tables, we see tha t the largest err or correspo nds to the least populated classes, which is expect ed. We note that f or the case of disgust and contentm ent we have underrepresented da ta, which leads t o greater err or, as random fores ts favor the majori ty classes. An additional factor was the subject’s unc onscious move ment as a respons e to strong stimulus . We should also mention tha t we only t ake into considera tion the main part of each emoti on, excluding any initial transitional sampl es; i.e. the first 10-30sec of each emotion. Table V : Results on highes t-ranked parts of experiment for 6 emotion s problem Label 1 2 3 4 5 6 Class. Error (%) Fear 1 71 8 10 0 13 3 32.3 Sadness/Anger 2 15 54 4 2 8 1 35.7 Awe/Reverence 3 9 2 96 6 9 2 22.6 Disgust 4 2 1 9 30 5 3 40.0 Joy/Amusement 5 8 1 8 1 105 2 16.0 Contentment 6 9 2 15 1 2 45 39.2 Average OOB - 28.7 Table VI : Resul ts on highest-ranked par ts of experiment f or binary setup Negative Positive Class. Error (%) Negative 190 49 20.5 Positive 37 286 11.5 Average OOB - - 15.3 Finally, with respec t to the selecti on of window size, we see a su mmary of the r esults with RF classification on dif ferent wind ow sizes, in Table VII. Clear deterioration was observ ed outside this range. From these r esults, we conclud e that a windo w of 32 samples, w ould minimize the pred iction error. Table VII: eff ect of windo w size on prediction err or Window Size Average OOB (%) 6 emotions Binary 16 30.2 15.5 32 28.7 15.3 64 36.0 18.0 4.c.ii Comparison of Classif iers Random Forests outperform the o ther classifiers w e tried on predi cting whether the su bject experienc es positive or nega tive feelings (binary s etup). A sum mary of the results for the binary probl em can be seen in Table VIII. The rep orted results ar e generated using 10-fold cross validati on. The mean cr oss- validation err or is reported . As the data are shu ffled, and different sa mples are left out each time to b e used as test-sets, we can compare dire ctly the cros s-validation error to the OOB error, since b oth are unbiased estimat es of the predi ction error. The other classifiers we report on ar e decision trees, artificial neural n etworks (AN N) and support vec tor machines (SVM). The ANN consists of one hidden layer with 10 n eurons. It was trained with the back-propag ation algorith m and radial basis fun ctions. Real Pred Pred d Real The reported results for the SVM classifier are deriv ed after the model paramet ers were optimized, which led to cho osing gamma 0.1 and cost function e qual to 10. We see that Ran dom Forests outperform the o ther classi fiers, followed b y the SVM classifier, on th e binary setup. Additionally, for the 6-emotion pr oblem, the predicti on error for SV Ms was found t o be 33.6% c ompared to 2 8.7% for RF. Table VIII: Comparison of classifiers on the binary se tup Classifier Prediction Error (%) Decision Tree 23.6 Artificial Neural Ne twork 25.0 Support Vector Machin es 19.0 Random Forests 15.3 4.d Variable Impo rtance The most impor tant features are derived based on Gi ni index, which is a measure of node impurity, with RF classification. Th e results are sh own in Table I X. Table IX: Variabl e importance on 6 emotions Signal Features 6-emotion Ranking Binary Ranking HRV Mean 1 1 HR St. Deviation 2 7 GSR St. Deviation 3 5 HR Mean 4 4 HRP Mean 5 3 GSR Mean 6 9 HRP St. Deviation 7 2 GSR Sum of Sq. 8 8 BR St. Deviation 9 6 BR Mean 10 11 HRV Std. of Diff. Sq. 11 12 HRV Mean of Diff. Sq. 12 13 BR Sum of Sq. 13 10 HRV St. Deviation 14 14 HRV Mean of Diff. 15 15 HRV Mean of Diff. Sq. 16 16 The features are rank ed from most to least sign ificant feature; with 1 d enoting the mos t significant feature and 16 the l east sign ificant one. From Table IX, we can see tha t the HR-related fea tures dominate as the most important ones. For both the 6-emotion and the bin ary setup , the mean of H RV is the most import ant featur e. Mean HRV is foll owed by the stand ard deviation of HR and the standard deviation of HRP, as the second most important fe ature for the 6-emotion and the bin ary problem, respectively. Out of the top five significant features, for both setups, f our are HR-based. 5 . Summary The entire pipeline of a controlled setup experiment was presented to set the baseline and train a model at personalized le vel, for future predicti on of the em otional state of the subject in real-life, und er static conditions. Main components of this flow are the f ollowing: triggers (material se lection), protocol, data collection and data anal ysis (pre-proc essing, feature extraction, classifi cation and p rediction). Random Forest was found to b e the more ac curate classifier, a mong the f our ones we train ed. We have very promising results on the six-emoti on experiment, usin g only easy- to -imple ment in wearable s, time- domain features. The accuracy f or the binary case is ~85%. Further more, the bios ignals were ranked in terms of their effe ct in clas sification accurac y and the Heart-Rate and its derivatives w ere found to be the most significant ones. Finally, an adapti ve stimuli s election and contr olled-setu p session-generation system was prop osed to allow for gr eater accuracy in b aseline estimation for emotion rec ognition. Starting from a larg e initial stimuli selection, the system c ould potentiall y generate the optimal sub set of stimuli material, bas ed on th e user’s feedback, t o establish em otion baselines at a personalized level. 6. Acknowled gements The authors would lik e to thank R. Kumar and his team f or providing the intern al physiol ogical sensor recording devic e, as well as G. Takos for his help and s upport. 7. References 1. K. R. Scherer, What a re emotion s? And how can the y be measured , Social Science Infor mation, vol. 44, no. 4, 2005. 2. P. Ekman, Facial exp ression and emotion , Americ an Psychol ogist, v ol . 4 8, no. 4, 1993. 3 . T. Hastie, R. Tibshirani a nd J. Friedman, The Elemen ts of Statistical Lea rning , 2 nd ed., Springer, NY, 2008. 4. T. Tamura et al, W earable Photople thysmograp hic Sensors — Past an d Present , Electr onics 2014. 5. S. Haykin, Adap tive Filter Theory , 3 rd ed., Prentice H all, NJ, 199 6. 6 . https://www.zeph yranywhere.co m/ 7. https://www.you tube.com/ 8 . N. Karim et al e, Heart Rate Variabil ity – A review , J ournal of Basic an d Applied Sci ences vol. 7, no. 1, 2011. 9 . https://psychol ogy.iresearch net.com/social-psycho logy/emotions/ 10 . https://en.wikiped ia.org /wiki/Emotion 11 . J. Selvaraj et al , Classifica tion of emotional states from electroca rdiogram sig nals: a non-linear approach based on hurst , BioMedical Eng ineering , 2013. 12 . E-H. Jang et al, Analysis of physiol ogical signals for recognition of bor edom, pain, an d surprise emotions, Journal of Physiological Anthropology, 2015. 13 . M. K. Abadi et al , User-cen tric Affective Video Tag ging from MEG and Peripheral Physiolog ical Responses , 2013 Humaine Associati on Conf. on Affec t. Computin g and Intell. Interaction, IEEE 201 3. 14 . J. Kim, and E. Andre, Emotion Recognition based on Physiolog ical Changes in Music Listening , IE EE Trans. On Pattern An alysis and Machine Intelligence, vol. 30, no. 12, 2008. 15 . D. S Quintana et ale, He art rate variab ility is associ ated with emotion recognition : direct evidence fo r a relationship between the autonomic nervous system and social co gnition , Int. Journ al of Phychophysiol., 2 012. 16 . M. T. Vald eras et al , Human Emotion Recognitio n Using Hea rt Rate Variabil ity Analysis wi th Spectral Bands Based on Respiratio n , Engineering in Medical a nd Biology Society, IE EE 2015. 17 . G. Valenza e t al, Reveali ng Real-Time Emotiona l Respon ses: a Personaliz ed Assessment based on Heartbeat Dynamics , Nature, 201 4. 18 . V. Kollia, Persona lization Effect on Emotion R ecognition from Physiological Data: An Investiga tion of Performance on Different Setups an d Classifiers , 2016, https://arxiv.o rg/abs/1607 .05832 . 19 . MATLAB Versi on: 8.0.0. 783 (R2012b), http:/ /www.mathwork s.com/ . 20 . R Core Tea m (2013), R: A la nguage and environment for statis tical computing , R Foundation for Statistical Computing , Vienna, Austria, http ://www.R-p roject.org/ . 21 . https://c ran.r-project.org/web/pa ckages/random Forest/randomFo rest.pdf 22 . https://c ran.r-project.org/web/pa ckages/neuraln et/neuralnet.pd f 23 . https://cran. r-project.org/w eb/packages/rpar t/rpart.pdf 24 . https://c ran.r-project.org/web/pa ckages/e1 071/e10 71.pdf

A Controlled Set-Up Experiment to Establish Personalized Baselines for Real-Life Emotion Recognition

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment