MeetSense: A Lightweight Framework for Group Identification using Smartphones

In an organization, individuals prefer to form various formal and informal groups for mutual interactions. Therefore, ubiquitous identification of such groups and understanding their dynamics are important to monitor activities, behaviours and well-b…

Authors: Snigdha Das, Soumyajit Chatterjee, S

MeetSense: A Lightweight Framework for Group Identification using   Smartphones
IEEE TRANSACTIONS ON MOBIL E COMPUTING, VOL. 14, NO . 8, A U GUST 2015 1 MeetSense : A Lightw ei ght F rame w or k f or Group Identification using Smar tphones Snigdha Das, Soumy ajit Chatterjee, Sandip Chakrabor ty , Biv as Mitra Abstract —In an organization, individuals prefer to form various formal and informal groups for mutual interactions. Therefore , ubiquitous i dentification of such groups and understanding thei r dynamics are i mport ant to monitor activities, behaviours and well-being of t he individuals. I n this paper, we dev elop a lightweight, yet near-accurate , m ethodology , called MeetSense , to identif y v arious interacting groups based on collective sensing through users’ sm ar tphones. Group detection from sensor signals i s not straightf orward because users in proximity m ay not alwa ys be under the same group . Therefore, we use acoustic contex t extrac ted from audio signals to infer interaction patter n among the subjects in prox imity . We have dev eloped an unsupervised and lightweight mechanism for user group detection by taking cues from network science and m easuring the cohesivity of the detected groups in ter ms of modularity . T aking modularit y into consideration, MeetSense can ef ficiently elim inate incorrect groups , as well as adapt the mechanism depending on the role play ed by the proximity and t he acoustic context in a specific scenario. The proposed method has been implemented and tested under many real-life scenarios in an academic institute environment, and we observe that M eetSense can identify user groups wit h close to 90% accuracy ev en in a noisy environment. Index T erms —group detecti on, smar tphone, collectiv e sensing. ✦ 1 I N T R O D U C T I O N W O R K P L A C E meetings and tea m formation among the individuals are key factors behind organizational ef- ficiency . In organizations and institutions, p eople formally as well as s poradically meet, interact and form groups for various purposes, which include information s haring [1], teaching and learning [2], pro blem solving and decision making [3], brainstorming [4], socialization [5] etc. T racking the dynamics of group formation facilitat es various utilitie s ; for instance, organizational leaders may prefer to monitor the formation of t e a ms, which benefit the overa ll efficiency and act iv eness of the organization [6], [7]; course instructors in flipped cla s sroo ms [8] in a n academic environment may like to know how the s tudents form groups a m ong them- selves to solve ass ignments and exercises. Unlike regular & pre-scheduled team meetings, the formation of s p oradic and instantaneous groups (often obs erved in office breaks, conferenc es etc.) ma ke the problem challenging. On the other hand, increasing availability of sensor-rich smart- phones provides a unique opportunity for collecting wide sensor information in a seamless ma nner . In this backdrop, we investigate t he potential of smartphones to develop a lightweight ubiquitous system for identifying and monitor- ing group formation. Notably , in this pap er , we primarily concentrate on the m eeting groups where co-located group members occasionally interact with each other . In this line, we capture differ ent typ es of real-lif e me e t ing group sce- narios such as outdoor roadside informal meeting; informal outdoor cafe meet, formal and informal laboratory me e t ing, and clas sro om interaction as shown in Figure 1 . • S. Da s, S. Chatterjee, S. Chakraborty and B. Mitra a re with the Department of Computer Science and Engineering, Indian Institute of T echnol ogy Kharagpur , India. E-mail: snigdhadas@sit.iitkgp.ernet.i n, sjituit@gmail .com, sandipc@cse.iitkgp.ac.i n, bivas@cse.iitkgp.ac.i n Identification of a meeting group p rimarily relies on the location proximity [9], [10] of the group members, which (apparently) can be conceptualized as a localization problem [11], [12]. In that direction, prior art explores the following three modalities – GPS, Bluetooth, and W iFi for identification of the location similarity in supervised [10 ] as well as unsup ervised [13] ma nner . In our context of group detection, vanilla localization based solutions demand high accuracy , which pus hes the system towards complex pr o- cessing. Notably , location proximity a lone is ins uf ficient to correctly discriminate and identify the meeting groups. For instance, consider a large conference hall, where multip le meeting groups get formed s imultaneously; here members of differ ent groups may exhibit locat ion s imilarity among themselves, which makes the group detection challenging. Close inspection r eveals that alb eit similarity in location proximity , context [14], [15], [16], [17] of the members par- ticipating in individual mee t ing groups play a critical role in identifying groups; for instance all the members of a specific group in t he conference hall share a substantial amount of contextu a l similarity (room illumination, ambi- ence noise, member interactions, ma gnetic fluctuations) [11], [16], [17 ]. However , identifying suitab le contextu al informa- tion, which is computat iona lly lightweight as well as carries the s ignature of a meeting group, is an important problem. W e propose acoustic context , extracted from the a u dio signals received by individual sma rtphones, as a key context indicator . In order to compute acoustic context , one can apply standard Mel-frequency Cepstral Coefficients (MFCC) [18] on the recor ded a udio signals for speaker identification by measuring the tone & pitch. However , this s olution comes with multip le challenges. (a) The pr ocess of using MFCC usually follows a su pervised approach which needs individ- uals’ pre-trained information. (b) MFCC is computationally expensive, which ma kes it inappropriate for developing a IEEE TRANSACTIONS ON MOBIL E COMPUTING, VOL. 14, NO . 8, A UGUST 2015 2 Fig. 1 : Setup of Differ ent Meeting Group Scenarios lightweight s ystem. (c) MFCC t e chnique is quite sensitive to noise, hence most suit a ble for the unidirectional microphone with the st er eo channel. Unfortunately , most of the commer- cial smartphones are equipped with omnidirectional micro- phones, which makes them prone to noise and corrupting the sp eaker identification p r ocess 1 . In Next2Me [19], Baker and Efstrat iou a t tempted to detect s ocial groups considering W iFi and sound fingerprints. First, W iFi signal strengths are us ed for detecting the co-located p opulation; next, this filtered population is fed to t he a u dio module for finding out the social groups. The audio module considers the top n frequencies of all the co-located individuals and computes the pairwise similarities. However , in the real-life environ- ment, getting the actual top n fr equencies is challenging, and little variation in the selection of frequencies exerts a huge impact on the similarity computation. Additionally , the audio signals captured on different smartphones can be time drifted, even if a single sp e a ker acts a s the audio source, s ince the clocks of different devices may not be time synchronized, and the subj e cts (devices) may be at differ ent distances from the sp eaker . Once the co-located p opulation has bee n identified a nd audio based context information has been extracted, state of the art techniques perform naive component ana lys is [19] and community detection [20] to identify social groups. However , in most of the cas e s , q uality (cohesivity) of the discovered groups ha ve been overlooked, which leads to the detection of incorr ect communities (false positives). In this pa per , we develop MeetSense , a smartphone- driven ubiquitous platform for automatic detection of meet- ing groups. The proposed me t hod is lightweight, uns u- pervised, hence equipp ed to detect instant aneously formed groups, without any p r e-training. First, we determine the co-located population using standard localization tech- niques [10], [19]. In our implementation, we relied on the W iFi-base d proximity; nevertheless , this can be extended to Bluetooth and GPS based techniques as well. The crux of the proposed method is the compu tation of acoustic context of the identified co-located population, which is based on the following key intuition. Interactions between participants of a meeting group switches from one sp e a ker to another; where, at a time, there exists (mostly) one dominating speaker . Hence, power of the dominating tone (say α 1 ) cap- tured by the smartphones (subjects 2 ) in one group (say G 6 ) is s ignifica ntly different from the power of the dominating tone ( α 4 ) captu r ed by the devices of the another group G 7 . If both the groups G 6 and G 7 are closely locate d, then all the devices might captu r e both t he tones with varying power . However , for the devices in group G 6 , the power of 1. h t tps://www .fabathome.or g/best- smartphone- microphone/ (last accessed Apr 12, 2018) 2. In this paper , we use the term ‘subject’ to ind icate a participant, a member or a smartphone, interchangeably . the dominating tone α 1 should be higher than α 4 , whereas exactly op p osite is likely to happen for the group G 7 . By discriminating t he power of the dominating tone, one can diff erentiate the acoustic context of the members of differ ent groups. Finally , leve ra ging on the proximity of the co- located population and their acoustic context, we propose MeetSense , a community-driven group detection model. T he advantage of this model is manifold. 1) The m odel is u nsupervised and lightweight. 2) This model can p erform group detection even in the absence of proximity indicators (say , W iFi etc.). 3) W e take cues fr om network science and measure the cohesivity of the detected groups with the help of mod- ularity . T aking modularity into consideration, MeetSense can efficiently eliminate incorrect groups (reduce false positives), as well as adapt the algorithm depending on the role played by the proximity a nd the acoust ic context in a specific s cenario. For insta nce, in case of a noisy environment, MeetSense combines both the modalities t o identify meeting groups. The organization of the p aper is the following. In Sec- tion 2, we formally define the meeting group and st ate the p roblem of group detection. W e introduce two primary indicators a nd the related literature in those contexts – (a) proximity , to identify co-located p opulation and (b) au dio signal, to comp ute acoust ic context. W e a lso conduct pilot experiments to highlight the challenges in extracting a cous - tic context a midst noisy environment, device heterogeneity etc. In Section 3, we propose a novel sound signal processing approach that can capture the a coustic context ev e n with low power microphones available with the s martphones. In Section 4, we develop MeetSense , a group detection model leveraging on the community detection algorithms. W e hav e implemented MeetSense in an academic camp us s cenario, and captured s everal gr oups like classroom teaching, la b meetings, seminars, cafeteria gatherings, outdoor meetings etc. In S ection 5, we show that MeetSense can detect such groups with more tha n 90% accuracy while incurring low computation overhead compa r ed to the state of the art group identification methods. 2 P R O B L E M D E FI N I T I O N A N D B AC K G RO U N D S T U DY In this section, first, we define the meeting group and state the problem of group detection in the context of smartphone-based sensing. Next, we identify the primary indicators (sa y , proximity , acoustic context et c. ) facilitating the group detection and ex p lor e their potentia l in the light of s tate of t he art endeavours . Finally , we concentrate on the acoustic context and conduct a pilot s tudy to highlight the challenges in group detection fr om audio signatures. 2.1 Pr oblem Statement W e st a rt with the definition of a Mee ting Group and subse- quently st ate the p r oblem of gro up detection. Definition 1 (Meeting Group). Given a population of s ubjects U , we define a meeting group G [ t,t + T ] ⊆ U for t he time period [ t, t + T ] as the collection of co-located individuals { u i ∈ U } s haring similar context. IEEE TRANSACTIONS ON M OBILE COMPUTING, V OL. 14, NO . 8, A UGUST 2015 3 For ins t ance, two su bjects u i and u j participate in a group G [ t,t + T ] iff u i and u j are located in close proximity and share similar context for time duration [ t, t + T ] [13]. In this pap e r , we hypothesiz e that sound p r ofile, observed by the group members, defines the acoustic context of a group. For instance, s ensing the verbal interactions b etween the participating members can discriminate one meeting group from a nother . Notably , in the acoustic context, we only concentrate on the tone from the huma n v oice signal, a nd categorically disregard the content of intera ction to preserve privacy . Consider each subject u i ∈ U carries a smartphone equipped with various sensors. W e collect the sensor log s i from each subject u i and p opulate the dat a in a central repository X . The se ns or log s i comprises of the location information p i and acoustic information α i . The locat ion information may come from various signals for indoor and outdoor localization techniques based on GPS, wireless signals etc. [12], [17], [21], [22], [23 ], [24]; s imilarly acous- tic information can be extracted from the audio signals captured by the s martphones [25], [26], [27], [28]. W e aim to discover the meet ing group G [ t,t + T ] formed during t he period [ t, t + T ] from the logged sens or repository X . 2.2 Primary Indicators and Respective Prior Art The definition of t he meeting group mainly relies on two kinds of sens ing modalities – (a) location Proximity and (b) Acoustic context . W e explore the recent attemp ts in this direction and highlight their potential & challenges in group detection. 2.2.1 Location Pro ximity Localizing the subjects within their proximity is t he initial step t owar ds the group identification. I n the line, the pa st literature explores mainly three modalities – GPS, Bluet ooth, and W iFi. GPS [23] is a n important modality (a lb eit energy- hungry) for localiza tion and detecting populat ion within proximity . Although GPS performs well in outdoor env i- ronments, its accuracy s harply falls in indoor environments due to t he interruption in the s ignal [21]. On the other side, Bluetooth-based stu dy [29 ] is on e of t he earliest atte m p ts for localization in indoor environments. However , Bluet ooth scanning is power hu ngry [1 0 ]. Moreover , many of the An- droid sma rtphones (starting from versions 4 . 4 ) ha v e pa rtial support for Bluetooth Low-Energy (BLE), which are capa ble of only detecting other BLE devices [16]. Additionally , the Bluetooth signal as a medium of information is considered to be unreliable and nois y . Recently , attemp ts have been made to detect proximity from W iFi fingerprint [10]. W iFi-based localization is consid- ered as a promising indicator for identifying the population in proximity . W iFi consumes significantly les s power as compared to Bluetooth a nd GPS. A lthough BLE app e a rs as an alternative to W iFi in terms of power consumpt ion, nevertheless BLE suffers from data los s and fluctuations with increasing distance [30]. Furthermore, W iFi ca n work in any env ironment irrespective of whether t he device lo- cation is indoor or outdoor . Each modality has its p ositive and nega tive as p ects in the context of localization. Hence, the select ion of modalities is highly dependent on the Fig. 2 : Impact of Audio S ignals in Group Detection - two speakers fr om two dif ferent groups talk simultaneously application for which the p r oximity is computed. In [10 ], the aut hors have developed a supervised ba sed learning approach for p erson-to-person proximity detection using W iFi fi ngerp rints, like access point (AP) coverage a nd signal strength measurements. On the other hand, the authors in [19] have developed an unsupervised lea rning based approach for p r oximity detection using a novel W iFi bas ed metric comput ed using Manhattan distance , which is the average of t he pairwise signal s trength diff erence among the APs, from which the sub ject receives signa ls. Any of these existing mechanism can be used for proximity detection for group identification. Once a set of subjects are detected to be in proximity , then the contextu al similarity further characterizes the subset that forms a meeting group. 2.2.2 Acoustic Conte xt A microphone is an important indicator to identify the meeting group members. P a rticipants, in general, avoid talking simultaneously in a meeting; although there can be a s mall overlap when the discussion switches from one speaker to another (utterance duration). Therefore the voice properties, such as p itch and tone of the cu rr ent s p eaker in a group dominates in the audio signals captured by individual subjects in that group [31]. Pitch defines t he perceived fundamental fr equency of the sound [32], whereas tone is the va riation or thickness of the pitch, indicating the quality of the sound [33]. Figure 2 ex p lains the intuition behind using hu m a n voice characteristics for group identi- fication. The blue a udio s ignal dominates f or the s ubjects of G 6 , whereas the red signal dominates for the sub j ects of G 7 . Therefore, huma n voice characteristics (aka acoustic context) may show a strong feat u r e simila rity , if the sub jects belong to the s a me group. Audio pitch a nd t one extraction from hu man voice sig- nal is a well-studied problem in the literature [32], [33 ]. YIN [32] is a simple time-domain p itch calculation algorithm which is u s ed in many existing ap plications such a s count- ing the crowd from human voice s ignals [33]. Although the pitch is a good indicator for speake r identification, however , pitch alone fails t o differ entiate t he relative dist a nce of the speakers from other subject s , since it only concentrates on the central frequency of the a udio signal. Theref ore, tone information needs to be extracted along with the pitch, and Mel-fr equency Ce pstral Coefficients (MFCC) b ased tech- niques [18] with Gauss ian Mixture Model (GMM) [34] can be ap plied for t his purpos e. However , in sm a rtphones, the use of unidirectional microphones with the s t er eo channel is rare. A smartphone may capture the voice signals from the subj ects of the other nea rby groups, apa rt from t he IEEE TRANSACTIONS ON M OBILE COMPUTING, V OL. 14, NO . 8, A UGUST 2015 4 T ABLE 1 : Pilot Study Minutiae Group ID Member IDs Location Primary Speaker G 1 U 1 , U 3 , U 4 SMR Lab U 4 G 2 U 2 , U 5 , U 6 Class C-118 U 2 G 3 U 1 , U 2 , U 3 , U 4 Cafeteria U 3 G 4 U 2 , U 5 SMR Lab U 5 G 5 U 1 , U 2 , U 3 , U 4 W ay to Cafeteria U 4 G 6 U 1 , U 2 , U 3 Outdoor R oadsi de U 1 G 7 U 4 , U 5 , U 6 Outdoor R oadsi de U 4 primary sp eaker of it s group, as shown in Figure 2 . Fur- ther , the environmental noise ge nera ted fr om the va riety of external sources may collude the recorded a udio signal. For instance, the humming nois e generated from the ACs and other machines (indoor) and vehicles (outdoor) may collude the collected audio s ignals and make the group detection challenging. Additionally , for instanta neous group detection, there is no apriori knowledge of the group mem- bers’ t one information. Therefor e, group identification fro m MFCC based audio proc essing along with some sup ervised techniques may pos e some a ddition al challenges , although they work well for applications like crowd count [33 ]. I n the following, we explore thes e challenges from the obser- vations over a p ilot experiment. 2.3 Pilot S tud y: Revealing Challenges with Aud io Sig- nals W e launched a pilot stu dy to exa mine the potential of the acoustic context in identifying the meeting groups a midst challenging scenarios. W e developed an Andro id app for collecting the audio signal log from the smartphones for conducting the pilot experiment. W e recruited six subjects in this exp eriment for two weeks, ins talled the app on their smartphones and instructed them to occasionally form pre-designed meeting gro ups ( multiple times) for around T ≥ 15 minut es. S ubjects have been asked to recor d the group formation instances manu a lly for validation. The de- tailed overview of the formed groups in this study is listed in T able 1. During the experiments, we have cap tured 16 bit audio signal at 44 . 1 kHz sampling rate fr om the in-built microphones of the smartphones. Nota bly , while forming those controlled groups, we pay specia l attention towards incorporating two fundamental cha llenges that may affect the group identification fr om audio s ignals – (1) device heterogeneity and (2) noisy environments. T o capture de- vice heterogeneity , we hav e u sed smartphones from four diff erent makes and models – 2 Moto X, 1 Moto G 2nd Gen, 2 OnePlus 3, 1 Samsu ng Note5. The noise environment can be s ummarized in the following generic scenarios. (a) Low noise e nviron ment: This includes the formation of meeting gr oups wher e the surr ounding environ mental noise is low (audio amp litude less than 4 0 dB [35]). Su bjects forming groups inside class r ooms (group G 2 ), during the conferenc e in formal ga therings, while moving ins ide t he laboratory ( G 1 , G 4 ) etc. can be included in this scenario. (b) Noisy environment: In this scenario, subjects are forming groups in the highly noisy environment (audio amplitude mor e than 40 dB). Subj ects forming groups in cafeterias (group G 3 ), in informal ga t hering, marketpla ce, outdoor environment ( G 5 ) etc. fall in this s cenario. 2.3.1 Observations W e first nor malize the amp litude of t he audio signal and then comp ute the audio p r essure a s an indicator of the volume of the audio signal received b y individual devices. W e concentrate on t he meeting group G 3 where sub ject U 3 primarily speaks while other group members mostly remain silent. In Figure 3, we plot the audio pressure received from the individual subjects ( U 1 , U 2 , U 3 , U 4 ) of gro up G 3 . W e observe that subjects ( U 1 , U 2 , U 4 ), participating in the sa me group G 3 , exhibit similar audio pressure. However , the audio pressure of U 3 deviates from the rest of the sub jects since the us er is moving while spea king. Therefore, the values a r e slightly dif ferent t han the other group members . The scenario gets compounded when we consider two groups G 1 ( U 1 , U 3 , U 4 ) a nd G 4 ( U 2 , U 5 ) which get formed inside the s ame laboratory during the similar time period. Figure 4 highlights the fact that although audio pressure of the s ubjects ( U 1 , U 4 ) participating in same group ( G 1 ) exhibit similar behav ior , however , the sam e indicator fails to show clear discrimination between the subjects (sa y U 1 , U 5 ) participating in two different groups ( G 1 , G 4 ). For further investigation, we move to frequency domain based analys is . Importantly , Figure 5 shows that the frequency component present in s ubject U 1 exhibits contras ting behaviour from the subj e ct U 5 , belonging to a dif ferent group. However , the frequency components of su b jects U 1 and U 4 present in the same meeting group ( G 1 ) exhibit (albeit minor) difference (due to environmental noise), posing a new challenge. Last but not the lea st, Figure 6 demonstrates the variation of amplitude (raw v ersion of au dio pressure) due to device heterogeneity . The s martphone microphones use automatic gain contro l (AGC) circuit, which exaggerat es the variat ion of amplitude for the sa me au dio signa l captured through diff erent devices. I n group G 2 , the subje cts U 2 and U 5 carry- ing same make & model devices whereas another subject U 6 carries a differ ent build. Although all three of them belong to t he same meeting group, Figu re 6 b exhibits a dissimilarity in amp litude for the subject s U 2 and U 6 (nevertheless, sim- ilarity can be obs e rve d for subj ects U 2 and U 5 (Figure 6a)). The detailed comp arison of the devices is listed in T able 2, in terms of similarity index for t he same audio signal. W e observe that the similarity index is sometimes quite low for two differ ent devices from two differ ent m a kes and models. 2.3.2 Lessons Learnt W e observe that audio signals provide u s wit h a good indi- cator to cap ture the a cous tic context of a group. However , due to omnidirectional nat u r e of smartphone microphones, a significant audio p r essure from the speakers of the nearby groups is also getting captured, as we observe in Figure 4 (formation of two groups G 1 and G 4 in t he same lab). For MFCC based techniques, the separa bility of cepstral coefficients gets distorted in the presence of multip le s p eak- ers and environmental noise [31 ], [36], [37 ]. H e nce, MFCC may b e able to capture t he presence of two sp eakers in the vicinity f or gr oups G 1 and G 4 , but it will fail to classify the subj ects ba sed on the primary s peakers. Device het- erogeneity exaggerates this problem further . In su mmary , although microphone p r ovides important signature uncov- ering group membership, however , it is inadequate in its current form for handling a dverse scenarios. IEEE TRANSACTIONS ON M OBILE COMPUTING, V OL. 14, NO . 8, A UGUST 2015 5 0 5 0 1 0 0 1 5 0 2 0 0 T i me i n s e c 2 0 2 5 3 0 3 5 4 0 4 5 5 0 5 5 6 0 6 5 Aud i o Pr e s s u r e ( d B ) U 1 U 2 U 3 U 4 Fig. 3 : Impact of Audio Pressure among the Subjects of Same Gr oup: U 1 , U 2 , U 3 , U 4 in G 3 0 5 0 1 0 0 1 5 0 2 0 0 Ti me i n s e c 2 0 3 0 4 0 5 0 6 0 7 0 8 0 Aud i o Pr e s s u r e ( d B ) U 1 U 4 U 5 Fig. 4 : Impact of Audio Pressure among the Subjects of Differ ent Groups: U 1 , U 4 in G 1 and U 5 in G 4 500 1000 1500 2000 2500 3000 0 0.5 1 Magnitude 10 -3 500 1000 1500 2000 2500 3000 Frequency in Hz 0 0.5 1 Magnitude 10 -3 500 1000 1500 2000 2500 3000 0 0.5 1 Magnitude 10 -3 U 5 U 1 U 4 similarity U 5 U 1 =-0.2634 similarity U 1 U 4 =-0.2042 similarity U 4 U 5 =0.2131 Fig. 5 : Deviations of Frequencies in Groups 0 10 20 30 40 50 60 -0.2 0 0.2 Amplitude 0 10 20 30 40 50 60 Time in sec -0.2 0 0.2 Amplitude similarity=0.1529 } (a) Same Build : U 2 and U 5 in G 2 0 10 20 30 40 50 60 -0.2 0 0.2 Amplitude 0 10 20 30 40 50 60 Time in sec -0.2 0 0.2 Amplitude similarity=0.0733 } (b) Different Build: U 2 and U 6 in G 2 Fig. 6 : Audio amplitude in devices of same and diff erent builds T ABLE 2: Audio A mplitude Similarity in H e t er ogeneous Devices Device-Device Audio A mplitude Similarity MotoX Samsung Note5 OnePlus3 MotoG Moto X 0.3247 0.1178 -0.2781 -0.1138 Samsung Note5 0.1178 0.2977 0.0896 0.1287 Oneplus3 -0.2781 0.0896 0.567 1 -0.0653 Moto G -0.1138 0.128 7 -0.065 3 0.5822 3 M E A S U R I N G A C O U S T I C C O N T E X T O F M E E T I N G G R O U P S From the pilot study , we demonstrate tha t audio signals are rich sources to captu r e the context of a meet ing group. However , we also comprehend tha t the naive audio p r o- cessing techniques are not su f ficient to ex t ract reliable infor- mation u nder various complicated scenarios. In this se ct ion, we develop a methodology for computing acoustic context from smartphone audio s ignals, as shown in Figu re 7. T he diff erent steps in this procedur e are as follows. 3.1 Prepr ocessing of V anilla Audio Signals For audio-based feature extraction, we collect the audio data α i from all the subjects u i at a sampling rate of f s , continuously for t units with an interva l of ˆ t units of time , where t and ˆ t a r e s pecified by the application deve lop ers. W e first extract the hu man s peech s ignal between 300 Hz to 3400 Hz using Butterworth bandpass filtering. The human speech signals captured from differ ent smartp hones are used for further p rocessing. Fig. 7 : Audio Signal Processing Flowchart 0 10 20 30 40 50 60 -0.1 0 0.1 Amplitude 0 10 20 30 40 50 60 Time in sec -0.1 0 0.1 Amplitude 11.8209sec (a) T wo time drifted sign als from s ame audio source -60 -40 -20 0 20 40 60 Time Drift in sec -4 -2 0 2 4 Correlation 11.8209 sec (b) Correlation by sh ifting the second signal with reference to the first signal Fig. 8 : Computation of T ime Drift 3.1.1 Time Drift Adjustment The au dio signals captured from differ ent d evices can be time drifted, even if a single speaker acts a s the a udio source. There a re broadly two reasons for this – (a) the clocks at differ ent devices ma y not be time synchronized, and (b) the s ubjects ma y be at differ ent distances from the speaker , whic h intr oduces propagation lag to t he signals. Figure 8a s hows the time drifted signals with a single speaker , ca p tured from two differ ent subjects. T o compare two signals, we need t o place both the signals at the same time reference frame, and therefor e eliminating the time drift is an important task for audio processing. Although some existing s tudies have developed tech- niques for t ime drift a djustment of audio signals captured in hand-held devices [38], they e m ploy smoothing t e chniques over the raw signal and thus tend to lose the physical properties of the signal, such as tone and pitch of the signal. However , su ch physical p r operties a r e importa nt to captu r e the nature of huma n voice, which a r e es sential for extracting acoustic context. Therefore, we a pply a simple technique in this paper to mitiga te the time drift introduced in the signa ls coming fr om a single a udio source. IEEE TRANSACTIONS ON M OBILE COMPUTING, V OL. 14, NO . 8, A UGUST 2015 6 T o eliminate the time drift, we use the concept of s im- ilarity meas u r e between the signals in the time domain. Consider the audio signa l coming from a single source, but ca p tured at two diff erent d evices. I deally , when both the signals are placed at the sa me refer ence frame, at the time domain (drift is z er o), the similarity between them should be maximum. T o mea sure the simila rity between the signals, we use s tatistical correlation. The p r ocedure works as follows. W e fix one signal a s the reference, a nd then shift another signal for the one-time unit at every s tep, and measure the correlation betwe e n the signals. Figure 8 b plots the correlation between the s ignals s hown in Figure 8a, with respect to the amount of time shift applied over the second signal, while considering the firs t signal as t he refer ence. A positive time shift indicates that the second signal has been shifted toward s the time clock, and a negative time shift represents that the signa l has been shifted backward s the time clock. In the example, we observe that the correlation is maximum when t he second signal is shifted 11 . 82 0 9 seconds, indicating that the drift is 11 . 820 9 seconds. Once the drift is calculated, one signal is shifted to make t he drift zero with r espect to the reference signal. 3.2 A udio T one Extraction The audio tone of the memb ers of a m e e t ing group s hould exhibit high s imilarity a m ong themselves whereas tone dissimilarity indicates differ ent groups. Hence, p airwise tone simila rity between the group members should be an important property to determine the acoustic context of that group. Considering tha t group particip ants, in gen- eral, a void talking s imultaneously in a meet ing, intuitively , there exists one dominating tone tha t gets captured a t the smartphones of all the subj ects in a meeting group. Audio tone extra ction is a well-studied problem [32], [33] and Mel- frequency cepstral coefficients (MFCC) bas ed techniques [32] are widely app lied for tone extraction from audio signals. However , we face the following challenges while extracting the tone from smartphone audio signals. (a) Sma rtphone microphones are omnidirectional, and they capt ur e envi- ronmental noise along with the human voice. Moreover , the devices are heterogeneous. MFCC fails in the face of the noisy environment and with device heterogeneity [36]. (b) T he device heterogeneity is in general handled through various energy-based normalization te chniques [31], [33], [37], however they fail for s martphone microphones due to the nonlinearity gain of amplifiers and the presence of automatic gain control (AGC) circuits 1 . (c) As MFCC is mostly followed by a supervised scheme, the approach may requir e the voice s amples from each user for the correct identifica- tion of pit ch and tone. Howev e r , most of the members in the instantaneous gro ups are new and appear for the first time. Hence, p r e-training is im p ossible in most of the scena rios. In this paper , we apply Com p lex Ce pstrum (CCEP) to perform t one extraction. CCEP of a signal S is compu ted as follows. CCEP ( S ) = I FT (log( FT ( S )) + j 2 π ℓ ) (1) where FT(.) is the Fourier transform, I FT(.) is the inv e rs e Fourier trans form and j = √ − 1 . The imaginary part u s es complex logarithmic function, a nd ℓ is an integer which U 2 U 3 U 5 U 6 (a ) S i ng l e S p e a ke r M ul t i p l e G r oup s S c e n a r i o S i mi l a r i t y w i t h S p e a ke r U 1 0 .0 0 0 .0 5 0 .1 0 0 .1 5 0 .2 0 0 .2 5 0 .3 0 S i mi l a r i t y S a me G r oup Di f fe r e n t G r oup U 2 U 3 U 5 U 6 U 2 U 5 U 2 U 6 U 3 U 5 U 3 U 6 (b ) M ul t i pl e S p e a ke r M ul t i p l e G r ou p s S c e na r i o S i mi l a r i t y a mong non-S p e a ke r 0 .0 0 .1 0 .2 0 .3 0 .4 0 .5 0 .6 S i mi l a r i t y S a me G r oup Di f fe r e n t G r oup Fig. 9 : Audio Simila rity V a riation in Differ ent S cena rios is required to properly unwrap the ima ginary part of the complex log function. CCEP uses the log compression of the power spectrum, a nd t herefore is les s affected by the environmental noise, the nonlinearity of amplifie rs and t he effect of AGC circuits. T o extract the tone from an audio signal, we segment the s igna l into one second u nit s , and then compute the CCEP for the audio segments . T he CCEP for segment ¯ t from subj ect u i is denoted as cep ¯ t i . These CCEP values for a ll the subj ects are then used for tone similarity measure, a s discussed next. 3.3 Compu ting Acoustic Context Feature C t ij W e compute cross-corr elation between the CCEP values to mea sure tone s imilarity , t her eby high and low cross- correlation indicates similar a nd diss imilar acoustic context between the pair of subjects , respectively . Let cep ¯ t i and cep ¯ t j denote the CCEP for s egment ¯ t from two different s u bjects u i and u j . W e comput e the segment wise cross-correlation between cep ¯ t i and cep ¯ t j as cor ¯ t ij , and then average it over the time span t . This audio cepstrum cross-correlation is used as the a coustic context simila rity C t ij for su bject pair u i and u j during time duration t . T o demonstrate the role of tone similarity to compute acoustic context of meet ing groups, we consider the groups G 6 and G 7 formed in o utdoor roadside wher e the group scenario shown in Figure 2 . Su b jects U 1 , U 2 , U 3 and U 4 , U 5 , U 6 form the Group G 6 and G 7 , respectively . I n the first scenario, subje ct U 1 in group G 6 is the dominating speaker , whereas members of Gr oup G 7 are mostly silent. Figure 9a shows the pairwise context similarity between the individual su bject and the dominating spe a ker U 1 (of group G 6 ). W e obs e rve t hat subjects in group G 6 (say , U 2 and U 3 ) exhibit higher similarity with dominating s peaker U 1 as compared with t he members of the group G 7 (say , s ubjects in U 5 and U 6 ). Next, we consider two dominating sp eakers U 1 and U 4 in two respective groups G 6 and G 7 . W e compute the context similarity betwee n any pa ir of (non-speaking) subjects. In Figure 9b we observe that members belonging to the same group (say U 2 and U 3 in Gr oup G 6 and U 5 and U 6 in Group G 7 ) exhibit higher context similarity compared to non-group pairs. Precisely , the context similarity between the intragroup members is subs tantially higher (close to 1 . 0 ) than the intergr oup memb e rs (close to 0 . 0 ). This result indicates that acoustic context within a single group exhibits substantia l similarity . W e also invest igate the impact of location of a s ubject on her acoustic context. W e set u p two groups G 6 and G 7 , 18 m apart in t he out door env ironment, with two dominating speakers namely U 1 and U 4 respectively . W e consider one moving s u bject U 2 , initially ins ide the G 6 (fro m T able 1) and IEEE TRANSACTIONS ON M OBILE COMPUTING, V OL. 14, NO . 8, A UGUST 2015 7 0 10 2 0 3 0 4 0 50 6 0 7 0 T i me (i n s e c ) 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 0 . 9 1 . 0 Aud i o C r os s -c or r e l a ti on S p e a ke r : U 1 S p e a ke r : U 4 Fig. 1 0 : A udio Cross- correlation v ariation over time with the moving U 2 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 Au d i o C r os s Cor r e l a ti on 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 ECDF S a me G r oup O uts i d e G r oup Fig. 1 1 : ECDF of Cepstrum Cross-Corr elation walks towards group G 7 (it takes around 66 sec to reach group G 7 from G 6 ). Figure 10 shows the va riation in the acoustic context similarity between the s ubject U 2 and the dominating s peaker over time. W e observe that, when the subject is in group G 6 , the context simila rity bet ween the U 1 and t he subject is high as compared with the U 4 . The reverse behaviour is noticed at the end of the experiment when the subject reaches G 7 . Howev e r , the context is confusing a s the subject loca t ed in the middle of both the groups. Finally , we perfor m an overa ll ev aluation, considering all the meeting groups formed in the p ilot stu dy . In Figure 11, we plot the empirical cumu lative distribution (ECDF) of the acous tic context similarity for the pair of subj ects. W e observe that when the subj ect pair is in the s ame group, the acoustic context similarity is high. On the contrary , when one s ubject is outside the group, co ntext similarity exhibits low v a lue. This establishes the fact that tone sim- ilarity , computed from cepstrum cross-corr elation, r eflects the acoustic context of a group and more important ly , the acoustic context within a single group exhibits substantia l similarity . The aforesaid methodology of ex t racting acoustic context from smartphone microphone has three broad advantages. First, as the fea ture is extracted from the dominating tone in an audio s ignal (captu r ed by cepstrum), it is sufficient if at least one subject in a group talks for a duration. The method ca n be utiliz e d to consider subje ct s who belong to a gr oup, but do not prefer to interact (consider a confer- ence presentation); as long as some other subj e ct from that group speaks. Second, the methodology does not v iola t e the p riva cy of individual su bjects. W e extract only the tone information from the ca p tured a udio signal and do not leverage the exact conversation. Third, the proposed model is u ns upervised. Therefore, there is no nee d of pre-training of t he tone information of the group members. 4 D E S I G N O F GroupSense MeetSense is an unsupervised framework for detecting meet- ing gr oups based on subject pr oximity a nd acoustic con- text. Figure 12 shows the flow outline of the MeetSense framework. First of all, the sensor logger module r ecor ds the micro phone data a long with the proximity indicators followed by the pairwise feature computation. Leveraging on the p r oximity and a coustic context features, we develop MeetSense for meeting group detection. 4.1 Feature Constru ction In this module, we compute acoustic context simila rity C t ij between the subj ect pair u i and u j at time t from the col- lected microphone log; the detailed compu tation procedure of acoustic context feature has already been e x plained in section 3. The pa irwise proximity similarity feature F t ij at time t can be extracted from any of the state of the art techniques [10], [19]. Now considering a su bject p a ir u i and u j , we need to compute the aggregated features F ij and C ij respectively for time duration T . One simp lest way of aggregation is computing the m e a n F ij and C ij from the instantaneous features F t ij and C t ij respectively for time duration T . However , the s ignal sample collected from p roximity indicator a nd m icrophone ma y suffer from sensitivity a nd fluctuations. Additionally , the audio signa ls can also get muffled by obstacles, clothing mat e ria ls, and are also imp acted by the interf erence. Evidently , the colluded mean featu r es, computed from all t he feature points F t ij and C t ij for the time d uration T , ma y not provide a clear indication of feature s imilarities between the subject pair u i and u j . Hence, we compute the refined mea n features F ij and C ij , b y eliminating the low-frequency noise component. Here we s p lit all the features points (say , for p roximity feature F t ij ) into two clust e rs (via k-means clus tering, with p value < 0 . 05 ). Eliminating the minor cluster as the noisy component, we compu te the mean F ij from the feature points in the ma jor cluster (see Algorithm 1). However , in case of p value ≥ 0 . 05 , we compute the mean F ij considering all the feature points in the single cluster . Similarly , we compute the refined mean acoustic context feature C ij from Algorithm 1. Algorithm 1 Feature Construction Inputs: F ij , T Output: F ij 1: [ C l 1 , C l 2 ] ← k means (( F ij , T ) , 2) 2: if p value > 0 . 05 then ⊲ Single Cluster Scenario 3: F ij ← 1 / | F ij | P ∀F t ij F t ij 4: else 5: if | C l 1 | > | C l 2 | then ⊲ Major Cluster C l 1 Scenario 6: F ij ← 1 / | F ij | P ∀F t ij ∈ C l 1 F t ij 7: else ⊲ Major Cluster C l 2 Scenario 8: F ij ← 1 / | F ij | P ∀F t ij ∈ C l 2 F t ij 9: end if 10: end if 4.2 Model Develop ment Finally , leveraging on the aforementioned features, we de- velop an unsupervis ed model for meet ing group detection. The model executes Part A (Algorithm 3) or Part B (Algo- rithm 4) depending on the availa bility of the location infor- mation (W iFi, Bluetooth, GPS etc). If the subj ect poss e s ses location information, the model exploits b oth proximity as well as acoustic features in P art A . Otherwise, the model only relies on the acoust ic information in Pa rt B . The out- come of the m odel is all the groups detected by both the individual pa rts . The outline of the model is described in Algorithm 2. Part A: In this part, we first att e m pt to ex t ract t he cluster of co-locating su bjects only based on the p air- wise proximity similarity . If we identify a h ighly cohe sive cluster G T p based IEEE TRANSACTIONS ON M OBILE COMPUTING, V OL. 14, NO . 8, A UGUST 2015 8 Fig. 1 2 : M eetSense Model [ F t : Proximity Feature, C t : A coustic Context Feature] Algorithm 2 MeetSens e: Group Dection Algorithm Inputs: u i ( p t i , α t i ) ∀ u i ∈ U , δ p 1 , δ p 2 , δ α Output: G T 1: if p t i 6 = ∅ then 2: U o ← U o ∪ u i 3: end if 4: U no ← U − U o 5: if U o 6 = ∅ then ⊲ Proximity A vailable Scenario 6: G T o ← ProximityA vailable Function ( u i ( p t i , α t i ) ∀ u i ∈ U o , δ p 1 , δ p 2 , δ α ) 7: end if 8: if U no 6 = ∅ then ⊲ Proximity Not A vailable Scenario 9: G T no ← ProximityNotA vailable Function ( u i ( α t i ) ∀ u i ∈ U no , δ α ) 10: end if 11: G T ← G T o ∪ G T no on proximity only , we consider G T p as a potentia l meeting group and execute the s e cond step. In the second s tep, we leverage only on the acous tic context featu r es to detect meeting group(s) G T α from the identified proximity clusters G T p . However , if we identify moderately cohes ive cluster G T p from the p r oximity features, the m odel aba ndons the cluster G T p , considering proximity as a critical albeit weak s igna l, and moves to the third step. In the third step , we combine both the proximity and acous tic context similarity features together to detect cohesive cluste r G T α on the complete proximity ava ilable population U o . If G T α exhibits high cohe- sivity , we as sert the cluster G T α as the meet ing group. Poor cohesivity in any step rejects the existence of any group in the population. The overall procedure is illustrated in Algorithm 3. I n the following, we introduce cohesive cluster detection, which is the core of the model. Detection of cohesive cluster: Consider a weighted net- work C G ( U , E ) , where u i ∈ U is a su bject and { e ij , w e ij } ∈ E denotes the weighted link e ij between the subject p air u i and u j . W e apply community detection algorithm [39] on C G to obtain a partition K = { K 1 , K 2 , . . . , K m } on pop- ulation U . Essentially the community detection algorithm partitions the network into communities ens u ring dense connections within a community a nd sparser connections between communities. W e consider the detected community K i as a cluster of popula tion U . The cohesivit y of the partition K can be meas ur ed with modularity index M as M = 1 4 ϕ P ij ( w e ij − ρ i ρ j 2 ϕ ) f ( σ i , σ j ) , which reflects the fraction of t he links that fall within a given community , compared to the expected fraction if links are distributed at random [39]. The ϕ , ρ i , σ i , and f ( . ) represent the sum of all of the edge weights in the network, sum of t he edge weight atta ched to node u i , the community of node u i , and d elta function, respectively . Nota bly , modularity of a weighed fully connected graph becomes z er o if all the nodes form a s ingle large community [40]. In this paper , we ap p ly W alkt ra p algorithm [20]; however , our methodology is not sensitive to any s pecific (weighted) community detection algorithm. Algorithm 3 comprises of the following three st e p s. Step 1: W e construct a complete p r oximity graph P G ( U o , F ) where U o denotes the complete p r oximity ava il- able p opulation a nd { e ij , F ij } ∈ F is a link between the subject pair u i and u j weighted by the proximity feature F ij computed over the time T . W e apply community detection algorithm on the p r oximity grap h P G to discover the clust er G T p with modularity M p . If the M p is above a threshold δ p 1 , we consider G T p as the ca ndidate meeting group and move to step 2 . If M p falls below a threshold δ p 2 , we reject the existence of any meeting group in popu la tion U . Otherwise, we move to s tep 3. Step 2: W e constr uct t he complete acoustic context graphs I G ( G T p , C ) where { e ij , C ij } ∈ C links between subject pair u i and u j ∈ G T p . Essentially , in I G , the link weight C ij depicts the acous t ic context similarity over t he time T . S imilar to step 1, we apply the community detection on I G t o discover the cluster G T α with modula rity M α . If the M α is ab ove a threshold δ α , we confirm G T α as t he detected meeting gr oups. Otherwise, we reject the exis tence of me e t ing groups in p opulation U o . Step 3: W e construct a complete proximity-acoustic con- text graph MG ( U o , W ) where { e ij , W ij } ∈ W links between subject p air u i and u j ∈ U o weighted by W ij = (1 − w ) × F ij + w × C ij . Essentially , in MG , the link weight W ij carries the information from both a coustic context and proximity feature. Similar to ste p 1, we apply the commu nit y detection on MG t o discover the cluster G T w with modularity M w . If the M w is above a threshold δ α , we confirm G T w as the detected meeting groups. Otherwise, we reject the presence of a ny group in p op ulation U o . Part B: Due to the u navailability of location data, in this part, we completely rely on the acoustic context similarity between the s ubjects. W e first construct a complete acous- tic context graph I G ( U no , C ) where { e ij , C ij } ∈ C links between s ubject p air u i and u j ∈ U no . Esse nt ia lly , in I G , the link weight C ij carries the information of the acous tic context featu r e over the time T . S imilar to part A , we apply IEEE TRANSACTIONS ON M OBILE COMPUTING, V OL. 14, NO . 8, A UGUST 2015 9 Algorithm 3 ProximityA vailable Function Inputs: u i ( p t i , α t i ) ∀ u i ∈ U o , δ p 1 , δ p 2 , δ α Output: G T 1: Compute F t ij , C t ij ⊲ Feature Generation 2: F ij ← Feature Construction ( F ij , T ) ∀ u i , u j 3: ( G T p , M p ) ← Comm unity Detection ( U o , F ) 4: if M p ≥ δ p 1 then ⊲ Proximity Domin ating Scenario 5: C ij ← Feature Construction ( C ij , T ) ∀ u i , u j ∈ G , G ∈ G T p 6: ( G T α , M α ) ← Community Detection ( G T p , C ) 7: if M α ≥ δ α ∀ ( G α , M α ) ∈ ( G T α , M α ) then ⊲ Proximity & Audio Influence Scenario 8: G T ← G T ∪ G α 9: else ⊲ Proximity Influence & Audio Insignificance Scenario 10: Failure 11: end if 12: else 13: if M p < δ p 1 and M p ≥ δ p 2 then 14: C ij ← Feature Construction ( C ij , T ) ∀ u i , u j 15: ( G T w , M w ) ← Comm unity Detection ( U o , (1 − w ) × F + w × C ) ∀ w ∈ [0 , 1] ⊲ Weighted Featur es 16: ( G T w , M ) ← max ( G T w , M w ) ∀ w ∈ [0 , 1] 17: if M ≥ δ α then ⊲ Proximity Confused & Audio Influence Scenario 18: G T ← G T w 19: else ⊲ Proximity Confused & Audio Insignificance Scenario 20: Failure 21: end if 22: else ⊲ Proximity Insignificance Scenario 23: Failure 24: end if 25: end if the community detection on I G to discover the clusters G T α with modularity M α . If the M α is a b ove a t hr eshold δ α , we confirm G T α as the detecte d meeting groups. Otherwise, we reject the existence of meeting groups in p opulation U no . The outline of the overall mechanism is portrayed in Algorithm 4. Algorithm 4 ProximityNotA vailable Function Inputs: u i ( α t i ) ∀ u i ∈ U no , δ α Output: G T 1: Compute C t ij ⊲ Feature Generation 2: C ij ← Feature Construction ( C ij , T ) ∀ u i , u j 3: ( G T α , M α ) ← Community Detection( U no , C ) 4: if M α ≥ δ α then ⊲ Audio Influence Scenario 5: G T ← G T α 6: else ⊲ Audio Insignificance Scenario 7: Failur e 8: end if 5 P E R F O R M A N C E E VA L UAT I O N W e evaluate Mee tSense by developing a smartphone-bas ed application and deploying it over I I T Kharagpur campus spreading 8 . 5 square kilomet res, consis ting of administra- tive blocks, approximately 30 academic departments along with camp us residential, hostels and market areas. W e first discuss the implementation of MeetSense followed by the field study and performance comparison with different baselines. 5.1 Field Stud y and Data Collection The data collection for Mee tSense is done initially through an Android app, DataGatherer , which has been lau nched over the smartphones of 40 subjects consisting of undergraduate and pos tgraduate stu dents, s ummer interns , research schol- ars and faculties of the institute. In our imp lementation of MeetSense , we have considered T ≥ 15 mins , that means if an int e ra ct ion continues for a t least 15 minutes, we consider it as a group. Nev e rtheles s, this is an application specific tunable p a rameter . W e have used differ ent models of sma rt- phones, where costs per phone range from USD 150$ to US D 700$ approximately . W e primarily gather W iFi (BSSID and signal strength) and audio data from smartp hones through DataGatherer which sends the da t a to a central server . The W iFi data is us ed for proximity detection based on e x isting methodologies [10], [19], and then a udio data is used to detect the gr oups among the subjects in proximity . The app scans t he a vailable W iFi access points once in a minute time interval, and continuous audio signals are tracked at a sa mpling rate of 44 . 1 kHz for a minu te time span followed by an interva l of three minutes. Moreover , we have discarded the details of the access points having t he signal strength les s tha n − 80 dBm which is the minimum signal strength for basic connectivity 3 . The data has been collected for a p proximately six months. W e collect the ground truth meeting group information from the participa nts for validation. I n ground truth data collection, a questionnaire a pp periodically probes from the participants regarding t he (a) start time of the meeting, (b) end time of the mee t ing, (c) mee t ing venue and (d) details of the other participants of the meet ing. In some cases where a p a rticipant miss es to provide t he ground truth information, we va lidat e the detected meeting groups from the participa nt s by forwarding an email at every two hours of each day . Bas ed on the field stu dy collected data, we identify seven typical me e t ing group scenarios, which repeatedly occurr ed (at least once in a week) during the six months field s tudy . These situations a r e highlighted keeping in mind the critical conditions of group formation that were developed in the pilot study (Section 2); t hu s reflect realistic meeting group scenarios with high probability . W e e v aluate the performance of Me etSense and compare it with other baselines considering these typical scenarios, as well as diff erent scenarios observed from the collected data . T hes e scenarios are as follows. S1 (Indoor: T wo groups at neighbouring rooms) : 3 su bjects attend a lecture in classroo m C- 119 , and 2 subjects have another me e t ing in the FV Lab opp osite to C- 119 at the same instance of time. S2 (Indoor: Three groups at different rooms at the same department) : 4 subjects interact in t he faculty office in t he second floor , 2 subj ects are in a meeting at the departmental library opposite to that faculty office, and 2 subjects are in another meeting at the SMR Lab in the first floor . S3 (Outdoor: Cafe teria interactions) : T wo differ ent groups at t he cafeteria, one wit h 3 subj ects in front of the ca feteria and another one with 3 s ubjects at the back of the cafeteria. S4 (Indoor: Large single group) : 7 subjects a t tend a presen- tation a t t he departmental confer ence room. S5 (Indoor: T wo different groups at a large lab) : 3 subj ects meet at cubicle K- 1 and another 3 subjects meet in the cubicle K- 10 of the SMR Lab. S6 (Indoor: T wo roaming groups) : 3 s u bjects together a nd 2 subj ects together roam around the corridor of the depart- ment, and moves from one room to another , forming two non-static groups. 3. h t tps://sup port.metageek.com/hc/en-us/articles/2 019557 54- Understanding-W iFi-Signal-Strength (Accessed on Apr 12, 2018) IEEE TRANSACTIONS ON M OBILE COMPUTING, V OL. 14, NO . 8, A UGUST 2015 10 S7 (Outdoor: T wo roaming groups) : 5 subjects together and 2 su bjects together roam within the cam pus maintaining a certain dista nce from each other , forming two non-static groups. 5.2 Prepr ocessing – P r o ximity Computatio n fo r Grou p Detection As we discussed earlier , gr oup detection first requir es to find out the s ubjects in proximity , and in MeetSense , we utilize the exist ing proximity detection mecha nisms tha t have been well studied in t he literature. W e focus on t he two approaches of proximity detection bas ed on W iFi dat a , as follows. (a) Supervised learning with W iFi-based proximity se ns- ing ( SL WP ) [10]: Sapiezyns ki et al. developed a W iFi ac- cess point based supervised proximity detection mechanism, where Bluetooth data is considered a s the ground truth. In this approach, a set of W iFi-ba sed feat ur es has been com- puted, such as overlapping access points, signal strength from differ ent access points etc., and then a s upport vector machine (SVN) is us ed to clas sify whether two subjects are in proximity or not. (b) N ext2Me [19]: This is a smartphone-bas ed syst e m for capturing s ocial interactions within close proximity u s ers. Next2Me uses W iFi s ignal information for measuring the pairwise co-located Ma nhattan distance between t he use rs , and then a threshold over the distance function is u sed to find out the subj ects in proximity . It can be noted that this is an u nsupervised ap proach. 5.3 Baselines f or Au dio Based Interaction Detection W e have a nalyzed the performance of MeetSense with fol- lowing three b a selines, which utilize audio signals for acoustic context detection. W e use the p r oximity followed by audio ba s ed a coustic context to detect v arious meeting groups. (a) Next2 M e [19 ]: After determining the subjects in pr ox- imity , Next2Me utilizes Jaccard simila rity over top n audio frequencies to captu r e the au dio fingerprints of various s u b- jects. Finally , they generate social community by app lying Louvain community detection a lgorithm. (b) AudioMatch [13]: Cas agranda et al. implemented a smartphone base d group detection system bas ed on the joint usage of GPS and audio fingerprints. The GPS information is us ed for filtering out nearby devices. On top of the GPS based clusters, t he au dio module is executed for identifying the groups. AudioMatch u ses short time Fourier tra ns form (STFT) with ove rla pping Hamm ing window . Finally , it com- putes the Ha mming distance between the pair of devices for detecting the nearby p airs. Next, we discuss the experimental procedure by comb- ing t he W iFi-based proximity detection and audio bas ed acoustic context detection together . 5.4 Experimental Pr ocedure For comparing the performance of MeetSense under the diff erent environment with differ ent baselines , we consider following combinations of proximity (P) and acoustic con- text (I) detection mecha nisms. I t can be noted tha t M eetSense primarily focuses on capturing acoustic context , whereas the proximity module is borrowed from existing methodolo- gies. The different combinations of proximity and acous tic context detection mechanisms us ed in our ex periments are as follows. I. SL WP (P) + Next2Me (I) : I n this arrangement, we e x tract the pairwise proximity information from SL WP , a nd the outcome is directly fed to t he N ext2Me audio model for group detection. II. SL WP (P) + MeetSense (I) : This arrangement uses the pairwise proximity information fro m SL WP . Then, Meet- Sense audio centric context detection is applied on top of the p r oximity outcome. III. SL WP (P) + Audi oMatch (I) : In this arrangement, we apply SL WP for pairwise proximity detection. After that, AudioMatch is applied on the outcome of the p roximity clusters for detecting the pairwis e acoustic context from the audio s igna ls. It can be noted that we hav e not used GPS for proximity detection as used in AudioMatch, as GPS give s very poor signal in the indoor scenarios. However , the audio module is us e d as it is, and finally , the community detection algorithm is used for group detection. IV . N ext2Me (P) + Next2Me (I) : T his arrangement is analo- gous with the actu al Next2Me system, where both the W iFi based proximity detection and the audio b ased acoust ic context detection, as done in Next2Me, are us ed for group detection. V . Nex t2Me (P) + MeetSense (I) : In this arrangement, we compute the proximity-based p airwise distance following Next2Me proximity module. The pairwise s imilarity is com- puted b y reversing the pairwise distance value. Then, we apply MeetSense Feature Construction algorithm 1 followed by community detection module on the pairwise simila rity value. Finally , MeetSense acoust ic context module is em- ployed on top of the proximity out come. VI. Next2Me (P) + AudioMatch (I) : This arrangement uses the proximity information from Next2Me like the p r evious setup. A fter t hat, AudioMatch is ap plied on the outcome of the proximity clusters. The pairwise acoust ic context infor- mation is finally fed to the community detection algorithm for meet ing group detection. 5.5 MeetSense Perf ormance W e first evaluate the overall performance of Mee tSense in terms of F 1 -Score [41] defined as follows. Let Γ and Υ be the sets of meeting groups in the gr ound tr uth data a nd the ones detected by MeetSense , respectively . T hen F 1 κν = 2 ×| κ ∩ ν | | κ | + | ν | where κ ∈ Γ and ν ∈ Υ . This paramet er captures the accuracy of the detected group ν in terms of membership overlap with gr ound truth κ for the m e e t ing durat ion T . Now , to obtain the final accuracy of MeetSense considering all the detecte d meet ing groups, we compute the a v erage F 1 -Score as F 1 = P ∀ κ ∈ Γ ; ∀ ν ∈ Υ F 1 κν | Υ | . T able 3 summariz e s the p erformance of MeetSense in terms of F 1 -Score a nd m odularity ( M ) for s even represen- tative scenarios a s well a s for a ll the observed scena rios combined. W e experimentally set up the model t hr esholds ( δ ) based on the bes t performance of the overall scenarios. The modularity index M indicates the cohesiveness of the IEEE TRANSACTIONS ON M OBILE COMPUTING, V OL. 14, NO . 8, A UGUST 2015 11 T ABLE 3 : Pe rformance Comparison ID SL WP Next2Me Next2Me GroupSense Aud ioMatch Next2Me GroupSense Aud ioMatch F 1 Score Modu- larity F 1 Score Modu- larity F 1 Score Modu- larity F 1 Score Modu- larity F 1 Score Modu- larity F 1 Score Modu- larity S1 1.0000 0.1124 1.0000 0.2879 0.72 73 0.0000 1.0000 0.0412 1.0000 0.2907 0.7273 0.0000 S2 0.9000 0.2030 1.0000 0.1760 0.6667 0.0000 0.9000 0.2030 1.0000 0.176 0 0.6667 0.0000 S3 0.5333 0.1261 1.0000 0.3642 0.7273 0.0000 0.5333 0.1261 1.0000 0.364 2 0.7273 0.0000 S4 0.8326 0.0772 1.0000 0.0000 1.0000 0.0000 0.8326 0.0772 1.0000 0.000 0 1.0000 0.0000 S5 0.8571 0.0732 1.0000 0.3801 0.7273 0.0000 0.8571 0.0732 1.0000 0.380 1 0.7273 0.0000 S6 1.0000 0.0000 1.0000 0.0000 1.0000 0.0000 1.0000 0.0000 1.0000 0.000 0 1.0000 0.0000 S7 0.5833 0.1942 0.6500 0.0976 0.8333 0.0000 0.5833 0.1942 0.6500 0.097 6 0.8333 0.0000 ALL 0.7971 0.0866 0.9421 0.2114 0.8212 0.0000 0.7971 0.0826 0.9421 0.2116 0.8212 0.0000 detected groups; hence eve n a low F 1 -Score with high modularity contributes more to identify maximum p artic- ipants in a meeting group. Although Me e tSense perfo rms marginally worse for certa in scenarios, like during out door mobile groups due to the high env ir onmental noise, we observe that accuracy is more than 90% for most of the cases. 5.5.1 Baseline Comparison As mentioned earlier , we ha v e set up six different arrange- ments proposed as well as exist ing schemes for investigating the p erfor mance at differ ent s cenarios. T able 3 compares the performance of MeetSense with AudioM atch and Next2Me with two dif fer ent proximity schemes. For all the three scenarios when SL WP is u sed for proximity m e a sure, we observe that MeetSense outperforms t he other bas elines. Although N ext2Me uses audio b ased fea tures to cap ture the social interaction a m ong the subjects (which is similar t o the acoustic context for MeetSense ), it us e s Ja cca rd s imilarity among top n audio frequencies, which is sus ceptible to envi- ronmental noise. For exa m ple, in an outdoor environment, the sound frequencies originated from external entities, su ch as moving vehicles, can fall within the top n frequency components. As a consequence, we observe that although Next2Me manages to perform well in indoor s cenarios, it poorly performs in the outdoor env ir onment. On the other side, AudioM atch app lies ha mming distance meas ure of the logarithmic amplitude of the audio signal for suppressing the noise component. A lthough this scheme works for a rti- ficially generated Gaussian noise, it poorly performs in the presence of real env ironmental noise. The impact is visible in the outdoor s cenarios. Next, we have tested the scenarios with three baselines for acoustic context me a surement along with Next2Me prox- imity measure. The results of these are appa r ently s imilar to the results from the scena rios with SL WP proximity measure, except for the proximity dominant s cenario S 1 in MeetSense . The simila r results also claim that the audio features are mor e dominant compa r ed to the proximity fea- tures for meeting group detection. Moreover , this validates the importance of Algorithm 4 in the overall meeting group detection mechanism. 5.5.2 Rob ustness of Acoustic Conte xt Measure For investiga ting the v a riations in t he performance of diff er- ent s chemes , we report the box-plot of the p airwise feature similarity va lues for the acoustic context, shown in Figure 13. T he box plot dep icts that there are significa nt mean diff erences betwee n the various schemes . In the box plot, N e xt2 M e N e xt2 M e + Aud i oM a tc h N e xt2 Me + Me e tS e n s e S L WP + Au di oM a tc h S L WP+ Me e tS e ns e S L WP + N e xt2 Me M e th od ol og y 0 .0 0 .2 0 .4 0 .6 0 .8 S i mi l a r i ty Fig. 1 3 : Me a n differ ence of similarity obt ained through diff erent methodologies the medians for differ ent schemes are shown in red lines. Focusing on the uppe r and the lower halv es from the median, t he results show tha t MeetSense captures significant variations in the pairwise sim ila rity between the su b jects. As we consider mu lt iple meeting group scenarios, the variation of the pairwise simila rity between the subjects is justified. It can b e noted that the median is b iased towards the lower values because the pairwise feature similarity becomes very close to zero wheneve r the t wo subjects in the p a ir are fr om diff erent gr oups. However , a wide variation of similarity values greater than 0 . 1 is observed when both the su b - jects are in the s a me group. On the contrary , the results for other baselines depict t hat Nex t2Me and AudioMatch show the minimal differ ence in the up p er and the lower halves from the median. T her efor e, the constructed feature is incapable of distinguishing betwee n the acous t ic context when the subjects a r e in the s ame group or differ ent groups. Hence, the F 1 score significantly drops for those baselines. Additionally , we a lso obs e rve that the median value is closer to the first quartile. As we captu r e the proximity a nd audio signatures of the s ubjects in various environments, the similarity values between each pair of subjects s ignifi- cantly v aries over the different meeting gro ups, causing the dense zone towards the lower halves from t he median. The wide variat ion of t he pairwise similarity values in differ ent groups further interprets that the simp le thresholding b ased scheme is not suita ble for detecting va rious types of meeting groups in the diverse envir onment. Hence, it justifi e s the requir ement of the complex Me etSense scheme (Algorithm 2). IEEE TRANSACTIONS ON M OBILE COMPUTING, V OL. 14, NO . 8, A UGUST 2015 12 5.5.3 Dissecting the Methodologies Next, we look into the performance of the various com- peting methodologies by exploring their internals . From the above experiments , we observe that baselines perform poorly for the s cenario S 3 . Therefore, we further study the top n frequency based similarity , thresholding ba sed hamming distance measure, and cepstrum similarity for that scenario. From Figure 14a, we found that au dio cross- correlation for ‘same group’ (target subjects are within a group) and ‘differ ent groups’ (target sub jects are fr om differ - ent groups) pairs of subj ects are more distinct in Mee tSense as compared with the other two bas eline m e t hods. T he outdoor e nv ir onment like the cafeteria (scenario S 3 ) are noisy due to the presence of the non-member group v oice and noise from the environment. As Nex t2 M e considers only the top 6 frequency components (as p er our implementa tion n = 6 ), it unknowingly considers those frequencies, result- ing in the similar audio correlation va lues for ‘same group’ and ‘differ ent groups’. On t he other side, AudioMatch com- pares the logarithmic a mplitude of STFT of t he au dio signal with its neighbouring p oints to generate 16 -bit fingerprint. Therefor e, the 16 -bit fingerp rint generation completely relies on the center comparing amplitu de value. If the center v a lue is corrupted due to the environmental noise, t he ent ire 16 - bit fingerprint is p r one to be corrupted. Those spurious fingerprints are further used for computing the Hamming distance be t ween the pair of subject s , resulting in identical behaviour for the audio correlation values in ‘same group’ and ‘different groups’. As MeetSense considers cep strum containing the tone information for computing the audio correlation, the correlation values are more close to one for ‘same group’ and close to zero for ‘dif fer ent groups’. There- fore, the audio features o f MeetSense can distinguish this scenario. Consequently , although Nex t2Me and AudioMatch fail to sepa ra t e out the groups based on t he audio feat ur es, MeetSense can correctly differ entiate t he groups. W e further ev aluate Next2M e , AudioMatch and Meet- Sense in varying noise environment. As simulating complete random noise is nearly imp ossible, we generate Gauss ian noise a t diff erent levels and sup erimpose the noise with the captured au dio s ignal. Figure 14b shows that MeetSense is more nois e resistant than N ext2Me , whereas AudioMatch is as noise r esistant as MeetSense . Analogous to noisy environ- mental scenarios, Next2Me performs poorly in the presence of sta tistically generated Gaussian noise due to the improper selection of t op 6 frequencies. In case of AudioMatch the generation of 16 -bit fingerp rint caus e s the dro p though it is much less prone to noise a s compa r ed to Next2Me because of considering the logarithmic amplitu de of ST FT . Next, we compare the three acoustic context detection mechanisms in terms of computat ional resource require- ments, as shown in Figure 15. W e measure thes e p erfor - mance s t atistics in a standard Linux (Kernel version: 4 .4.0) based workstation (Dell Precision T ower 7810) using the free command to obtain the primary memory consump- tion of the differ ent methodologies. W e compute the total execution time and the overall memory consumption during the execution of the three methods. W e obs erve that (i) MeetSense takes ve ry less time per iterat ion during t he computation process compared to Next2M e and AudioMatch 0 . 0 0 . 2 0 .4 0 .6 0 . 8 1 . 0 Aud i o C r o s s C o r r e l a ti o n (S 3 ) NM :N e xt2 M e , G S :M e e tS e ns e , AM :Aud i o M a tc h , S G :S a me G r o u p , DG :Di ffe r e ntG r o up s 0 .0 0 .2 0 .4 0 .6 0 .8 1 .0 EC D F G S :S G G S :DG NM :S G NM :DG AM :S G AM :DG (a) Audio correlation comparison for scenario S 3 0 1 0 2 0 3 0 4 0 5 0 Noi s e (d B ) 0 .5 5 0 .6 0 0 .6 5 0 .7 0 0 .7 5 0 .8 0 0 .8 5 0 .9 0 0 .9 5 1 .0 0 F 1 S c or e M e e tS e ns e Aud i oM a tc h Ne xt2 Me (b) Effect of Noise in MeetSense and Next2Me Fig. 1 4 : P e rformance analysis at differ ent environments 0 5 1 0 1 5 2 0 2 5 3 0 N umb e r of Ite r a ti on s (a ) T i me E l a p s e d 0 1 0 2 0 3 0 4 0 5 0 6 0 7 0 8 0 T i me E l a p s e d (s e c ) M e e tS e n s e Aud i oM a tc h Ne xt2 M e 0 1 0 2 0 3 0 4 0 5 0 6 0 7 0 8 0 T i me i n s e c (b ) M e mor y C on s u mp ti on 0 5 1 0 1 5 2 0 2 5 3 0 3 5 M e mor y C on s . (M i B ) Fig. 1 5 : Performance in terms of computa tional cost for the unsupervised mechanisms (Figure 15a); (ii) the memory consu mption for MeetSense is less than Next2Me (Figure 15b). M eetSense enjoys the benefit of lower resource consumption primarily b e ca use it computes only cepstrum component for a few segme nt over the entire interaction time, whereas Nex t2 M e us e s several windowing operations along with smoothing a nd FFT com- putations. AudioMatch calculate s au dio spectrogram using short time fourier tra nsform with a highly overlapp ing hamming window , causing higher ela p sed time than that of MeetSense . In a nu tshell, we observe that MeetSense can detect v arious meeting groups generically and in a device independent way , however , can provide better gro up de- tection a ccuracy with les s resource u sage compared t o the baseline mechanis ms. 5.5.4 MeetSense Internals In this s ubsection, we discuss the importance of t he mod- ularity value in M e etSense , and how the proximity and acoustic context features improve the modularity of the proposed group detection mecha nism. W e plot the F 1 -Score with respect to t he modularity , as shown in Figure 16a. W e obs erve that the F 1 -Score converges to 1 . 0 when the modularity is more than 0 . 35 . Hence a group is detected with high accuracy when t he cohesiveness is als o high. T his indicates t he imp ortance of modularity index in MeetSense . Therefor e, the community detection algorithm use d in MeetSense tries to op timize the modularity in successive iterations. In this line, Figure 16b highlights the importance of Step 2 of Me e tSense model, where we plot F 1 -Score with respect IEEE TRANSACTIONS ON M OBILE COMPUTING, V OL. 14, NO . 8, A UGUST 2015 13 0 . 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 Mod u l a r i ty (a ) Mod ( a r i ty a nd F 1 s c or e 0. 6 5 0. 7 0 0. 7 5 0. 8 0 0. 8 5 0. 9 0 0. 9 5 1. 0 0 F 1 S c or e 0 .0 0 .2 0. 4 0 .6 0 . 8 1 .0 We i g ht of A(d i o F e a t( r e s (b ) Imp a c t of F e a t(r e s 0 .6 0 .7 0 .8 0 .9 1 .0 F 1 S c or e 1 .0 2. 0 3 .0 4 .0 N(mb e r of Ite r a ti on s (c ) C on )e r g e nc e of the A g or i thm −0 .2 −0 .1 0 .0 0 .1 0 .2 0 .3 0 .4 In te r me d i a te Mod ( a r i ty T yp e 1 T yp e 2 Fig. 1 6 : M eetSense insights to the weight ( w ) of the audio feature. T he figure indicates that the maximu m modularity is obtained when both the proximity and the acoustic context attain non-zer o weights, indicating that b oth the features are im p ortant for correct detection of meeting groups. However , the importance of acoustic context is more prominent over the proximity fea- ture. Next, we look into the conver gence property of Meet- Sense . As mentioned earlier , modularity of a weighted fully connected graph converges to zero when all the nodes form a single large community [40]. Accordingly , the group detec- tion algorithm converges with two cases of the modularity ( M ) valu e – (a) M > 0 . 0 , when there are multiple groups in the p op ulation of subjects ( T y pe 1 ) and (b) M ≈ 0 . 0 , when there is a single lar ge group consisting of all the subjects from t he population ( T y pe 2 ). Figure 16c plots the change in modularity value with respect to the number of iterations performed in the a lgorithm, for these two cases. W e observe that for a T y pe 1 group corresponding to scenario S 5 , we get the maximu m modularity close to 0 . 4 with 3 iterations, whereas for a T y pe 2 group corresponding to scenario S 4 , the modularity starts with a negative value and converges to z ero with iteration 4 . 6 C O N C L U S I O N In this paper , we have developed MeetSense , a smartphone based light-weight met hodol ogy to infer va rious meeting groups by sensing the acoustic context around the users in p r oximity . From the pilot stu dy , we have observed that although audio levels captured by a smartphone give a good indication of the a coustic context of the environment, a significant au dio pressure fr om speakers of the nea rby groups also gets captured due to t he omnidirectional natu r e of smartphone micr ophones. W e have develope d a novel unsupervised m e t hodolo gy to process audio signals to cap- ture the context and used the concept of cohesivity fr om network s cience to identify the groups b ased on context information. T he implementa tion and thor ough testing of MeetSense s hows that it can significantly improve group detection accura cy compared to other b aselines, and the method is independent of scenarios or devices used t o cap- ture signals. However , our understa nding is tha t MeetSense can perform well when the underline groups are sufficiently cohesive; it may fail in the scenarios when multiple groups are overlapped in spa ce, or a group is spat ia lly overlap p ed with individuals who are not part of that group, for ex- ample, s m a ll groups in a crowded space. Nev ertheless, the proposed methodology has the advanta ges of device independence, unsu pervised modelling a nd light-weight computation, which can be utilized to develop the wide range of applications that requir e user group identification and group behaviour analysis. R E F E R E N C E S [1] K. A. McComas, “Citizen satisfaction with public meetings used for risk communication,” Journal of Ap plied Communication Re- search , vol. 31, no. 2, pp. 164–184, 2003. [2] T . Clar k, “T eaching stud e n ts to enhance the ec olo gy of small group meetings,” Business Communication Quar terl y , vol. 61, no. 4, pp. 40– 52, 1998. [3] K. McComas, L. S. T uite, L . W aks, and L. A. Sherman, “Predicting satisfaction and outcome acceptance with advisory committee meetings: Th e role of procedura l justice,” Jo urna l of Applied So ci al Psychology , vol. 37, no. 5, pp. 905–927, 2007. [4] B. A. Reinig and B. Shin, “The dynamic effects of group support systems on group mee t ings,” Journal of Ma nagement Info rmation Systems , v ol. 19, no. 2, pp. 303–325, 2002. [5] A. P . Horan, “An effective workplace stress management inte r- vention: Chicken soup for t he soul at W ork tm employee groups,” Work , vol. 18, n o. 1, pp. 3–13, 2002. [6] J. A. Allen, T . Beck, C. W . Scott, and S. G. Rogelberg, “Under- standing workplace meetings: A qualitative taxonomy of meeting purposes,” Management Researc h Rev iew , vol. 37, no. 9, pp. 791–814, 2014. [7] K. Jayarajah, Y . Lee, A. Misra, and R. K. Balan, “Need accurate user behaviour?: pay attention to groups!” in Proceedings of the ACM conference on Pervasive and Ubiquitous Computing , 2015, pp. 855–866 . [8] M. B. Gilboy , S. Heinerichs, and G. Pazzaglia , “Enhancing student engagement using the flipped classroom,” Journal of nutri tion education and behavior , vol. 47, no. 1, pp. 109–114, 2015. [9] J. W eppner and P . Lukowicz, “Bluetoo th based c oll aborative crowd density estimation w ith mobile phones,” in Proceedings of the IEE E conference o n Pervasive computing and communications , 2013 , pp. 193–20 0. [10] P . Sapiezynski, A. Stopczynski, D . K. W ind, J. Leskovec, and S. Lehmann, “Inferring person-to-perso n prox imity using wifi signals,” Proceedings of the ACM o n Interactive, Mob ile, Wearable and Ubiquitous T echnologies , vol. 1, no. 2, p. 24, 2017. [11] H. W ang, S. Sen, A. Elgohary , M. Farid, M. Y oussef, and R. R. Choudhury , “No need to war-drive: Unsupervised indoor local- ization,” in Proceedings o f the 10 th IEEE co nference on Mo b ile systems, applications, a nd services , 2012, pp. 197–210 . [12] C. W u, J. Xu, Z. Y ang, N. D. Lane, and Z. Y in, “Gain without pain: Accurate wifi-based localization usi ng fi ngerprint spatial gradient,” Proceedings o f the ACM on Interactive, Mobi le, Wearable and Ub iquitous T echnologies , vol. 1, no. 2, p. 29, 2017. [13] P . Casagranda, M. L. Sapino, and K. S. Candan, “Audio assisted group detection using smartphones,” in Proceedings of IEEE Con- ference on Multi media & Expo Workshops , 2015, pp. 1–6. [14] D. Sant ani and D. Gatica-Perez , “Loud and trendy: Cr owdsourcing impress ions of social ambiance in popular indoor urban places,” in Proceedings of the 23 r d ACM co nference on Mul timedia , 2015, pp. 211–220 . [15] F . Calabr ese, L. Ferrari, and V . D. Blondel, “Urban sensing using mobile phone netw ork d ata: a survey of resear ch,” ACM Comput- ing Surveys , vol. 47, no. 2, p. 25, 2015. [16] R. Sen, Y . Lee, K. Jayarajah, A. Misra, and R . K. Balan, “Grumon: Fast and ac c urate gro up monitoring for heterogeneous urban spaces,” in P r oceedings of the 12 th ACM conference on Embedded Network Sensor Systems , 2014, pp. 46–60. [17] M. Azizyan, I. Constandache, and R. R oy Choudhury , “Surr ound- sense: mobile phone localization via ambience fingerprinting,” in Proceedings of the 15 th ACM conference on Mobi le computing and networking , 2009, pp. 261–272. [18] S. Davis and P . Mermels tein, “Compariso n of parametric represen- tations for monosyllabi c word recognition in continuously spoken sentences,” IEEE T ransactions on Acousti cs, Speech, and Signal Pro- cessing , vol. 28, no. 4, pp. 357–366, 1980. [19] J. Baker and C. Efstratiou, “Next2me: Capturing social interactions through smartphone devices using wifi and audio signals,” in EAI Conference o n Mobile a nd Ub i quitous Systems: Computing, Networking and Servi ces , 2017. [20] P . Pons and M. Latapy , “Computing communities in large net- works using random w alks ,” Journal o f Graph Al gorithms and Applications , vol. 10, no. 2, pp. 191–218, 2006. IEEE TRANSACTIONS ON M OBILE COMPUTING, V OL. 14, NO . 8, A UGUST 2015 14 [21] K. Chintalapudi, A. Padmanabha Iyer , and V . N. Padmanabhan, “Indoor localization w ithout the pain,” in Proceedings of the 16 th ACM conference on Mo bile computing and networki ng , 2010, pp. 173– 184. [22] H. Abdelnasser , R. Mohamed, A. Elgohary , M. F . Alzantot, H. W ang, S. Sen, R. R . Choudhury , and M. Y oussef, “Semanticsla m: Using environment landmarks for unsupervised indoor localiza- tion,” IEEE T ransactions on Mobile Computing , vol. 15, no. 7, pp. 1770–17 82, 2016. [23] H. Aly , A. Basalamah, and M. Y oussef, “Accurate and energy- efficient gps-less outdoor localization,” ACM T ransactions on Spa- tial Al gorithms and Systems , vol. 3, no. 2, p. 4, 2017. [24] J. Paek and R. Kim, Joongheon Hand Govindan, “Energy-efficient rate-adaptive gps-based positioning for smartphones,” in Proceed- ings of the 8 t h ACM conf erence on Mobile systems, a pplications, and services , 2010. [25] T . M. T . Do and D. Gatica-Per ez, “Grou pus: Smartphone proxi mity data and human interaction type mining,” in Proceedings of the 15 th Annual IEEE Symposium on Wearable Computers , 2011, pp. 21–28. [26] H. Hong, C. Luo, and M. C. Chan, “Socialpr obe: und erstanding social interaction th r ough passive wifi monitoring,” in Proceedings of the 13 th ACM Conference on Mobil e a nd Ubiquitous Systems: Computing, Networ k ing and Services , 2016, pp. 94–103. [27] Y . Lee, C. Min, C. Hwang, J. Lee, I. Hwang, Y . Ju, C. Y oo, M. Moon, U. L ee, and J. Song, “Sociophone: Everyday face-to- face interaction monitoring platform usi ng mul ti-phone sensor fusion,” in Proceeding of the 11 th ACM confe rence o n Mobil e systems, applications, a nd services , 2013, pp. 375–388 . [28] S. Zhang, Y . Zhu, and A. Roy-Chowdhury , “T racking multiple interacting tar gets in a camera network,” Computer V isio n and Image U nderstandi ng , vol. 134, pp. 64–73, 2015. [29] N. Eagle and A. S. Pentland, “Reality mining: sensing c omplex social systems,” Personal and ubiquito us computing , v ol. 10, no. 4, pp. 255–26 8, 2006. [30] R. Friedman, A. Kogan, and Y . Krivolapov , “On power and throughput tradeoff s of wifi and bluetooth in smartphones,” IEEE T ra nsactions on Mob i le Computing , vol. 12, no. 7, pp. 1363–1376, 2013. [31] Z. L iu, Z. Zhang, L. W . He, and P . Chou, “Energy-based sound sour ce localiza tion and gain normalizati on for ad h oc microphone arrays,” in Proceedings o f the IEEE Conference o n Acoustics, Speech and Si gnal Processing , vol. 2, 2007, pp. II–761 –764. [32] A. De Cheveign ´ e and H. Kawahara, “YIN, a fundamental fre- quency estimator for speech and music,” The Journal o f the Aco us- tical So ciety o f America , vol. 111, no. 4, pp. 1917–1930, 2002. [33] C. Xu, S. Li, G. Liu, Y . Zhang, E. Miluzzo , Y .-F . Chen, J. Li, and B. Firner , “ Crowd++ : Unsupervised speaker count with smart- phones,” in Proceedings of the ACM Joint Conf er ence on Pervasive and Ub iquitous Computing , 2013, pp. 43–52. [34] T . Song, X. Cheng, H. Li, J. Y u, S. W ang, and R. Bie, “Detec ting driver phone calls in a moving vehicle based on voice features,” in Proceedings o f the 35 th IEEE conference on Computer Communications , 2016, pp. 1–9. [35] W . Passchier-V ermeer and W . F . Passchier , “Noise exposure and public health,” Environmental Health Perspectives , vol. 108, pp. 123– 131, 2000. [36] M. L. Narayana and S. K. Kopparapu, “Effect of noise-in-speech on mfcc parameters,” in Proceedings o f the 9 th WSEAS conference on si gnal, speech and image processing, a nd 9 th WSEAS conf erence on Multimedia, internet & video technologies , 2009, pp. 39–43. [37] M. Chen, Z. Liu, L. W . He, P . Chou, and Z. Zhang, “Ener gy- based position e stimation of microphones and speakers for ad hoc microphone arrays,” in Proceedings of the IEEE Workshop o n Applications of Signal P r ocessing to Audio and Acousti cs , 2007, pp. 22–25. [38] M. Guggenberger , M. Lux, and L. B ¨ osz ¨ ormenyi, “An analysis of time drift in hand-held recording devices,” in International Conference on Multim edia Modeling , 2015, pp. 203–213. [39] A. Lancichinetti, S. Fortunato, and F . Radicchi, “Benchmark graphs for testing c ommunity detec tion algorithms,” Physical review E , vol. 78, no. 4, p. 046110, 2008. [40] M. E. Newman, “Modularity and community structure in net- works,” Proceedings of th e nationa l academy o f sci ences , vol. 10 3, no. 23, pp. 8577–8582, 2006. [41] L. R . Dice, “Measures of the amount of ecologic association be- tween species,” Ecology , vol. 26, no. 3, pp. 297–302, 1945. Snigdha Das is currently pursuing PhD from De- par tment of Computer Science and Engineering, Indian Institut e of T echnology Kharagpur , India. She received M .S .(by Research) from School of Information T echnology , Indian Instit ut e of T ech- nology Kharagpur , India, in 2015. Her current research interests include mobile systems and ubiquitous com put ing. Soum y ajit Chatterjee joined IIT Kharagpur in 2017, as a research scholar (Doctorate Pro- gr am). He received his M.T ech in Comput er Sci- ence from Indian Institute of T echnology (I ndian School of Mines), Dhanbad in the year 2016 and B.E. from Univer sity Institute of T echnology , Univ ersity of Burdwan in 2012. He also has an industry experience of one year sev en m onths. Currently , his domain of research is mobile sys- tems and ubiquitous computing. Sandip Chakrabor ty received his Ph.D . and M. T ech degrees from t he Indian Insti t ute of T echnology Guwahati, I ndia, in 2014 and 2011, respectiv ely , and BE in Informat ion T echnology from Jadavpur University , K olkata, I ndia in 2009. Currently , he is an Assistant Profess or with the Depar tment of Computer Scienc e and Engineer- ing, Indian Institute of T echnology Kharagpur , India. His research interests include computer systems and distributed com put ing. Bivas Mitra is an Assistant Profess or (since April 2013) in t he Depar tment of Computer Sci- ence & Engineering at IIT Kharagpur , India. Prior to that, He work ed briefly with Samsung Elec- tronics, Noida as a Chief Engineer . He received his Ph.D in Computer Science & Engineering from IIT Kharagpur , I ndia. He di d his first post- doc (May 2010-June 2011) at the F rench Na- tional Centre for Scientific Research (CNRS ), P aris, F rance and second postdoc (July 2011- July 2012) at the Universite catholique de Lou- v ain (UCL), Belgium. His research interests include network science, multilay er network and mobi l e affectiv e computi ng.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment