End-to-End Audio Visual Scene-Aware Dialog using Multimodal Attention-Based Video Features

End-to-End A udio V isual Scene-A war e Dialog using Multimodal Attention-Based V ideo Featur es Chiori Hori † , Huda Alamri ∗† , Jue W ang † , Gordon Wicher n † , T akaaki Hori † , Anoop Cherian † , Tim K. Marks † , V incent Cartillier ∗ , Raphael Gontijo Lopes ∗ , Abhishek Das ∗ , Irfan Essa ∗ , Dhruv Batra ∗ Devi Parikh ∗ , † Mitsubishi Electric Research Laboratories (MERL), Cambridge, MA, USA ∗ School of Interactiv e Computing, Geor gia T ech Abstract Dialog systems need to understand dynamic visual scenes in order to hav e con v er- sations with users about the objects and ev ents around them. Scene-aware dialog systems for real-world applications could be de veloped by inte grating state-of- the-art technologies from multiple research areas, including: end-to-end dialog technologies, which generate system responses using models trained from dialog data; visual question answering (VQA) technologies, which answer questions about images using learned image features; and video description technologies, in which descriptions/captions are generated from videos using multimodal information. W e introduce a ne w dataset of dialogs about videos of human behaviors. Each dialog is a typed conv ersation that consists of a sequence of 10 question-and-answer (QA) pairs between tw o Amazon Mechanical T urk (AMT) work ers. In total, we collected dialogs on ∼ 9 , 000 videos. Using this new dataset, we trained an end-to- end conv ersation model that generates responses in a dialog about a video. Our experiments demonstrate that using multimodal features that were de veloped for multimodal attention-based video description enhances the quality of generated dialog about dynamic scenes (videos). Our dataset, model code and pretrained models will be publicly av ailable for a new V ideo Scene-A ware Dialog challenge. 1 Introduction Spoken dialog technologies hav e been applied in real-world human-machine interfaces including smart phone digital assistants, car na vigation systems, voice-controlled smart speakers, and human- facing robots [ 1 , 2 , 3 ]. Generally , a dialog system consists of a pipeline of data processing modules, including automatic speech recognition, spok en language understanding, dialog management, sen- tence generation, and speech synthesis. Howe ver , all of these modules require signiﬁcant hand engineering and domain knowledge for training. Recently , end-to-end dialog systems have been gathering attention, and they ob viate this need for expensi ve hand engineering to some extent. In end- to-end approaches, dialog models are trained using only paired input and output sentences, without relying on pre-designed data processing modules or intermediate internal data representations such as concept tags and slot-v alue pairs. End-to-end systems can be trained to directly map from a user’ s utterance to a system response sentence and/or action. This signiﬁcantly reduces the data preparation and system de velopment cost. Several types of sequence-to-sequence models hav e been applied to end-to-end dialog systems, and it has been shown that the y can be trained in a completely data-driv en manner . End-to-end approaches ha ve also been shown to better handle ﬂexible con versations between the user and the system by training the model on large con versational datasets [4, 5]. Preprint. W ork in progress. In these applications, howe ver , all conv ersation is triggered by user speech input, and the contents of system responses are limited by the training data (a set of dialogs). Current dialog systems cannot understand dynamic scenes using multimodal sensor-based input such as vision and non-speech audio, so machines using such dialog systems cannot hav e a con versa- tion about what’ s going on in their surroundings. T o develop machines that can carry on a conv ersation about objects and ev ents taking place around the machines or the users, dynamic scene-aware dialog technology is essential. T o interact with humans about visual information, systems need to understand both visual scenes and natural language inputs. One naiv e approach could be a pipeline system in which the output of a visual description system is used as an input to a dialog system. In this cascaded approach, semantic frames such as "who" is doing "what" and "where" must be extracted from the video description results. The prediction of frame type and the value of the frame must be trained using annotated data. In contrast, the recent re volution of neural network models allo ws us to combine different modules into a single end-to-end differentiable network. W e can simultaneously input video features and user utterances into an encoder-decoder -based system whose outputs are natural-language responses. Using this end-to-end framew ork, visual question answering (VQA) has been intensiv ely researched in the ﬁeld of computer vision [ 6 , 7 , 8 ]. The goal of VQA is to generate answers to questions about an imaged scene, using the information present in a single static image. As a further step tow ards con versational visual AI, the new task of visual dialog was introduced [ 9 ], in which an AI agent holds a meaningful dialog with humans about an image using natural, con versational language [ 10 ]. While VQA and visual dialog take signiﬁcant steps tow ards human-machine interaction, they only consider a single static image. T o capture the semantics of dynamic scenes, recent research has focused on video description (natural-language descriptions of videos). The state of the art in video description uses a multimodal attention mechanism that selectiv ely attends to different input modalities (feature types), such spatiotemporal motion features and audio features, in addition to temporal attention [ 11 ]. In this paper , we propose a new research target, a dialog system that can discuss dynamic scenes with humans, which lies at the intersection of multiple av enues of research in natural language processing, computer vision, and audio processing. T o advance this goal, we introduce a ne w model that incorporates technologies for multimodal attention-based video description into an end-to-end dialog system. W e also introduce a new dataset of human dialogues about videos. W e are making our dataset, code, and model publicly av ailable for a new V ideo Scene-A ware Dialog Challenge. 2 A udio Visual Scene-A war e Dialog Dataset W e collected text-based con versations data about short videos for Audio V isual Scene-A ware Dialog (A VSD) as described in [ 12 ] using from an e xisting video description dataset, Charades [ 13 ], for Dialog System T echnology Challenge the 7th edition (DSTC7) 1 . Charades is an untrimmed and multi-action dataset, containing 11,848 videos split into 7985 for training, 1863 for v alidation, and 2,000 for testing. It has 157 action categories, with sev eral ﬁne-grained actions. Further, this dataset also provides 27,847 te xtual descriptions for the videos, each video is associated with 1–3 sentences. As these textual descriptions are only a vailable in the training and v alidation set, we report ev aluation results on the validation set. The data collection paradigm for dialogs w as similar to the one described in [ 9 ], in which for each image, two dif ferent Mechanical T urk workers interacted via a text interface to yield a dialog. In [ 9 ], each dialog consisted of a sequence of questions and answers about an image. In the video scene- aware dialog case, tw o Amazon Mechanical T urk (AMT) workers had a discussion about ev ents in a video. One of the w orkers played the role of an answerer who had already watched the video. The answerer answered questions ask ed by another AMT worker – the questioner . The questioner was not allowed to w atch the whole video but only the ﬁrst, middle and last frames of the video which were single static images. After having a con versation to to ask about the ev ents that happened between the frames through 10 rounds of QA, the questioner summarized the events in the video as a description. In total, we collected dialogs for 7043 videos from the Charades training set and all of the validation set (1863 videos). Since we did not hav e scripts for the test set, we split the v alidation set into 732 1 http://workshop.colips.or g/dstc7/call.html 2 LSTM decoder α 1, n,1 α 1, n,L α 2, i,1 α 2, n,L ’ β 1, n β 2, n d 1, n Attentional multimodal fusion c 1,n W C1 d 2 , n W C2 c 2,n x 11 x 12 x 1L α 1, n,2 x 21 x 22 x 2L’ x’ 21 x’ 2 2 α 2, n,2 x’ 11 x’ 12 x’ 1L x’ 2L’ y i y i-1 y i+1 s i -1 s i D o e s h e d o a n yt h i n g e l se t h a t i s o f n o t e ? N o t re a l l y h e j u st l a ys t h e re l i ke h e ’ s l a zy . H 4 C o u l d yo u g e t a n i d e a o f a b o u t h o w o l d h e i s? I w o u l d p ro b a b l y g u e ss h e ’ s i n l a t e 2 0 ’ s o r e a rl y 3 0 ’ s. H 3 D o e s h e g e t u p t o t u rn t h e l i g h t s o f f ? At t h e ve ry e n d h e st a n d s u p t o t u rn i t o f f , ye s. H 2 C a n yo u t e l l w h a t h e ’ s w a t ch i n g o n T V? I t i s so me d o cu me n t a ry a b o u t t h e p l a n e t e a rt h . H 1 A b o y i s l a yi n g o n h i s b e d w a t ch i n g a T V p ro g ra m a b o u t Ea rt h . H 0 Context of Dialog Audio feature Visual feature Multimodal Feature extractors Question: “What is he doing?” Q n Output word sequence Answer: “He is watching TV .” A n LSTM LSTM LSTM LSTM LSTM L STM Pa i rs o f Q As 12 8 1 28 256 L S T M Script/Description (optional) Input multimodal feature sequence g n g n (q) g n ( av ) g n (h) Figure 1: Our multimodal-attention based video scene-aw are dialog system and 733 videos and used them as our v alidation and test sets respectiv ely . See T able 1 for statistics. The av erage numbers of words per question and answer are 8 and 10, respectiv ely . T able 1: V ideo Scene-aware Dialog Dataset on Charades training val idation test #dialogs 6,172 732 733 #turns 123,480 14,680 14,660 #words 1,163,969 138,314 138,790 3 V ideo Scene-aware Dialog System W e built an end-to-end dialog system that can generate answers in response to user questions about ev ents in a video sequence. Our architecture is similar to the Hierarchical Recurrent Encoder in Das et al. [ 9 ]. The question, visual features, and the dialog history are fed into corresponding LSTM-based encoders to build up a context embedding, and then the outputs of the encoders are fed into a LSTM-based decoder to generate an answer . The history consists of encodings of QA pairs. W e feed multimodal attention-based video features into the LSTM encoder instead of single static image features. Figure 1 sho ws the architecture of our video scene-aw are dialog system. 3.1 End-to-end Con versation Modeling This section e xplains the neural conv ersation model of [ 4 ], which is designed as a sequence-to- sequence mapping process using recurrent neural networks (RNNs). Let X and Y be input and output sequences, respecti vely . The model is used to compute posterior probability distribution P ( Y | X ) . For conv ersation modeling, X corresponds to the sequence of pre vious sentences in a con versation, and Y is the system response sentence we want to generate. In our model, both X and Y are sequences of words. X contains all of the pre vious turns of the con versation, concatenated in sequence, separated by markers that indicate to the model not only that a new turn has started, b ut which speaker said that sentence. The most likely hypothesis of Y is obtained as ˆ Y = arg max Y ∈V ∗ P ( Y | X ) (1) = arg max Y ∈V ∗ | Y | Y m =1 P ( y m | y 1 , . . . , y m − 1 , X ) , (2) where V ∗ denotes a set of sequences of zero or more words in system v ocabulary V . 3 Let X be word sequence x 1 , . . . , x T and Y be word sequence y 1 , . . . , y M . The encoder network is used to obtain hidden states h t for t = 1 , . . . , T as: h t = LSTM ( x t , h t − 1 ; θ enc ) , (3) where h 0 is initialized with a zero vector . LSTM ( · ) is a LSTM function with parameter set θ enc . The decoder network is used to compute probabilities P ( y m | y 1 , . . . , y m − 1 , X ) for m = 1 , . . . , M as: s 0 = h T (4) s m = LSTM ( y m − 1 , s m − 1 ; θ dec ) (5) P ( y | y 1 , . . . , y m − 1 , X ) = softmax ( W o s m + b o ) , (6) where y 0 is set to , a special symbol representing the end of sequence. s m is the m -th decoder state. θ dec is a set of decoder parameters, and W o and b o are a matrix and a vector . In this model, the initial decoder state s 0 is gi ven by the ﬁnal encoder state h T as in Eq. (4) , and the probability is estimated from each state s m . T o efﬁciently ﬁnd ˆ Y in Eq. (1) , we use a beam search technique since it is computationally intractable to consider all possible Y . In the scene-aware-dialog scenario, a scene conte xt vector including audio and visual features is also fed to the decoder . W e modify the LSTM in Eqs. (4)–(6) as s n, 0 = ¯ 0 (7) s n,m = LSTM  [ y | n,m − 1 , g | n ] | , s n,m − 1 ; θ dec  , (8) P ( y n | y n, 1 , . . . , y n,m − 1 , X ) = softmax ( W o s n,m + b o ) , (9) where g n is the concatenation of question encoding g ( q ) n , audio-visual encoding g ( av ) n and history encoding g ( h ) n for generating the n -th answer A n = y n, 1 , . . . , y n, | Y n | . Note that unlike Eq. (4) , we feed all contextual information to the LSTM at e very prediction step. This architecture is more ﬂexible since the dimensions of encoder and decoder states can be dif ferent. g ( q ) n is encoded by another LSTM for the n -th question, and g ( h ) n is encoded with hierarchical LSTMs, where one LSTM encodes each question-answer pair and then the other LSTM summarizes the question-answer encodings into g ( h ) n . The audio-visual encoding is obtained by multi-modal attention described in the next section. 3.2 Multimodal-attention based V ideo Features T o predict a word sequence in video description, prior work [ 14 ] e xtracted content vectors from image features of VGG-16 and spatiotemporal motion features of C3D, and combined them into one vector in the fusion layer as: g ( av ) n = tanh K X k =1 d k,n ! , (10) where d k,n = W ( λ D ) ck c k,n + b ( λ D ) ck , (11) and c k,n is a context v ector obtained using the k -th input modality . W e call this approach Naïve Fusion, in which multimodal feature vectors are combined using projection matrices W ck for K different modalities (input sequences x k 1 , . . . , x kL for k = 1 , . . . , K ). T o fuse multimodal information, prior work [ 11 ] proposed method extends the attention mechanism. W e call this fusion approach multimodal attention . The approach can pay attention to speciﬁc modalities of input based on the current state of the decoder to predict the w ord sequence in video description. The number of modalities indicating the number of sequences of input feature vectors is denoted by K . The following equation sho ws an approach to perform the attention-based feature fusion: g ( av ) n = tanh K X k =1 β k,n d k,n ! . (12) 4 The similar mechanism for temporal attention is applied to obtain the multimodal attention weights β k,n : β k,n = exp( v k,n ) P K κ =1 exp( v κ,n ) , (13) where v k,n = w | B tanh( W B g ( q ) n + V B k c k,n + b B k ) . (14) Here the multimodal attention weights are determined by question encoding g ( q ) n and the context vector of each modality c k,n as well as temporal attention weights in each modality . W B and V B k are matrices, w B and b B k are vectors, and v k,n is a scalar . The multimodal attention weights can change according to the question encoding and the feature vectors (sho wn in Figure 1). This enables the decoder netw ork to attend to a dif ferent set of features and/or modalities when predicting each subsequent word in the description. Naïve fusion can be considered a special case of Attentional fusion, in which all modality attention weights, β k,n , are constantly 1. 4 Experiments for Multimodal attention-based Video F eatures T o select best video features for the video scene-aware dialog system, we ﬁrstly e valuate the perfor - mance of video description using multimodal attention-based video features in this paper . 4.1 Datasets W e ev aluated our proposed feature fusion using the MSVD (Y ouT ube2T ext) [ 15 ], MSR-VTT [ 16 ], and Charades [13] video data sets. • MSVD (Y ouTube2T ext) co vers a wide range of topics including sports, animals, and music. W e applied the same condition deﬁned by [ 15 ]: a training set of 1,200 video clips, a validation set of 100 clips, and a test set of the remaining 670 clips. • MSR-VTT is split into training, v alidation, and testing sets of 6,513, 497, and 2,990 clips respectiv ely . Howe ver , approvimatebly 12 of the MSR-VTT videos on Y ouTube have been remov ed. W e used the a v ailable data consists of 5,763, 419, and 2,616 clips for train, validation, and test respecti vely deﬁned by [11]. • Charades [ 13 ] is split into 7985 clips for training and 1863 clips for validation. provides 27,847 textual descriptions for the videos, As these textual descriptions are only av ailable in the training and validation set, we report the e valuation results on the v alidation set. Details of textual descriptions are summarized in T able 2. T able 2: Sizes of textual descriptions in MSVD (Y ouT ube2T ext), MSR-VTT and Charades #Descriptions V ocabulary Dataset #Clips #Description per clip #W ord size MSVD 1,970 80,839 41.00 8.00 13,010 MSR-VTT 10,000 200,000 20.00 9.28 29,322 Charades 9,848 16,140 1.64 13.04 2,582 4.2 V ideo Processing W e used a sequence of 4096-dimensional feature vectors of the output from the fully-connected fc7 layer of a VGG-16 network pretrained on the ImageNet dataset for the image features. The pretrained C3D [ 17 ] model is used to generate features for model motion and short-term spatiotemporal activity . The C3D network reads sequential frames in the video and outputs a ﬁxed-length feature v ector e very 16 frames. 4096-dimensional features of acti vation vectors from fully-connected fc6-1 layer was applied to spatiotemporal features. 5 In addition to the VGG-16 and C3D features, we also adopted the state-of-the-art I3D features [ 18 ], spatiotemporal features that were de veloped for action recognition. The I3D model inﬂates the 2D ﬁlters and pooling kernels in the Inception V3 network along their temporal dimension, b uilding 3D spatiotemporal ones. W e used the output from the "Mixed_5c" layer of the I3D network to be used as video features in our frame work. As a pre-processing step, we normalized all the video features to hav e zero mean and unit norm; the mean was computed ov er all the sequences in the training set for the respectiv e feature. In the experiments in this paper , we treated I3D-rgb (I3D features computed on a stack of 16 video frame images) and I3D-ﬂow (I3D features computed on a stack of 16 frames of optical ﬂow ﬁelds) as two separate modalities that are input to our multimodal attention model. T o emphasize this, we refer to I3D in the results tables as I3D (rgb-ﬂo w). 4.3 A udio Processing While the original MSVD (Y ouT ube2T ext) dataset does not contain audio features, we were able to collect audio data for 1,649 video clips (84% of the dataset) from the video URLs. In our previous work on multimodal attention for video description, we used two dif ferent types of audio features: concatenated mel-frequency cepstral coef ﬁcient (MFCC) features [ 19 ], and SoundNet [ 20 ] features [ 21 ]. In this paper , we also e valuate features e xtracted using a ne w state-of-the-art model, Audio Set VGGish [22]. Inspired by the VGG image classiﬁcation architecture (Conﬁguration A without the last group of con volutional/pooling layers), the Audio Set VGGish model operates on 0.96 s log Mel spectrogram patches extracted from 16 kHz audio, and outputs a 128-dimensional embedding vector . The model was trained to predict an ontology of labels from only the audio tracks of millions of Y ouT ube videos. In this work, we overlap frames of input to the VGGish network by 50%, meaning an Audio Set VGGish feature vector is output every 0.48 s. For SoundNet [ 20 ], in which a fully conv olutional architecture was trained to predict scenes and objects using a pretrained image model as a teacher , we take as input to the audio encoder the output of the second-to-last con volutional layer, which giv es a 1024-dimensional feature vector e very 0.67 s, and has a receptiv e ﬁeld of approximately 4.16 s. F or raw MFCC features, sequences of 13-dimensional MFCC features are e xtracted from 50 ms windows, ev ery 25 ms, and then 20 consecutive frames are concatenated into a 260-dimensional vector and normalized to zero mean/unit v ariance (computed over the training set) and used as input to the BLSTM audio encoder . 4.4 Experimental Setup The caption generation model, i.e., the decoder network, is trained to minimize the cross entropy criterion using the training set. Image features and deep audio features (SoundNet and VGGish) are fed to the decoder network through one projection layer of 512 units, while MFCC audio features are fed to a BLSTM encoder (one projection layer of 512 units and bidirectional LSTM layers of 512 cells) followed by the decoder network. The decoder network has one LSTM layer with 512 cells. Each word is embedded to a 256-dimensional v ector when it is fed to the LSTM layer . In this video description task, we used L2 regularization for all experimental conditions and used RMSprop optimization. 4.5 Evaluation The quality of the automatically generated sentences will be ev aluated with objective measures to measure the similarity between the generated sentences and ground truth sentences. W e will use the ev aluation code for MS COCO caption generation 2 for objectiv e ev aluation of system outputs, which is a publicly av ailable tool supporting v arious automated metrics for natural language generation such as BLEU, METEOR, R OUGE_L, and CIDEr . 2 https://github.com/tylin/coco- caption 6 T able 3: V ideo description ev aluation results on the MSVD (Y ouT ube2T ext) test set. MSVD (Y ouT ube2T ext) Full Dataset Modalities (feature types) Evaluation metric Image Spatiotemporal Audio BLEU4 METEOR CIDEr VGG-16 C3D 0.524 0.320 0.688 VGG-16 C3D MFCC 0.539 0.322 0.674 I3D (rgb-ﬂo w) 0.525 0.330 0.742 MFCC 0.527 0.325 0.702 I3D (rgb-ﬂo w) SoundNet 0.529 0.319 0.719 VGGish 0.554 0.332 0.743 T able 4: V ideo description ev aluation results on MSR-VTT Subset. Approximately 12% of the MSR-VTT videos hav e been removed from Y ouT ube, so we train and test on the remaining Subset of MSR-VTT videos that we were able to download. The normalization for the visual features was not applied to MSR-VTT in this experiments. MSR-VTT Subset Modalities (feature types) Evaluation metric Image Spatiotemporal Audio BLEU4 METEOR CIDEr VGG-16 C3D MFCC 0.397 0.255 0.400 I3D (rgb-ﬂo w) 0.347 0.241 0.349 MFCC 0.364 0.253 0.393 I3D (rgb-ﬂo w) SoundNet 0.366 0.246 0.387 VGGish 0.390 0.263 0.417 4.6 Results and Discussion T ables 3, 4, and 5 show the e valuation results on the MSVD (Y ouT ube2T ext), MSR-VTT Subset, and Charades datasets. The I3D spatiotemporal features outperformed the combination of VGG-16 image features and C3D spatiotemporal features. W e also tried a combination of VGG-16 image features plus I3D spatiotemporal features, but we do not report those results because they did not improve performance over I3D features alone. W e believ e this is because I3D features already include enough image information for the video description task. In comparison to C3D, which uses the V GG-16 base architecture and was trained on the Sports-1M dataset [ 23 ], I3D uses a more po werful Inception-V3 network architecture and was trained on the larger (and cleaner) Kinectics [ 24 ] dataset. As a result, I3D has demonstrated state-of-the-art performance for the task of human action recognition in video sequences [ 18 ]. Further , the Inception-V3 architecture has signiﬁcantly fe wer network parameters than the VGG-16 network, making it more ef ﬁcient. In terms of audio features, the Audio Set V GGish model provided the best performance. While we expected the deep features (SoundNet and V GGish) to provide improved performance compared to MFCC, there are sev eral possibilities as to why VGGish performed better than SoundNet. First, the VGGish model was trained on more data, and had audio speciﬁc labels, whereas SoundNet used pre-trained image classiﬁcation networks to pro vide labels for training the audio network. Second, the large Audio Set ontology used to train VGGish lik ely provides the ability to learn features more relev ant to text descriptions than the broad scene/object labels used by SoundNet. T able 5: V ideo description ev aluation results on Charades. Charades Dataset Modalities (feature types) Evaluation metric Image Spatiotemporal Audio BLEU4 METEOR CIDEr I3D (rgb-ﬂo w) 0.094 0.149 0.236 MFCC 0.098 0.156 0.268 I3D (rgb-ﬂo w) SoundNet - - - VGGish 0.100 0.157 0.270 Since it is intractable to enumerate all possible word sequences in vocab ulary V , we usually limit them to the n -best hypotheses generated by the system. Although in theory the distribution P ( Y 0 | X ) should be the true distribution, we instead estimate it using the encoder -decoder model. 7 T able 6: System response generation e valuation results with objecti ve measures. Attentional Input features fusion BLEU1 BLEU2 BLEU3 BLEU4 METEOR ROUGE_L CIDEr QA - 0.236 0.142 0.094 0.065 0.101 0.257 0.595 QA + Captions - 0.245 0.152 0.103 0.073 0.109 0.271 0.705 QA + VGG16 - 0.231 0.141 0.095 0.067 0.102 0.259 0.618 QA + I3D no 0.246 0.153 0.104 0.073 0.109 0.269 0.680 QA + I3D yes 0.250 0.157 0.108 0.077 0.110 0.274 0.724 QA + I3D + VGGish no 0.249 0.155 0.106 0.075 0.110 0.275 0.701 QA + I3D + VGGish yes 0.256 0.161 0.109 0.078 0.113 0.277 0.727 5 Experiments for V ideo-scene-aware Dialog In this paper , we e xtended an end-to-end dialog system to scene-aware dialog with multimodal fusion. As shown in Fig. 1, we embed the video and audio features selected in Section 2. 5.1 Conditions W e ev aluated our proposed system with the dialog data for Charades we collected. T able 1 sho ws the size of each data set. W e compared the performance between models trained from v arious combinations of the QA text, visual and audio features. In addition, we tested an efﬁcac y of multimodal-attention mechanism for dialg response generation. W e employed an AD AM optimizer [ 25 ] with the cross-entropy criterion and iterated the training process up to 20 epochs. For each of the encoder-decoder model types, we selected the model with the lowest perple xity on the expanded dev elopment set. W e used the parameters of the LSTMs with #layer=2 and #cells=128 for encoding history and question sentences. V ideo features were projected to 256 dimensional feature space before modality fusion. The decoder LSTM had a structure of #layer=2 and #cells=128 as well. 5.2 Evaluation Results T able 6 shows the response sentence generation performance of our models, training and decoding methods using objectiv e measures, BLEU1-4, METEOR, R OUGE_L, and CIDEr , which were computed with the e valuation code for MS COCO caption generation as done for video description. W e in vestigated different input features including question-answering dialog history plus last question (QA), human-annotated captions (Captions), video features of VGG16 or I3D r gb and ﬂow features (I3D), and audio features (VGGish). First we ev aluated response generation quality with only QA features as a baseline without an y video scene features. Then, we added the caption features to QA, and the performance improved signiﬁcantly . This is because each caption pro vided the scene information in natural language and helped the system answer the question correctly . Howe ver , such human annotations are not av ailable for real systems. Next we added VGG16 features to QA, b ut they did not increase the ev aluation scores from those of QA-only features. This result indicates that QA+VGG16 is not enough to let the system generate better responses than those of QA+Captions. After that, we replaced VGG16 with I3D, and obtained a certain improv ement from the QA-only case. As in the video description, it has been shown that the I3D features are also useful for scene-aware dialog. Furthermore, we applied the multi- modal attention mechanism (attentional fusion) for I3D r gb and ﬂow features, and obtained further improv ement in all the metrics. Finally , we examined the efﬁcac y of audio features. The table shows that VGGish obviously contributed to increasing the response quality especially when using the attentional fusion. The following e xample of system response was obtained with or without VGGish features, which worked better for the questions regarding audios: 8 Question: was there audio ? Ground truth: there is audio , i can hear music and background noise . I3D: no , there is no sound in the video . I3D+VGGish: yes there is sound in the video . 6 Conclusion In this paper , we propose a new research target, a dialog system that can discuss dynamic scenes with humans, which lies at the intersection of multiple av enues of research in natural language processing, computer vision, and audio processing. T o advance this goal, we introduce a ne w model that incorporates technologies for multimodal attention-based video description into an end-to-end dialog system. W e also introduce a new dataset of human dialogues about videos. Using this new dataset, we trained an end-to-end con versation model that generates system responses in a dialog about an input video. Our experiments demonstrate that using multimodal features that were dev eloped for multimodal attention-based video description enhances the quality of generated dialog about dynamic scenes. W e are making our data set and model publicly av ailable for a new V ideo Scene-A ware Dialog challenge. References [1] Michael F McT ear , “Spoken dialogue technology: enabling the con versational user interf ace, ” A CM Computing Surve ys (CSUR) , vol. 34, no. 1, pp. 90–169, 2002. [2] Ste ve J Y oung, “Probabilistic methods in spoken–dialogue systems, ” Philosophical T ransactions of the Royal Society of London A: Mathematical, Physical and Engineering Sciences , vol. 358, no. 1769, pp. 1389–1402, 2000. [3] V ictor Zue, Stephanie Seneff, James R Glass, Joseph Polifroni, Christine Pao, Timothy J Hazen, and Lee Hetherington, “Juplter: a telephone-based con versational interface for weather information, ” IEEE T ransactions on speech and audio pr ocessing , vol. 8, no. 1, pp. 85–96, 2000. [4] Oriol V inyals and Quoc Le, “ A neural con versational model, ” arXiv preprint , 2015. [5] Ryan Lowe, Nissan Pow , Iulian Serban, and Joelle Pineau, “The ubuntu dialogue corpus: A large dataset for research in unstructured multi-turn dialogue systems, ” arXiv preprint arXiv:1506.08909 , 2015. [6] Stanislaw Antol, Aishwarya Agra wal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi P arikh, “VQA: Visual Question Answering, ” in International Confer ence on Computer V ision (ICCV) , 2015. [7] Peng Zhang, Y ash Goyal, Douglas Summers-Stay , Dhruv Batra, and Devi Parikh, “Yin and Yang: Balancing and answering binary visual questions, ” in Confer ence on Computer V ision and P attern Recognition (CVPR) , 2016. [8] Y ash Go yal, T ejas Khot, Douglas Summers-Stay , Dhruv Batra, and De vi Parikh, “Making the V in VQA matter: Ele vating the role of image understanding in Visual Question Answering, ” in Confer ence on Computer V ision and P attern Recognition (CVPR) , 2017. [9] Abhishek Das, Satwik Kottur , Khushi Gupta, A vi Singh, Deshraj Y ada v , José M. F . Moura, De vi Parikh, and Dhruv Batra, “V isual dialog, ” CoRR , vol. abs/1611.08669, 2016. [10] Abhishek Das, Satwik Kottur , José M.F . Moura, Stefan Lee, and Dhruv Batra, “Learning cooperati ve visual dialog agents with deep reinforcement learning, ” in International Conference on Computer V ision (ICCV) , 2017. [11] Chiori Hori, T akaaki Hori, T eng-Y ok Lee, Ziming Zhang, Bret Harsham, John R. Hershey , T im K. Marks, and Kazuhiko Sumi, “ Attention-based multimodal fusion for video description, ” in The IEEE International Confer ence on Computer V ision (ICCV) , Oct 2017. [12] Huda Alamri, V incent Cartillier , Raphael Gontijo Lopes, Abhishek Das, Jue W ang, Irfan Essa, Dhruv Batra, Devi Parikh, Anoop Cherian, T im K Marks, and Chiori Hori, “ Audio visual scene-aware dialog (a vsd) challenge at dstc7, ” arXiv preprint , 2018. 9 [13] Gunnar A. Sigurdsson, Gül V arol, Xiaolong W ang, Ivan Lapte v , Ali F arhadi, and Abhinav Gupta, “Hollywood in homes: Crowdsourcing data collection for acti vity understanding, ” ArXiv , 2016. [14] Haonan Y u, Jiang W ang, Zhiheng Huang, Y i Y ang, and W ei Xu, “V ideo paragraph captioning using hierarchical recurrent neural networks, ” CoRR , vol. abs/1510.07712, 2015. [15] Sergio Guadarrama, Ni veda Krishnamoorthy , Girish Malkarnenkar , Subhashini V enugopalan, Raymond Mooney , Tre vor Darrell, and Kate Saenko, “Y outube2text: Recognizing and describ- ing arbitrary activities using semantic hierarchies and zero-shot recognition, ” in Pr oceedings of the IEEE International Confer ence on Computer V ision , 2013, pp. 2712–2719. [16] Jun Xu, T ao Mei, T ing Y ao, and Y ong Rui, “Msr-vtt: A large video description dataset for bridging video and language, ” in Pr oceedings of the IEEE Confer ence on Computer V ision and P attern Recognition (CVPR) , 2016. [17] Du T ran, Lubomir D. Bourdev , Rob Fergus, Lorenzo T orresani, and Manohar Paluri, “Learning spatiotemporal features with 3d con volutional networks, ” in 2015 IEEE International Con- fer ence on Computer V ision, ICCV 2015, Santiago, Chile, December 7-13, 2015 , 2015, pp. 4489–4497. [18] Joao Carreira and Andrew Zisserman, “Quo vadis, action recognition? a ne w model and the kinetics dataset, ” in CVPR , 2017. [19] Chiori Hori, T akaaki Hori, T eng-Y ok Lee, Ziming Zhang, Bret Harsham, John R Hershey , T im K Marks, and Kazuhiko Sumi, “ Attention-based multimodal fusion for video description, ” in ICCV , 2017. [20] Y usuf A ytar, Carl V ondrick, and Antonio T orralba, “Soundnet: Learning sound representations from unlabeled video, ” in NIPS , 2016. [21] Chiori Hori, T akaaki Hori, T im K Marks, and John R Hershe y , “Early and late inte gration of audio features for automatic video description, ” in ASR U , 2017. [22] S. Hershey , S. Chaudhuri, D. P . W . Ellis, J. F . Gemmeke, A. Jansen, R. C. Moore, M. Plakal, D. Platt, R. A. Saurous, B. Seybold, M. Slaney , R. J. W eiss, and K. W ilson, “CNN architectures for large-scale audio classiﬁcation, ” in ICASSP , 2017. [23] Andrej Karpathy , George T oderici, Sanketh Shetty , Thomas Leung, Rahul Sukthankar , and Li Fei-Fei, “Large-scale video classiﬁcation with con volutional neural networks, ” in Pr oceedings of the IEEE confer ence on Computer V ision and P attern Recognition , 2014, pp. 1725–1732. [24] W ill Kay , Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra V ijaya- narasimhan, Fabio V iola, T im Green, Tre vor Back, Paul Natsev , et al., “The kinetics human action video dataset, ” arXiv , 2017. [25] Diederik Kingma and Jimmy Ba, “ Adam: A method for stochastic optimization, ” arXiv preprint arXiv:1412.6980 , 2014. 10

End-to-End Audio Visual Scene-Aware Dialog using Multimodal Attention-Based Video Features

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment