Compositional Structure Learning for Sequential Video Data

Conventional sequential learning methods such as Recurrent Neural Networks (RNNs) focus on interactions between consecutive inputs, i.e. first-order Markovian dependency. However, most of sequential data, as seen with videos, have complex temporal de…

Authors: Kyoung-Woon On, Eun-Sol Kim, Yu-Jung Heo

Compositional Structure Learning for Sequential Video Data
Compositional Structur e Learning f or Sequential V ideo Data K young-W oon On 1 Eun-Sol Kim 2 Y u-Jung Heo 1 Byoung-T ak Zhang 1 Abstract Con ventional sequential learning methods such as Recurrent Neural Networks (RNNs) focus on interactions between consecutiv e inputs, i.e. first- order Marko vian dependency . Howe ver , most of sequential data, as seen with videos, ha ve complex temporal dependencies that imply variable-length semantic flows and their compositions, and those are hard to be captured by con ventional meth- ods. Here, we propose T emporal Dependency Networks (TDNs) for learning video data by dis- cov ering these complex structures of the videos. The TDNs represent video as a graph whose nodes and edges correspond to frames of the video and their dependencies respectiv ely . V ia a parameter- ized kernel with graph-cut and graph con volutions, the TDNs find compositional temporal dependen- cies of the data in multilev el graph forms. W e ev aluate the proposed method on the lar ge-scale video dataset Y outube-8M. The experimental re- sults show that our model ef ficiently learns the complex semantic structure of video data. 1. Introduction A fundamental problem in learning sequential data is to find semantic structures underlying the sequences for better representation learning. In particular , the most challenging problems are to segment the whole long-length sequence in multiple semantic units and to find their compositional structures. In terms of neural network architectures, many problems with sequential inputs are resolved by using Recurrent Neu- ral Networks (RNNs) as it naturally takes sequential inputs frame by frame. Howe ver , as the RNN-based methods take frames in (incremental) order , the parameters of the meth- ods are trained to capture patterns in transitions between 1 Department of Computer Science and Engineering, Seoul National University , Seoul, South Korea 2 Kakao Brain, Seong- nam, South Korea. Correspondence to: Byoung-T ak Zhang < btzhang@snu.ac.kr > . Presented at the ICML 2019 W orkshop on Learning and Reasoning with Graph-Structured Data Copyright 2019 by the author(s). successiv e frames, which makes it hard to find long-term temporal dependencies through overall frames. F or this reason, their variants, such as Long Short-T erm Memory (LSTM) ( Hochreiter & Schmidhuber , 1997 ) and Gated Re- current Units (GR U) ( Chung et al. , 2014 ), have made the suggestion of ignoring noisy (unnecessary) frames and main- taining the semantic flow through the whole sequence by turning switches on and off. Howe ver , it is still hard to retain multiple semantic flo ws and to learn their hierarchical and compositional relationships. In this work, we propose T emporal Dependency Netw orks (TDNs) which can discov er composite dependency structure in video inputs and utilize them for representation learning of videos. The composite structures are defined as a multi- lev el graph form, which make it possible to find long-length dependencies and its hierarchical relationships effecti vely . A single video data input is represented as a temporal graph, where nodes and edges represent frames of the video and relationships between two nodes. From the input represen- tations, the TDNs find composite temporal structures in the graphs with two key operations: temporally constrained normalized graph-cut and graph conv olutions. A set of semantic units is found by cutting the input graphs with temporally constrained normalized graph-cuts. Here, the cutting operator is conducted with the weighted adjacenc y matrix of the graph which is estimated by parameterized kernels. After getting a new adjacency matrix with the cut- ting operations, representations of the inputs are updated by applying graph conv olutional operations. As a result of stacking those operations, compositional structures be- tween whole frames are discovered in a multilev el graph form. Furthermore, the proposed method can be learned in an end-to-end manner . W e ev aluate our method with the Y ouTube-8M dataset, which is for the video understanding task. As a qualita- tiv e analysis of the proposed model, we visualize semantic temporal dependencies of sequential input frames, which are automatically constructed. The remainder of the paper is org anized as follo ws. T o make further discussion clear , the problem statement of this paper are described in the following sections. After that, T emporal Dependency Netw orks (TDNs) are suggested in detail and the experimental results with the real dataset Y ouT ube-8M Compositional Structure Lear ning for Sequential Video Data are presented. 2. Problem Statement W e consider videos as inputs, and a video is represented as a graph G . The graph G has nodes corresponding to each frame in the video with feature vectors and the dependencies between two nodes are represented with weight values of corresponding edges. Suppose that a video X has N successiv e frames and each frame has an m -dimensional feature vector x ∈ R m . Each frame corresponds to a node v ∈ V of graph G , and the dependency between tw o frames v i , v j is represented by a weighted edge e ij ∈ E . From G = ( V , E ) , the dependency structures among video frames is defined as the weighted adjacency matrix A , where A ij = e ij . W ith aforementioned notations and definitions, we can no w formally define the problem of video representations learning as follows: Given the video frames r epr esentations X ∈ R N × m , we seek to discover a weighted adjacency matrix A ∈ R N × N which r epr esents dependency among frames. f : X → A (1) W ith X and A , final r epresentations for video h ∈ R l ar e acquir ed by g . g : { X , A } → h (2) The obtained video representations h can be used for v arious tasks of video understanding. In this paper , the multi-label classification problem for video understanding is mainly considered. 3. T emporal Dependency Networks The T emporal Dependenc y Networks (TDNs) consist of two modules: a structure learning module with the graph-cuts and a representation learning module with graph con volu- tions. In the structure learning module, the dependencies between frames ˆ A are estimated via parameterized kernels and the temporally constrained graph-cut algorithm. The suggested graph-cut algorithm i) makes the dense dependencies to be sparse, and ii) forms a set of temporally non-ov erlapped semantic units (disjoint sub-graphs) to construct the compo- sitional hierarchies. After getting the matrix ˆ A , representations of the inputs are updated by applying graph conv olutions followed by pooling operations. As mentioned earlier in this work, furthermore, by stacking those modules, compositional structures of whole frames are discov ered in a multile vel graph form. Figure 1 (a) illustrates the whole structure of TDNs. In the next sections, operations of each of these modules are described in detail. 3.1. Structure Learning Module The structure learning module is composed of two steps. The first step is to learn the dependencies over all frames via the parameterized kernel K : ˆ A ij = K ( x i , x j ) = ReLU ( f ( x i ) > f ( x j )) (3) where f ( x ) is a single-layer feed-forward network without non-linear activ ation: f ( x ) = W f x + b f (4) with W f ∈ R m × m and b f ∈ R m . Then, the ˆ A is refined by the normalized graph-cut algo- rithm ( Shi & Malik , 2000 ). The objectiv e of the normalized graph-cut is: N cut ( V 1 , V 2 ) = P v i ∈ V 1 ,v j ∈ V 2 ˆ A ij P v i ∈ V 1 ˆ A i · + P v i ∈ V 1 ,v j ∈ V 2 ˆ A ij P v j ∈ V 2 ˆ A j · (5) It is formulated as a discrete optimization problem and usu- ally relaxed to continuous, which can be solved by eigen- value problem with the O ( n 2 ) time complexity ( Shi & Ma- lik , 2000 ). The video data is composed of time continuous sub-sequences so that no two partitioned sub-graphs have an ov erlap in physical time. Therefore, we add the temporal constraint ( Rasheed & Shah , 2005 ; Sakarya & T elatar , 2008 ) as follows, ( i < j or i > j ) for all v i ∈ V 1 , v j ∈ V 2 (6) Thus, a cut can only be made along the temporal axis and complexity of the graph partitioning is reduced to linear time. W e apply the graph-cut recursiv ely so that the refined ˆ A and multiple partitioned sub-graphs are obtained. The number of sub-graph K is determined by the length of the video N . K = 2 b log 2 √ N c− 1 (7) Figure 1 (b) depicts the detailed operations of the structure learning module. 3.2. Representation Learning Module After estimating the weighted adjacency matrix ˆ A , the rep- resentation learning module updates the representations of each frame via a graph con volution operation ( Kipf & W elling , 2016 ) followed by a position-wise fully con- nected network. W e also employ a residual connection ( He et al. , 2016 ) around each layer followed by layer normaliza- tion ( Ba et al. , 2016 ): Compositional Structure Lear ning for Sequential Video Data Figure 1. (a): Overall architecture of the T emporal Dependency Networks (TDNs) for a video classification task. (b), (c): sophisticated illustrations of Structure Learning Module and Representation Learning Module. Z 0 = LN ( σ ( ˆ D − 1 ˆ AX W Z 0 ) + X ) (8) Z = LN ( σ ( Z 0 W Z + b Z ) + Z 0 ) (9) where W Z 0 , W Z ∈ R m × m and b Z ∈ R m . Once the representations of each frame are updated, an av erage pooling operation for each partitioned sub-graph is applied. Then we can obtain higher lev el representations Z ∈ R K × m , where K is the number of partitioned sub- graphs (Figure 1 (c)). In the same way , Z is fed into the new structure learning module and we can get the the video-lev el representation h ∈ R m . Finally , labels of the video can be predicted using a simple classifier . 4. Experiments 4.1. Y ouT ube-8M Dataset Y ouT ube-8M ( Abu-El-Haija et al. , 2016 ) is a benchmark dataset for video understanding, where the task is to deter - mine the key topical themes of a video. It consists of 6.1M video clips collected from Y ouTube and the video inputs consist of two multimodal sequences, which are the image and audio. Each video is labeled with one or multiple tags referring to the main topic of the video. The dataset split into three partitions, 70% for training, 20% for validation and 10% for test. As we hav e no access to the test labels, all results in this paper are reported for validation set. Each video is encoded at 1 frame-per-second up to the first 300 seconds. As the raw video data is too huge to be treated, each modality is pre-processed with pretrained models by the author of the dataset. More specifically , the frame-lev el visual features were extracted by inception-v3 network ( Szegedy et al. , 2016 ) trained on ImageNet and the audio features were e xtracted by VGG-inspired architec- ture ( Hershey et al. , 2017 ) trained for audio classification. PCA and whitening method are then applied to reduce the dimensions to 1024 for the visual and 128 for audio features. Global A verage Precision (GAP) is used for the e valuation metric for the multi-label classification task as used in the Y ouT ube-8M competition. For each video, 20 labels are predicted with confidence scores. Then the GAP score computes the a verage precision and recall across all of the predictions and all the videos. 4.2. Qualitative results: Learning compositional temporal dependencies In this section, we demonstrate compositional learning capa- bility of TDNs by analyzing constructed multilev el graphs. T o make further discussion clear , four terms are used to describe the compositional dependency structure in input video: semantic units, scenes, sequences and a video for each level. In Figure 2 , a real example with the usage of video titled “Rice Pudding 1 ” is described to sho w the results. In Figure 2 (a), the learned adjacency matrices in each layer are visualized in gray-scale images: two of the left are obtained from the 1st layer and two of the right from the 2nd layer . T o denote multile vel semantic flo ws, four color- coded rectangles (blue, orange, red and green) are mark ed and those colors are consistent with Figure 2 (b) and (c). Along with diagonal elements of the adjacency matrix in the 1st layer (Figure 2 (a)-1), a set of semantic units are de- tected corresponding to bright blocks (blue). Interestingly , we could find that each semantic unit contains highly cor- related frames. For example, the #1 and #2 are each shots introducing the Y ouT ube cooking channel and how to make rice pudding, respectiv ely . The #4 and #5 are shots sho wing a recipe of rice pudding and explaining about the various kinds of rice pudding. The #6 and #7 are shots putting in- gredients into boiling water in the pot and bringing milk to boil along with other ingredients. At the end of the video clip, #11 is a shot decorating cooked rice pudding and #12 is an outro shot that in vites the viewers to subscribe. 1 https://youtu.be/cD3enxnS-JY Compositional Structure Lear ning for Sequential Video Data Figure 2. An example of the constructed temporal dependency structures for a real input video, titled ”Rice Pudding” (https://youtu.be/cD3enxnS-JY) is visualized. The topical themes (labels) of this video are { Food, Recipe, Cooking, Dish, Dessert, Cake, baking, Cream, Milk, Pudding and Risotto } . (a): Learned adjacency matrices in the layer 1 and 2 are visualized. The strength of connection are encoded in a gray-scale which 1 to white and 0 to black. (a)-1: 12 bright blocks in layer 1 are detected (blue rectangles), each block (highly connected frames) represents a semantic unit. (a)-2: Sub-graphs of the input are denoted by orange rectangles. It shows that the semantically meaningful scenes are found by temporally constrained graph-cut. (a)-3 and (a)-4: learned high-lev el dependency structures in layer 2 are revealed with red and green rectangles. (b): The conceptual illustration of learned temporal dependency is shown. In the 1st layer, the temporal dependency structure is learned only within the sub-graphs. In the 2nd layer, inter connections of sub-graphs are learned to capture high-lev el temporal dependencies. (c): The whole composite temporal dependencies are presented. These semantic units compose variable-length scenes of the video, and each scene corresponds to a sub-graph obtained via graph-cut (Figure 2 (a)-2.). For e xample, #13 is a scene introducing this cooking channel and rice pudding. Also, #15 is a scene of making rice pudding step by step with details and #16 is an outro scene wrapping up with cooked rice pudding. The 1st-layer of the model updates representa- tions of frame-le vel nodes with these dependenc y structures, then aggre gates frame-le vel nodes to form scene-le vel nodes (Layer 1 in the Figure 2 (b)). In Figure 2 (a)-3 and (a)-4, the sequence-le vel semantic de- pendencies (red) are shown. The #17 denotes a sequence of making rice pudding from beginning to end, which contains much of the information for identifying the topical theme of this video. Finally , the representations of scenes are updated and aggregated to get representations of the whole video (Layer 2 in the Figure 2 (b)). The Figure 2 (c) presents the whole composite temporal dependencies. 5. Conclusions In this paper , we proposed TDNs which learn not only the representations of multimodal sequences, but also composite temporal dependency structures within the sequence. The qualitativ e experiment is conducted on a real large-scale video dataset and shows that the proposed model ef ficiently learns the inherent dependency structure of temporal data. Acknowledgements This work was partly supported by the Korea go vernment (2015-0-00310-SW .StarLab, 2017-0-01772-VTT , 2018-0- 00622-RMI, 2019-0-01367-BabyMind, 10060086-RISF , P0006720-GENK O). Compositional Structure Lear ning for Sequential Video Data References Abu-El-Haija, S., K othari, N., Lee, J., Natsev , P ., T oderici, G., V aradarajan, B., and V ijayanarasimhan, S. Y outube- 8m: A large-scale video classification benchmark. arXiv pr eprint arXiv:1609.08675 , 2016. Ba, J. L., Kiros, J. R., and Hinton, G. E. Layer normalization. arXiv pr eprint arXiv:1607.06450 , 2016. Chung, J., Gulcehre, C., Cho, K., and Bengio, Y . Empirical e valuation of gated recurrent neural networks on sequence modeling. arXiv pr eprint arXiv:1412.3555 , 2014. He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learn- ing for image recognition. In Pr oceedings of the IEEE confer ence on computer vision and pattern r ecognition , pp. 770–778, 2016. Hershey , S., Chaudhuri, S., Ellis, D. P ., Gemmeke, J. F ., Jansen, A., Moore, R. C., Plakal, M., Platt, D., Saurous, R. A., Seybold, B., et al. Cnn architectures for lar ge-scale audio classification. In Acoustics, Speech and Signal Pr o- cessing (ICASSP), 2017 IEEE International Confer ence on , pp. 131–135. IEEE, 2017. Hochreiter , S. and Schmidhuber, J. Long short-term memory . Neural computation , 9(8):1735–1780, 1997. Kipf, T . N. and W elling, M. Semi-supervised classifica- tion with graph con volutional networks. arXiv pr eprint arXiv:1609.02907 , 2016. Rasheed, Z. and Shah, M. Detection and representation of scenes in videos. IEEE transactions on Multimedia , 7(6): 1097–1105, 2005. Sakarya, U. and T elatar , Z. Graph-based multile vel temporal video segmentation. Multimedia systems , 14(5):277–290, 2008. Shi, J. and Malik, J. Normalized cuts and image segmenta- tion. IEEE T ransactions on pattern analysis and mac hine intelligence , 22(8):888–905, 2000. Szegedy , C., V anhoucke, V ., Ioffe, S., Shlens, J., and W ojna, Z. Rethinking the inception architecture for computer vi- sion. In Pr oceedings of the IEEE confer ence on computer vision and pattern r ecognition , pp. 2818–2826, 2016.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment