OmniNet: A unified architecture for multi-modal multi-task learning

OmniNet: A uniﬁed arc hitecture for m ulti-mo dal m ulti-task learning Subho jeet Pramanik 1 , Priy ank a Agra wal 2 , and Aman Hussain 3 1 IBM Cloud email@subho.in 2 IBM Researc h pagrawal.ml@gmail.com 3 Univ ersity of Amsterdam email@amanhussain.com Abstract. T ransformer is a popularly used neural net work arc hitecture, esp ecially for language understanding. W e introduce an extended and uniﬁed architecture that can b e used for tasks inv olving a v ariet y of mo dalities lik e image, text, videos, etc. W e prop ose a spatio-temp oral cac he mec hanism that enables learning spatial dimension of the input in addition to the hidden states corresp onding to the temp oral input sequence. The proposed architecture further enables a single model to supp ort tasks with m ultiple input modalities as w ell as asynchronous m ulti-task learning, th us we refer to it as OmniNet . F or example, a single instance of OmniNet can concurren tly learn to p erform the tasks of part- of-sp eec h tagging, image captioning, visual question answ ering and video activit y recognition. W e demonstrate that training these four tasks to- gether results in about three times compressed model while retaining the p erformance in comparison to training them individually . W e also sho w that using this neural net work pre-trained on some mo dalities assists in learning unseen tasks such as video captioning and video question an- sw ering. This illustrates the generalization capacity of the self-attention mec hanism on the spatio-temp oral cache present in OmniNet . Keyw ords: multi-modal, multi-task learning, transformer, spatio-temp oral, atten tion-netw orks, neural-netw ork 1 In tro duction T ransformer [38] is currently one of the b est performing mo dels for an y sequence transduction tasks, esp ecially those in v olving natural language. It is originally designed for a single task at a time. In fact, most of the generic deep learn- ing arc hitectures [3, 37, 39] that ha ve been designed and dev elop ed are able to learn, alb eit very well, a single task and handle one task sp eciﬁc input domain lik e image, text or audio. F urthermore with these mo dels, we often rely on the generalization capability of the trained netw ork to guarantee performance on unseen examples. T ransfer learning [12, 33] is another p opular paradigm used to adapt the mo del to learn a related task with similar input domain. The success 2 Subho jeet Pramanik, Priyank a Agraw al, and Aman Hussain of neural net works across these challenges is kno wn to b e due to their abilit y in learning eﬀective representations of the data. F or example, the self-attention mec hanism in T ransformers can capture the global temp oral dep endence in se- quen tial data very well. Naturally , the question arises whether w e can extend these architectures, like the T ransformer, to b e able to learn shared representa- tions from m ultiple input domains and to be able attend on these represen tations to p erform a multitude of tasks concurrently . The researc h into m ulti-task mo dels that learn to solve v aried tasks across a multitude of input domains is not new. W ork done in [25] demonstrates an arc hitecture capable of learning a shared representation across audio and video mo dalities. Similarly in [6] a con volutional architecture has b een designed to supp ort a v ariety of NLP tasks. Ho wev er, most of these architectures are designed to learn sp eciﬁc set of tasks with known input domains. T o the best of our kno wledge, there do es not exist a single uniﬁed arc hitecture that w orks out of the b o x for an y combination of multi-modal inputs. T o address this gap, we extend T ransformer tow ards a uniﬁed arc hitecture, namely OmniNet , whic h enables a single mo del to support tasks with m ultiple input mo dalities and asynchronous multi-task learning. W e consider that most real-life data like image, text, sp eec h, video, etc. is a direct conjunction of spa- tial and temp oral components. Therefore, we emplo y a spatio-temporal cache mec hanism to learn a shared representation of the input data across the spatial (space) and temporal (time) dimension. Using a generalized enc o de() function, OmniNet can pro cess and store spatio-temp oral represen tation for eac h of the input domains and then de c o de() predictions across a multitude of tasks. In our exp erimen ts, w e train a single instance of the OmniNet to solve a n umber of tasks spanning multiple multi-domains such as part-of-sp eec h tagging, image captioning, visual question answ ering and video activit y recognition. T o mak e our w ork repro ducible, op en to scrutiny a nd further developmen t, we will op en source a demonstration of our system implemen ted using Pytorc h [27]. 2 Related W ork Multi-task learning has b een extensively studied in the literature, with appli- cations to a wide set of problems ranging from natural language pro cessing (NLP) [6, 7, 9, 13] to sp eec h recognition [18, 30] to vision [1, 4, 26, 40]. It has also found its use in a com bination of diverse tasks lik e image captioning and text translation and parsing [22, 29, 41]. How ev er, most of these architectures assume the set of tasks to b e known in adv ance. Similarly , multi-modal learning has b een essential for solving a broad range of interesting problems such as Vi- sual Question Answering [15, 16] and Video Question Answ ering [20]. Again, the state-of-the-art models are highly speciﬁc to the ob jective in hand and not easily adaptable to diﬀerent tasks or domains. [14] proposed MultiMo del architecture for learning multiple tasks but lac ks supp ort for multi-modal tasks with more than one input domains such as visual question answering. OmniNet: A uniﬁed architecture for multi-modal m ulti-task learning 3 3 Prop osed Mo del W e prop ose a uniﬁed architecture, namely OmniNet , to enable learning multi- mo dal tasks with multiple input domains and supp ort generic multi-tasking for an y set of tasks. The OmniNet arc hitecture consists of multiple sub-net works, called peripheral net works, connected to a common cen tral neural net work called the Cen tral Neural Pro cessor (CNP) (Figure 1). Eac h p eripheral net work is used to encode the domain speciﬁc input into feature representations. In this work, w e describ e image, text and video p eripherals (Section 3.1). One can add more, sa y sp eec h peripheral, dep ending on the task. The output representation of a p eripheral net work is alwa ys a spatio-temp oral tensor x ∈ R t × s × d model , where t & s are the temp oral and spatial dimensions of the input resp ectiv ely , and d model is the mo del dimension input to the CNP . Fig. 1. OmniNet p erforming image captioning, visual question answering and POS tagging at once The spatio-temp oral represen tations generated by the peripheral netw orks corresp onding to each input domain are then pro cessed b y the CNP . The CNP uses fully attention based enco der-decoder [2, 5, 36] mo del for sequence trans- duction similar to the T ransformer architecture [38], whic h is the state-of-the- art for multiple language mo deling tasks (Section 3.2). During the encoding stage, the CNP implements a generic enc o de( x , D ) function to ﬁrst pro cess and store the spatio-temporal represen tations of the input, where x ∈ R t × s × d model is the spatio-temp oral tensor pro duced by the p eripheral net works and D ∈ Z : 0 ≤ D < D len is the domain id and D len is the max num b er of domains supp orted b y the CNP . The enc o de() function is called multiple times, once for eac h m ulti-mo dal input from resp ectiv e peripheral. During the deco ding stage, a de c o de( y shif ted , τ ) function is used to deco de predictions as softmax proba- bilities, where y shif ted ∈ Z N − 1 are the target outputs shifted one time-step to 4 Subho jeet Pramanik, Priyank a Agraw al, and Aman Hussain the righ t, N is the length of t he output sequence; τ ∈ Z : 0 ≤ τ < τ len is task id and τ len is the total num b er of supp orted tasks. The deco ding step is similar to [38], mo diﬁed to incorp orate a t wo-step atten tion mechanism o v er spatial and temp oral cac he. 3.1 P eripheral net works First, we elaborate on how w e supp ort multiple input domains using periph- eral netw orks. A p eripheral netw ork can use a pre-trained model from existing literature to ultimately enco de a giv en domain input to a standardized feature represen tation x ∈ R t × s × d model , where t & s are the temp oral and spatial dimen- sions of the input respectively , and d model is the model dimension input to the Cen tral Neural Pro cessor. Here w e detail text and vision p eripherals and one can add more p eripherals or alter the p eripheral design dep ending on the task. Vision peripheral: This p eripheral uses a conv olutional neural netw ork to enco de image and video inputs in the tasks. F or an image of dimension h × w × n c , this peripheral do wn-samples in to h 0 × w 0 × n 0 c , where h, w, n c are the height, width and num b er of input channels resp ectiv ely . F or a video, each frame is input to the p eripheral to pro duce F × h 0 × w 0 × n 0 c , where F is the total n umber of frames in the video. The encoding v ectors are then pro jected to dimension d model using a fully connected la yer. The output is then reshap ed into a spatio- temp oral tensor of x ∈ R t × h 0 w 0 × d model , where t = 1 for an image and t = F for a video. In our experiments, we use the pre-trained R esNet-152 model, a v ariant of ResNet [10] consisting of 152 conv olutional lay ers. W e remov e the ﬁnal fully connected and a vg-p o oling lay ers to generate spatial feature representations for a given image/video. Language p eripheral: The Language p eripheral uses byte-pair enco ding [31] to generate sub words for a giv en input sentence. The sub words are passed to an embedding lay er to generate sub word em b eddings of dimension d emb and pro jected to dimension d model using a fully connected lay er. The output is then reshap ed into a spatio-temporal tensor x ∈ R t × 1 × d model , where t equal to n umber of subw ords in the input sen tence. As w e do not hav e any spatial dimension in textual data, the spatial dimension of x from a Language p eripheral is alwa ys 1. In our exp erimen ts, W e used pre-trained subw ord em b eddings with d emb = 300 and vocab siz e = 25000 from [11], whic h includes pre-trained sub w ord embed- dings of ov er 275 languages, to initialize the w eights of the embedding matrix. 3.2 Cen tral Neural Pro cessor (CNP) T o process the spatio-temp oral information in the input data, the CNP imple- men ts a spatial cac he C s , temporal cache C t and a link array L . The spatial and temp oral cac he and the link array are a list of elements, initialized as empty b efore the enco ding pro cess. During the encoding stage, an enc o de() routine tak es as input, the tensor x generated from the p eripheral and corresp onding domain/p eripheral id D . This function pro cesses the spatial and temp oral infor- mation in the input x and stores them in to the spatial cac he C s and the temp oral OmniNet: A uniﬁed architecture for multi-modal m ulti-task learning 5 cac he C t , resp ectively and stores their dimensions t & s in the link arra y . F or a given task, this enc o de() routine is called K times, where K is the num b er of inputs in the task. Note that these inputs can belong to same or diﬀerent domains. Fig. 2. left: T emp or alEnc o der architecture; right: OmniNet de co de() arc hitecture. Enco de ( x , D ): F or a giv en input x ∈ R t × s × d model and domain iden tiﬁer D , the enc o de() routine is describ ed in Algorithm 1. Since inputs can come from m ultiple p eripherals, the algorithm ﬁrst concatenates the input with the domain em b edding to ensure a domain-a ware enco ding of the input (Steps 2 to 3). Steps 4 to 7 pro cess the spatial information in x b y unrolling the time dimension and adding these unrolled v ectors in to the spatial cac he. Steps 8 to 10 process the temp oral information in x by av eraging the spatial dimension of x and then 6 Subho jeet Pramanik, Priyank a Agraw al, and Aman Hussain passing the av eraged tensor to a self-atten tion based T emp or alEnc o der . This T emp or alEnc o der is similar to the enco der used in [38] as shown in Figure 2 is used to calculate temp oral embeddings of the input sequence. The output from the T emp or alEnc o der is app ended to the temp oral cac he. Algorithm 1 enc o de() : Enco des spatial and temp oral representations into spa- tial and temp oral cache Require: x ∈ R t × s × d model , D , C s , L , C t 1: L ← L ∪ ( t → s ) 2: D emb ← E mbedLay er ( D ) 3: x ← F C ( C oncat ( x, D emb ) , d model ) 4: if s > 1 then 5: S ← Reshape ( x, ( ts, d model )) { where, output S = [ S 1 , . . . , S ts ] s.t. S i ∈ R d model is a spatial feature vector. } 6: C s ← C s ∪ [ S 1 , . . . , S ts ] { App end spatial representations to spatial cache } 7: end if 8: T ← ( P s i =1 x [: , i, :]) /s 9: T ← T empor alE ncoder ( T ) { where, output T = [ T 1 , . . . , T t ] s.t. T j ∈ R d model is the enco ding of temp oral dimension in x . } 10: C t ← C t ∪ [ T 1 , . . . , T t ] { App end temporal representations to temp oral cache } The ab o ve encoding routine k eeps app ending spatio-temp oral information to C t & C s for each input x k ∈ R t k × s k × d model . Note the sup erscript k to denote corresp ondence to k -th input of the task, where k ∈ 1 , . . . , K . After K calls, w e ha ve the temporal cac he C t = [ T 1 , . . . , T R ], where R = P K r =1 t r ; the spatial cac he C s = [ S 1 , . . . , S P ], where P = { P p t p ∗ s p : p ∈ 1 , . . . , K ∧ s p > 1 } and the link array L = [( t 1 → s 1 ) , . . . , ( t K → s K )]. Note that C s can also b e empty in case the enc o de() is only called with inputs with s k = 1 ∀ k . Next, we use the de c o de() routine to generate predictions as softmax probabilities. Deco de ( y shif ted , τ ) : The architecture of the de c o de() function is sho wn in Figure 2. The de c o de() tak es as argument the output lab els y shif ted shifted one time step to the right, a task id τ and generates predictions by attending from the spatial and temporal cache. The de c o de() function is structured similar to the deco der used in the T r ansformer arc hitecture [38] and jointly attends on the v ectors stored in the temporal and spatial cac he. Similar to [38], the deco ding ﬁrst starts b y attending o ver the output embeddings using mask ed multi-head scaled dot pro duct atten tion. The attention lay er for the temp oral cache uses scaled dot-pro duct attention with multiple heads as speciﬁed in [38]. Atten tion la yer for the spatial cac he, uses gated multi-head atten tion to attend ov er the elemen ts of the spatial cac he. F or inputs with b oth time and space dimension (e.g. video), we wan t the spatial atten tion la yer to attend more on frames which hav e relativ ely high attention scores in the temp oral cache atten tion lay er. Therefore, the atten tion score output from the temporal cac he m ulti-head atten tion la yer A ∈ R n h × N × R , is used to calculate the tensor G ∈ R n h × N × P used for gating the OmniNet: A uniﬁed architecture for multi-modal m ulti-task learning 7 atten tion score output in the spatial attention la yer, where n h is the num b er of heads in multi-head attention as describ ed in [38]. The tensor G is calculated using A & L as detailed in Algorithm 2. Giv en 4 Q : the matrix of queries, K : k eys of dimension d k and V : v alues of dimension d v , the scaled dot-pro duct atten tion for the spatial la yer is mo diﬁed as: A ttention( Q, K , V , G ) =  Softmax  QK T √ d v   G  V (1) In order to use the same CNP for m ultiple tasks with v arying output v o cabu- laries, we use m ultiple output em b edding lay ers O utputE mbedLay er 1 , . . . , O utputE mbedLay er τ len , to generate the output embeddings for eac h task. At the ﬁnal lay er, w e use multiple ( F C + S of tmax ) 1 , . . . , ( F C + S of tmax ) τ len clas- siﬁcation lay ers for eac h task. W e also calculate a task embedding v ector using τ and alwa ys start deco ding using the task em b edding vector. Algorithm 2 Calculate G using output scores from temp oral atten tion and link arra y Require: L , A 1: idx ← 0 2: for each t, s in L do 3: G ← [] 4: if s > 0 then 5: A 0 ← A [: , : , idx : idx + t ] 6: A 0 ← E xpand ( A 0 , ( n h , N , t, s )) { where E xpand ( tensor, dimension ) expands tensor according to a given dimension } 7: A 0 ← Reshape ( A 0 , ( n h , N , ts )) 8: G ← G ∪ A 0 { App end the respective temp oral attention scores to G } 9: end if 10: idx ← idx + t 11: end for 12: G ← S tack ( G ) { Stack the list of tensors to construct tensor G of dimension ( n h , N , P ) } 3.3 Multi-task learning In order to train a single a mo del simultaneously on mu tiple tasks w e used the HogWild training approac h as described in [28]. Similar to the approach describ ed in [24], the main pro cess holds a global copy of the mo del. W e create separate work er processes for each task, where each pro cess maintains a local cop y of a mo del. At eac h training iteration, eac h pro cess starts by synchronizing its lo cal model with the global copy . This is done through forward and bac kward propagation on its lo cal copy and then copying the lo cally computed gradien ts to the global mo del asynchronously . Eac h process then calls the global mo del 4 F or brevity , we reuse the notations of [38] in this description 8 Subho jeet Pramanik, Priyank a Agraw al, and Aman Hussain optimizer asynchronously to up date the w eights of the global mo del. Instead of storing the model in CPU as in [24] w e alwa ys store the local copies across m ultiple GPUs. 4 T asks and Setup T o ev aluate the eﬀectiveness of our prop osed framework for tasks spanning div erse mo dalities, we choose a set cov ering all p ossible spatio-temp oral data arc hetypes: Image Captioning, Part-of-Speech (POS) tagging, Visual Question Answ ering (VQA) and Video-activity Recognition. Each of these tasks explores a unique p oten tial spatio-temporal conﬁguration of the input con taining diﬀer- en t v alues of t and s , where t and s are the temp oral and spatial dimensions of the input resp ectiv ely . This enables us to p erform a comprehensiv e study of the m ulti-mo dal and multi-input capabilities of the system. W e further elab orate on the prop erties of these tasks b elo w. F or training, we alwa ys use cross-entrop y loss with Adam optimizer [17] and sc hedule the learning rate using Noam scheduler [32] similar to [38] 5 . In the vision p eripheral, w e freeze the la yers of pretrained ResNet model. Remaining parts of the architecture (p eripherals and CNP) are k ept trainable. In this sec- tion, we pro vide details on the datasets used and the mo del setup for eac h of these tasks. P art-of-sp eec h (POS)T agging: T o illustrate the task with only temp oral mo dalit y ( t > 1 & s = 1 , where t and s are the temp oral and spatial dimensions of the input resp ectiv ely), w e consider POS tagging problem. Given an input se- quence of w ords, the model should produce a sequence of POS tags corresp onding to each word. W e use Penn T ree-bank 6 [23] which contains gold annotations on English WSJ articles b y exp erts. During the encoding stage, each input sen tence is pro cessed b y the language p eripheral to generate a spatio-temp oral tensor x ∈ R t × 1 × d model , where t is the sequence length of subw ords in the input. The CNP encode () function is then used to encode x into the temp oral cache. Note that spatial cache is empty for text inputs as s = 1. Therefore, the deco ding stage is same as that of T ransformers to predict the sequence of POS tags. Image Captioning: This task represents the ones with inputs con taining only spatial modality ( t = 1 & s > 1 ). The captioning model is required to predict text caption for a given image. W e use the MSCOCO 2014 dataset [21] for training and presen t results on the COCO v alidation set. During the en- co ding stage, the input image is resized to 224 × 224 and pro cessed b y the vi- sion peripheral containing pre-trained ResNet-152 to produce image em b eddings x ∈ R 1 × 49 × d model . x is then input to the enc o de() function whic h populates corre- sp onding spatial and temp oral cache. The deco ding stage uses de c o de() function with output vocabulary size 25000 to generate the captions. 5 The h yp erparameter v alues used for n h , d model , N Lay er s , d k , d v are same as that sp eciﬁed in T ransformer base mo del [38]. 6 h ttps://catalog.ldc.up enn.edu/LDC99T42; W e use splits 0-18 as training, 19-21 as dev elopment and 22-24 as test sets OmniNet: A uniﬁed architecture for multi-modal m ulti-task learning 9 Visual Question Answ ering: F or the task with inputs from multiple do- main, such that eac h contains either spatial or temp oral mo dalit y (either t > 1 & s = 1 , or t = 1 & s > 1 for each input), w e choose the task of visual ques- tion answering. Giv en a question ov er an image as inputs, the mo del is supp osed to predict the correct answ er lab el. W e use the recen tly introduced VQA v2.0 dataset [8] for this purp ose and p erform ev aluation on the V QA test-dev set. All the images are resized to dimension 224 × 224 b efore training. The enco ding stage of this task utilizes t wo p eripherals: the vision peripheral is used to gen- erate a tensor x 1 ∈ R 1 × 49 × d model for the input image. The language p eripheral is used to enco de the questions in to x 2 ∈ R t × 1 × d model , where t is equal to the length of the subw ords in the question. The enc o de() function is the called tw o times, ﬁrst with x 1 and second with x 2 as input. Finally , the decode () with out- put vocabulary size 3500 is to generate the answ ers as softmax probabilities in a single deco ding step. Video Activity Recognition: F or tasks which con tain both spatial and temp oral modality in a single input ( t > 1 & s > 1 ), we consider the action recognition task on videos. F or this purp ose, w e use the HMDB dataset [19]. The dataset consists of ov er 5000 short length clips of real life actions with 51 classes. W e present our results on train-test split 1. W e use 16 frames p er video and resize each of them to 224 × 224. During the enco ding stage, each frame of the video is passed through the vision p eripheral to cum ulativ ely generate a video encoding x ∈ R 16 × 49 × d model whic h is then used as input to the enc o de() function. Finally , the decode () with output v o cabulary size 51 is to predict the action as softmax probabilities in a single deco ding step. 5 Results and Discussion W e presen t the ev aluation on (a) T asks of v arious mo dalities illustrated in Section 4 (b) Multi-tasking setup for these tasks (T able 1) (c) Reuse of the m ulti-task mo del for an unseen task (Figure 3). In addition, we also provide some ablation studies on the arc hitecture (T able 2) P erformance of prop osed architecture on individual tasks: W e c ho ose a set of four tasks with diverse input mo dalities and com binations as describ ed in previous Section 4. W e train the OmniNet mo del indep enden tly across eac h of the ab o ve tasks. Eac h of the tasks demonstrates unique capabilities of this generic arc hitecture. More speciﬁcally , in T able 1 w e compare our results with the following state-of-the-art 7 :- POS tagging: [35]; image captioning & VQA: [1] and HMDB: [34]. It is imp ortan t to note that we do not p erform any h yp er- parameter optimization. W e b eliev e that, with more computational p o wer, ﬁne tuning the hyperparameters tow ards these tasks should result in comparable or ev en improv ed p erformance to the state-of-the-art. These results can indeed b e used as a baseline for an y future work which aims at using a single architecture 7 Since most of these tasks are p opular challenges, w e compare with state-of-the- art which are generically applicable for the resp ective task instead of the c hallenge dataset. 10 Subho jeet Pramanik, Priyank a Agraw al, and Aman Hussain across v arious p ossible spatio-temp oral archet yp es. It is interesting to note that the mo del is extensible to a new domain without any mo diﬁcation to the CNP as long as one can add a sp eciﬁc peripheral to con vert domain inputs in to spatio- temp oral tensors. This asp ect of the architecture makes it applicable to sev eral p opular m ulti-mo dal tasks. POS Captioning Visual Question Answering HMDB Acc. BLEU-4 Meteor Overall Y/N Num. Other Acc. #P ARAMS SOT A 97.44 36.2 27.0 63.2 80.3 42.8 55.8 59.4 - IND 95.61 28.9 25.2 55.31 74.09 35.17 46.35 55.29 450 m MUL T-3 95.82 28.8 25.2 56.79 76.75 35.82 47.16 - 149 . 03 m MUL T-4 95.44 27.4 24.5 55.76 75.49 35.64 46.08 54.44 149 . 07 m T able 1. P erformance of OmniNet on diverse set of tasks. IND: Mo del trained in- dividually for each of the given tasks; MUL T-3: Multi-task model trained on POS, Captioning & VQA; MUL T-4: Multi-task model trained across all the four tasks. Eﬀect of training a div erse set of tasks together: W e trained tw o m ulti-task mo dels: (1) MUL T-3 (POS+VQA+Captioning) and (2) MUL T-4 (POS+V QA+Captioning+HMDB), using hogwild approach. While the MUL T-3 mo del attains similar and sometimes b etter p erformance, the ﬁnal MUL T-4 mo del attains slightly reduced p erformance, when compared to the in- dep enden t task scores. W e b eliev e this is due to the skewness in the size of the HMDB dataset containing only 5000 training samples. How ever as a tradeoﬀ, adding the HMDB task shows interesting zero-shot results demonstrated b e- lo w. Using a m ulti-task model also results in three times reduction in the total n umber of parameters. That is, when a separate mo del is used for eac h task, w e hav e a total of ov er 450 × 10 6 parameters. Whereas during multi-tasking since a single mo del is shared, we hav e a total of ov er 149 × 10 6 parameters, while ac hieving similar p erformance. In terestingly , the mo del is able to attend on spatio-temp oral components of the inputs from diﬀeren t tasks and concur- ren tly generate p redictions across them, thus demonstrating the generalization capabilit y of our architecture. T o wards zero-shot learning: reuse of pre-trained net work for un- seen task: Sharing represen tations across m ultiple tasks provides the b eneﬁt to transfer of useful knowledge across multiple domains. Since, image and video are pro cessed b y the same vision p eripheral, w e conducted an exp erimen t to see whether our model pre-trained on all the four tasks (MUL T-4) can p erform video captioning and video question-answ ering without any explicit training on these tasks i.e. zero-shot learning. The results of the ev aluation on randomly pick ed instances from the HMDB test split 1 are sho wn in Figure 3. In terestingly , the mo del p erforms quite well on related actions that w ere presen t in the COCO and V QA training set; such as captions related to horse riding and baseball; or questions related to concepts present in VQA. Without training on any video captioning & video QA instance, the mo del could use the trained information OmniNet: A uniﬁed architecture for multi-modal m ulti-task learning 11 Fig. 3. Results of zero-shot video captioning and video question-answering. from image captioning, image QA (VQA) and video action recognition (HMDB) apply them on videos to generate meaningful predictions, hence demonstrating the capability of the mo del to transfer kno wledge across related m ulti-mo dal tasks. Ho wev er, on concepts that are not present in the trained datasets, the mo del either describes the en vironment in the video or replaces with alternate kno wn concepts. This case study , although not comprehensive, shows the ca- pabilit y of the mo del to learn shared representations and ability to transfer kno wledge across domains. W e b eliev e that adding more tasks and domains will lead to more in teresting zero-shot learning results in future across a wide range of problems. Impact of individual architectural comp onen ts: In order to support diﬀeren t input domains, our arc hitecture introduces spatial cache and link arra y comp onen ts to the original T r ansformer arc hitecture (whic h only consists of mec hanisms to handle temp oral data). W e conducted an ablation study on each of these comp onents to v erify their imp ortance across v arious tasks as sho wn in T able 2. The ablation was conducted on the independent (IND) as well the m ulti-tasking mo del (MUL T-4). The second ro w ablates the link array from our arc hitecture i.e. removing the multiplication of G in Equation 1. The link array w as designed to assist in tasks with inputs such as video, con taining b oth spatial as well as temporal mo dalit y in a single input. The total num b er of spatial comp onen ts becomes very large as n umber of frames in the video increases, thereb y making it diﬃcult to attend on v arious spatial region s throughout the video. Using link arra y the spatial atten tion lay er can attend more on speciﬁc imp ortan t frames in the video. Therefore, remo v al of link array leads to a huge reduction in p erformance in HMDB compared to other tasks as they do not ha ve b oth spatio-temp oral mo dalities for an y single input. Remov al of spatial cac he, on the other hand, has signiﬁcant eﬀect on performance across all tasks con taining spatial modality . Since, image captioning contains primarily spatial mo dalit y and hence the BLEU drops signiﬁcantly after ablation. As other tasks utilize the temporal cac he for prediction, in the m ulti-task setting the captioning 12 Subho jeet Pramanik, Priyank a Agraw al, and Aman Hussain task learns to utilize the spatial av erage of the image stored in the temp oral cac he for prediction and hence retains some performance after ablation of the spatial cache. On the other hand when trained indep enden tly on captioning, the net work learns to utilize the information in the spatial cac he only , and hence the score drops to zero after ablation. VQA leverages b oth, spatial information from image and temp oral information from question, retains some p erformance from the use of temp oral cache. Note that, POS tagging task is not aﬀected b y ablation of an y of the comp onents since it only has temp oral mo dalit y in the input. POS (Acc.) Captioning (BLEU-4) V QA (Overall) HMDB (Acc.) IND MUL T-4 IND MUL T-4 IND MUL T-4 IND MUL T-4 OmniNet 95.61 95.44 28.9 27.4 55.31 55.76 55.29 54.44 w/o link array 95.61 95.44 28.9 27.4 54.05 55.30 45.94 46.79 w/o spatial cache 95.61 95.44 0 11.9 39.86 44.24 10.91 11.50 T able 2. Ablation study on the eﬀect of prop osed architectural comp onen ts. 6 Conclusions and F uture W ork W e present a uniﬁed neural net work architecture OmniNet capable of learn- ing tasks with multiple inputs of v arying mo dalities. The architecture can b e further adopted for multi-task learning across any set of tasks containing spatio- temp oral data. Sharing one mo del across multiple tasks also results in a signiﬁ- can t reduction in the total num b er of parameters. W e further demonstrate that this shared mo del can learn robust representations from v arious spatio-temp oral inputs which are reusable for unseen tasks. W e believe that this prop osed ar- c hitecture has wide applicability to an y task with spatio-temp oral inputs. T o extend its usability , we would lik e to introduce new p eripherals supp orting more domains such as sp eec h. W e are also keen on exploring other aspects to the data b ey ond temporal and spatial dimensions such as graphs and relational data. F ur- ther, it w ould be interesting to in vestigate sc heduling mec hanisms for optimizing the multi-tasking framework. References 1. Anderson, P ., He, X., Buehler, C., T eney, D., Johnson, M., Gould, S., Zhang, L.: Bottom-up and top-down atten tion for image captioning and visual question an- sw ering. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recog- nition. pp. 6077–6086 (June 2018). h ttps://doi.org/10.1109/CVPR.2018.00636 2. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by join tly learn- ing to align and translate. In: 3rd International Conference on Learning Repre- sen tations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference T rack Pro ceedings (2015), OmniNet: A uniﬁed architecture for multi-modal m ulti-task learning 13 3. Battenberg, E., Chen, J., Child, R., Coates, A., Gaur, Y., Li, Y., Liu, H., Satheesh, S., Seetapun, D., Sriram, A., Zhu, Z.: Exploring neural transducers for end-to- end sp eech recognition. CoRR abs/1707.07413 (2017), 1707.07413 4. Chen, Y., Zhao, D., Lv, L., Zhang, Q.: Multi-task learning for danger- ous ob ject detection in autonomous driving. Information Sciences 432 , 559 – 571 (2018). h ttps://doi.org/https://doi.org/10.1016/j.ins.2017.08.035, http:// www.sciencedirect.com/science/article/pii/S0020025517308848 5. Cho, K., v an Merrienbo er, B., Gulcehre, C., Bahdanau, D., Bougares, F., Sc hw enk, H., Bengio, Y.: Learning phrase represen tations using RNN enco der–decoder for statistical mac hine translation. In: Proceedings of the 2014 Conference on Empirical Metho ds in Natural Language Pro cessing (EMNLP). pp. 1724–1734. Association for Computational Linguistics, Doha, Qatar (Oct 2014). https://doi.org/10.3115/v1/D14-1179, https://www.aclweb. org/anthology/D14- 1179 6. Collob ert, R., W eston, J.: A uniﬁed architecture for natural language pro cess- ing: Deep neural netw orks with multitask learning. In: Proceedings of the 25th In ternational Conference on Machine Learning. pp. 160–167. ICML ’08, ACM, New Y ork, NY, USA (2008). https://doi.org/1 0.1145/1390156.1390177, http: //doi.acm.org/10.1145/1390156.1390177 7. Dong, D., W u, H., He, W., Y u, D., W ang, H.: Multi-task learning for m ultiple language translation. In: ACL (2015) 8. Goy al, Y., Khot, T., Summers-Stay , D., Batra, D., Parikh, D.: Making the V in V QA matter: Elev ating the role of image understanding in Visual Question An- sw ering. In: Conference on Computer Vision and P attern Recognition (CVPR) (2017) 9. Hashimoto, K., Xiong, C., Tsuruok a, Y., So c her, R.: A joint many-task mo del: Gro wing a neural netw ork for multiple NLP tasks. In: Pro ceedings of the 2017 Conference on Empirical Methods in Natural Language Pro cessing. pp. 1923–1933. Asso ciation for Computational Linguistics, Cop enhagen, Denmark (Sep 2017). h ttps://doi.org/10.18653/v1/D17-1206, https://www.aclweb.org/ anthology/D17- 1206 10. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and P attern Recognition (CVPR). pp. 770–778 (June 2016). https://doi.org/10.1109/CVPR.2016.90 11. Heinzerling, B., Strub e, M.: BPEmb: T okenization-free Pre-trained Subw ord Em- b eddings in 275 Languages. In: c hair), N.C.C., Choukri, K., Cieri, C., Declerc k, T., Goggi, S., Hasida, K., Isahara, H., Maegaard, B., Mariani, J., Mazo, H., Moreno, A., Odijk, J., Pip eridis, S., T okunaga, T. (eds.) Pro ceedings of the Eleven th Interna- tional Conference on Language Resources and Ev aluation (LREC 2018). Europ ean Language Resources Asso ciation (ELRA), Miyazaki, Japan (Ma y 7-12, 2018 2018) 12. Hu, L., Kan, M., Shan, S., Chen, X.: Duplex generative adv ersarial netw ork for unsup ervised domain adaptation. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2018) 13. Johnson, M., Sch uster, M., Le, Q.V., Krikun, M., W u, Y., Chen, Z., Thorat, N., Vi´ egas, F., W atten b erg, M., Corrado, G., Hughes, M., Dean, J.: Go ogle’s m ultilingual neural machine translation system: Enabling zero-shot translation. T ransactions of the Asso ciation for Computational Linguistics 5 , 339–351 (2017), https://www.aclweb.org/anthology/Q17- 1024 14 Subho jeet Pramanik, Priyank a Agraw al, and Aman Hussain 14. Kaiser, L., Gomez, A.N., Shazeer, N., V aswani, A., Parmar, N., Jones, L., Uszkoreit, J.: One mo del to learn them all. CoRR abs/1706.05137 (2017), http://arxiv. org/abs/1706.05137 15. Kim, J.H., Jun, J., Zhang, B.T.: Bilinear attention netw orks. In: Ben- gio, S., W allach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Gar- nett, R. (eds.) Adv ances in Neural Information Pro cessing Systems 31, pp. 1564–1574. Curran Asso ciates, Inc. (2018), http://papers.nips.cc/paper/ 7429- bilinear- attention- networks.pdf 16. Kim, J.H., Lee, S.W., Kwak, D., Heo, M.O., Kim, J., Ha, J.W., Zhang, B.T.: Multimo dal residual learning for visual qa. In: Lee, D.D., Sugiyama, M., Luxburg, U.V., Guyon, I., Garnett, R. (eds.) Adv ances in Neural Information Pro cessing Systems 29, pp. 361–369. Curran Asso ciates, Inc. (2016), http://papers.nips. cc/paper/6446- multimodal- residual- learning- for- visual- qa.pdf 17. Kingma, D.P ., Ba, J.: Adam: A metho d for stochastic optimization. In: 3rd In- ternational Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, Ma y 7-9, 2015, Conference T rack Pro ceedings (2015), abs/1412.6980 18. Krishna, K., T oshniwal, S., Livescu, K.: Hierarchical m ultitask learning for ctc- based sp eec h recognition. CoRR abs/1807.06234 (2018), abs/1807.06234 19. Kuehne, H., Jhuang, H., Garrote, E., P oggio, T., Serre, T.: HMDB: a large video database for human motion recognition. In: Pro ceedings of the International Con- ference on Computer Vision (ICCV) (2011) 20. Lei, J., Y u, L., Bansal, M., Berg, T.L.: TV QA: lo calized, compositional video ques- tion answering. CoRR abs/1809.01696 (2018), 01696 21. Lin, T.Y., Maire, M., Belongie, S., Ha ys, J., Perona, P ., Ramanan, D., Doll´ ar, P ., Zitnic k, C.L.: Microsoft coco: Common ob jects in con text. In: Fleet, D., Pa jdla, T., Schiele, B., T uytelaars, T. (eds.) Computer Vision – ECCV 2014. pp. 740–755. Springer International Publishing, Cham (2014) 22. Luong, T., Le, Q.V., Sutsk ever, I., Viny als, O., Kaiser, L.: Multi-task sequence to sequence learning. In: International Conference on Learning Representations (2016) 23. Marcus, M., Kim, G., Marcinkiewicz, M.A., MacIn tyre, R., Bies, A., F erguson, M., Katz, K., Schasberger, B.: The penn treebank: Annotating predicate argu- men t structure. In: Proceedings of the W orkshop on Human Language T echnol- ogy . pp. 114–119. HL T ’94, Asso ciation for Computational Linguistics, Strouds- burg, P A, USA (1994). https://doi.org/10.3115/1075812.1075835, https://doi. org/10.3115/1075812.1075835 24. Mnih, V., Badia, A.P ., Mirza, M., Gra ves, A., Harley , T., Lillicrap, T.P ., Silv er, D., Kavuk cuoglu, K.: Asynchronous methods for deep reinforcement learning. In: Pro ceedings of the 33rd International Conference on International Conference on Mac hine Learning - V olume 48. pp. 1928–1937. ICML’16, JMLR.org (2016), http: //dl.acm.org/citation.cfm?id=3045390.3045594 25. Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.Y.: Multimo dal deep learning. In: Proceedings of the 28th In ternational Conference on International Conference on Machine Learning. pp. 689–696. ICML’11, Omnipress, USA (2011), http://dl.acm.org/citation.cfm?id=3104482.3104569 26. Pasun uru, R., Bansal, M.: Multi-task video captioning with video and entailmen t generation. CoRR abs/1704.07489 (2017), OmniNet: A uniﬁed architecture for multi-modal m ulti-task learning 15 27. Paszk e, A., Gross, S., Chin tala, S., Chanan, G., Y ang, E., DeVito, Z., Lin, Z., Desmaison, A., An tiga, L., Lerer, A.: Automatic diﬀerentiation in PyT orc h. In: NIPS Auto diﬀ W orkshop (2017) 28. Rech t, B., Re, C., W right, S., Niu, F.: Hogwild: A lock-free ap- proac h to parallelizing sto c hastic gradien t descen t. In: Sha we-T aylor, J., Zemel, R.S., Bartlett, P .L., P ereira, F., W einberger, K.Q. (eds.) Adv ances in Neural Information Processing Systems 24, pp. 693– 701. Curran Asso ciates, Inc. (2011), http://papers.nips.cc/paper/ 4390- hogwild- a- lock- free- approach- to- parallelizing- stochastic- gradient- descent. pdf 29. Ruder, S., Bingel, J., Augenstein, I., Sgaard, A.: Laten t m ulti-task architecture learning (2017) 30. Seltzer, M.L., Dropp o, J.: Multi-task learning in deep neural net works for improv ed phoneme recognition. In: 2013 IEEE In ternational Conference on Acoustics, Sp eec h and Signal Pro cessing. pp. 6965–6969 (May 2013). h ttps://doi.org/10.1109/ICASSP .2013.6639012 31. Sennrich, R., Haddow, B., Birch, A.: Neural machine translation of rare w ords with subw ord units. In: Pro ceedings of the 54th Annual Meeting of the Association for Computational Linguistics (V olume 1: Long P ap ers). pp. 1715–1725. Asso ciation for Computational Linguistics, Berlin, Germany (Aug 2016). https://doi.org/10.18653/v1/P16-1162, https://www.aclweb.org/ anthology/P16- 1162 32. Shazeer, N., Stern, M.: Adafactor: Adaptive learning rates with sublinear memory cost. CoRR abs/1804.04235 (2018), 33. Shu, R., Bui, H.H., Narui, H., Ermon, S.: A dirt-t approac h to unsup ervised domain adaptation. CoRR abs/1802.08735 (2018) 34. Simony an, K., Zisserman, A.: Tw o-stream conv olutional netw orks for action recognition in videos. In: Ghahramani, Z., W elling, M., Cortes, C., Lawrence, N.D., W einberger, K.Q. (eds.) Adv ances in Neural Information Processing Systems 27, pp. 568–576. Curran Asso ciates, Inc. (2014), http://papers.nips.cc/paper/ 5353- two- stream- convolutional- networks- for- action- recognition- in- videos. pdf 35. Sp ousto v´ a, D.j., Ha jiˇ c, J., Raab, J., Spousta, M.: Semi-sup ervised training for the av eraged p erceptron POS tagger. In: Pro ceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009). pp. 763–771. Asso ciation for Computational Linguistics, Athens, Greece (Mar 2009), https://www.aclweb. org/anthology/E09- 1087 36. Sutskev er, I., Viny als, O., Le, Q.V.: Sequence to sequence learning with neu- ral netw orks. In: Ghahramani, Z., W elling, M., Cortes, C., Lawrence, N.D., W einberger, K.Q. (eds.) Adv ances in Neural Information Processing Systems 27, pp. 3104–3112. Curran Associates, Inc. (2014), http://papers.nips.cc/paper/ 5346- sequence- to- sequence- learning- with- neural- networks.pdf 37. Szegedy , C., Ioﬀe, S., V anhouck e, V.: Inception-v4, inception-resnet and the im- pact of residual connections on learning. CoRR abs/1602.07261 (2016), http: 38. V asw ani, A., Shazeer, N., P armar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L.u., Polosukhin, I.: Atten tion is all you need. In: Guyon, I., Luxburg, U.V., Bengio, S., W allach, H., F ergus, R., Vish wanathan, S., Garnett, R. (eds.) Adv ances in Neural Information Processing Systems 30, pp. 5998–6008. Curran Asso ciates, Inc. (2017), http://papers.nips.cc/paper/7181- attention- is- all- you- need.pdf 16 Subho jeet Pramanik, Priyank a Agraw al, and Aman Hussain 39. W u, Y., Sch uster, M., Chen, Z., Le, Q.V., Norouzi, M., Macherey , W., Krikun, M., Cao, Y., Gao, Q., Mac herey , K., Klingner, J., Shah, A., Johnson, M., Liu, X., uk asz Kaiser, Gouws, S., Kato, Y., Kudo, T., Kaza wa, H., Stevens, K., Kurian, G., Patil, N., W ang, W., Y oung, C., Smith, J., Riesa, J., Rudnick, A., Vin yals, O., Corrado, G., Hughes, M., Dean, J.: Go ogle’s neural mac hine translation system: Bridging the gap b et ween h uman and machine translation. CoRR abs/1609.08144 (2016), 40. Zhang, Z., Luo, P ., Loy , C.C., T ang, X.: F acial landmark detection by deep multi- task learning. In: In ECCV. 94108 (2014) 41. Zhao, W., W ang, B., Y e, J., Y ang, M., Zhao, Z., Luo, R., Qiao, Y.: A multi- task learning approach for image captioning. In: Pro ceedings of the Twen ty- Sev enth In ternational Joint Conference on Artiﬁcial In telligence, IJCAI-18. pp. 1205–1211. International Joint Conferences on Artiﬁcial Intelligence Organization (7 2018). https://doi.org/10.24963/ijcai.2018/168, https://doi.org/10.24963/ ijcai.2018/168

OmniNet: A unified architecture for multi-modal multi-task learning

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment