Correlational Neural Networks

1 Correlational Neural Networks 1 Sarath Chandar 1 , Mitesh M Khapra 2 , Hugo Lar ochelle 3 , Balaraman Ra vindran 4 1 Uni versity of Montreal. apsarathchandar@gmail.com 2 IBM Research India. mikhapra@in.ibm.com 3 Uni versity of Sherbrook e. hugo.larochelle@usherbrooke.ca 4 Indian Institute of T echnology Madras. ravi@cse.iitm.ac.in K eywords: Representation Learning, Deep Learning, T ransfer Learning, Neural Networks, Autoencoders, Common Representations. Abstract Common Representation Learning (CRL), wherein dif ferent descriptions (or vie ws) of the data are embedded in a common subspace, is receiving a lot of attention re- cently . T wo popular paradigms here are Canonical Correlation Analysis (CCA) based approaches and Autoencoder (AE) based approaches. CCA based approaches learn a joint representation by maximizing correlation of the vie ws when projected to the com- mon subspace. AE based methods learn a common representation by minimizing the error of reconstructing the two vie ws. Each of these approaches has its o wn advan- tages and disadv antages. For example, while CCA based approaches outperform AE based approaches for the task of transfer learning, they are not as scalable as the latter . In this work we propose an AE based approach called Correlational Neural Network (CorrNet), that e xplicitly maximizes correlation among the vie ws when projected to the common subspace. Through a series of experiments, we demonstrate that the proposed CorrNet is better than the abov e mentioned approaches with respect to its ability to learn correlated common representations. Further , we employ CorrNet for se veral cross language tasks and show that the representations learned using CorrNet perform better than the ones learned using other state of the art approaches. 1 Intr oduction In se veral real world applications, the data contains more than one view . F or e xample, a movie clip has three vie ws (of different modalities) : audio, video and te xt/subtitles. 1 W ork done while the ﬁrst author was at IIT Madras and IBM Research India. Ho wever , all the views may not always be a vailable. For e xample, for many movie clips, audio and video may be a vailable b ut subtitles may not be av ailable. Recently there has been a lot of interest in learning a common representation for multiple views of the data (Ngiam et al., 2011; Klementiev et al., 2012; Chandar et al., 2013, 2014; Andrew et al., 2013; Hermann & Blunsom, 2014b; W ang et al., 2015) which can be useful in se veral do wnstream applications when some of the views are missing. W e consider four applications to motiv ate the importance of learning common representations: (i) reconstruction of a missing vie w , (ii) transfer learning, (iii) matching corresponding items across views, and (iv) improving single view performance by using data from other vie ws. In the ﬁrst application, the learned common representations can be used to train a model to reconstruct all the views of the data (akin to autoencoders reconstructing the input vie w from a hidden representation). Such a model would allo w us to reconstruct the subtitles ev en when only audio/video is av ailable. Now , as an example of transfer learning, consider the case where a profanity detector trained on mo vie subtitles needs to detect profanities in a movie clip for which only video is av ailable. If a common representation is av ailable for the dif ferent vie ws, then such detectors/classiﬁers can be trained by computing this common representation from the relev ant vie w (subtitles, in the above example). At test time, a common representation can again be computed from the av ailable vie w (video, in this case) and this representation can be fed to the trained model for prediction. Third, consider the case where items from one view (say , names written using the script of one language) need to be matched to their corresponding items from another view (names written using the script of another language). One w ay of doing this is to project items from the two vie ws to a common subspace such that the common representations of corresponding items from the two vie ws are correlated. W e can then match items across views based on the correlation between their projections. Finally , consider the case where we are interested in learning word representations for a language. If we ha ve access to translations of these words in another language then these translations can provide some context for disambiguation which can lead to learning better word representations. In other w ords, jointly learning representations for a word in language L 1 and its translation in language L 2 can lead to better word representations in L 1 (see section 9). Having motiv ated the importance of Common Representation Learning (CRL), we no w formally deﬁne this task. Consider some data Z = { z i } N i =1 which has two vie ws: X and Y . Each data point z i can be represented as a concatenation of these tw o vie ws : z i = ( x i , y i ) , where x i ∈ R d 1 and y i ∈ R d 2 . In this work, we are interested in learning two functions, h X and h Y , such that h X ( x i ) ∈ R k and h Y ( y i ) ∈ R k are projections of x i and y i respecti vely in a common subspace ( R k ) such that for a gi ven pair x i , y i : 1. h X ( x i ) and h Y ( y i ) should be highly correlated. 2. It should be possible to reconstruct y i from x i (through h X ( x i ) ) and vice versa. Canonical Correlation Analysis (CCA) (Hotelling, 1936) is a commonly used tool for learning such common representations for two-view data (Udupa & Khapra, 2010; Dhillon et al., 2011). By deﬁnition, CCA aims to produce correlated common repre- sentations but, it suffers from some drawbacks. First, it is not easily scalable to very 2 large datasets. Of course, there are some approaches which try to make CCA scalable (for example, (Lu & Foster, 2014)), but such scalability comes at the cost of perfor- mance. Further , since CCA does not explicitly focus on reconstruction, reconstructing one view from the other might result in low quality reconstruction. Finally , CCA cannot beneﬁt from additional non-parallel, single-vie w data. This puts it at a se vere disadv an- tage in sev eral real world situations, where in addition to some parallel two-view data, abundant single vie w data is av ailable for one or both vie ws. Recently , Multimodal Autoencoders (MAEs) (Ngiam et al., 2011) hav e been pro- posed to learn a common representation for two views/modalities. The idea in MAE is to train an autoencoder to perform two kinds of reconstruction. Gi ven any one view , the model learns both self-reconstruction and cross-reconstruction (reconstruction of the other vie w). This makes the representations learnt to be predictiv e of each other . Ho wever , it should be noticed that the MAE does not get any explicit learning signal encouraging it to share the capacity of its common hidden layer between the views. In other words, it could de velop units whose activ ation is dominated by a single view . This makes the MAE not suitable for transfer learning, since the vie ws are not guaranteed to be projected to a common subspace. This is indeed veriﬁed by the results reported in (Ngiam et al., 2011) where they sho w that CCA performs better than deep MAE for the task of transfer learning. These two approaches ha ve complementary characteristics. On one hand, we hav e CCA and its v ariants which aim to produce correlated common representations b ut lack reconstruction capabilities. On the other hand, we hav e MAE which aims to do self-reconstruction and cross-reconstruction but does not guarantee correlated common representations. In this paper , we propose Correlational Neural Network (CorrNet) as a method for learning common representations which combines the advantages of the two approaches described above. The main characteristics of the proposed method can be summarized as follo ws: • It allows for self/cross reconstruction. Thus, unlike CCA (and like MAE) it has predicti ve capabilities. This can be useful in applications where a missing view needs to be reconstructed from an existing vie w . • Unlike MAE (and like CCA) the training objectiv e used in CorrNet ensures that the common representations of the two views are correlated. This is particularly useful in applications where we need to match items from one view to their cor- responding items in the other vie w . • CorrNet can be trained using Gradient Descent based optimization methods. Par- ticularly , when dealing with large high dimensional data, one can use Stochastic Gradient Descent with mini-batches. Thus, unlik e CCA (and like MAE) it is easy to scale CorrNet. • The procedure used for training CorrNet can be easily modiﬁed to beneﬁt from additional single vie w data. This makes CorrNet useful in many real world appli- cations where additional single vie w data is av ailable. W e ev aluate CorrNet using four dif ferent experimental setups. First, we use the MNIST hand-written digit recognition dataset to compare CorrNet with other state of 3 the art CRL approaches. In particular , we e valuate its (i) ability to self/cross reconstruct (ii) ability to produce correlated common representations and (iii) usefulness in trans- fer learning. In this setup, we use the left and right halves of the digit images as two vie ws. Next, we use CorrNet for a transfer learning task where the two views of data come from two different languages. Speciﬁcally , we use CorrNet to project parallel documents in two languages to a common subspace. W e then employ these common representations for the task of cross language document classiﬁcation (transfer learn- ing) and sho w that the y perform better than the representations learned using other state of the art approaches. Third, we use CorrNet for the task of transliteration equiv alence where the aim is to match a name written using the script of one language (ﬁrst vie w) to the same name written using the script of another language (second view). Here again, we demonstrate that with its ability to produce better correlated common representa- tions, CorrNet performs better than CCA and MAE. Finally , we employ CorrNet for a bigram similarity task and show that jointly learning words representations for two languages (two views) leads to better words representations. Speciﬁcally , representa- tions learnt using CorrNet help to improv e the performance of a bigram similarity task. W e would like to emphasize that unlike other models which have been tested mostly in only one of these scenarios, we demonstrate the ef fecti veness of CorrNet in all these dif ferent scenarios. The remainder of this paper is organized as follo ws. In section 2 we describe the architecture of CorrNet and outline a training procedure for learning its parameters. In section 3 we propose a deep variant of CorrNet. In section 4 we brieﬂy discuss some related models for learning common representations. In section 5 we present experi- ments to analyze the characteristics of CorrNet and compare it with CCA, KCCA and MAE. In section 6 we empirically compare Deep CorrNet with some other deep CRL methods. In sections 7, 8, and 9, we report results obtained by using CorrNet for the tasks of cross language document classiﬁcation, transliteration equiv alence detection and bigram similarity respectiv ely . Finally , we present concluding remarks in section 10 and highlight possible future work. 2 Corr elational Neural Network As described earlier , our aim is to learn a common representation from two views of the same data such that: (i) any single view can be reconstructed from the common representation, (ii) a single vie w can be predicted from the representation of another vie w and (iii) like CCA, the representations learned for the two views are correlated. The ﬁrst goal above can be achie ved by a con ventional autoencoder . The ﬁrst and second can be achie ved together by a Multimodal autoencoder but it is not guaranteed to project the two views to a common subspace. W e propose a v ariant of autoencoders which can work with tw o views of the data, while being explicitly trained to achie ve all the above goals. In the following sub-sections, we describe our model and the training procedure. 4 2.1 Model W e start by proposing a neural network architecture which contains three layers: an input layer , a hidden layer and an output layer . Just as in a con ventional single view autoencoder , the input and output layers hav e the same number of units, whereas the hidden layer can ha ve a different number of units. For illustration, we consider a two- vie w input z = ( x , y ) . For all the discussions, [ x , y ] denotes a concatenated vector of size d 1 + d 2 . Gi ven z = ( x , y ) , the hidden layer computes an encoded representation as follows: h ( z ) = f ( Wx + Vy + b ) where W is a k × d 1 projection matrix, V is a k × d 2 projection matrix and b is a k × 1 bias vector . Function f can be any non-linear activ ation function, for example sigmoid or tanh . The output layer then tries to reconstruct z from this hidden representation by computing z 0 = g ([ W 0 h ( z ) , V 0 h ( z )] + b 0 ) where W 0 is a d 1 × k reconstruction matrix, V 0 is a d 2 × k reconstruction matrix and b 0 is a ( d 1 + d 2 ) × 1 output bias v ector . V ector z 0 is the reconstruction of z . Function g can be any acti vation function. This architecture is illustrated in Figure 1. The parameters of the model are θ = { W , V , W 0 , V 0 , b , b 0 } . In the next sub-section we outline a procedure for learning these parameters. Figure 1: Correlational Neural Network 2.2 T raining Restating our goals more formally , giv en a two-vie w data Z = { ( z i ) } N i =1 = { ( x i , y i ) } N i =1 , for each instance, ( x i , y i ) , we would like to: • Minimize the self-reconstruction error , i.e. , minimize the error in reconstructing x i from x i and y i from y i . • Minimize the cross-reconstruction error , i.e. , minimize the error in reconstructing x i from y i and y i from x i . 5 • Maximize the correlation between the hidden representations of both views. W e achieved this by ﬁnding the parameters θ = { W , V , W 0 , V 0 , b , b 0 } which minimize the follo wing objective function: J Z ( θ ) = N X i =1 ( L ( z i , g ( h ( z i ))) + L ( z i , g ( h ( x i ))) + L ( z i , g ( h ( y i )))) − λ corr( h ( X ) , h ( Y )) corr( h ( X ) , h ( Y )) = P N i =1 ( h ( x i ) − h ( X ))( h ( y i ) − h ( Y )) q P N i =1 ( h ( x i ) − h ( X )) 2 P N i =1 ( h ( y i ) − h ( Y )) 2 where L is the reconstruction error , λ is the scaling parameter to scale the fourth term with respect to the remaining three terms, h ( X ) is the mean vector for the hidden rep- resentations of the ﬁrst view and h ( Y ) is the mean vector for the hidden representations of the second view . If all dimensions in the input data take binary values then we use cross-entropy as the reconstruction error otherwise we use squared error loss as the reconstruction error . For simplicity , we use the shorthands h ( x i ) and h ( y i ) to note the representations h (( x i , 0)) and h ((0 , y i )) that are based only on a single vie w 2 . For each data point with 2 vie ws x and y , h ( x i ) just means that we are computing the hidden rep- resentation using only the x -view . Or in other words, in equation h( z ) = f( Wx + Vy + b ), we set y =0. So, h( x ) = f( Wx + b ). h ( x ) = h ( x , 0) is not a choice per se, but a no- tation we are deﬁning. h ( z ) , h ( x ) and h ( y ) are certainly not guarantied to be identical, though training will gain in making them that way , because of the various reconstruc- tion terms. The correlation term in the objectiv e function is calculated considering the hidden representation as a random vector . In words, the objecti ve function decomposes as follo ws. The ﬁrst term is the usual autoencoder objecti ve function which helps in learning meaningful hidden representa- tions. The second term ensures that both vie ws can be predicted from the shared repre- sentation of the ﬁrst vie w alone. The third term ensures that both vie ws can be predicted from the shared representation of the second vie w alone. The fourth term interacts with the other objecti ves to make sure that the hidden representations are highly correlated, so as to encourage the hidden units of the representation to be shared between vie ws. W e can use stochastic gradient descent (SGD) to ﬁnd the optimal parameters. For all our experiments, we used mini-batch SGD. The fourth term in the objective function is then approximated based on the statistics of a minibatch. Approximating second order statistics using minibatches for training was also used successfully in the batch normalization training method of Iof fe & Szegedy (2015). The model has four hyperparameters: (i) the number of units in its hidden layer , (ii) λ , (iii) mini-batch size, and (i v) the SGD learning rate. The ﬁrst hyperparameter is dependent on the speciﬁc task at hand and can be tuned using a v alidation set (ex- actly as is done by other competing algorithms). The second hyperparameter is only to ensure that the correlation term in the objectiv e function has the same range as the reconstruction errors. This is again easy to approximate based on the giv en data. The 2 They represent the generic functions h X and h Y mentioned in the introduction. 6 third hyperparameter approximates the correlation of the entire dataset and lar ger mini- batches are preferred o ver smaller mini-batches. The ﬁnal h yperparameter , the learning rate is common for all neural network based approaches. Once the parameters are learned, we can use the CorrNet to compute representa- tions of vie ws that can potentially generalize across vie ws. Speciﬁcally , giv en a new data instance for which only one vie w is av ailable, we can compute its corresponding representation ( h ( x ) if x is observ ed or h ( y ) if y is observed) and use it as the new data representation. 2.3 Using additional single view data In practice, it is often the case that we hav e abundant single view data and compar - ati vely little two-vie w data. For example, in the context of text documents from two languages ( X and Y ), typically the amount of monolingual (single vie w) data a vailable in each language is much larger than parallel (two-vie w) data av ailable between X and Y . Giv en the abundance of such single vie w data, it is desirable to exploit it in order to improv e the learned representation. CorrNet can achiev e this, by using the single view data to improv e the self-reconstruction error as explained belo w . Consider the case where, in addition to the data Z = { ( z i ) } N i =1 = { ( x i , y i ) } N i =1 , we also hav e access to the single vie w data X = { ( x i ) } N 1 i = N +1 and Y = { ( y i ) } N 2 i = N +1 . No w , during training, in addition to using Z as explained before, we also use X and Y by suitably modifying the objectiv e function so that it matches that of a con ventional autoencoder . Speciﬁcally , when we have only x i , then we could try to minimize J X ( θ ) = N 1 X i = N +1 L ( x i , g ( h ( x i ))) and similarly for y i . In all our experiments, when we ha ve access to all three types of data ( i.e . , X , Y and Z ), we construct 3 sets of mini-batches by sampling data from X , Y and Z respectiv ely . W e then feed these mini-batches in random order to the model and perform a gradient update based on the corresponding objecti ve function. 3 Deep Corr elational Neural Networks An obvious e xtension for CorrNets is to allow for multiple hidden layers. The main mo- ti vation for ha ving such Deep Correlational Neural Networks is that a better correlation between the vie ws of the data might be achiev able by more non-linear representations. W e use the following procedure to train a Deep CorrNet. 1. Train a shallo w CorrNet with the gi ven data (see step-1 in Figure 2). At the end of this step, we hav e learned the parameters W , V and b . 2. Modify the CorrNet model such that the ﬁrst input vie w connects to a hidden layer using weights W and bias b . Similarly connect the second view to a hidden layer using weights V and bias b . W e hav e now decoupled the common hidden layer for each vie w (see step-2 in Figure 2). 7 3. Add a ne w common hidden layer which takes its input from the hidden layers created at step 2. W e now hav e a CorrNet which is one layer deeper (see step-3 in Figure 2). 4. Train the ne w Deep CorrNet on the same data. 5. Repeat steps 2, 3 and 4, for as many hidden layers as required. Figure 2: Stacking CorrNet to create Deep Correlartional Neural Network. W e would like to point out that we could have follo wed the procedure described in Chandar et al. (2013) for training Deep CorrNet. In Chandar et al. (2013), they learn deep representation for each view separately and use it along with a shallow CorrNet to learn a common representation. Ho wev er, feeding non-linear deep representations to a shallo w CorrNet makes it harder to train the CorrNet. Also, we chose not to use the deep training procedure described in Ngiam et al. (2011) since the objectiv e function used by them during pre-training and training is different. Speciﬁcally , during pre- training the objecti ve is to minimize self reconstruction error whereas during training the objectiv e is to minimize both self and cross reconstruction error . In contrast, in the stacking procedure outlined abov e, the objectiv es during training and pre-training are aligned. Our current training procedure for deep CorrNet is similar to greedy layerwise pre- training of deep autoencoders. W e believe that this procedure is more faithful to the global training objecti ve of Corrnet and it works well. W e do not ha ve strong empirical e vidence that this is superior to other methods such as the one described in Chandar et al. (2013) and Ngiam et al. (2011). When we hav e less parallel data, using method described in Chandar et al. (2013) makes more sense and each method has its o wn adv antages. W e leav e a detailed comparison of these different alternati ves of Deep CorrNet as future work. 8 4 Related Models In this section, we describe other related models for common representation learning. W e restrict our discussion to CCA based methods and Neural Network based methods only . Canonical Correlation Analysis (CCA) (Hotelling, 1936) and its v ariants, such as regularized CCA (V inod, 1976; Nielsen et al., 1998; Cruz-Cano & Lee, 2014) are the de-facto approaches used for learning common representation for two different views in the literature (Udupa & Khapra, 2010; Dhillon et al., 2011). Kernel CCA (Akaho, 2001; Hardoon et al., 2004) which is another variant of CCA uses the standard ker - nel trick to ﬁnd pairs of non-linear projections of the two vie ws. Deep CCA, a deep version of CCA is also introduced in (Andrew et al., 2013). One issue with CCA is that it is not easily scalable. Even though there are se veral works on scaling CCA (see (Lu & Foster, 2014)), the y are all approximations to CCA and hence lead to a decrease in the performance. Also is not very trivial to extend CCA to multiple views. Ho w- e ver there are some recent work along this line (T enenhaus & T enenhaus, 2011; Luo et al., 2015) which require complex computations. Lastly , con ventional CCA based models can work only with parallel data. Ho we ver , in real life situations, parallel data is costly when compared to single view data. The inability of CCA to le verage such single view data acts as a drawback in many real world applications. Representation Constrained CCA (RCCCA) (Mishra, 2009) is one such model which can beneﬁt from both single vie w data and multivie w data. It effecti vely uses a weighted combination of PCA (for single view) and CCA (for two views) by minimizing self-reconstruction errors and maximizing correlation. CorrNet, in contrast, minimizes both self and cross reconstruction error while maximizing correlation. RCCCA can also be considered as a linear version of DCCAE proposed in W ang et al. (2015). Hsieh (2000) is one of the earliest Neural Netw ork based model for nonlinear CCA. This method uses three feedforward neural networks. The ﬁrst neural network is a double-barreled architecture where two networks project the vie ws to a single unit such that the projections are maximally correlated. This network is ﬁrst trained to maximize the correlation. Then the in verse mapping for each view is learnt from the correspond- ing canonical cov ariate representation by minimizing the reconstruction error . There are clear diffe rences between this Neural CCA model and CorrNet. First, CorrNet is a sin- gle neural network which is trained with a single objectiv e function while Neural CCA has three networks trained with dif ferent objectiv e functions. Second, Neural CCA does only correlation maximization and self-reconstruction, whereas CorrNet does correla- tion maximization, self-reconstruction and cross-reconstruction, all at the same time. Multimodal Autoencoder (MAE) (Ngiam et al., 2011) is another Neural Network based CRL approach. Even though the architecture of MAE is similar to that of Corr- Net there are clear differences in the training procedure used by the two. Firstly , MAE only aims to minimize the follo wing three errors: (i) error in reconstructing z i from x i ( E 1 ), (ii) error in reconstructing z i from y i ( E 2 ) and (iii) error in reconstructing z i from z i ( E 3 ). More speciﬁcally , unlike the fourth term in our objectiv e function, the objecti ve function used by MAE does not contain any term which forces the network to learn cor - related common representations. Secondly , there is a dif ference in the manner in which these terms are considered during training. Unlike CorrNet, MAE only considers one 9 of the above terms at a time. In other words, giv en an instance z i = ( x i , y i ) it ﬁrst tries to minimize E 1 and updates the parameters accordingly . It then tries to minimize E 2 follo wed by E 3 . Empirically , we observ ed that a training procedure which considers all three loss terms together performs better than the one which considers them separately (Refer Section 5.5). Deep Canonical Correlation Analysis (DCCA) (Andrew et al., 2013) is a recently proposed Neural Network approach for CCA. DCCA employs two deep networks, one per view . The model is trained in such a way that the ﬁnal layer projections of the data in both the vie ws are maximally correlated. DCCA maximizes only correlation whereas CorrNet maximizes both, correlation and reconstruction ability . Deep Canonically Cor- related Auto Encoders (DCCAE) (W ang et al., 2015) (de veloped in parallel with our work) is an extension of DCCA which considers self reconstruction and correlation. Unlike CorrNet it does not consider cross-reconstruction. 5 Analysis of Corr elational Neural Networks In this section, we perform a set of experiments to compare CorrNet, CCA (Hotelling, 1936), K ernel CCA (KCCA) (Akaho, 2001) and MAE (Ngiam et al., 2011) based on: • ability to reconstruct a view from itself • ability to reconstruct one view gi ven the other • ability to learn correlated common representations for the two vie ws • usefulness of the learned common representations in transfer learning. For CCA, we used a C++ library called dlib (King, 2009). For KCCA, we used an implementation provided by the authors of (Arora & Liv escu, 2012). W e implemented CorrNet and MAE using Theano (Bergstra et al., 2010). 5.1 Data Description W e used the standard MNIST handwritten digits image dataset for all our experiments. This data consists of 60,000 train images and 10,000 test images. Each image is a 28 * 28 matrix of pixels; each pixel representing one of 256 grayscale v alues. W e treated the left half of the image as one view and the right half of the image as another image. Thus each view contains 14 ∗ 28 = 392 dimensions. W e split the train images into two sets. The ﬁrst set contains 50,000 images and is used for training. The second set contains 10,000 images and is used as a v alidation set for tuning the hyper-parameters of the four models described abov e. 5.2 P erformance of Self and Cr oss Reconstruction Among the four models listed abov e, only CorrNets and MAE ha ve been explicitly trained to construct a view from itself as well as from the other view . So, in this sub- section, we consider only these two models. T able 1 shows the Mean Squared Errors (MSEs) for self and cross reconstruction when the left half of the image is used as input. 10 Model MSE f or self reconstruction MSE f or cross reconstruction CorrNet 3.6 4.3 MAE 2.1 4.2 T able 1: Mean Squared Error for CorrNet and MAE for self reconstruction and cross reconstruction The abov e table suggests that CorrNet has a higher self reconstruction error and al- most the same cross reconstruction error as that of MAE. This is because unlike MAE, in CorrNet, the emphasis is on maximizing the correlation between the common rep- resentations of the two views. This goal captured by the fourth term in the objective function obviously interferes with the goal of self reconstruction. As we will see in the next sub-section, the embeddings learnt by CorrNet for the two vie ws are better correlated e ven though the self-reconstruction error is sacriﬁced in the process. Figure 3 sho ws the reconstruction of the right half from the left half for a few sample images. The ﬁgure reiterates our point that both CorrNet and MAE are equally good at cross reconstruction. Figure 3: Reconstruction of right half of the image giv en the left half. First block shows the original images, second block shows images where the right half is reconstructed by CorrNet and the third block shows images where the right half is reconstructed by MAE. 5.3 Corr elation between r epr esentations of two views As mentioned abov e, in CorrNet we emphasize on learning highly correlated represen- tations for the two vie ws. T o sho w that this is indeed the case, we follo w (Andre w et al., 2013) and calculate the total/sum correlation captured in the 50 dimensions of the com- mon representations learnt by the four models described abo ve. The training, v alidation and test sets used for this experiment were as described in section 5.1. The results are reported in T able 2. 11 Model Sum Correlation CCA 17.05 KCCA 30.58 MAE 24.40 CorrNet 45.47 T able 2: Sum/T otal correlation captured in the 50 dimensions of the common represen- tations learned by dif ferent models using MNIST data. The total correlation captured in the 50 dimensions learnt by CorrNet is clearly better than that of the other models. Next, we check whether this is indeed the case when we change the number of dimensions. For this, we varied the number of dimensions from 5 to 80 and plotted the sum correlation for each model (see Figure 4). For all the models, we tuned the hyper- parameters for dim = 50 and used the same hyper -parameters for all dimensions. Figure 4: Sum/T otal correlation as a function of the number of dimensions in the com- mon representations learned by dif ferent models using MNIST data. Again, we see that CorrNet clearly outperforms the other models. CorrNet thus achie ves its primary goal of producing correlated embeddings with the aim of assisting transfer learning. 5.4 T ransfer Learning acr oss views T o demonstrate transfer learning, we take the task of predicting digits from only one half of the image. W e ﬁrst learn a common representation for the two views using 50,000 images from the MNIST training data. For each training instance, we take only one half of the image and compute its 50 dimensional common representation using one of the models described abov e. W e then train a classiﬁer using this representation. For each 12 test instance, we consider only the other half of the image and compute its common representation. W e then feed this representation to the classiﬁer for prediction. W e use the linear SVM implementation pro vided by (Pedregosa et al., 2011) as the classiﬁer for all our experiments. F or all the models considered in this experiment, representation learning is done using 50,000 train images and the best hyperparameters are chosen using the 10,000 images from the v alidation set. W ith the chosen model, we report 5-fold cross validation accuracy using 10,000 images av ailable in the standard test set of MNIST data. W e report accuracy for two settings (i) Left to Right (training on left vie w , testing on right view) and (ii) Right to Left (training on right view , testing on left vie w). Model Left to Right Right to Left CCA 65.73 65.44 KCCA 68.1 75.71 MAE 64.14 68.88 CorrNet 77.05 78.81 Single vie w 81.62 80.06 T able 3: Transfer learning accuracy using the representations learned using dif ferent models on the MNIST dataset. Single view corresponds to the classiﬁer trained and tested on same vie w . This is the upper bound for the performance of any transfer learning algorithm. Once again, we see that CorrNet performs signiﬁcantly better than the other models. T o verify that this holds ev en when we decrease the data for learning common representation to 10000 images. The results as reported in T able 4 show that e ven with less data, CorrNet perform betters than other models. Model Left to Right Right to Left CCA 66.13 66.71 KCCA 70.68 70.83 MAE 68.69 72.54 CorrNet 76.6 79.51 Single vie w 81.62 80.06 T able 4: Transfer learning accuracy using the representations learned using dif ferent models trained with 10000 instances from the MNIST dataset. 5.5 Relation with MAE At the face of it, it may seem that both CorrNet and MAE differ only in their objectiv e functions. Speciﬁcally , if we remove the last correlation term from the objectiv e func- tion of CorrNet then it w ould become equiv alent to MAE. T o verify this, we conducted experiments using both MAE and CorrNet without the last term (say CorrNet(123)). When using SGD to train the networks, we found that the performance is almost sim- ilar . Ho we ver , when we use some advanced optimization technique like RMSProp, CorrNet(123) starts performing better than MAE. The results are reported in T able 5. 13 Model Optimization Left to Right Right to Left MAE SGD 63.9 67.98 CorrNet(123) SGD 63.89 67.93 MAE RMSProp 64.14 68.88 CorrNet(123) RMSProp 67.82 72.13 T able 5: Results for transfer learning across views This experiment sheds some light on why CorrNet is better than MAE. Ev en though the objecti ve of MAE and CorrNet(123) is same, MAE tries to solve it in a stochastic way which adds more noise. Howe ver , CorrNet(123) performs better since it is actually working on the combined objecti ve function and not the stochastic v ersion (one term at a time) of it. 5.6 Analysis of Loss T erms The objecti ve function deﬁned in Section 2.2 has the following four terms: • L 1 = P N i =1 L ( z i , g ( h ( z i )) • L 2 = P N i =1 L ( z i , g ( h ( x i )) • L 3 = P N i =1 L ( z i , g ( h ( y i )) • L 4 = λ corr( h ( X ) , h ( Y )) In this section, we analyze the importance of each of these terms in the loss function. For this, during training, we consider different loss functions which contain dif ferent combinations of these terms. In addition, we consider four more loss terms for our analysis. • L 5 = P N i =1 L ( y i , g ( h ( x i )) • L 6 = P N i =1 L ( x i , g ( h ( y i )) • L 7 = P N i =1 L ( x i , g ( h ( x i )) • L 8 = P N i =1 L ( y i , g ( h ( y i )) where L 5 and L 6 essentially capture the loss in reconstructing only one vie w (say , x i ) from the other vie w ( y i ) while L 7 and L 8 capture the loss in self reconstruction. For this, we ﬁrst learn common representations using dif ferent loss functions as listed in the ﬁrst column of T able 6. W e then repeated the transfer learning e xperiments using common representations learned from each of these models. For example, the sixth row in the table sho ws the results when the following loss function is used for learning the common representations. J Z ( θ ) = L 1 + L 2 + L 3 + L 4 which is the same as that used in CorrNet. 14 Loss function used f or training Left to Right Right to Left L 1 24.59 22.56 L 1 + L 4 65.9 67.54 L 2 + L 3 71.54 75 L 2 + L 3 + L 4 76.54 80.57 L 1 + L 2 + L 3 67.82 72.13 L 1 + L 2 + L 3 + L 4 77.05 78.81 L 5 + L 6 35.62 32.26 L 5 + L 6 + L 4 62.05 63.49 L 7 + L 8 10.26 10.33 L 7 + L 8 + L 4 73.03 76.08 T able 6: Comparison of the performance of transfer learning with representations learned using dif ferent loss functions. Each e ven numbered row in the table reports the performance when the correla- tion term ( L 4 ) was used in addition to the other terms in the row immediately before it. A pair -wise comparison of the numbers in each ev en numbered row with the row immediately abov e it suggests that the correlation term ( L 4 ) in the loss function clearly produces representations which lead to better transfer learning. 6 Experiments using Deep Corr elational Neural Network In this section, we ev aluate the performance of the deep extension of CorrNet. Ha ving already compared with MAE in the pre vious section, we focus our e valuation here on a comparison with DCCA (Andre w et al., 2013). All the models were trained using 10000 images from the MNIST training dataset and we computed the sum correlation and transfer learning accuracy for each of these models. For transfer learning, we use the linear SVM implementation pro vided by (Pedregosa et al., 2011) for all our experiments and do 5-fold cross validation using 10000 images from MNIST test data. W e report results for two settings (i) Left to Right (training on left view , testing on right view) and (ii) Right to Left (training on right view , testing on left vie w). These results are summarized in T able 7. In this T able, model- x - y means a model with x units in the ﬁrst hidden layer and y units in second hidden layer . For example, CorrNet-500-300-50 is a Deep CorrNet with three hidden layers containing 500, 300 and 50 units respecti vely . The third layer containing 50 units is used as the common representation. Model Sum Correlation Left to Right Right to Left CorrNet-500-50 47.21 77.68 77.95 DCCA-500-50 33.00 66.41 64.65 CorrNet-500-300-50 45.634 80.46 80.47 DCCA-500-500-50 33.77 70.06 72.43 T able 7: Comparison of sum correlation and transfer learning performance of different deep models 15 Both the Deep CorrNets (CorrNet-500-50 and CorrNet-500-300-50) clearly perform better than the corresponding DCCA. W e notice that for both the transfer learning tasks, the 3-layered CorrNet (CorrNet-500-300-50) performs better than the 2-layered Corr- Net (CorrNet-500-50) b ut the sum correlation of the 2-layered CorrNet is better than that of the 3-layered CorrNet. 7 Cr oss Language Document Classiﬁcation In this section, we will learn bilingual word representations using CorrNet and use these representations for the task of cross language document classiﬁcation. W e experiment with three language pairs and show that our approach achiev es state-of-the-art perfor- mance. Before we discuss bilingual word representations let us consider the task of learning word representations for a single language. Consider a language X containing d words in its v ocabulary . W e represent a sentence in this language using a binary bag-of-words representation x . Speciﬁcally , each dimension x i is set to 1 if the i th vocab ulary word is present in the sentence, 0 otherwise. W e wish to learn a k -dimensional vectorial representation of each word in the v ocabulary from a training set of sentence bags-of- words { x i } N i =1 . W e propose to achie ve this by using a CorrNet which works with only a single vie w of the data (see section 2.3). Ef fectiv ely , one can view a CorrNet as encoding an input bag-of-words x as the sum of the columns in W corresponding to the words that are present in x , followed by a non-linearity . Thus, we can view W as a matrix whose columns act as vector representations (embeddings) for each w ord. Let’ s no w assume that for each sentence bag-of-words x in some source language X , we have an associated bag-of-words y for this sentence translated in some tar get language Y by a human expert. Assuming we hav e a training set of such ( x , y ) pairs, we’ d like to learn representations in both languages that are aligned, such that pairs of translated words ha ve similar representations. The CorrNet can allow us to achiev e this. Indeed, it will effecti vely learn word representations (the columns of W and V ) that are not only informativ e about the words present in sentences of each language, but will also ensure that the representations’ space is aligned between language, as required by the cross-vie w reconstruction terms and the correlation term. Note that, since the binary bags-of-words are very high-dimensional (the dimension- ality corresponds to the size of the vocab ulary , which is typically large), reconstructing each binary bag-of-word will be slow . Since we will later be training on millions of sentences, training on each individual sentence bag-of-words will be expensi ve. Thus, we propose a simple trick, which exploits the bag-of-words structure of the input. As- suming we are performing mini-batch training (where a mini-batch contains a list of the bags-of-words of adjacent sentences), we simply propose to merge the bags-of-words of the mini-batch into a single bag-of-words and perform an update based on that merged bag-of-words. The resulting effect is that each update is as ef ﬁcient as in stochastic gradient descent, b ut the number of updates per training epoch is divided by the mini- batch size. As we’ll see in the experimental section, this trick produces good word representations, while sufﬁciently reducing training time. W e note that, additionally , 16 we could hav e used the stochastic approach proposed by Dauphin et al. (2011) for re- constructing binary bag-of-words representations of documents, to further improve the ef ﬁciency of training. They use importance sampling to a void reconstructing the whole V -dimensional input v ector . 7.1 Document r epr esentations Once we learn the language speciﬁc word representation matrices W and V as de- scribed above, we can use them to construct document representations, by using their columns as word vector representations. Giv en a document d , we represent it as the tf-idf weighted sum of its words’ representations: ψ X ( d ) = W tf-idf ( d ) for language X and ψ Y ( d ) = V tf-idf ( d ) for language Y , where tf-idf ( d ) is the tf-idf weight vector of document d . W e use the document representations thus obtained to train our document classiﬁers, in the cross-lingual document classiﬁcation task described in Section 7.3. 7.2 Related W ork on Multilingual W ord Representations Recent work that has considered the problem of learning bilingual representations of words usually has relied on word-le vel alignments. Klementie v et al. (2012) propose to train simultaneously two neural network languages models, along with a regular- ization term that encourages pairs of frequently aligned words to have similar word embeddings. Thus, the use of this regularization term requires to ﬁrst obtain w ord-level alignments from parallel corpora. Zou et al. (2013) use a similar approach, with a dif- ferent form for the regularizer and neural network language models as in (Collobert et al., 2011). In our work, we speciﬁcally in vestigate whether a method that does not rely on word-le vel alignments can learn comparably useful multilingual embeddings in the context of document classiﬁcation. Looking more generally at neural networks that learn multilingual representations of words or phrases, we mention the work of Gao et al. (2014) which sho wed that a useful linear mapping between separately trained monolingual skip-gram language models could be learned. They too howe ver rely on the speciﬁcation of pairs of words in the two languages to align. Mikolov et al. (2013) also propose a method for training a neural network to learn useful representations of phrases, in the context of a phrase- based translation model. In this case, phrase-le vel alignments (usually extracted from word-le vel alignments) are required. Recently , Hermann & Blunsom (2014b,a), pro- posed neural network architectures and a mar gin-based training objecti ve that, as in this work, does not rely on word alignments. W e will brieﬂy discuss this work in the ex- periments section. A tree based bilingual autoencoder with similar objecti ve function is also proposed in (Chandar et al., 2014). 7.3 Experiments The technique proposed in this work enable us to learn bilingual embeddings which capture cross-language similarity between words. W e propose to ev aluate the quality 17 of these embeddings by using them for the task of cross-language document classiﬁca- tion. W e follo wed closely the setup used by Klementie v et al. (2012) and compare with their method, for which word representations are publicly av ailable 3 . The set up is as follo ws. A labeled data set of documents in some language X is av ailable to train a classiﬁer , ho wev er we are interested in classifying documents in a dif ferent language Y at test time. T o achie ve this, we lev erage some bilingual corpora, which is not labeled with any document-le vel categories. This bilingual corpora is used to learn document representations that are coherent between languages X and Y . The hope is thus that we can successfully apply the classiﬁer trained on document representations for language X directly to the document representations for language Y . Follo wing this setup, we performed experiments on 3 data sets of language pairs: English/German (EN/DE), English/French (EN/FR) and English/Spanish (EN/ES). For learning the bilingual embeddings, we used sections of the Europarl corpus (K oehn, 2005) which contains roughly 2 million parallel sentences. W e considered 3 language pairs. W e used the same pre-processing as used by Klementiev et al. (2012). W e tok- enized the sentences using NL TK (Bird Stev en & Klein, 2009), removed punctuations and lo wercased all words. W e did not remove stopwords. As for the labeled document classiﬁcation data sets, they were extracted from sec- tions of the Reuters RCV1/RCV2 corpora, again for the 3 pairs considered in our exper - iments. Follo wing Klementiev et al. (2012), we consider only documents which were assigned exactly one of the 4 top lev el categories in the topic hierarchy (CCA T , ECA T , GCA T and MCA T). These documents are also pre-processed using a similar procedure as that used for the Europarl corpus. W e used the same vocab ularies as those used by Klementie v et al. (2012) (varying in size between 35 , 000 and 50 , 000 ). Models were trained for up to 20 epochs using the same data as described earlier . W e used mini-batch (of size 20) stochastic gradient descent. All results are for word embeddings of size D = 40 , as in Klementiev et al. (2012). Further , to speed up the training for CorrNet we merged each 5 adjacent sentence pairs into a single training instance, as described earlier . For all language pairs, λ was set to 4 . The other hyperpa- rameters were tuned to each task using a training/validation set split of 80% and 20% and using the performance on the validation set of an av eraged perceptron trained on the smaller training set portion (notice that this corresponds to a monolingual classiﬁcation experiment, since the general assumption is that no labeled data is av ailable in the test set language). W e compare our models with the following approaches: • Klementiev et al. (2012): This model uses word embeddings learned by a mul- titask neural network language model with a regularization term that encourages pairs of frequently aligned words to hav e similar word embeddings. From these embeddings, document representations are computed as described in Section 7.1. • MT : Here, test documents are translated to the language of the training documents using a standard phrase-based MT system, MOSES 4 which was trained using 3 http://klementiev.org/data/distrib/ 4 http://www.statmt.org/moses/ 18 default parameters and a 5-gram language model on the Europarl corpus (same as the one used for inducing our bilingual embeddings). • Majority Class: T est documents are simply assigned the most frequent class in the training set. For the EN/DE language pairs, we directly report the results from Klementie v et al. (2012). For the other pairs (not reported in Klementiev et al. (2012)), we used the em- beddings a v ailable online and performed the classiﬁcation experiment ourselv es. Simi- larly , we generated the MT baseline ourselves. T able 8 summarizes the results. They were obtained using 1000 RCV training ex- amples. W e report results in both directions, i.e. language X to Y and vice versa. The best performing method in all the pairs except one is CorrNet. In particular , CorrNet often outperforms the approach of Klementie v et al. (2012) by a large margin. In the last row of the table, we also include the results of some recent work by Hermann & Blunsom (2014b,a). They proposed two neural network architectures for learning word and document representations using sentence-aligned data only . Instead of an autoencoder paradigm, they propose a margin-based objectiv e that aims to make the representation of aligned sentences closer than non-aligned sentences. While their trained embeddings are not publicly av ailable, they report results for the EN/DE clas- siﬁcation experiments, with representations of the same size as here ( D = 40 ) and trained on 500K EN/DE sentence pairs. Their best model in that setting reaches accu- racies of 83.7% and 71.4% respectiv ely for the EN → DE and DE → EN tasks. One clear advantage of our model is that unlike their model, it can use additional mono- lingual data. Indeed, when we train CorrNet with 500k EN/DE sentence pairs, plus monolingual RCV documents (which come at no additional cost), we get accuracies of 87.9% (EN → DE) and 76.7% (DE → EN), still improving on their best model. If we do not use the monolingual data, CorrNet’ s performance is worse b ut still competitiv e at 86.1% for EN → DE and 68.8% for DE → EN. Finally , without constraining D to 40 (they use 128 ) and by using additional French data, the best results of Hermann & Blunsom (2014b) are 88.1% (EN → DE) and 79.1% (DE → EN), the later being, to our kno wledge, the current state-of-the-art (as reported in the last row of T able 8) 5 . EN → DE DE → EN EN → FR FR → EN EN → ES ES → EN CorrNet 91.8 74.2 84.6 74.2 49.0 64.4 Klementiev et al. 77.6 71.1 74.5 61.9 31.3 63.0 MT 68.1 67.4 76.3 71.1 52.0 58.4 Majority Class 46.8 46.8 22.5 25.0 15.3 22.2 Hermann and Blunsom 88.1 79.1 N.A. N.A. N.A. N.A. T able 8: Cross-lingual classiﬁcation accuracy for 3 language pairs, with 1000 labeled examples. W e also ev aluate the effect of varying the amount of supervised training data for training the classiﬁer . For bre vity , we report only the results for the EN/DE pair , which 5 After we published our results in (Chandar et al., 2014), Soyer et al. (2015) hav e improved the performance for EN → DE and DE → EN to 92.7% and 82.4% respectiv ely . 19 january president said en de en de en de january januar president pr ¨ asident said gesagt march m ¨ arz i pr ¨ asidentin told sagte october oktober mr pr ¨ asidenten say sehr july juli presidents herr believ e heute december dezember thank ich saying sagen 1999 jahres president-in-ofﬁce ratspr ¨ asident wish heutigen june juni report danken shall letzte month 1999 voted danke again hier oil microsoft market en de en de en de oil ¨ ol microsoft microsoft market markt supply boden cds cds markets marktes supplies beﬁndet insider warner single m ¨ arkte gas ger ¨ at ibm tageszeitungen commercial binnenmarkt fuel erd ¨ ol acquisitions ibm competition m ¨ arkten mineral infolge shareholding handelskammer competitiv e handel petroleum abh ¨ angig warner exchange business ¨ offnung crude folge online veranstalter goods binnenmarktes T able 9: Example English words along with 8 closest words both in English (en) and German (de), using the Euclidean distance between the embeddings learned by CorrNet are summarized in Figure 5 and Figure 6. W e observe that CorrNet clearly outperforms the other models at almost all data sizes. More importantly , it performs remarkably well at very low data sizes (100), suggesting it learns very meaningful embeddings, though the method can still beneﬁt from more labeled data (as in the DE → EN case). T able 9 also illustrates the properties captured within and across languages, for the EN/DE pair . For a few English words, the words with closest word representations (in Euclidean distance) are shown, for both English and German. W e observe that words that form a translation pair are close, b ut also that close words within a language are syntactically/semantically similar as well. 20 Figure 5: Cross-lingual classiﬁcation accuracy results for EN → DE Figure 6: Cross-lingual classiﬁcation accuracy results for DE → EN The excellent performance of CorrNet suggests that merging sev eral sentences into single bags-of-words can still yield good word embeddings. In other words, not only we do not need to rely on word-le vel alignments, but exact sentence-le vel alignment is also not essential to reach good performances. W e experimented with the mer ging of 5, 25 and 50 adjacent sentences into a single bag-of-words. Results are shown in T able 10. They suggest that merging sev eral sentences into single bags-of-words does not necessarily impact the quality of the word embeddings. Thus they conﬁrm that e xact sentence-le vel alignment is not essential to reach good performances as well. 21 # sent. EN → DE DE → EN EN → FR FR → EN EN → ES ES → EN CorrNet 5 91.75 72.78 84.64 74.2 49.02 64.4 25 88.0 64.5 78.1 70.02 68.3 54.68 50 90.2 49.2 82.44 75.5 38.2 67.38 T able 10: Cross-lingual classiﬁcation accuracy for 3 dif ferent pairs of languages, when merging the bag-of-words for different numbers of sentences. These results are based on 1000 labeled examples. 8 T ransliteration Equi valence In the previous section, we sho wed the application of CorrNet in a cross language learn- ing setup. In addition to cross language learning, CorrNet can also be used for matching equi valent items across vie ws. As a case study , we consider the task of determining transliteration equiv alence of named entities wherein gi ven a word u written using the script of language X and a word v written using the script of language Y the goal is to determine whether u and v are transliterations of each other . Sev eral approaches have been proposed for this task and the one most related to our work is an approach which uses CCA for determining transliteration equi valence. W e condider English-Hindi as the language pair for which transliteration equi va- lence needs to be determined. For learning common representations we used approx- imately 15,000 transliteration pairs from NEWS 2009 English-Hindi training set (Li et al., 2009). W e represent each Hindi word as a bag of 2860 bigram characters. This forms the ﬁrst view ( x i ). Similarly we represent each English word as a bag of 651 bigram characters. This forms the second vie w ( y i ). Each such pair ( x i , y i ) then serves as one training instance for the CorrNet. For testing we consider the standard NEWS 2010 transliteration mining test set (Kumaran et al., 2010). This test set contains approximately 1000 W ikipedia English Hindi title pairs. The original task deﬁnition is as follows. For a giv en English title containing T 1 words and the corresponding Hindi title containing T 2 words identify all pairs which form a transliteration pair . Speciﬁcally , for each title pair , consider all T 1 × T 2 word pairs and identify the correct transliteration pairs. In all, the test set contains 5468 word pairs out of which 982 are transliteration pairs. For e very word pair ( x i , y i ) we obtain a 50 dimensional common representation for x i and y i using the trained CorrNet. W e then calculate the correlation between the representations of x i and y i . If the correlation is abo ve a threshold we mark the word pair as equi valent. This threshold is tuned using an additional 1000 pairs which were provided as training data for the NEWS 2010 transliteration mining task. As seen in T able 11 CorrNet clearly performs better than the other methods. Note that our aim is not to achiev e state of the art performance on this task but to compare the quality of the shared representations learned using dif ferent CRL methods considered in this paper . 22 Model F1-measure (%) CCA 49.68 KCCA 42.36 MAE 72.75 CorrNet 81.56 T able 11: Performance on NEWS 2010 En-Hi T ransliteration Mining Dataset 9 Bigram similarity using multilingual word embedding In this section, we consider one more dataset/application to compare the performance of CorrNet with other state of the art methods. Speciﬁcally , the task at hand is to calculate the similarity score between two bigram pairs in English based on their representations. These bigram representations are calculated from word representations learnt using En- glish German word pairs. The motiv ation here is that the German word provides some context for disambiguating the English word and hence leads to better word represen- tations. This task has been already considered in Mitchell & Lapata (2010), Lu et al. (2015) and W ang et al. (2015). W e follow the similar setup as W ang et al. (2015) and use the same dataset. The English and German words are ﬁrst represented using 640-dimensional monolingual w ord vectors trained via Latent Semantic Indexing (LSI) on the WMT 2011 monolingual ne ws corpora. W e used 36,000 such English-German monolingual word vector pairs for common representation learning. Each pair consist- ing of one English ( x i ) and one German( y i ) word thus acts as one training instance, z i = ( x i , y i ) , for the CorrNet. Once a common representation is learnt, we project all the English words into this common subspace and use these word embeddings for computing similarity of bigram pairs in English. The bigram similarity dataset was initially used in Mitchell & Lapata (2010). W e consider the adjecti ve-noun (AN) and v erb-object (VN) subsets of the bigram similarity dataset. W e use the same tuning and test splits of size 649/1,972 for each subset. The vector representation of a bigram is computed by simply adding the vector representa- tions of the two w ords in the bigram. Follo wing pre vious work, we compute the cosine similarity between the two vectors of each bigram pair , order the pairs by similarity , and report the Spearman’ s correlation ( ρ ) between the model’ s ranking and human rankings. Follo wing W ang et al. (2015), we ﬁx the dimensionality of the vectors at L = 384. Other hyperparameters are tuned using the tuning data. The results are reported in T able 12 where we compare CorrNet with different methods proposed in W ang et al. (2015). CorrNet performs better than the previous state-of-the-art (DCCAE) on av erage score. Best results are obtained using CorrNet-500-384. This experiment suggests that apart from multi view applications such as (i) transfer learning (ii) reconstructing missing vie w and (iii) matching items across vie ws, CorrNet can also be employed to exploit multi view data to improve the performance of a single view task (such as monolingual bigram similarity). 23 Model AN VN A vg. Baseline (LSI) 45.0 39.1 42.1 CCA 46.6 37.7 42.2 SplitAE 47.0 45.0 46.0 CorrAE 43.0 42.0 42.5 DistAE 43.6 39.4 41.5 FKCCA 46.4 42.9 44.7 NKCCA 44.3 39.5 41.9 DCCA 48.5 42.5 4.5 DCCAE 49.1 43.2 46.2 CorrNet 46.2 47.4 46.8 T able 12: Spearman’ s correlation for bigram similarity dataset. Results for other models are taken from W ang et al. (2015) 10 Conclusion and Futur e W ork In this paper , we proposed Correlational Neural Networks as a method for learning com- mon representations for two vie ws of the data. The proposed model has the capability to reconstruct one vie w from the other and it ensures that the common representations learned for the two views are aligned and correlated. Its training procedure is also scalable. Further , the model can beneﬁt from additional single view data, which is of- ten av ailable in many real world applications. W e employ the common representations learned using CorrNet for two do wnstream applications, viz. , cross language document classiﬁcation and transliteration equi valence detection. For both these tasks we show that the representations learned using CorrNet perform better than other methods. W e believ e it should be possible to extend CorrNet to multiple views. This could be very useful in applications where v arying amounts of data are av ailable in different vie ws. For e xample, typically it would be easy to ﬁnd parallel data for English/German and English/Hindi, but harder to ﬁnd parallel data for German/Hindi. If data from all these languages can be projected to a common subspace then English could act as a pi vot language to facilitate cross language learning between Hindi and German. W e intend to in vestigate this direction in future w ork. Acknowledgement W e would like to thank Alexander Klementiev and Ivan Tito v for providing the code for the classiﬁer and data indices for the cross language document classiﬁcation task. W e would like to thank Janarthanan Rajendran for v aluable discussions. 24 Refer ences Akaho, S. (2001). A kernel method for canonical correlation analysis. In Pr oc. Int’l Meeting on Psychometric Society . Andre w , G., Arora, R., Bilmes, J., and Liv escu, K. (2013). Deep canonical correlation analysis. ICML . Arora, R. and Li vescu, K. (2012). K ernel CCA for multi-vie w learning of acoustic fea- tures using articulatory measurements. In 2012 Symposium on Machine Learning in Speech and Langua ge Pr ocessing, MLSLP 2012, P ortland, Ore gon, USA, September 14, 2012 , pages 34–37. Bergstra, J., Breuleux, O., Bastien, F ., Lamblin, P ., P ascanu, R., Desjardins, G., T urian, J., W arde-Farle y , D., and Bengio, Y . (2010). Theano: a CPU and GPU math expres- sion compiler . In Pr oceedings of the Python for Scientiﬁc Computing Conference (SciPy) . Bird Stev en, E. L. and Klein, E. (2009). Natural Language Pr ocessing with Python . OReilly Media Inc. Chandar , S., Khapra, M. M., Ravindran, B., Raykar , V . C., and Saha, A. (2013). Multi- lingual deep learning. NIPS Deep Learning W orkshop . Chandar , S., Lauly , S., Larochelle, H., Khapra, M. M., Ravindran, B., Raykar , V ., and Saha, A. (2014). An autoencoder approach to learning bilingual word representa- tions. In Pr oceedings of NIPS . Collobert, R., W eston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., and Kuksa, P . (2011). Natural Language Processing (Almost) from Scratch. Journal of Machine Learning Resear ch , 12:2493–2537. Cruz-Cano, R. and Lee, M.-L. T . (2014). Fast regularized canonical correlation analysis. Computational Statistics & Data Analysis , 70:88 – 100. Dauphin, Y ., Glorot, X., and Bengio, Y . (2011). Large-Scale Learning of Embeddings with Reconstruction Sampling. In Pr oceedings of the 28th International Confer ence on Machine Learning (ICML 2011) , pages 945–952. Omnipress. Dhillon, P ., Foster , D., and Ungar , L. (2011). Multi-view learning of word embeddings via cca. In NIPS . Gao, J., He, X., Y ih, W .-t., and Deng, L. (2014). Learning continuous phrase represen- tations for translation modeling. In Pr oceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long P apers) , pages 699–709, Baltimore, Maryland. Hardoon, D. R., Szedmak, S., and Shawe-T aylor, J. (2004). Canonical correlation analysis: An overvie w with application to learning methods. Neural Computation , 16(12):2639–2664. 25 Hermann, K. M. and Blunsom, P . (2014a). Multilingual Distributed Representations without W ord Alignment. In Pr oceedings of International Confer ence on Learning Repr esentations (ICLR) . Hermann, K. M. and Blunsom, P . (2014b). Multilingual models for compositional distributed semantics. In Pr oceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, A CL 2014, June 22-27, 2014, Baltimor e, MD, USA, V olume 1: Long P apers , pages 58–68. Hotelling, H. (1936). Relations between two sets of v ariates. Biometrika , 28:321 – 377. Hsieh, W . (2000). Nonlinear canonical correlation analysis by neural netw orks. Neural Networks , 13(10):1095 – 1105. Iof fe, S. and Szegedy , C. (2015). Batch normalization: Accelerating deep network training by reducing internal cov ariate shift. CoRR , abs/1502.03167. King, D. E. (2009). Dlib-ml: A machine learning toolkit. Journal of Machine Learning Resear ch , 10:1755–1758. Klementie v , A., Tito v , I., and Bhattarai, B. (2012). Inducing Crosslingual Distributed Representations of W ords. In Pr oceedings of the International Conference on Com- putational Linguistics (COLING) . K oehn, P . (2005). Europarl: A parallel corpus for statistical machine translation. In MT Summit . Kumaran, A., Khapra, M. M., and Li, H. (2010). Report of news 2010 transliteration mining shared task. In Pr oceedings of the 2010 Named Entities W orkshop , pages 21–28, Uppsala, Sweden. Li, H., Kumaran, A., Zhang, M., and Pervouvhine, V . (2009). Whitepaper of news 2009 machine transliteration shared task. In Pr oceedings of the 2009 Named Enti- ties W orkshop: Shar ed T ask on T ransliteration (NEWS 2009) , pages 19–26, Suntec, Singapore. Lu, A., W ang, W ., Bansal, M., Gimpel, K., and Liv escu, K. (2015). Deep multilingual correlation for improv ed word embeddings. In N AACL-HL T . Lu, Y . and Foster , D. P . (2014). large scale canonical correlation analysis with iterative least squares. In NIPS . Luo, Y ., T ao, D., W en, Y ., Ramamohanarao, K., and Xu, C. (2015). T ensor canonical correlation analysis for multi-vie w dimension reduction. In Arxiv . Mikolo v , T ., Le, Q., and Sutsk ever , I. (2013). Exploiting Similarities among Languages for Machine T ranslation. T echnical report, arXi v . Mishra, S. (2009). Representation-constrained canonical correlation analysis: a hy- bridization of canonical correlation and principal component analyses. Journal of Applied Economic Sciences (J AES) , pages 115–124. 26 Mitchell, J. and Lapata, M. (2010). Composition in distributional models of semantics. Cognitive Science , 34(8):1388–1429. Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., and Andrew , N. (2011). Multimodal deep learning. ICML . Nielsen, F . ˚ A., Hansen, L. K., and Strother , S. C. (1998). Canonical ridge analysis with ridge parameter optimization. Pedregosa, F ., V aroquaux, G., Gramfort, A., Michel, V ., Thirion, B., Grisel, O., Blondel, M., Prettenhofer , P ., W eiss, R., Dubourg, V ., V anderplas, J., Passos, A., Cournapeau, D., Brucher , M., Perrot, M., and Duchesnay , E. (2011). Scikit-learn: Machine learn- ing in Python. Journal of Mac hine Learning Resear ch , 12:2825–2830. Soyer , H., Stenetorp, P ., and Aizawa, A. (2015). Le veraging monolingual data for crosslingual compositional word representations. In Pr oceedings of the 3r d Interna- tional Confer ence on Learning Repr esentations , San Diego, California, USA. T enenhaus, A. and T enenhaus, M. (2011). Regularized generalized canonical correla- tion analysis. Psychometrika , 76(2):257–284. Udupa, R. and Khapra, M. M. (2010). T ransliteration equiv alence using canonical cor- relation analysis. In Pr oceedings of the 32nd Eur opean Confer ence on IR Resear ch , pages 75–86. V inod, H. (1976). Canonical ridge and econometrics of joint production. Journal of Econometrics , 4(2):147 – 166. W ang, W ., Arora, R., Li vescu, K., and Bilmes, J. (2015). On deep multi-vie w represen- tation learning. In ICML . Zou, W . Y ., Socher , R., Cer , D., and Manning, C. D. (2013). Bilingual W ord Embed- dings for Phrase-Based Machine T ranslation. In Confer ence on Empirical Methods in Natural Langua ge Pr ocessing (EMNLP 2013) . 27

Correlational Neural Networks

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment