Domain-Adversarial Neural Networks

Domain-Adversarial Neural Netw orks Hana Ajakan 1 H A NA . A JA K A N . 1 @ U L A V A L . C A Pascal Germain 1 PA SC A L . G E R M A I N @ I F T . U L A V A L . C A Hugo Larochelle 2 H U G O . L A RO C H E L L E @ U S H E R B R O O K E . C A Franc ¸ ois Laviolette 1 F R A N C O I S . L A V I O L E T T E @ I F T . U L A V A L . C A Mario Marchand 1 M A R I O . M A R C H A N D @ I F T . U L A V A L . C A 1 D ´ epartement d’informatique et de g ´ enie logiciel, Univ ersit ´ e Lav al, Qu ´ ebec, Canada 2 D ´ epartement d’informatique, Univ ersit ´ e de Sherbrooke, Qu ´ ebec, Canada All authors contributed equally to this w ork. Abstract W e introduce a new representation learning al- gorithm suited to the context of domain adapta- tion, in which data at training and test time come from similar b ut different distributions. Our al- gorithm is directly inspired by theory on domain adaptation suggesting that, for ef fective domain transfer to be achieved, predictions must be made based on a data representation that cannot dis- criminate between the training (source) and test (target) domains. W e propose a training objec- tiv e that implements this idea in the context of a neural network, whose hidden layer is trained to be predictive of the classiﬁcation task, but un- informativ e as to the domain of the input. Our experiments on a sentiment analysis classiﬁca- tion benchmark, where the target domain data av ailable at training time is unlabeled, show that our neural network for domain adaption algo- rithm has better performance than either a stan- dard neural network or an SVM, ev en if trained on input features extracted with the state-of-the- art marginalized stacked denoising autoencoders of Chen et al. ( 2012 ). 1. Introduction The cost of generating labeled data for a ne w learning task is often an obstacle for applying machine learning meth- ods. There is thus great incentive to dev elop ways of ex- ploiting data from one problem that generalizes to another . Domain adaptation focuses on the situation where we hav e data generated from two different, but somehow similar , distributions. One example is in the context of sentiment analysis in written re vie ws, where we might w ant to distin- guish between the positiv e from the ne gati ve ones. While we might ha ve labeled data for re vie ws of one type of prod- ucts ( e.g. , movies), we might want to be able to generalize to revie ws of other products ( e.g. , books). Domain adapta- tion tries to achieve such a transfer by exploiting an extra set of unlabeled training data for the ne w problem to which we wish to generalize ( e.g . , unlabeled re views of books). One of the main approach to achieve such a transfer is to learn a classiﬁer and a representation which will fa v or the transfer . A large body of work exists on training both a classiﬁer and a representation that are linear ( Bruzzone & Marconcini , 2010 ; Germain et al. , 2013 ; Cortes & Mohri , 2014 ). Ho we ver , recent research has sho wn that non-linear neural networks can also be successful ( Glorot et al. , 2011 ). Speciﬁcally , a variant of the denoising autoencoder ( V in- cent et al. , 2008 ), kno wn as marginalized stacked denois- ing autoencoders (mSD A) ( Chen et al. , 2012 ), has demon- strated state-of-the-art performance on this problem. By learning a representation which is robust to input corrup- tion noise, they hav e been able to learn a representation which is also more stable across changes of domain and can thus allow cross-domain transfer . In this paper , we propose to control the stability of repre- sentation between domains explicitly into a neural netw ork learning algorithm. This approach is motiv ated by theory on domain adaptation ( Ben-David et al. , 2006 ; 2010 ) that suggests that a good representation for cross-domain trans- fer is one for which an algorithm cannot learn to identify the domain of origin of the input observation. W e show that this principle can be implemented into a neural network learning objecti ve that includes a term where the network’ s hidden layer is working adversarially tow ards output con- nections predicting domain membership. The neural net- work is then simply trained by gradient descent on this ob- jectiv e. The success of this domain-adversarial neural net- work (DANN) is conﬁrmed by extensi ve experiments on both toy and real world datasets. In particular , we sho w that D ANN achieves better performances than a regular neu- Domain-Adversarial Neural Networks ral network and a SVM on a sentiment analysis classiﬁca- tion benchmark. Moreover , we sho w that D ANN can reach state-of-the-art performance by taking as input the repre- sentation learned by mSDA, conﬁrming that minimizing domain discriminability explicitly improves ov er on only relying on a representation which is robust to noise. 2. Domain Adaptation W e consider binary classiﬁcation tasks where X ⊆ R n is the input space and Y = { 0 , 1 } is the label set. Moreov er , we have two different distributions ov er X × Y , called the sour ce domain D S and the tar get domain D T . A domain adaptation learning algorithm is then provided with a la- beled sour ce sample S drawn i.i.d. from D S , and an unla- beled target sample T drawn i.i.d. from D X T , where D X T is the marginal distrib ution of D T ov er X . S = { ( x s i , y s i ) } m i =1 ∼ ( D S ) m ; T = { x t i } m 0 i =1 ∼ ( D X T ) m 0 . The goal of the learning algorithm is to build a classiﬁer η : X → Y with a lo w tar get risk R D T ( η ) def = Pr ( x t ,y t ) ∼D T  η ( x t ) 6 = y t  , while having no information about the labels of D T . 2.1. Domain Diver gence T o tackle the challenging domain adaptation task, many ap- proaches bound the target error by the sum of the source error and a notion of distance between the source and the target distributions. These methods are intuitively justiﬁed by a simple assumption: the source risk is expected to be a good indicator of the target risk when both distributions are similar . Se veral notions of distance hav e been proposed for domain adaptation ( Ben-Da vid et al. , 2006 ; 2010 ; Mansour et al. , 2009a ; b ; Germain et al. , 2013 ). In this paper , we fo- cus on the H -div er gence used by Ben-David et al. ( 2006 ; 2010 ), and based on the earlier work of Kifer et al. ( 2004 ). Deﬁnition 1 ( Ben-David et al. ( 2006 ; 2010 ); Kifer et al. ( 2004 )) . Giv en two domain distributions D X S and D X T ov er X , and a hypothesis class H , the H -diverg ence be- tween D X S and D X T is d H ( D X S , D X T ) def = 2 sup η ∈H     Pr x s ∼D X S  η ( x s ) = 1  − Pr x t ∼D X T  η ( x t ) = 1      . That is, the H -div ergence relies on the capacity of the hy- pothesis class H to distinguish between examples gener- ated by D X S from examples generated by D X T . Ben-David et al. ( 2006 ; 2010 ) proved that, for a symmetric hypothe- sis class H , one can compute the empirical H -diver gence between two samples S ∼ ( D X S ) m and T ∼ ( D X T ) m 0 by computing ˆ d H ( S, T ) def = (1) 2 1 − min η ∈H  1 m m X i =1 I [ η ( x s i ) = 1] + 1 m 0 m 0 X i =1 I [ η ( x t i ) = 0]  ! , where I [ a ] is the indicator function which is 1 if predicate a is true, and 0 otherwise. 2.2. Proxy Distance Ben-David et al. ( 2006 ; 2010 ) suggested that, ev en if it is generally hard to compute ˆ d H ( S, T ) exactly ( e.g. , when H is the space of linear classiﬁers on X ), we can easily ap- proximate it by running a learning algorithm on the prob- lem of discriminating between source and target e xamples. T o do so, we construct a ne w dataset U = { ( x s i , 1) } m i =1 ∪ { ( x t i , 0) } m 0 i =1 , (2) where the examples of the source sample are labeled 1 and the examples of the tar get sample are labeled 0 . Then, the risk of the classiﬁer trained on new dataset U approximates the “ min ” part of Equation ( 1 ). Thus, giv en a test error  on the problem of discriminating between source and tar get examples, this Pr oxy A-distance (P AD) is gi ven by ˆ d A = 2(1 − 2  ) . (3) In the experiments section of this paper , we compute the P AD value following the approach of Glorot et al. ( 2011 ); Chen et al. ( 2012 ), i.e. , we train a linear SVM on a subset of dataset U (Equation ( 2 )), and we use the obtained classiﬁer error on the other subset as the value of  in Equation ( 3 ). 2.3. Generalization Bound on the T arget Risk The work of Ben-David et al. ( 2006 ; 2010 ) also showed that the H -diver gence d H ( D X S , D X T ) is upper bounded by its empirical estimate ˆ d H ( S, T ) plus a constant complex- ity term that depends on the VC dimension of H and the size of samples S and T . By combining this result with a similar bound on the source risk, the following theorem is obtained. Theorem 2 ( Ben-David et al. ( 2006 )) . Let H be a hypoth- esis class of VC dimension d . W ith pr obability 1 − δ over the choice of samples S ∼ ( D S ) m and T ∼ ( D X T ) m , for every η ∈ H : R D T ( η ) ≤ R S ( η ) + r 4 m  d log 2 e m d + log 4 δ  + ˆ d H ( S, T ) + 4 r 1 m  d log 2 m d + log 4 δ  + β , Domain-Adversarial Neural Networks with β ≥ inf η ∗ ∈H [ R D S ( η ∗ ) + R D T ( η ∗ )] , and R S ( η ) = 1 m m X i =1 I [ η ( x s i ) 6 = y s i ] is the empirical sour ce risk. The previous result tells us that R D T ( η ) can be low only when the β term is low , i.e. , only when there exists a clas- siﬁer that can achieve a low risk on both distributions. It also tells us that, to ﬁnd a classiﬁer with a small R D T ( η ) in a gi ven class of ﬁxed VC dimension, the learning algo- rithm should minimize (in that class) a trade-off between the source risk R S ( η ) and the empirical H -div ergence ˆ d H ( S, T ) . As pointed-out by Ben-David et al. ( 2006 ), a strategy to control the H -div ergence is to ﬁnd a represen- tation of the examples where both the source and the tar get domain are as indistinguishable as possible. Under such a representation, a hypothesis with a lo w source risk will, according to Theorem 2 , perform well on the target data. In this paper , we present an algorithm that directly e xploits this idea. 3. A Domain-Adversarial Neural Network The originality of our approach is to explicitly implement the idea e xhibited by Theorem 2 into a neural network clas- siﬁer . That is, to learn a model that can generalize well from one domain to another, we ensure that the internal representation of the neural network contains no discrimi- nativ e information about the origin of the input (source or target), while preserving a low risk on the source (labeled) examples. 3.1. Source Risk Minimization (Standard NN) Let us consider the following standard neural network (NN) architecture with one hidden layer: h ( x ) = sigm  b + Wx  , (4) f ( x ) = softmax  c + Vh ( x )  , with sigm( a ) def = h 1 1+exp( − a i ) i | a | i =1 , softmax( a ) def =  exp( a i ) P | a | j =1 exp( a j )  | a | i =1 . Note that each component f y ( x ) of f ( x ) denotes the con- ditional probability that the neural network assigns x to class y . Given a training source sample S = { ( x s i , y s i ) } m i =1 , the natural classiﬁcation loss to use is the negati ve log- probability of the correct label: L  f ( x ) , y  def = log 1 f y ( x ) . This leads to the following learning problem on the source domain. min W , V , b , c " 1 m m X i =1 L  f ( x s i ) , y s i  # . (5) W e view the output of the hidden layer h ( · ) (Equation ( 4 )) as the internal representation of the neural network. Thus, we denote the source sample representations as h ( S ) def =  h ( x s i )  m i =1 . 3.2. A Domain Adaptation Regularizer Now , consider an unlabeled sample from the target do- main T = { x t i } m 0 i =1 and the corresponding representations h ( T ) = { h ( x t i ) } m 0 i =1 . Based on Equation ( 1 ), the empirical H -div ergence of a symmetric hypothesis class H between samples h ( S ) and h ( T ) is gi ven by ˆ d H  h ( S ) , h ( T )  = (6) 2  1 − min η ∈H  1 m m X i =1 I  η ( h ( x s i )) = 1  + 1 m 0 m 0 X i =1 I  η ( h ( x t i )) = 0   ! . Let us consider H as the class of hyperplanes in the rep- resentation space. Inspired by the Proxy A-distance (see Section 2.2 ), we suggest estimating the “ min ” part of Equa- tion ( 6 ) by a logistic regressor that model the probability that a gi ven input (either x s or x t ) is from the source do- main D X S (denoted z = 1 ) or the target domain D X T (de- noted z = 0 ): p ( z = 1 | φ φ φ ) = o ( φ φ φ ) def = sigm( d + u > φ φ φ ) , (7) where φ φ φ is either h ( x s ) or h ( x t ) . Hence, the function o ( · ) is a domain r e gressor . This enables us to add a domain adaptation term to the ob- jectiv e of Equation ( 5 ), giving the follo wing problem to solve: min W , V , b , c " 1 m m X i =1 L  f ( x s i ) , y s i  (8) + λ max u ,d − 1 m m X i =1 L d  o ( x s i ) , 1  − 1 m 0 m 0 X i =1 L d  o ( x t i ) , 0  ! # , where the hyper-parameter λ > 0 weights the domain adaptation regularization term and L d  o ( x ) , z  = − z log  o ( x )  − (1 − z ) log  1 − o ( x )  . In line with Theorem 2 , this optimization problem im- plements a trade-of f between the minimization of the source risk R S ( · ) and the diver gence ˆ d H ( · , · ) . The hyper- parameter λ is then used to tune the trade-of f between these two quantities during the learning process. Domain-Adversarial Neural Networks f 0 ( x ) o ( x ) u , d f 1 ( x ) classi ﬁ c ation out put domain r egr essor input ... W , b ... V , c x ... ... h ( x ) r epr esentat ion (hidden layer) Figure 1. D ANN architecture 3.3. Learning Algorithm (D ANN) W e see that Equation ( 8 ) in volves a maximization op- eration. Hence, the neural network (parametrized by { W , V , b , c } ) and the domain regressor (parametrized by { u , d } ) are competing against each other , in an adversarial way , for that term. The obtained domain adversarial neu- ral network (D ANN) is illustrated by Figure 1 . In DANN, the hidden layer h ( · ) maps an example (either source or target) into a representation in which the output layer f ( · ) accurately classiﬁes the source sample, while the domain regressor o ( · ) is unable to detect if an example belongs to the source sample or the target sample. T o optimize Equation ( 8 ), one option would be to fol- low a hard-EM approach, where we would alternate be- tween optimizing until con ver gence the adversarial param- eters u , d and the other regular neural network parameters W , V , b , c . Howe ver , we’ ve found that a simpler stochas- tic gradient descent (SGD) approach is suf ﬁcient and works well in practice. Here, an SGD approach consists in sam- pling a pair of source and target example x s i , x t j and up- dating a gradient step update of all parameters of D ANN. Crucially , while the update of the regular parameters fol- lows as usual the opposite direction of the gradient, for the adversarial parameters u , d the step must follow the gradi- ent’ s direction (since we maximize with respect to them, instead of minimizing). The algorithm is detailed in Al- gorithm 1 . In the pseudocode, we use e ( y ) to represent a “one-hot” vector , consisting of all 0 s e xcept for a 1 at posi- tion y . Also,  is the element-wise product. For each e xperiment described in this paper , we used early stopping as the stopping criteria: we split the source labeled sample to use 90% as the training set S and the remaining 10% as a validation set S V . W e stop the learning process when the risk on S V is minimal. Algorithm 1 D ANN 1: Input: samples S = { ( x s i , y s i ) } m i =1 and T = { x t i } m 0 i =1 , 2: hidden layer size l , adaptation parameter λ , learning rate α . 3: Output: neural network { W , V , b , c } 4: W , V ← random init( l ) 5: b , c , u , d ← 0 6: while stopping criteria is not met do 7: for i from 1 to m do 8: # Forward propagation 9: h ( x s i ) ← sigm( b + Wx s i ) 10: f ( x s i ) ← softmax( c + Vh ( x s i )) 11: # Backpropagation 12: ∆ c ← − ( e ( y s i ) − f ( x s i )) 13: ∆ V ← ∆ c h ( x s i ) > 14: ∆ b ←  V > ∆ c   h ( x s i )  (1 − h ( x s i )) 15: ∆ W ← ∆ b · ( x s i ) > 16: # Domain adaptation regularizer... 17: # ...from current domain 18: o ( x s i ) ← sigm( d + u > h ( x s i )) 19: ∆ d ← λ (1 − o ( x s i )) ; ∆ u ← λ (1 − o ( x s i )) h ( x s i ) 20: tmp ← λ (1 − o ( x s i )) u  h ( x s i )  (1 − h ( x s i )) 21: ∆ b ← ∆ b + tmp ; ∆ W ← ∆ W + tmp · ( x s i ) > 22: # ...from other domain 23: j ← uniform in teger(1 , . . . , m 0 ) 24: h ( x t j ) ← sigm( b + Wx t j ) 25: o ( x t j ) ← sigm( d + u > h ( x t j )) 26: ∆ d ← ∆ d − λo ( x t j ) ; ∆ u ← ∆ u − λo ( x t j ) h ( x t j ) 27: tmp ← − λo ( x t j ) u  h ( x t j )  (1 − h ( x t j )) 28: ∆ b ← ∆ b + tmp ; ∆ W ← ∆ W + tmp · ( x t j ) > 29: # Update neural network parameters 30: W ← W − α ∆ W ; V ← V − α ∆ V 31: b ← b − α ∆ b ; c ← c − α ∆ c 32: # Update domain classifier parameters 33: u ← u + α ∆ u ; d ← d + α ∆ d 34: end for 35: end while 4. Related W ork As mentioned previously , the general approach of achiev- ing domain adaptation by learning a new data representa- tion has been explored under many facets. A large part of the literature ho we ver has focused mainly on linear hy- pothesis (see for instance Blitzer et al. , 2006 ; Bruzzone & Marconcini , 2010 ; Germain et al. , 2013 ; Baktashmot- lagh et al. , 2013 ; Cortes & Mohri , 2014 ). More recently , non-linear representations ha ve become increasingly stud- ied, including neural network representations ( Glorot et al. , 2011 ; Li et al. , 2014 ) and most notably the state-of-the- art mSDA ( Chen et al. , 2012 ). That literature has mostly focused on exploiting the principle of robust representa- tions, based on the denoising autoencoder paradigm ( V in- cent et al. , 2008 ). One of the contribution of this work is to show that domain discriminability is another principle that is complimentary to robustness and can improve cross- domain adaptation. Domain-Adversarial Neural Networks What distinguishes this work from most of the domain adaptation literature is D ANN’ s inspiration from the the- oretical work of Ben-David et al. ( 2006 ; 2010 ). Indeed, D ANN directly optimizes the notion of H -div ergence. W e do note the work of Huang & Y ates ( 2012 ), in which HMM representations are learned for word tagging using a poste- rior regularizer that is also inspired by Ben-David et al. ’ s work. In addition to the tasks being dif ferent (word tag- ging versus sentiment classiﬁcation), we would argue that D ANN learning objectiv e more closely optimizes the H - div ergence, with Huang & Y ates ( 2012 ) relying on cruder approximations for efﬁcienc y reasons. The idea of learning representations that are indiscriminate of some auxiliary label has also been explored in other con- texts. For instance, Zemel et al. ( 2013 ) proposes the notion of f air representations, which are indiscriminate to whether an example belongs to some identiﬁed set of groups. The resulting algorithm is different from D ANN and not di- rectly deriv ed from the H -di vergence. Finally , we mention some related research on using an ad- versarial (minimax) formulation to learn a model, such as a classiﬁer or a neural network, from data. There has been work on learning linear classiﬁers that are robust to changes in the input distribution, based on a minimax for- mulation ( Bagnell , 2005 ; Liu & Ziebart , 2014 ). This work howe ver assumes that a good feature representation of the input for a linear classiﬁer is av ailable and doesn’t address the problem of learning it. W e also note the work of Good- fellow et al. ( 2014 ), who propose generativ e adversarial networks to learn a good generativ e model of the true data distribution. This work shares with DANN the use of an ad- versarial objective, applying it instead to the unsupervised problem of generativ e modeling. 5. Experiments 5.1. T oy Problem As a ﬁrst experiment, we study the behavior of the pro- posed D ANN algorithm on a variant of the inter -twinning moons 2D problem, where the target distribution is a rota- tion of the source distribution. For the source sample S , we generate a lower moon and an upper moon labeled 0 and 1 respectiv ely , each of which containing 150 examples. The target sample T is obtained by generating a sample in the same way as S (without keeping the labels) and then by rotating each example by 35 ◦ . Thus, T contains 300 un- labeled examples. In Figure 2 , the examples from S are represented by “ + ”and “ − − − ”, and the examples from T are represented by black dots. W e study the adaptation capability of D ANN by comparing it to the standard NN. In our experiments, both algorithms share the same network architecture, with a hidden layer size of 15 neurons. W e ev en train NN using the same pro- cedure as D ANN. That is, we keep updating the domain regressor component using target sample T (with a hyper- parameter λ = 6 ; the same value used for D ANN), but we disable the adversarial back-propagation into the hidden layer . T o do so, we execute Algorithm 1 by omitting the lines numbered 21 and 28 . In this way , we obtain a NN learning algorithm – based on the source risk minimization of Equation ( 5 ) – and simultaneously train the domain re- gressor of Equation ( 7 ) to discriminate between source and target domains. Using this toy experiment, we will ﬁrst il- lustrate how D ANN adapts its decision boundary compared to NN. Moreover , we will also illustrate how the represen- tation gi ven by the hidden layer is less adapted to the do- main task with DANN than it is with NN. The results are illustrated in Figure 2 , where the graphs in part (a) relate to the standard NN, and the graphs in part (b) relate to D ANN. By looking at the corresponding (a) and (b) graphs in each column, we compare NN and DANN from four different perspectiv es, described in detail below . Label classiﬁcation. The ﬁrst column of Figure 2 shows the decision boundaries of D ANN and NN on the problem of predicting the labels of both source and the tar get e xam- ples. As expected, NN accurately classiﬁes the two classes of the source sample S , but is not fully adapted to the tar - get sample T . On the contrary , the decision boundary of D ANN perfectly classiﬁes examples from both source and target samples. D ANN clearly adapts here to the target dis- tribution. Representation PCA. T o analyze how he domain adapta- tion regularizer affects the representation h ( · ) provided by the hidden layer , the second column of Figure 2 presents a principal component analysis (PCA) on the set of all rep- resentations of source and target data points, i.e. , h ( S ) ∪ h ( T ) . Thus, gi ven the trained network (NN or D ANN), e v- ery point from S and T is mapped into a 15 -dimensional feature space through the hidden layer , and projected back into a two-dimensional plane deﬁned by the ﬁrst two prin- cipal components. In the DANN-PCA representation, we observe that target points are homogeneously spread out among the source points. In the NN-PCA representation, clusters of target points containing very few source points are clearly visible. Hence, the task of labeling the target points seems easier to perform on the D ANN-PCA repre- sentation. T o push the analysis further, four crucial data points iden- tiﬁed by A, B, C and D in the graphs of the ﬁrst column (which correspond to the moon extremities in the origi- nal space) are represented again on the graphs of the sec- ond column. W e observe that points A and B are very close to each other in the NN-PCA representation, while they clearly belong to different classes. The same hap- Domain-Adversarial Neural Networks L A B E L C L A SS I FI C A T I O N − 3 − 2 − 1 0 1 2 3 + - - + - + - + + - - + - + - + - + + + - - - + - - + + + - - - + - - + + + + - - + - - + - + + + + + + - - - - - - - - + - - + + - + - - + + - - + + + - - - - + + + - + - - - + + + + - - - - - - - + + - + - + + - + + - - - + + + + - + - - + - + - + - + + + + + + + + - - + + - + + - + - - - - + - + A B C D R E P R ES E N TA T I O N P C A − 1 . 0 − 0 . 5 0 . 0 0 . 5 1 . 0 + - - + - + - + + - - + - + - + - + + + - - - + - - + + + - - - + - - + + + + - - + - - + - + + + + + + - - - - - - - - + - - + + - + - - + + - - + + + - - - - + + + - + - - - + + + + - - - - - - - + + - + - + + - + + - - - + + + + - + - - + - + - + - + + + + + + + + - - + + - + + - + - - - - + - + A B C D D O M A I N C L A S S I FI C A T I O N − 3 − 2 − 1 0 1 2 3 + - - + - + - + + - - + - + - + - + + + - - - + - - + + + - - - + - - + + + + - - + - - + - + + + + + + - - - - - - - - + - - + + - + - - + + - - + + + - - - - + + + - + - - - + + + + - - - - - - - + + - + - + + - + + - - - + + + + - + - - + - + - + - + + + + + + + + - - + + - + + - + - - - - + - + H I D D EN N E U RO N S − 3 − 2 − 1 0 1 2 3 + - - + - + - + + - - + - + - + - + + + - - - + - - + + + - - - + - - + + + + - - + - - + - + + + + + + - - - - - - - - + - - + + - + - - + + - - + + + - - - - + + + - + - - - + + + + - - - - - - - + + - + - + + - + + - - - + + + + - + - - + - + - + - + + + + + + + + - - + + - + + - + - - - - + - + (a) Standard NN and a non adverserial domain regressor (Algorithm 1 , without Lines 21 and 28 ) − 3 − 2 − 1 0 1 2 3 + - - + - + - + + - - + - + - + - + + + - - - + - - + + + - - - + - - + + + + - - + - - + - + + + + + + - - - - - - - - + - - + + - + - - + + - - + + + - - - - + + + - + - - - + + + + - - - - - - - + + - + - + + - + + - - - + + + + - + - - + - + - + - + + + + + + + + - - + + - + + - + - - - - + - + A B C D − 1 . 5 − 1 . 0 − 0 . 5 0 . 0 0 . 5 1 . 0 1 . 5 + - - + - + - + + - - + - + - + - + + + - - - + - - + + + - - - + - - + + + + - - + - - + - + + + + + + - - - - - - - - + - - + + - + - - + + - - + + + - - - - + + + - + - - - + + + + - - - - - - - + + - + - + + - + + - - - + + + + - + - - + - + - + - + + + + + + + + - - + + - + + - + - - - - + - + A B C D − 3 − 2 − 1 0 1 2 3 + - - + - + - + + - - + - + - + - + + + - - - + - - + + + - - - + - - + + + + - - + - - + - + + + + + + - - - - - - - - + - - + + - + - - + + - - + + + - - - - + + + - + - - - + + + + - - - - - - - + + - + - + + - + + - - - + + + + - + - - + - + - + - + + + + + + + + - - + + - + + - + - - - - + - + − 3 − 2 − 1 0 1 2 3 + - - + - + - + + - - + - + - + - + + + - - - + - - + + + - - - + - - + + + + - - + - - + - + + + + + + - - - - - - - - + - - + + - + - - + + - - + + + - - - - + + + - + - - - + + + + - - - - - - - + + - + - + + - + + - - - + + + + - + - - + - + - + - + + + + + + + + - - + + - + + - + - - - - + - + (b) DANN (Algorithm 1 ) Figure 2. The inter-twinning moons toy problem. W e adopt the colors of Figure 1 for classiﬁcation output and domain regressor v alue. pens to points C and D. Conv ersely , these four points are located at opposite corners in the D ANN-PCA representa- tion. Note also that the target point A (resp. D) – which is difﬁcult to classify in the original space – is located in the “ + ”cluster (resp. “ − − − ”cluster) in the D ANN-PCA representation. Therefore, the representation promoted by D ANN is more suited for the domain adaptation task. Domain classiﬁcation. The third column of Figure 2 shows the decision boundary on the domain classiﬁcation problem, which is given by the domain regressor o ( · ) of Equation ( 7 ). More precisely , x is classiﬁed as a source example when o ( x ) ≥ 0 . 5 , and is classiﬁed as a domain example otherwise. Remember that, during the learning process of D ANN, the o ( · ) regressor struggles to discrimi- nate between source and target domains, while the hidden representation h ( · ) is adversarially updated to pre vent it to succeed. As explained abov e, we trained the domain regressor during the learning process of NN, but without allowing it to inﬂuence the learned representation h ( · ) . On one hand, the D ANN domain regressor utterly fails to discriminate the source and target distributions. On the other hand, the NN domain regressor shows a better (al- though imperfect) discriminant. This again corroborates that the D ANN representation doesn’ t allo w discriminating between domains. Hidden neurons. In the plot of the last column of Fig- ure 2 , the lines sho w the decision surfaces of the hid- den layer neurons (deﬁned by Equation ( 4 )). In other words, each of the ﬁfteen plot line corresponds to the points x ∈ R 2 for which the i th component of h ( x ) equals 1 2 , for i ∈ { 1 , . . . , 15 } . W e observe that the neurons of NN are grouped in three clusters, each one allowing to generate a straight line part of the curved decision boundary for the label classiﬁcation problem. Howe ver , most of these neurons are also able to (roughly) capture the rotation angle of the domain classi- ﬁcation problem. Hence, we observe that the adaptation regularizer of D ANN pre v ents these kinds of neurons to be produced. It is indeed striking to see that the two predomi- nant patterns in the NN neurons ( i.e. , the two parallel lines crossing the plane from the lower left corner to the upper right corner) are absent among D ANN neurons. 5.2. Sentiment Analysis Dataset In this section, we compare the performance of our pro- posed D ANN algorithm to a standard neural network with one hidden layer (NN) described by Equation ( 5 ), and a Support V ector Machine (SVM) with a linear kernel. T o select the hyper-parameters of each of these algorithms, we use grid search and a very small validation set which consists in 100 labeled examples from the target domain. Finally , we select the classiﬁers having the lo west target validation risk. W e compare the algorithms on the Amazon r eviews dataset, as pre-processed by Chen et al. ( 2012 ). This dataset in- cludes four domains, each one composed of revie ws of a Domain-Adversarial Neural Networks T able 1. Error rates on the Amazon revie ws dataset (left), and Pairwise Poisson binomial test (right). (a) Error rates on the Amazon revie ws dataset Original data mSD A repr esentation sour ce name → tar get name DANN NN SVM DANN NN SVM books → dvd 0.201 0.199 0.206 0.176 0.171 0.175 books → electronics 0.246 0.251 0.256 0.197 0.228 0.244 books → kitchen 0.230 0.235 0.229 0.169 0.166 0.172 dvd → books 0.247 0.261 0.269 0.176 0.173 0.176 dvd → electronics 0.247 0.256 0.249 0.181 0.234 0.220 dvd → kitchen 0.227 0.227 0.233 0.151 0.153 0.178 electronics → books 0.280 0.281 0.290 0.237 0.241 0.229 electronics → dvd 0.273 0.277 0.278 0.216 0.228 0.261 electronics → kitchen 0.148 0.149 0.163 0.118 0.126 0.137 kitchen → books 0.283 0.288 0.325 0.222 0.226 0.234 kitchen → dvd 0.261 0.261 0.274 0.208 0.214 0.209 kitchen → electronics 0.161 0.161 0.158 0.141 0.136 0.138 (b) Pairwise Poisson binomial test Original data D ANN NN SVM D ANN 0.50 0.90 0.97 NN 0.10 0.50 0.87 SVM 0.03 0.13 0.50 mSD A repr esentations D ANN NN SVM D ANN 0.50 0.82 0.88 NN 0.18 0.50 0.90 SVM 0.12 0.10 0.50 speciﬁc kind of product (books, dvd disks, electronics, and kitchen appliances). Re views are encoded in 5 000 dimen- sional feature vectors of unigrams and bigrams, and labels are binary: “ 0 ” if the product is ranked up to 3 stars, and “ 1 ” if the product is ranked 4 or 5 stars. W e perform twelve domain adaptation tasks. For example, “books → dvd” corresponds to the task for which books is the source domain and dvd disks the target one. All learn- ing algorithms are gi ven 2 000 labeled source e xamples and 2 000 unlabeled target examples. Then, we ev aluate them on separate target test sets (between 3 000 and 6 000 ex- amples). Note that NN and SVM don’t use the unlabeled target sample for learning. Here are more details about the procedure used for each learning algorithms. D ANN. The adaptation parameter λ is chosen among 9 values between 10 − 2 and 1 on a logarithmic scale. The hidden layer size l is either 1 , 5 , 12 , 25 , 50 , 75 , 100 , 150 , or 200 . Finally , the learning rate α is ﬁxed at 10 − 3 . NN. W e use exactly the same hyper-parameters and train- ing procedure as D ANN above, except that we don’t need an adaptation parameter . Note that one can train NN by us- ing the D ANN implementation (Algorithm 1 ) with λ = 0 . SVM. The hyper-parameter C of the SVM is chosen among 10 values between 10 − 5 and 1 on a logarithmic scale. This range of values is the same used by Chen et al. ( 2012 ) in their experiments. The “Original data” part of T able 1(a) shows the target test risk of all algorithms, and T able 1(b) reports the prob- ability that one algorithm is signiﬁcantly better than an- other according to the Poisson binomial test ( Lacoste et al. , 2012 ). W e note that D ANN has a signiﬁcantly better per- formance than NN and SVM, with respecti ve probabilities 0.90 and 0.97 . As the only difference between DANN and NN is the domain adaptation regularizer , we conclude that our approach successfully helps to ﬁnd a representation suitable for the target domain. 5.3. Combining D ANN with A utoencoders W e now wonder whether our DANN algorithm can im- prov e on the representation learned by the state-of-the- art Marginalized Stac ked Denoising Autoencoders (mSD A) proposed by Chen et al. ( 2012 ). In brief, mSDA is an unsupervised algorithm that learns a new robust feature representation of the training samples. It takes the unla- beled parts of both source and target samples to learn a feature map from the input space X to a new representa- tion space. As a denoising autoencoder , it ﬁnds a feature representation from which one can (approximately) recon- struct the original features of an example from its noisy counterpart. Chen et al. ( 2012 ) showed that using mSD A with a linear SVM classiﬁer giv es state-of-the-art perfor- mance on the Amazon re views datasets. As an alternativ e to the SVM, we propose to apply our D ANN algorithm on the same representations generated by mSD A (using rep- resentations of both source and target samples). Note that, ev en if mSD A and D ANN are two representation learning approaches, they optimize different objectiv es, which can be complementary . W e perform this experiment on the same amazon r evie ws dataset described in the previous subsection. For each pair source-target, we generate the mSD A representations us- ing a corruption probability of 50% and a number of lay- ers of 5 . W e then ex ecute the three learning algorithms (D ANN, NN, and SVM) on these representations. More precisely , following the experimental procedure of Chen et al. ( 2012 ), we use the concatenation of the output of the 5 layers and the original input as the new representa- tion. Thus, each example is now encoded in a vector of Domain-Adversarial Neural Networks 0 . 0 0 . 5 1 . 0 1 . 5 2 . 0 P AD on r a w input 0 . 0 0 . 5 1 . 0 1 . 5 2 . 0 P AD on D ANN representations B → D B → E B → K D → E D → K E → K K → E K → D K → B E → D D → B (a) DANN on Original data . 0 . 0 0 . 5 1 . 0 1 . 5 2 . 0 P AD on NN representations 0 . 0 0 . 5 1 . 0 1 . 5 2 . 0 P AD on D ANN representations B → D B → E B → K D → B D → E D → K E → B E → D E → K K → B K → D K → E (b) D ANN & NN with 100 hidden neurons. 0 . 6 0 . 8 1 . 0 1 . 2 1 . 4 1 . 6 1 . 8 2 . 0 P AD on r a w input 0 . 6 0 . 8 1 . 0 1 . 2 1 . 4 1 . 6 1 . 8 2 . 0 P AD on mSD A and D ANN representations B ↔ D B ↔ E B ↔ K D ↔ E D ↔ K E ↔ K B → K D → E E → K K → E K → B E → D D → B mSD A mSD A + D ANN (c) DANN on mSD A repr esentations. Figure 3. Proxy A-distances (P AD). Note that the P AD v alues of mSD A representations are symmetric when swapping source and tar get samples. Also, some D ANN results, when used on top of mSD A, have P AD v alues lower than 0 . 5 and doesn’t appear on Figure (c). 30 000 dimensions. Note that we use the same grid search as in Subsection 5.2 , but with a learning rate α of 10 − 4 for both DANN and NN. The results of “mSD A representa- tion” columns in T able 1(a) conﬁrm that combining mSD A and D ANN is a sound approach. Indeed, the Poisson bino- mial test sho ws that D ANN has a better performance than NN and SVM with probabilities 0.82 and 0.88 respectiv ely , as reported in T able 1(b) . 5.4. Proxy A-distance The theoretical foundation of DANN is the domain adapta- tion theory of Ben-David et al. ( 2006 ; 2010 ). W e claimed that D ANN ﬁnds a representation in which the source and the target example are hardly distinguishable. Our toy ex- periment of Section 5.1 already points out some evidences, but we want to conﬁrm it on real data. T o do so, we com- pare the Proxy A-distance (P AD) on various representa- tions of the Amazon Reviews dataset. These representa- tions are obtained by running either NN, DANN, mSD A, or mSD A and D ANN combined. Recall that P AD, as de- scribed in Section 2.2 , is a metric estimating the similarity of the source and the target representations. More precisely , to obtain a P AD value, we use the following procedure: (1) we construct the dataset U of Equation ( 2 ) using both source and target representations of the training samples; (2) we randomly split U in two subsets of equal size; (3) we train linear SVMs on the ﬁrst subset of U using a large range of C values; (4) we compute the error of all obtained classiﬁers on the second subset of U ; and (5) we use the lowest error to compute the P AD value of Equation ( 3 ). Firstly , Figure 3(a) compares the P AD of D ANN represen- tations obtained in the experiments of Section 5.2 (using the hyper-parameters v alues leading to the results of T a- ble 1 ) to the P AD computed on raw data. As expected, the P AD v alues are dri ven down by the D ANN representations. Secondly , Figure 3(b) compares the P AD of DANN repre- sentations to the P AD of standard NN representations. As the P AD is inﬂuenced by the hidden layer size (the discrim- inating po wer tends to increase with the dimension of the representation), we ﬁx here the size to 100 neurons for both algorithms. W e also ﬁx the adaptation parameter of DANN to λ ' 0 . 31 as it was the value that has been selected most of the time during our preceding experiments on the Ama- zon Reviews dataset. Again, D ANN is cle arly leading to the lowest P AD values. Lastly , Figure 3(c) presents two sets of results related to Section 5.3 experiments. On one hand, we reproduce the results of Chen et al. ( 2012 ), which noticed that the mSD A representations gav e greater P AD values than those ob- tained with the original (raw) data. Although the mSD A approach clearly helps to adapt to the target task, it seems to contradict the theory of Ben-David et al. . On the other hand, we observe that, when running D ANN on top of mSD A (using the hyper-parameters values leading to the results of T able 1 ), the obtained representations ha ve much lower P AD values. These observations might explain the improv ements provided by D ANN when combined with the mSD A procedure. 6. Conclusion and Future W ork In this paper , we hav e proposed a neural network algo- rithm, named DANN, that is strongly inspired by the do- main adaptation theory of Ben-David et al. ( 2006 ; 2010 ). The main idea behind D ANN is to encourage the network’ s hidden layer to learn a representation which is predicti ve of the source example labels, but uninformati ve about the do- main of the input (source or target). Extensi ve experiments on the inter-twinning moons toy problem and Amazon re- views sentiment analysis dataset have shown the effecti ve- ness of this strategy . Notably , we achieved state-of-the-art performances when combining D ANN with the mSD A au- toencoders of Chen et al. ( 2012 ), which turned out to be Domain-Adversarial Neural Networks two complementary representation learning approaches. W e believe that our domain adaptation regularizer that we dev elop for the D ANN algorithm can be incorporated into many other learning algorithms. Natural extensions of our work would be deeper network architectures, multi-source adaptation problems and other learning tasks beyond the basic binary classiﬁcation setting. W e also intend to meld the D ANN approach with denoising autoencoders, to po- tentially improve on the tw o steps procedure of Section 5.3 . References Bagnell, J. Andrew . Robust supervised learning. In Pr o- ceedings of 20th National Conference on Artiﬁcial In- telligence , pp. 714–719. AAAI Press / The MIT Press, 2005. Baktashmotlagh, Mahsa, Harandi, Mehrtash, Lov ell, Brian, and Salzmann, Mathieu. Unsupervised domain adapta- tion by domain in v ariant projection. In Pr oceedings of the 2013 IEEE International Conference on Computer V ision , pp. 769–776, 2013. Ben-David, Shai, Blitzer, John, Crammer, K oby , and Pereira, Fernando. Analysis of representations for do- main adaptation. In NIPS , pp. 137–144, 2006. Ben-David, Shai, Blitzer , John, Crammer , K oby , Kulesza, Alex, Pereira, Fernando, and V aughan, Jennifer W ort- man. A theory of learning from different domains. Ma- chine Learning , 79(1-2):151–175, 2010. Blitzer , John, McDonald, Ryan T ., and Pereira, Fer- nando. Domain adaptation with structural correspon- dence learning. In Confer ence on Empirical Methods in Natural Languag e Pr ocessing , pp. 120–128, 2006. Bruzzone, Lorenzo and Marconcini, Mattia. Domain adap- tation problems: A DASVM classiﬁcation technique and a circular validation strategy . T ransaction P attern Anal- ysis and Machine Intelligence , 32(5):770–787, 2010. Chen, Minmin, Xu, Zhixiang Eddie, W einberger , Kilian Q., and Sha, Fei. Marginalized denoising autoencoders for domain adaptation. In ICML , pp. 767–774, 2012. Cortes, Corinna and Mohri, Mehryar . Domain adaptation and sample bias correction theory and algorithm for re- gression. Theor . Comput. Sci. , 519:103–126, 2014. Germain, Pascal, Habrard, Amaury , Laviolette, Franc ¸ ois, and Morv ant, Emilie. A P A C-Bayesian approach for do- main adaptation with specialization to linear classiﬁers. In ICML , pp. 738–746, 2013. Glorot, Xavier , Bordes, Antoine, and Bengio, Y oshua. Do- main adaptation for lar ge-scale sentiment classiﬁcation: A deep learning approach. In ICML , pp. 513–520, 2011. Goodfellow , Ian J., Pouget-Abadie, Jean, Mirza, Mehdi, Xu, Bing, W arde-Farley , David, Ozair, Sherjil, Courville, Aaron C., and Bengio, Y oshua. Generati ve adversarial nets. In NIPS , pp. 2672–2680, 2014. Huang, Fei and Y ates, Alexander . Biased representation learning for domain adaptation. In Joint Confer ence on Empirical Methods in Natural Language Pr ocessing and Computational Natural Language Learning , pp. 1313– 1323, 2012. Kifer , Daniel, Ben-David, Shai, and Gehrk e, Johannes. De- tecting change in data streams. In V ery Larg e Data Bases , pp. 180–191, 2004. Lacoste, Alexandre, Laviolette, Franc ¸ ois, and Marchand, Mario. Bayesian comparison of machine learning algo- rithms on single and multiple datasets. In AIST A TS , pp. 665–675, 2012. Li, Y ujia, Swersky , Ke vin, and Zemel, Richard. Unsu- pervised domain adaptation by domain in variant projec- tion. In NIPS 2014 W orkshop on T ransfer and Multitask Learning , 2014. Liu, Anqi and Ziebart, Brian D. Robust classiﬁcation under sample selection bias. In NIPS , pp. 37–45, 2014. Mansour , Y ishay , Mohri, Mehryar , and Rostamizadeh, Af- shin. Domain adaptation: Learning bounds and algo- rithms. In COLT , 2009a. Mansour , Y ishay , Mohri, Mehryar , and Rostamizadeh, Af- shin. Multiple source adaptation and the r ´ enyi diver - gence. In UAI , pp. 367–374, 2009b. V incent, Pascal, Larochelle, Hugo, Bengio, Y oshua, and Manzagol, Pierre-Antoine. Extracting and composing robust features with denoising autoencoders. In ICML , pp. 1096–1103, 2008. Zemel, Richard S., W u, Y u, Swersky , Ke vin, Pitassi, T oni- ann, and Dwork, Cynthia. Learning fair representations. In ICML , pp. 325–333, 2013.

Domain-Adversarial Neural Networks

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment