Domain Generalization for Object Recognition with Multi-task Autoencoders

Domain Generalization f or Object Recognition with Multi-task A utoencoders Muhammad Ghifary W . Bastiaan Kleijn Mengjie Zhang David Balduzzi V ictoria Uni versity of W ellington { muhammad.ghifary,bastiaan.kleijn,mengjie.zhang } @ecs.vuw.ac.nz, david.balduzzi@vuw.ac.nz Abstract The pr oblem of domain gener alization is to take knowl- edge acquir ed fr om a number of r elated domains wher e training data is available, and to then successfully apply it to pr eviously unseen domains. W e pr opose a new fea- tur e learning algorithm, Multi-T ask Autoencoder (MT AE), that pro vides good generalization performance for cr oss- domain object r ecognition. Our algorithm extends the standar d denoising autoen- coder framework by substituting artiﬁcially induced cor- ruption with naturally occurring inter -domain variability in the appearance of objects. Instead of r econstructing ima ges fr om noisy ver sions, MT AE learns to tr ansform the original image into analogs in multiple related domains. It ther eby learns featur es that ar e rob ust to variations across domains. The learnt featur es ar e then used as inputs to a classiﬁer . W e evaluated the performance of the algorithm on benchmark image r ecognition datasets, wher e the task is to learn featur es fr om multiple datasets and to then pr edict the image label fr om unseen datasets. W e found that (de- noising) MT AE outperforms alternative autoencoder -based models as well as the curr ent state-of-the-art algorithms for domain generalization. 1. Introduction Recent years hav e seen dramatic advances in object recognition by deep learning algorithms [23, 11, 32]. Much of the increased performance derives from applying large networks to massi ve labeled datasets such as P ASCAL V OC [14] and ImageNet [22]. Unfortunately , dataset bias – which can include factors such as backgrounds, camera viewpoints and illumination – often causes algorithms to generalize poorly acr oss datasets [35] and signiﬁcantly lim- its their usefulness in practical applications. Dev eloping algorithms that are in variant to dataset bias is therefore a compelling problem. Problem deﬁnition. In object recognition, the “visual world” can be considered as decomposing into views (e.g. perspectiv es or lighting conditions) corresponding to do- mains. For example, frontal-vie ws and 45 ◦ rotated-views correspond to two dif ferent domains. Alternatively , we can associate views or domains with standard image datasets such as P ASCAL VOC2007 [14], and Of ﬁce [31]. The problem of learning from multiple source domains and testing on unseen tar get domains is referred to as do- main generalization [6, 26]. A domain is a probability distribution P k from which samples { x i , y i } N k i =1 are dra wn. Source domains provide training samples, whereas distinct target domains are used for testing. In the standard super- vised learning framew ork, it is assumed that the source and target domains coincide. Dataset bias becomes a signiﬁcant problem when training and test domains dif fer: applying a classiﬁer trained on one dataset to images sampled from an- other typically results in poor performance [35, 18]. The goal of this paper is to learn features that improv e general- ization performance across domains. Contribution. The challenge is to b uild a system that rec- ognizes objects in previously unseen datasets, given one or multiple training datasets. W e introduce Multi-task Au- toencoder (MT AE), a feature learning algorithm that uses a multi-task strate gy [8, 34] to learn unbiased object features, where the task is the data reconstruction. Autoencoders were introduced to address the problem of ‘backpropagation without a teacher’ by using inputs as labels – and learning to reconstruct them with minimal distortion [28, 5]. Denoising autoencoders in particular are a po werful basic circuit for unsupervised representation learning [36]. Intuitively , corrupting inputs forces autoen- coders to learn representations that are robust to noise. This paper proposes a broader view: that autoencoders are generic cir cuits for learning in variant features . The main contribution is a new training strategy based on nat- urally occurring transformations such as: rotations in view- ing angle, dilations in apparent object size, and shifts in lighting conditions. The resulting Multi-T ask Autoencoder learns features that are robust to real-world image variabil- ity , and therefore generalize well across domains. Exten- siv e experiments sho w that MT AE with a denoising crite- rion outperforms the prior state-of-the-art in domain gener- alization ov er various cross-dataset recognition tasks. 1 2. Related work Domain generalization has recently attracted attention in classiﬁcation tasks, including automatic gating of ﬂow cy- tometry data [6, 26] and object recognition [16, 21, 38]. Khosla et al . [21] proposed a multi-task max-mar gin classi- ﬁer , which we refer to as Undo-Bias, that explicitly encodes dataset-speciﬁc biases in feature space. These biases are used to push the dataset-speciﬁc weights to be similar to the global weights. Fang et al . [16] de veloped Unbiased Metric Learning (UML) based on learning to rank frame work. V al- idated on weakly-labeled web images, UML produces a less biased distance metric that provides good object recognition performance. and v alidated on weakly-labeled web images. More recently , Xu et al . [38] extended an exemplar -SVM to domain generalization by adding a nuclear norm-based reg- ularizer that captures the likelihoods of all positive samples. The proposed model is denoted by LRE-SVM. Other works in object recognition exist that address a similar problem, in the sense of having unknown targets, where the unseen dataset contains noisy images that are not in the training set [17, 33]. Howe ver , these were designed to be noise-speciﬁc and may suffer from dataset bias when observing objects with different types of noise. A closely related task to domain generalization is do- main adaptation, where unlabeled samples from the tar get dataset are available during training. Many domain adapta- tion algorithms ha ve been proposed for object recognition (see, e.g ., [2, 31]). Domain adaptation algorithms are not readily applicable to domain generalization, since no infor - mation is av ailable about the tar get domain. Our proposed algorithm is based on the feature learn- ing approach. Feature learning has been of a great interest in the machine learning community since the emergence of deep learning (see [4] and references therein). Some feature learning methods ha ve been successfully applied to domain adaptation or transfer learning applications [9, 13]. T o our best knowledge, there is no prior work along these lines on the more difﬁcult problem of domain generalization, i.e ., to create useful representations without observing the target domain. 3. The Proposed A pproach Our goal is to learn features that provide good domain generalization. T o do so, we extend the autoencoder [7] into a model that jointly learns multiple data-reconstruction tasks taken from related domains. Our strategy is moti- vated by prior work demonstrating that learning from mul- tiple related tasks can improv e performance on a novel, yet related, task – relative to methods trained on a single- task [1, 3, 8, 34]. 3.1. A utoencoders Autoencoders (AE) hav e become established as a pre- training model for deep learning [5]. The autoencoder train- ing consists of two stages: 1) encoding and 2) decoding . Giv en an unlabeled input x ∈ R d x , a single hidden layer autoencoder f Θ ( x ) : R d x → R d x can be formulated as h = σ enc ( W > x ) ˆ x = σ dec ( V > h ) = f Θ ( x ) , (1) where W ∈ R d x × d y , V ∈ R d y × d x are input- to-hidden and hidden-to-output connection weights 1 re- spectiv ely , h ∈ R d h is the hidden node vector , and σ enc ( · ) = [ s enc ( z 1 ) , ..., s enc ( z d h )] > , σ dec ( · ) = [ s dec ( z 1 ) , ..., s dec ( z d x )] > are element-wise non-linear acti- vation functions, and s enc and s dec are not necessarily iden- tical. Popular choices for the acti vation function s ( · ) are, e.g ., the sigmoid s ( a ) = (1 + exp( − a )) − 1 and the rectiﬁed linear (ReLU) s ( a ) = max(0 , a ) . Let Θ = { W , V } be the autoencoder parameters and { x i } N i =1 be a set of N input data. Learning corresponds to minimizing the following objecti ve ˆ Θ := arg min Θ N X i =1 L ( f Θ ( x i ) , x i ) + η R ( Θ ) , (2) where L ( · , · ) is the loss function, usually in the form of least squar e or cr oss-entr opy loss, and R ( · ) is a regularization term used to a void ov erﬁtting. The objective (2) can be op- timized by the backpropagation algorithm [29]. If we ap- ply autoencoders to raw pixels of visual object images, the weights W usually form visually meaningful “ﬁlters” that can be interpreted qualitativ ely . T o create a discriminative model using the learnt autoen- coder model, either of the follo wing options can be consid- ered: 1) the feature map φ ( x ) := σ enc ( ˆ W > x ) is e xtracted and used as an input to supervised learning algorithms while keeping the weight matrix ˆ W ﬁxed; 2) the learnt weight ma- trix ˆ W is used to initialize a neural network model and is updated during the supervised neural netw ork training ( ﬁne- tuning ). Recently , sev eral variants such as denoising au- toencoders (D AE) [37] and contractiv e autoencoders (CAE) [27] ha ve been proposed to extract features that are more rob ust to small changes of the input. In D AEs, the ob- jectiv e is to reconstruct a clean input x gi ven its corrupted counterpart ˜ x ∼ Q ( ˜ x | x ) . Commonly used types of corrup- tion are zero-masking, Gaussian, and salt-and-pepper noise. Features extracted by D AE have been prov en to be more discriminativ e than ones extracted by AE [37]. 1 While the bias terms are incorporated in our experiments, they are intentionally omitted from equations for the sake of simplicity . 2 Figure 1. The Multi-task Autoencoder (MT AE) architecture, which consists of three layers with multiple separated outputs; each output corresponds to one task/domain. 3.2. Multi-task A utoencoders W e refer to our proposed domain generalization algo- rithm as Multi-task Autoencoder (MT AE). From an archi- tectural viewpoint, MT AE is an autoencoder with multiple output layers, see Fig. 1. The input-hidden weights repre- sent shar ed parameters and the hidden-output weights rep- resent domain-speciﬁc parameters. The architecture is sim- ilar to the supervised multi-task neural networks proposed by Caruana [8]. The main difference is that the output layers of MT AE correspond to dif ferent domains instead of dif fer- ent class labels. The most important component of MT AE is the training strategy , which constructs a generalized denoising autoen- coder that learns in variances to naturally occurring trans- formations. Denoising autoencoders focus on the special case where the transformation is simply noise. In contrast, MT AE training treats a speciﬁc perspecti ve on an object as the “corrupted” counterpart of another perspectiv e ( e.g ., a rotated digit 6 is the noisy pair of the original digit). The au- toencoder objectiv e is then reformulated along the lines of multi-task learning: the model aims to jointly achie ve good r econstruction of all sour ce views given a particular view . For example, applying the strategy to handwritten digit im- ages with sev eral views, MT AE learns representations that are in v ariant across the source views, see Section 4. T wo types of reconstruction tasks are performed dur- ing MT AE training: 1) self-domain reconstruction and 2) between-domain reconstruction. Given M source domains, there are M × M reconstruction tasks, of which M task are self-domain reconstructions and the remaining M × ( M − 1) tasks are between-domain reconstructions. Note that the self-domain reconstruction is identical to the standard au- toencoder reconstruction (1). Formal description. Let { x l i } n l i =1 , be a set of d x - dimensional data points in the l th domain, where l ∈ { 1 , ..., M } . Each domain’ s data points are combined into a matrix X l ∈ R n l × d x , where x l > i is its i th row , such that ( x 1 i , x 2 i , . . . x M i ) form a category-le vel correspondence. This conﬁguration enforces the number of samples in a cat- egory to be the same in every domain. Note that such a con- ﬁguration is necessary to ensure that the between-domain reconstruction works (we will discuss ho w to handle the case with unbalanced samples in Section 3.3). The input and output pairs used to train MT AE can then be written as concatenated matrices ¯ X = [ X 1 ; X 2 ; ... ; X M ] , ¯ X l = [ X l ; X l ; ... ; X l ] (3) where ¯ X , ¯ X l ∈ R N × d x and N = P M l =1 n l . In words, ¯ X is the matrix of data points taken from all domains and ¯ X l is the matrix of replicated data sets taken from the l th domain. The replication imposed in ¯ X l constructs input-output pairs for the autoencoder learning algorithm. In practice, the al- gorithm can be implemented efﬁciently – without replicat- ing the matrix in memory . W e now describe MT AE more formally . Let ¯ x > i and ¯ x l > i be the i th row of matrices ¯ X and ¯ X l , respectiv ely , the feed- forward MT AE reconstruction is h i = σ enc ( W > ¯ x i ) , f Θ ( l ) ( ¯ x i ) = σ dec ( V ( l ) > h i ) , (4) where Θ ( l ) = { W , V ( l ) } contains the matrices of shared and individual weights, respecti vely . The MT AE training is achie ved as follows. Let us deﬁne the loss function summed ov er the datapoints J ( Θ ( l ) ) = N X i =1 L  f Θ ( l ) ( ¯ x i ) , ¯ x l i  . (5) Giv en M domains, training MT AE corresponds to minimiz- ing the objectiv e ˆ Θ ( l ) := arg min Θ ( l ) M X l =1 J ( Θ ( l ) ) + η R ( Θ ( l ) ) , (6) where R ( Θ ( l ) ) is a re gularization term. In this work, we use the standard l 2 -norm weight penalty R ( Θ ( l ) ) = k W k 2 2 + k V ( l ) k 2 2 . Stochastic gradient descent is applied on each reconstruction task to achieve the objecti ve (6). Once training is completed, the optimal shar ed weights ˆ W are obtained. The stopping criterion is empirically determined by monitoring the a verage loss ov er all reconstruction tasks during training – the process is stopped when the av erage loss stabilizes. The detailed steps of MT AE training is sum- marized in Algorithm 1. The training protocol can be supplemented with a de- noising criterion as in [37] to induce more robust-to-noise features. T o do so, simply replace ¯ x i in (4) with its cor- rupted counterpart ˜ ¯ x i ∼ Q ( ˜ ¯ x i | ¯ x i ) . W e name the MT AE model after applying the denoising criterion the Denoising Multi-task Autoencoder (D-MT AE). 3 Algorithm 1 The MT AE feature learning algorithm. Input: • Data matrices based on (3): ¯ X and ¯ X l , ∀ l ∈ { 1 , ..., M } ; • Source labels: { y l i } n l i =1 , ∀ l ∈ { 1 , ..., M } ; • The learning rate: α ; 1: Initialize W ∈ R d x × d h and V ( l ) ∈ R d h × d x , ∀ l ∈ { 1 , ..., M } with small random real values; 2: while not end of epoch do 3: Do R AN D - S E L as described in Section 3.3 to balance the number of samples per categories in ¯ X and ¯ X l ; 4: for l = 1 to M do 5: for all ro w of ˜ X do 6: Do a forward pass based on (4); 7: Update W and V ( l ) to achieve the objecti ve (6) with respect to the following rules V ( l ) ij ← V ( l ) ij − α ∂ J ( { W , V ( l ) } ) ∂ V ( l ) ij , W ij ← W ij − α ∂ J ( { W , V ( l ) } ) ∂ W ij ; 8: end for 9: end for 10: end while Output: • MT AE learnt weights: ˆ W ∀ l ∈ { 1 , ..., M } ; 3.3. Handling unbalanced samples per category MT AE requires that ev ery instance in a particular domain has a category-le vel corresponding pair in ev ery other do- main. MT AE’ s apparent applicability is therefore limited to situations where the number of source samples per category is the same in e very domain. Howe ver , unbalanced samples per category occur frequently in applications. T o overcome this issue, we propose a simple r andom selection procedure applied in the between-domain reconstructions, denoted by R A N D - S E L , which is simply balancing the samples per cat- egory while keeping their cate gory-le vel correspondence. In detail, the R A N D - S E L strategy is as follows. Let m c be the number of subsamples in the c -th category , where m c = min( n 1 c , n 2 c , . . . , n M c ) and n lc is the number of samples in the c -th category of domain l ∈ { 1 , . . . , M } . For each category c and each domain l , select m c samples randomly such that n lc = n 2 c = . . . n M c = m c . This procedure is ex ecuted in ev ery iteration of the MT AE algorithm, see Line 3 of Algorithm 1. 4. Experiments and Results W e conducted experiments on se veral real world ob- ject datasets to ev aluate the domain generalization abil- ity of our proposed system. In Section 4.1, we in vesti- gate the behaviour of MT AE in comparison to standard single-task autoencoder models on raw pixels as proof-of- principle. In Section 4.2, we ev aluate the performance of MT AE against several state-of-the-art algorithms on mod- ern object datasets such as the Of ﬁce [31], Caltech [20], P ASCAL VOC2007 [14], LabelMe [30], and SUN09 [10]. 4.1. Cr oss-recognition on the MNIST and ETH-80 datasets In this part, we aim to understand MT AE’ s behavior when learning from multiple domains that form physically reasonable object transformations such as roll, pitch rota- tion, and dilation. The task is to categorize objects in views (domains) that were not presented during training. W e e val- uate MT AE against se veral autoencoder models. T o per- form the ev aluation, a variety of object views were con- structed from the MNIST handwritten digit [24] and ETH- 80 object [25] datasets. Data setup. W e created four new datasets from MNIST and ETH-80 images: 1) MNIST -r , 2) MNIST -s, 3) ETH80- p, and 4) ETH80-y . These new sets contain multiple do- mains so that e very instance in one domain has a pair in another domain. The detailed setting for each dataset is as follows. MNIST -r contains six domains, each corresponding to a degree of roll rotation. W e randomly chose 1000 digit images of ten classes from the original MNIST training set to represent the basic view , i.e., 0 degree of rotation; 2 each class has 100 images. Each image was subsampled to a 16 × 16 representation to simplify the computation. This subset of 1000 images is denoted by M . W e then created 5 rotated views from M with 15 ◦ difference in counterclock- wise direction, denoted by M 15 ◦ , M 30 ◦ . M 45 ◦ , M 60 ◦ , and M 75 ◦ . The MNIST -s is the counterpart of MNIST -r , where each domain corresponds to a dilation factor . The vie ws are denoted by M , M ∗ 0 . 9 , M ∗ 0 . 8 , M ∗ 0 . 7 , and M ∗ 0 . 6 , where the subscripts represent the dilation factors with respect to M . The ETH80-p consists of eight object classes with 10 subcategories for each class. In each subcategory , there are 41 different views with respect to pose angles. W e took ﬁv e views from each class denoted by E p 0 ◦ , E p 22 ◦ , E p 45 ◦ , E p 68 ◦ , and E p 90 ◦ , which represent the horizontal poses, i.e., pitch-rotated views starting from the top view to the side view . This makes the number of instances only 80 for each view . W e then greyscaled and subsampled the images to 28 × 28 . The ETH80-y contains ﬁve vie ws of the ETH-80 representing the vertical poses, i.e., yaw-rotated views start- ing from the right-side view to the left-side view denoted by E + y 90 ◦ , E + y 45 ◦ , E y 0 ◦ , E − y 45 ◦ , and E − y 90 ◦ . Other set- tings such as the image dimensionality and preprocessing stage are similar to ETH80-p. Examples of the resulting views are depicted in Fig. 2. Baselines. W e compared the classiﬁcation performance of our models with sev eral single-task autoencoder mod- 2 Note that the rotation angle of the basic view is not perfectly 0 ◦ since the original MNIST images hav e varying appearances. 4 T able 1. The leave-one- domain -out classiﬁcation accuracies % on the MNIST -r and MNIST -s. Bold-red and bold-black indicate the best and second best performance. Source T arget Raw AE D AE CAE uDICA MT AE D-MT AE MNIST -r leav e-one- roll-rotation -out M 15 ◦ , M 30 ◦ , M 45 ◦ , M 60 ◦ , M 75 ◦ M 52 . 40 74 . 20 76 . 90 72 . 10 67 . 20 77 . 90 82 . 50 M , M 30 ◦ , M 45 ◦ , M 60 ◦ , M 75 ◦ M 15 ◦ 74 . 10 93 . 20 93 . 20 95 . 30 87 . 80 95 . 70 96 . 30 M , M 15 ◦ , M 45 ◦ , M 60 ◦ , M 75 ◦ M 30 ◦ 71 . 40 89 . 90 91 . 30 92 . 60 88 . 80 91 . 20 93 . 40 M , M 15 ◦ , M 30 ◦ , M 60 ◦ , M 75 ◦ M 45 ◦ 61 . 40 82 . 20 81 . 10 81 . 50 77 . 80 77 . 30 78 . 60 M , M 15 ◦ , M 30 ◦ , M 45 ◦ , M 75 ◦ M 60 ◦ 67 . 40 90 . 00 92 . 80 92 . 70 84 . 20 92 . 40 94 . 20 M , M 15 ◦ , M 30 ◦ , M 45 ◦ , M 60 ◦ M 75 ◦ 55 . 40 73 . 80 76 . 50 79 . 30 69 . 50 79 . 90 80 . 50 A verage 63 . 68 83 . 88 85 . 30 85 . 58 79 . 22 85 . 73 87 . 58 MNIST -s leav e-one- dilation -out M ∗ 0 . 9 , M ∗ 0 . 8 , M ∗ 0 . 7 , M ∗ 0 . 6 M 54 . 00 67 . 50 71 . 80 75 . 80 75 . 80 74 . 50 76 . 00 M , M ∗ 0 . 8 , M ∗ 0 . 7 , M ∗ 0 . 6 M ∗ 0 . 9 80 . 40 95 . 10 94 . 00 94 . 90 88 . 60 97 . 80 98 . 00 M , M ∗ 0 . 9 , M ∗ 0 . 7 , M ∗ 0 . 6 M ∗ 0 . 8 82 . 60 94 . 60 92 . 90 94 . 90 86 . 60 96 . 30 96 . 40 M , M ∗ 0 . 9 , M ∗ 0 . 8 , M ∗ 0 . 6 M ∗ 0 . 7 78 . 20 93 . 70 91 . 60 92 . 50 87 . 40 95 . 80 94 . 90 M , M ∗ 0 . 9 , M ∗ 0 . 8 , M ∗ 0 . 7 M ∗ 0 . 6 64 . 70 74 . 80 76 . 10 77 . 50 75 . 30 78 . 00 78 . 30 A verage 71 . 98 85 . 14 85 . 28 87 . 12 82 . 74 88 . 48 88 . 72 (a) M (b) M 15 ◦ (c) M 30 ◦ (d) M 45 ◦ (e) M 60 ◦ (f) M 75 ◦ (g) M (h) M ∗ 0 . 9 (i) M ∗ 0 . 8 (j) M ∗ 0 . 7 (k) M ∗ 0 . 6 (l) E p 0 ◦ (m) E p 22 ◦ (n) E p 45 ◦ (o) E p 68 ◦ (p) E p 90 ◦ Figure 2. Some image e xamples from the MNIST -r , MNIST -s, and ETH80-p . els: Descriptions of the methods and their hyperparameter settings are provided belo w . • AE [5]: the standard autoencoder model trained by stochastic gradient descent, where all object views were concatenated as one set of inputs. The number of hidden nodes was ﬁxed at 500 on the MNIST dataset and at 1000 on the ETH-80 dataset. The learning rate, weight decay penalty , and number of iterations were empirically determined at 0 . 1 , 3 × 10 − 4 , and 100 , re- spectiv ely . • D AE [37]: the denoising autoencoder with zero- masking noise, where all object views were concate- nated as one set of input data. The corruption lev el was ﬁxed at 30% for all cases. Other hyper -parameter values were identical to AE. • CAE [27]: the autoencoder model with the Jacobian matrix norm regularization referred to as the contrac- tive autoencoder . The corresponding regularization constant λ was set at 0.1. • MT AE : our proposed multi-task autoencoder model with identical hyper-parameter settings as AE, except for the learning rate set at 0.03, which was also chosen empirically . This value provides a lower reconstruc- tion error for each task and visually clearer ﬁrst layer weights. • D-MT AE : MT AE with a denoising criterion. The learning rate was set the same as MT AE; other hyper- parameters followed D AE. W e also ev aluated the unsupervised Domain-Inv ariant Component Analysis ( uDICA ) [26] on these datasets for completness. The hyper -parameters were tuned using 10- fold cross-validation on source domains. W e also did ex- periments using the supervised v ariant, DICA, with the same tuning strate gy . Surprisingly , the peak performance of uDICA is consistently higher than DICA. A possible expla- nation is that the Dirac kernel function measuring the label similarity is less appropriate in this application. W e normalized the raw pixels to a range of [0 , 1] for autoencoder-based models and l 2 -unit ball for uDICA. W e ev aluated the classiﬁcation accuracies of the learnt features using multi-class SVM with linear kernel (L-SVM) [12]. Using a linear kernel keeps the classiﬁer simple – since our main focus is on the feature extraction process. The LIB- LINEAR package [15] was used to run the L-SVM. Cross-domain recognition r esults. W e e v aluated the ob- ject classiﬁcation accuracies of each algorithm by leave- one-domain-out test, i.e ., taking one domain as the test set and the remaining domains as the training set. For all autoencoder-based algorithms, we repeated the experiments on each leave-one-domain-out case 30 times and reported the average accuracies. The standard deviations are not re- ported since they are small ( ± 0 . 1 ). 5 The detailed results on the MNIST -r and MNIST -s can be seen in T able 1. On av erage, MT AE has the second best classiﬁcation accuracies, and in particular outperforms single-task autoencoder models. This indicates that the multi-task feature learning strategy can provide better dis- criminativ e features than the single-task feature learning w .r .t. unseen object views. The algorithm with the best performance is on these datasets is D-MT AE. Speciﬁcally , D-MT AE performs best on av erage and also on 9 out of 11 individual cross-domain cases of the MNIST -r and MNIST -s. The closest single-task feature learning competitor to D-MT AE is CAE. This sug- gests that the denoising criterion strongly beneﬁts domain generalization. The denoising criterion is also useful for single-task feature learning although it does not yield com- petitiv e accuracies, see AE and D AE performance. W e also obtain a consistent trend on the ETH80-p and ETH80-y datasets, i.e ., D-MT AE and MT AE are the best and second best models. In detail, D-MT AE and MT AE produce the av erage accuracies of 87 . 85% and 87 . 50% on the ETH80-p, and 97% and 96 . 50% on the ETH80-y . Observe that there is an anomaly in the MNIST -r dataset: the performance on M 45 ◦ is far worse than its neighbors ( M 30 ◦ , M 60 ◦ ). This anomaly appears to be related to the ge- ometry of the MNIST -r digits. W e found that the most fre- quently misclassiﬁed digits are 4, 6, and 9 on M 45 ◦ , which rarely occurs on other MNIST -r’ s domains – typically 4 as 9, 6 as 4, and 9 as 8. The same phenomenon applies to L-SVM. W eight visualization. Useful insight is obtained from considering the qualitati ve outcome of the MT AE training by visualizing the ﬁrst layer weights. Figure 4 depicts the weights of some autoencoder models, including ours, on the MNIST -r dataset. Both MT AE and D-MT AE’ s weights form “ﬁlters” that tend to captur e the underlying transformation acr oss the MNIST -r vie ws, which is the r otation. This effect is unseen in AE and D AE, the ﬁlters of which only explain the contents of handwritten digits in the form of Fourier component-like descriptors such as local blob detectors and stroke detectors [37]. This might be a reason that MT AE and D-MT AE features can provide better domain general- ization than AE and DAE, since they implicitly capture the relationship among the source domains. Next we discuss the difference between MT AE and D- MT AE ﬁlters. The D-MT AE ﬁlters not only capture the ob- ject transformation, but also produce features that describe the object contents more distinctiv ely . These ﬁlters basi- cally combine both properties of the D AE and MT AE ﬁlters that might beneﬁt the domain generalization. In variance analysis. A possible explanation for the ef- fectiv eness of MT AE relates to the dimensionality of the manifold in feature space where samples concentrate. Figure 3. The av erage singular value spectrum of the Jacobian ma- trix ov er the MNIST -r and MNIST -s datasets. W e hypothesize that if features concentrate near a lo w- dimensional submanifold, then the algorithm has found simple in v ariant features and will generalize well. T o test the hypothesis, we examine the singular value spectrum of the Jacobian matrix J x ( z ) = h ∂ z i ∂ x j i ij , where x and z are the input and feature vectors respectively [27]. The spectrum describes the local dimensionality of the man- ifold around which samples concentrate. If the spectrum de- cays rapidly , then the manifold is locally of low dimension. Figure 3 depicts the average singular value spectrum on test samples from MNIST -r and MNIST -s. The spectrum of D-MT AE decays the most rapidly , followed by MT AE and then DAE (with similar rates), and AE decaying the slowest. The ranking of decay rates of the four algorithms matches their ranking in terms of empirical performance in T able 1. Figure 3 thus provides partial conﬁrmation for our hypothesis. Howe ver , a more detailed analysis is necessary before drawing strong conclusions. 4.2. Cross-recognition on the Ofﬁce, Caltech, V OC2007, LabelMe, and SUN09 datasets In the second set of experiments, we e valuated the cross- recognition performance of the proposed algorithms on modern object datasets. The aim is to show that MT AE and D-MT AE are applicable and competiti ve in the more general setting. W e used the Ofﬁce, Caltech, P ASCAL V OC2007, LabelMe, and SUN09 datasets from which we formed two cross-domain datasets. Our general strategy is to extend the generalization of features extracted from the current best deep con volutional neural netw ork [23]. Data Setup. The ﬁrst cross-domain dataset consists of images from P ASCAL V OC2007 (V), LabelMe (L), Caltech-101 (C), and SUN09 (S) datasets, each of which represents one domain. C is an object-centric dataset, while V , L, and S are scene-centric. This dataset, which we abbre- 6 (a) AE (b) D AE (c) MT AE (d) D-MT AE Figure 4. The 2D visualization of 100 randomly chosen weights after pretraining on the MNIST -r dataset. Each patch corresponds to a row of the learnt weight matrix W that represents a “ﬁlter”. The weight value w ij ≥ 3 is depicted with white, w ij ≤ − 3 is depicted with black, otherwise it is gray . viate as VLCS , shares ﬁ ve object categories: ‘bird’, ‘car’, ‘chair’, ‘dog’, and ‘person’. Each domain in the VLCS dataset was di vided into a training set ( 70% ) and a test set ( 30% ) by random selection from the ov erall dataset. The detailed training-test conﬁguration for each domain is sum- marized in T able 2. Instead of using the raw features di- rectly , we employed the DeCAF 6 features [13] as inputs to the algorithms. These features hav e dimensionality of 4,096 and are publicly av ailable. 3 The second cross-domain dataset is referred to as the Of- ﬁce+Caltech [31, 19] dataset that contains four domains: Amazon (A), W ebcam (W), DSLR (D), and Caltech-256 (C), which share ten common categories. This dataset has 8 to 151 instances per category per domain, and 2,533 in- stances in total. W e also used the DeCAF 6 features ex- tracted from this dataset, which are also publicly av ailable. 4 T able 2. The number of training and test instances for each domain in the VLCS dataset. Domain V OC2007 LabelMe Caltech-101 SUN09 #training 2,363 1,859 991 2,297 #test 1,013 797 424 985 T able 3. The groundtruth L-SVM accuracies % on the standard training-test evaluation. The left-most column indicates the train- ing set, while the upper-most ro w indicates the test set. T raining/T est VOC2007 LabelMe Caltech-101 SUN09 V OC2007 66 . 34 34 . 50 65 . 09 52 . 49 LabelMe 44 . 03 68 . 76 43 . 87 41 . 02 Caltech-101 52 . 81 32 . 37 95 . 99 39 . 29 SUN09 52 . 42 42 . 03 40 . 33 74 . 21 T raining protocol. On these datasets, we utilized the MT AE or D-MT AE learning as pretr aining for a fully- connected neural network with one hidden layer (1HNN). The number of hidden nodes was set at 2,000, which is less than the input dimensionality . In the pretraining stage, the number of output layers was the same as the number of source domains –each corresponds to a particular source 3 http: // www . cs . dartmouth . edu / c henfang / pro j page / FXR iccv13 / index . php 4 http: // v c . sce . ntu . edu . sg / transfer learning domain adaptation / domain. The sigmoid activ ation and linear activ ation func- tions were used for σ enc ( · ) and σ dec ( · ) . The MT AE pretraining was run with the learning rate at 5 × 10 − 4 , the number of epochs at 500 , and the batch size at 10 , which were empirically determined w .r .t. the smallest a verage reconstruction loss. D-MT AE has the same hyper-parameter setting as MT AE except the addi- tional zero-masking corruption level at 20% . After the pre- training is completed, we then performed back-propagation ﬁne-tuning using 1HNN with softmax output, where the ﬁrst layer weights were initialized by either the MT AE or D-MT AE learnt weights. The supervised learning hyper-parameters were tuned using 10-fold cross v alidation (10FCV) on source domains. W e denote the overall models by MT AE+1HNN and D-MT AE+1HNN . Baselines. W e compared our proposed models with six baselines: 1. L-SVM : an SVM with linear kernel. 2. 1HNN : a single hidden layer neural network without pretraining. 3. D AE+1HNN : a two-layer neural network with denois- ing autoencoder pretraining (D AE+1HNN). 4. Undo-Bias [21]: a multi-task SVM-based algorithm for undoing dataset bias. Three hyper -parameters ( λ, C 1 , C 2 ) require tuning by 10FCV . 5. UML [16]: a structural metric learning-based algo- rithm that aims to learn a less biased distance met- ric for classiﬁcation tasks. The initial tuning proposal for this method was using a set of weakly-labeled data retriev ed from querying class labels to search engine. Howe ver , here we tuned the hyperparameters using the same strategy as others (10FCV) for a fair comparison. 6. LRE-SVM [38]: a non-linear ex emplar-SVMs model with a nuclear norm regularization to impose a low-rank likelihood matrix . Four hyper-parameters ( λ 1 , λ 2 , C 1 , C 2 ) were tuned using 10FCV . The last three are the state-of-the-art domain generalization algorithms for object recognition. 7 T able 4. The cross-recognition accuracy % on the VLCS dataset. Source T arget L-SVM 1HNN D AE+1HNN Undo-Bias UML LRE-SVM MT AE+1HNN D-MT AE+1HNN L,C,S V 58 . 86 59 . 10 62 . 00 54 . 29 56 . 26 60 . 58 61 . 09 63 . 90 V ,C,S L 52 . 49 58 . 20 59 . 23 58 . 09 58 . 50 59 . 74 59 . 24 60 . 13 V ,L,S C 77 . 67 86 . 67 90 . 24 87 . 50 91 . 13 88 . 11 90 . 71 89 . 05 V ,L,C S 49 . 09 57 . 86 57.45 54 . 21 58 . 49 54 . 88 60 . 20 61 . 33 A vg. 59 . 93 65.46 67 . 23 63 . 52 65 . 85 65 . 83 67 . 81 68 . 60 T able 5. The cross-recognition accuracy % on the Of ﬁce+Caltech dataset. Source T arget L-SVM 1HNN DAE+1HNN Undo-Bias UML LRE-SVM MT AE+1HNN D-MT AE+1HNN A,C D,W 82 . 08 83 . 41 82 . 05 80 . 49 82 . 29 84 . 59 84 . 23 85 . 35 D,W A,C 76 . 12 76 . 49 79 . 04 69 . 98 79 . 54 81 . 17 79 . 30 80 . 52 C,D,W A 90 . 61 92 . 13 92 . 02 90 . 98 91 . 02 91 . 87 92 . 20 93 . 13 A,W ,D C 84 . 51 85 . 89 85 . 17 85 . 95 84 . 59 86 . 38 85 . 98 86 . 15 A vg. 83 . 33 84 . 48 84 . 70 81 . 85 84 . 36 86 . 00 85 . 43 86 . 29 W e report the performance in terms of the classiﬁcation accuracy (%) follo wing Xu et al . [38]. For all algorithms that are optimized stochastically , we ran independent train- ing processes using the best performing hyper -parameters in 10 times and reported the av erage accuracies. Similar to the previous experiment, we do not report the standard deviations due to their small v alues ( ± 0 . 2 ). Results on the VLCS dataset. W e ﬁrst conducted the standard training-test ev aluation using L-SVM, i.e ., learn- ing the model on a training set from one domain and test- ing it on a test set from another domain, to check the groundtruth performance and also to identify the existence of the dataset bias. The performance is summarized in T a- ble 3. W e can see that the bias indeed exists in ev ery domain despite the use of DeCAF 6 , the sixth layer features of the state-of-the-art deep con v olutional neural network. The per- formance gap between the best cross-domain performance and the groundtruth is large, with ≥ 14% dif ference. W e then ev aluated the domain generalization perfor- mance of each algorithm. W e conducted leave-one-domain- out e valuation, which induces four cross-domain cases. The complete recognition results are shown in T able 4. In general, the dataset bias can be reduced by all algo- rithms after learning from multiple source domains (com- pare, e.g ., the minimum accuracy over the ﬁrst row –V as the target– in T able 4 with the maximum cross-recognition accuracy ov er the VOC2007’ s column in T able 3). Further- more, Caltech-101, which is object-centric, appears to be the easiest dataset to recognize, consistent with an in ves- tigation in [35]: scene-centric datasets tend to generalize well ov er object-centric datasets. Surprisingly , the perfor- mance of 1HNN has already achieved competitive accuracy compared to more complicated state-of-the-art algorithms, Undo-Bias, UML, and LRE-SVM. Furthermore, D-MT AE outperforms other algorithms on three out of four cross- domain cases and on average, while MT AE has the second best performance on av erage. Results on the Ofﬁce+Caltech dataset. W e report the experiment results on the Of ﬁce+Caltech dataset. T able 5 summarizes the recognition accuracies of each algorithm ov er four cross-domain cases. D-MT AE+1HNN has the best performance on two out of four cross-domain cases and ranks second for the remaining cases. On av erage, D- MT AE+1HNN has better performance than the prior state- of-the-art on this dataset, LRE-SVM [38]. 5. Conclusions W e have proposed a new approach to multi-task feature learning that reduces dataset bias in object recognition. The main idea is to extract features shared across domains via a training protocol that, gi ven an image from one domain, learns to reconstruct analogs of that image for all domains. The strategy yields two v ariants: the Multi-task Autoen- coder (MT AE) and the Denoising MT AE which incorpo- rates a denoising criterion. A comprehensi ve suite of cross- domain object recognition ev aluations sho ws that the algo- rithms successfully learn domain-in variant features, yield- ing state-of-the-art performance when predicting the labels of objects from unseen target domains. Our results suggest several directions for further study . Firstly , it is worth inv estigating whether stacking MT AEs improv es performance. Secondly , more effecti ve proce- dures for handling unbalanced samples are required, since these occur frequently in practice. Finally , a natural appli- cation of MT AEs is to streaming data such as video , where the appearance of objects transforms in real-time. The problem of dataset bias remains f ar from solved: the best model on the VLCS dataset achiev ed accuracies less than 70% on average. A partial explanation for the poor performance compared to supervised learning is insufﬁcient training data: the class-overlap across datasets is quite small (only 5 classes are shared across VLCS). Further progress in domain generalization requires larger datasets. 8 References [1] A. Argyriou, T . Evgeniou, and M. Pontil. Con vex Multi-T ask Feature Learning. Machine Learning , 73(3):243–272, 2008. [2] M. Baktashmotlagh, M. T . Harandi, B. C. Lo vell, and M. Salzmann. Domain Adaptation on the Statistical Man- ifold. In CVPR , pages 2481–2488, 2014. [3] J. Baxter . A Model of Inductiv e Bias Learning. Journal of Artiﬁcial Intelligence Resear ch , 12:149–198, 2000. [4] Y . Bengio, A. C. Courville, and P . V incent. Representa- tion Learning: A Revie w and New Perspectiv es. IEEE T ransactions on P attern Analysis and Machine Intelligence , 35(8):1798–1828, 2013. [5] Y . Bengio, P . Lamblin, D. Popovici, and H. Larochelle. Greedy Layer -W ise T raining of Deep Networks. In NIPS , pages 153–160, 2007. [6] G. Blanchard, G. Lee, and C. Scott. Generalizing from Sev- eral Related Classiﬁcation T asks to a Ne w Unlabeled Sam- ple. In NIPS , v olume 1, pages 2178–2186, 2011. [7] H. Bourlard and Y . Kamp. Auto-Association by Multilayer Perceptrons and Singular V alue Decomposition. Biological Cybernetics , 59:291–294, 1988. [8] R. Caruana. Multitask Learning. Machine Learning , 28:41– 75, 1997. [9] M. Chen, Z. Xu, K. W einberger , and F . Sha. Marginalized Denoising Autoencoders for Domain Adaptation. In ICML , pages 767–774, 2012. [10] M. J. Choi, J. J. Lim, A. T orralba, and A. S. W illsky . Ex- ploiting hierarchical context on a large database of object categories. In CVPR , pages 129–136, 2010. [11] D. Ciresan, U. Meier, and J. Schmidhuber . Multi-column deep neural network for image classiﬁcation. In CVPR , pages 3642–3649, 2012. [12] K. Crammer and Y . Singer. On the algorithmic implemen- tation of multiclass kernel-based vector machines. JMLR , 2:265–292, 2001. [13] J. Donahue, Y . Jia, O. V inyals, J. Hoffman, N. Zhang, E. Tzeng, and T . Darrell. DeCAF: A Deep Con volutional Activ ation Feature for Generic V isual Recognition. In ICML , pages 647–655, 2014. [14] M. Everingham, L. V an-Gool, C. K. I. W illiams, J. Winn, and A. Zisserman. The P ASCAL V isual Object Classes Chal- lenge 2007 (V OC2007) Results, 2007. [15] R.-E. Fan, K.-W . Chang, C.-J. Hsieh, X.-R. W ang, and C.-J. Lin. LIBLINEAR: A Library for Large Linear Classiﬁcation. JMLR , 9:1871–1874, 2008. [16] C. Fang, Y . Xu, and D. N. Rockmore. Unbiased Metric Learning: On the Utilization of Multiple Datasets and W eb Images for Softening Bias. In ICCV , pages 1657–1664, 2013. [17] M. Ghifary , W . B. Kleijn, and M. Zhang. Deep hybrid networks with good out-of-sample object recognition. In ICASSP , pages 5437–5441, 2014. [18] B. Gong, K. Grauman, and F . Sha. Reshaping V isual Datasets for Domain Adaptation. In NIPS , pages 1286–1294, 2013. [19] B. Gong, Y . Shi, F . Sha, and K. Grauman. Geodesic Flow Kernel for Unsupervised Domain Adaptation. In CVPR , pages 2066–2073, 2012. [20] G. Grifﬁn, A. Holub, and P . Perona. Caltech-256 object cat- egory dataset. T echnical report, California Inst. of T ech., 2007. [21] A. Khosla, T . Zhou, T . Malisiewicz, A. Efros, and A. T or- ralba. Undoing the Damage of Dataset Bias. In ECCV , vol- ume I, pages 158–171, 2012. [22] A. Krizhevsky . Learning Multiple Layers of Features from T iny Images. Master’ s thesis, Department of Computer Sci- ence, Univ ersity of T oronto, Apr . 2009. [23] A. Krizhevsk y , I. Sutske ver , and G. E. Hinton. ImageNet Classiﬁcation with Deep Con volutional Neural Networks. In NIPS , volume 25, pages 1106–1114, 2012. [24] Y . LeCun, L. Bottou, Y . Bengio, and P . Haffner . Gradient- based learning applied to document recognition. In Pr oceed- ings of the IEEE , volume 86, pages 2278–2324, 1998. [25] B. Leibe and B. Schiele. Analyzing appearance and contour based methods for object categorization. In CVPR , pages 409–415, 2003. [26] K. Muandet, D. Balduzzi, and B. Sch ¨ olkopf. Domain Gen- eralization via In variant Feature Representation. In ICML , pages 10–18, 2013. [27] S. Rifai, P . V incent, X. Muller , X. Glorot, and Y . Bengio. Contractiv e Auto-Encoders : Explicit In v ariance During Fea- ture Extraction. In ICML , number 1, pages 833–840, 2011. [28] D. Rumelhart, G. Hinton, and R. W illiams. P arallel Dis- tributed Pr ocessing. I: F oundations . MIT Press, 1986. [29] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learn- ing representations by back-propagating errors. Natur e , 323:533–536, 1986. [30] B. C. Russell, A. T orralba, K. P . Murphy , and W . T . Free- man. LabelMe: A database and web-based tool for image annotation. In IJCV , v olume 77, pages 157–173. 2008. [31] K. Saenko, B. Kulis, M. Fritz, and T . Darrell. Adapting V i- sual Cateogry Models to New Domains. In ECCV , pages 213–226, 2010. [32] I. Sutskev er , O. V inyals, and Q. Le. Sequence to Sequence Learning with Neural Networks. In NIPS , 2014. [33] Y . T ang and C. Eliasmith. Deep networks for robust visual recognition. In ICML , pages 1055–1062, 2010. [34] S. Thrun. Is learning the n-th thing any easier than learning the ﬁrst? In NIPS , pages 640–646, 1996. [35] A. T orralba and A. Efros. Unbiased look at dataset bias. In CVPR , pages 1521–1528, 2011. [36] P . V incent, H. Larochelle, Y . Bengio, and P .-A. Manzagol. Extracting and Composing Robust Features with Denoising Autoencoders. In ICML , 2008. [37] P . V incent, H. Larochelle, I. Lajoie, Y . Bengio, and P .-A. Manzagol. Stacked Denoising Autoencoders: Learning Use- ful Representations in a Deep Network with a Local De- noising Criterion. J ournal of Machine Learning Researc h , 11:3371–3408, 2010. [38] Z. Xu, W . Li, L. Niu, and D. Xu. Exploiting Low-Rank Structure from Latent Domains for Domain Generalization. In ECCV , pages 628–643, 2014. 9

Domain Generalization for Object Recognition with Multi-task Autoencoders

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment