Augmenting Variational Autoencoders with Sparse Labels: A Unified Framework for Unsupervised, Semi-(un)supervised, and Supervised Learning

A U G M E N T I N G V A R I A T I O N A L A U T O E N C O D E R S W I T H S P A R S E L A B E L S : A U N I FI E D F R A M E W O R K F O R U N S U P E RV I S E D , S E M I - ( U N ) S U P E RV I S E D , A N D S U P E RV I S E D L E A R N I N G A P R E P R I N T Felix Berkhahn ∗ 1 , Richard Ke ys † 1 , W ajih Ouertani ‡ 1 , Nikhil Shetty § 1 , and Dominik Geißler ¶ 1 1 Relayr (GmbH), Munich Nov ember 15, 2019 A B S T R AC T W e present a ne w ﬂavor of V ariational Autoencoder (V AE) that interpolates seamlessly between unsupervised, semi-supervised and fully supervised learning domains. W e show that unlabeled datapoints not only boost unsupervised tasks, but also the classiﬁcation performance. V ice versa, e very label not only improv es classiﬁcation, but also unsupervised tasks. The proposed architecture is simple: A classiﬁcation layer is connected to the topmost encoder layer , and then combined with the resampled latent layer for the decoder . The usual evidence lo wer bound (ELBO) loss is supplemented with a supervised loss target on this classiﬁcation layer that is only applied for labeled datapoints. This simplicity allows for e xtending any existing V AE model to our proposed semi-supervised framework with minimal effort. In the context of classiﬁcation, we found that this approach e ven outperforms a direct supervised setup. K eywords Machine Learning, Semi-supervised learning, V ariational Autoencoder , Anomaly Detection, Transfer Learning, Representation Learning 1 Introduction In many domains, unlabeled data is ab undant, whereas obtaining rich labels may be time consuming, expensi ve and rely on manual annotations. As such, the value proposition of semi-supervised learning algorithms is immense: It allows us to train well performing predictiv e systems with only a fraction of labeled datapoints. In this paper , we present a ne w ﬂa vor of V ariational Autoencoder (V AE) that enables semi-supervised learning. The model architecture requires only minimal modiﬁcations on any gi ven purely unsupervised V AE. The semi-supervised classiﬁcation accuracy has similar performance as slightly more complex approaches known in the literature [ 1 , 16 ]. This was benchmarked using the MNIST (section 3.1.1), Fashion-MNIST (section 3.2.1) and UCI-HAR (section 3.3) data sets. W e veriﬁed that e ven if e very single datapoint is labeled, framing the training process in the context of V AE training improv es the classiﬁcation accuracy compared to the common way of training the classiﬁcation network in isolation. W e conjecture that supplementing the classiﬁcation loss with the V AE loss forces the network to learn better representations of the data. Here the V AE reconstruction task acts as a regularizer during training of the classiﬁcation network. ∗ felix.berkhahn@relayr .io † richard.keys@relayr .io ‡ wajih.ouertani@relayr .io § nikhil.shetty@relayr .io ¶ dominik.geissler@relayr .io Augmenting V ariational Autoencoders with Sparse Labels: A uniﬁed framew ork for unsupervised, semi-(un)supervised, and supervised learning A P R E P R I N T W e also veriﬁed that the a vailability of labels helps the model to ﬁnd better latent representations of the data: W e used the betaV AE disentanglement metric to asses the quality of the found representations (section 4). Furthermore, we applied the V AEs to the problem of anomaly detection, and observed that its performance increases when the model is trained with additional labeled samples - see sections 3.1.3, 3.2.3 and 3.3 for benchmarks on MNIST , Fashion-MNIST and UCI-HAR respectiv ely . In that sense, not only is the reconstruction of the model boosted by the av ailability of unlabeled datapoints (which is the normal semi-supervised setup), but vice versa the anomaly detection performance is also improv ed by the av ailability of labels. In summary , we ha ve de veloped a model which adapts seamlessly on the ful l 0-100% range of a vailable labels. The result is a ‘uniﬁed’ model in which the anomaly detection capability is improved by any a vailable label, and vice versa in which the predicti ve capability is signiﬁcantly boosted by the ab undance of unlabeled data. This paper provides a more thorough in vestigation and benchmark of the concepts which were published in a blog post in 2018 [5]. 2 Model 2.1 Model architectur e The general model architecture is depicted in Figure 1a. As can be seen, the model is an extension of the original V AE [ 2 ] which is depicted in Figure 1c. The only addition is that a classiﬁcation layer π (typically a one-hot classifying layer using softmax activ ation) is introduced that is attached to the topmost encoder layer . The µ and σ layer encode the mean and standard deviation of the g aussian prior in the latent layer: p ( z ) = N ( z | µ, σ ) (1) After sampling the latent variable z using the probability distribution (1), z and the activ ations of π are merged and fed into the decoder p θ : x recon ∼ p θ ( π , z ) = p θ ( π ⊕ z ) (2) where x recon denotes the reconstructed data of the decoder . Hence, the classiﬁcation predictions are also contrib uting to the reconstruction of the data. (a) Semi-Supervised V AE, model S S (b) Supervised Classiﬁer , model E S (c) Unsupervised Anomaly Detector, model E U Figure 1: Comparison of our model architecture (a) to the supervised (b) / unsupervised (c) equiv alents. The greyed out cells shown in (b) and (c) are not part of the models, and highlight the difference to our model (a). Note that the π layer (and its loss) represents the extension to the standard V AE proposed in this paper . 2 Augmenting V ariational Autoencoders with Sparse Labels: A uniﬁed framew ork for unsupervised, semi-(un)supervised, and supervised learning A P R E P R I N T 2.2 Loss function W e propose an ad-hoc modiﬁcation of the standard V AE evidence lo wer bound L E LB O [2] loss function: L = L E LB O + L cl (3) where L E LB O = E z ∼ p ( z ) [log ( p θ ( x | y , z ))] − K L [ p φ ( z ) ||N (0 , I )] (4) L cl = − α ( y ) #labeled X i y i · log( π i ) (5) K L [ p φ ( z ) ||N (0 , I )] denotes the Kullback-Leibler di vergence of p φ ( z ) and a standard normal distribution, which is only applied to the latent variables z , but not the labels π . p φ ( z ) is the probability distribution of the latent variable generated by the encoder . y represents the label of the datapoint. α ( y ) is equal to zero if there is no label (ie y belongs to the ’unlabeled class’), else it is one. Normalizing α to the number of all labeled datapoints per batch aids stabilizing training. L cl denotes the classiﬁcation log-loss. 2.3 Upsampling the labeled data In order to pre vent artiﬁcial noise from a stochastic number of labeled contributions in the log-loss term (5), we chose to not only normalize this term but also ﬁx the number of labeled samples per batch: Besides the completely (un)supervised edge cases, we sampled datasets such that each batch contained labeled and unlabeled samples in a ratio of 1:1. Additionally , this prev ents unlabeled datapoints from dominating training in cases of very sparsely labeled datasets. 2.4 Differences to Kingma’ s V AE [1] Our work is lar gely inspired by [ 1 ]. Howe ver , the model we are proposing differs from the model M2 of [ 1 ] in sev eral aspects as shown in T able 1. T able 1: Differences to Kingma et al. our model Kingma et al encoder single encoder network sharing weights q φ ( z , y | x ) two independent encoder networks q φ ( z , y | x ) = q φ ( z | x ) q φ ( y | x ) latent layer latent activ ations only depend on x : µ ( x ) latent activ ations depend on both x and y : µ ( x , y ) treatment of unlabeled data α ( y ) will omit contribution to L cl unknown y is summed ov er The simplicity of our model allows to turn an y existent V AE into a semi-supervised V AE by simply adding the π layer and extending the loss function. In particular , all learned weights can directly be reused when transitioning into the semi-supervised learning scenario. This is very useful, as in many real world applications, a labeled dataset (ev en partially labeled) is only built up o ver time and not a vailable at project initiation. 2.5 Classiﬁcation - Decoder as a regularizer W e benchmarked the classiﬁcation performance of our model for various data sets (MNIST , Fashion-MNIST , UCI-HAR, see sections 3.2.1, 3.2.1 and 3.3) as a function of av ailable labels. Not surprisingly , more labeled or unlabeled samples generally improv es performance. Moreov er , we also tested our model in the scenario where all datapoints were labeled. Interestingly , we found that the obtained model was performing better than training the same classiﬁcation model (Figure 1b) in a standard supervised scenario. In other words, framing the training process in the context of V AE training allo ws the classiﬁcation network to learn better weights compared to training it the ’ standard way’ with only the π classiﬁcation loss. The additional training target of reproducing the input via the decoder forces the network to learn more meaningful representations in its deeper layers, from which the classiﬁcation beneﬁts. The decoder and V AE training act as a regularizer , as it challenges the network to ﬁnd more subtle and granular representations of the input data, i.e. it will combat ov erﬁtting. At the same time, these representations are meaningful, as they contain valuable information about 3 Augmenting V ariational Autoencoders with Sparse Labels: A uniﬁed framew ork for unsupervised, semi-(un)supervised, and supervised learning A P R E P R I N T how to reconstruct the datapoint properly - hence it is expected that they enhance any task built on top of them (for instance, classiﬁcation). 2.6 Semi-Unsupervised learning As we have seen in the previous sections, the availability of unlabeled datapoints aids the model to form better representations in its deeper layers, hence enabling semi-supervised learning. Maybe the opposite is true as well: Does the av ailability of labels also aid with ﬁnding better representations? Does it perform better on reconstruction related tasks such as anomaly detection? This problem setup can be generally described as a ﬂav or of ’transfer learning’: can the model improve its task related to unsupervised learning by lev eraging the av ailability of labels that are primarily associated to the supervised learning task? This was in vestigated in two different kind of experiments: (a) we benchmarked the quality of the representations directly via the betaV AE score as a function of available labels (section 4). In this case the added π layer can be interpreted as an additional loss term directly reﬂecting the betaVEA score. And (b) we used the V AE as an anomaly detector (see sections 3.1.3, 3.2.3 and 3.3). In this case our approach can be vie wed as feature engineering: usually labels incorporate domain kno wledge of some very speciﬁc, yet important, property of the data set. Thereby our method can guide the π layer to wards an extractor for those very speciﬁc high-level features. In this experiment, we contrasted the semi-supervised model to an equi v alent purely unsupervised V AE by removing the π layer from the network and the loss function (this corresponds to the left and right most panels of Figure 1). W e then compared its anomaly detection performance with the anomaly detection performance of our model trained either on a portion or all normal datapoints labeled. The term ’ semi-unsupervised learning’ is a perfect description of this task - as semi-supervised learning enhances the performance of a supervised task by using unlabeled data, ’ semi-unsupervised’ learning would enhance the performance of an unsupervised task by using labeled data. The only other mention of the term was used to describe experiments [ 3 ] [ 4 ] on some other v ariations of the classic V AE [ 2 ]. This unsupervised task was ho wev er quite different, in which the objecting was to cluster unlabeled datapoints and subsequently classify them using one-shot-learning. 3 Experiments The networks used are described in detail in appendix A. Each semi-supervised model S S (our architecture as described abov e in 2.1 ), was contrasted with two sibling networks: (a) An equiv alent supervised network E S , corresponding only to the encoder plus the π layer of the S S , being trained only on the cross-entropy loss. And (b) an equivalent unsupervised network E U , which is identical to our S S architecture, but with the π layer remov ed. Throughout this section, we will refer to our models using the abbreviations as shown in T able 2. The error bars were generated by re-running each scenario at least 10 times (unless speciﬁed otherwise). T able 2: Model abbre viations. Generally speaking the equiv alent models E S and E U are deri ved from S S by either removing the resampling step and decoder (thus making it supervised, E S ) or by removing the π layer (thus making it fully unsupervised, E U ). Dense Con volutional Recurrent Semi-Supervised (ours) S S D S S C N N S S RN N Equiv alent Supervised E S D E S C N N E S RN N Equiv alent Unsupervised E U D E U C N N E U RN N 3.1 MNIST For almost any type of image classiﬁer , the natural place to begin any benchmarking is by using the MNIST [ 6 ] dataset, a well known dataset containing 70 , 000 ( 60 , 000 training and 10 , 000 testing) grey-scale images of hand written digits. Giv en the versatility of the model, we conducted the follo wing three benchmarks. 3.1.1 Semi-Supervised perf ormance The ﬁrst task is semi-supervised learning, the area within which the model was designed to bring the most beneﬁt. T o create a semi-supervised dataset, we simply discard a certain percentage of the labels, b ut not the images themselv es. 4 Augmenting V ariational Autoencoders with Sparse Labels: A uniﬁed framew ork for unsupervised, semi-(un)supervised, and supervised learning A P R E P R I N T This means an equiv alent supervised model will only be able to train on the sample of the dataset which is labeled, whereas the semi-supervised model will beneﬁt from all the additional unlabeled samples. The semi-supervised model was trained for 10 epochs, whilst the supervised equiv alent was trained for 20 epochs, such that the comparison is made between fully conv erged models. For comparison, 10 different labeled subsets of the dataset were taken, with each model trained on identical data, the results of which are displayed in T able 3. T able 3: Semi-Supervised MNIST classiﬁcation results model 100 labels 1 , 000 labels accuracy log loss accuracy log loss S S D 0 . 808 ± 0 . 006 0 . 80 ± 0 . 01 0 . 9346 ± 0 . 0009 0 . 215 ± 0 . 002 E S D 0 . 763 ± 0 . 003 0 . 75 ± 0 . 01 0 . 902 ± 0 . 006 0 . 41 ± 0 . 03 S S C N N 0 . 811 ± 0 . 006 0 . 86 ± 0 . 03 0 . 945 ± 0 . 001 0 . 218 ± 0 . 06 E S C N N 0 . 765 ± 0 . 003 0 . 744 ± 0 . 009 0 . 89 ± 0 . 02 0 . 6 ± 0 . 2 Unsurprisingly , both variants of the semi-supervised model outperform their purely supervised equiv alents for both sets of labels. For the CNN v ariant, there was a clear increase in the accuracy of the classiﬁcation for both 100 and 1 , 000 labels as it scored 4 . 6% and 5 . 5% higher than the supervised equiv alent respecti vely . Interestingly , on 100 labels the log loss was lo wer for the supervised model than for the semi-supervised model for both the dense and CNN v ariants; howe ver when trained on 1 , 000 labels this trend was re versed. A possible interpretation of this could be that the supervised model is trained on a much smaller dataset and hence starts to ov erﬁt, producing predictions with a higher certainty . This might be beneﬁcial for the log loss since the supervised model has predictive power nonetheless as is corroborated by the accurac y score. 3.1.2 Decoder as a regularizer The next test for the model is in the purely supervised domain, testing the hypothesis that the decoder acts as a regularizer and assists the model in ﬁnding better representations of the dataset. For this test, both models were trained on the full ( 60 , 000 sample) training dataset until con verged, with the results displayed in T able 4. T able 4: Supervised MNIST classiﬁcation results model accuracy log loss S S D 0 . 9855 ± 0 . 0003 0 . 062 ± 0 . 001 E S D 0 . 9814 ± 0 . 0003 0 . 131 ± 0 . 003 S S C N N 0 . 9916 ± 0 . 0003 0 . 036 ± 0 . 002 E S C N N 0 . 9904 ± 0 . 0009 0 . 055 ± 0 . 009 For both the dense and CNN cases the accurac y scores were very similar and, in the case of the CNN model, the error bars are almost ov erlapping. The log loss of the semi-supervised model howe ver was signiﬁcantly lo wer , in particular for the dense model there was a 50% reduction when compared with that of the supervised model. This strongly suggests that the addition of the reconstruction task introduces a much higher conﬁdence in the classiﬁcations of the semi-supervised model. 3.1.3 Semi-Unsupervised Learning The purpose of the semi-unsupervised task was to verify that the introduction of labels can improve the performance of an anomaly detection task. The results presented in T able 5 were obtained by training the model on 9 of the 10 classes (designated ’normal’ using the terminology of anomaly detection) in the MNIST dataset and inferring on all classes, essentially declaring the left out class to be ’anomalous’ data. The anomaly score returned by the models is the log reconstruction probability , which is expected to be higher for the anomalous classes. The performance of the scores are ev aluated using the A UC (area under curve, also R OC) score. W ith the exception of the digit 1 , there was a considerable improvement for each anomalous class, with an av erage improv ement of 4 . 1% ov er the purely unsupervised model. It demonstrates that the addition of labels aids the model in learning a better representation of normal data. Again, as hypothesized, this is most likely due to the additional label information assisting the model in identifying the category as important high le vel feature. 5 Augmenting V ariational Autoencoders with Sparse Labels: A uniﬁed framew ork for unsupervised, semi-(un)supervised, and supervised learning A P R E P R I N T T able 5: Label assisted results of anomaly detection on MNIST anomalous class A UC of E U D A UC of S S D 0 0 . 949 ± 0 . 001 0 . 969 ± 0 . 001 1 0 . 47 ± 0 . 03 0 . 095 ± 0 . 006 2 0 . 9610 ± 0 . 0009 0 . 9719 ± 0 . 0004 3 0 . 848 ± 0 . 003 0 . 902 ± 0 . 002 4 0 . 708 ± 0 . 004 0 . 751 ± 0 . 003 5 0 . 860 ± 0 . 003 0 . 894 ± 0 . 001 6 0 . 9295 ± 0 . 0008 0 . 960 ± 0 . 002 7 0 . 669 ± 0 . 002 0 . 755 ± 0 . 005 8 0 . 891 ± 0 . 004 0 . 922 ± 0 . 002 9 0 . 62 ± 0 . 02 0 . 677 ± 0 . 005 The very poor performance of using the digit 1 as the anomaly class could also be evidence to support this. Giv en the similarities between the digits 1 and 7 , it is likely that the dense representation found by the model for the digit 7 was also suf ﬁcient for reconstructing the 1 ’ s, especially giv en that the shape of the digit 1 is most often also found within the digit 7 . This would also explain why there was not a similar drop in performance when considering 7 as the anomaly class. Considering clustered embeddings, the dense representation of the digit 1 would not be enough to properly reconstruct a 7 , leading to a higher anomaly score Based on these results, a further experiment was run to assess the ef fect of the amount of a vailable labels. Classes ’7’ and ’9’ were chosen as they achiev ed the largest improvement over the unsupervised equiv alent and the results are displayed in T able 6. T able 6: Anomaly detection A UC w .r .t. label av ailability label % anomaly class 7 anomaly class 9 1% 0 . 751 ± 0 . 006 0 . 670 ± 0 . 004 10% 0 . 736 ± 0 . 009 0 . 679 ± 0 . 003 25% 0 . 744 ± 0 . 008 0 . 680 ± 0 . 004 50% 0 . 752 ± 0 . 004 0 . 677 ± 0 . 004 75% 0 . 744 ± 0 . 004 0 . 673 ± 0 . 004 99% 0 . 745 ± 0 . 009 0 . 652 ± 0 . 004 For this test, the model was trained by re-sampling the av ailable labels such that the model is trained on an equal amount of labeled and unlabeled data, without making any changes to the testing data. If the label fraction is below 50% , this amounts to up-sampling the labeled data, while vice-versa for a label fraction abo ve 50% to do wnsampling of the labeled data. For both classes, there is almost no difference in the label percentage as the error bars all overlap. Not only is this result somewhat surprising, b ut it is an adv antage of such a model. Firstly , it demonstrates that a tiny fraction of labels is all that is required to bring a substantial increase in performance compared to the unsupervised domain, i.e. almost the maximum pay-off can be achie ved straight a way . Secondly , although one would have intuitiv ely expected the performance of the anomaly detector to increase with respect to the number of labels, this is not the case and does not conﬂict with the hypothesis. Giv en that the labels are up-sampled during training to balance the learning objective, increasing the percentage of labels in the training set simply increases the div ersity of the labeled data rather than the quantity . Considering the hypothesis that learning clustered representation of the data assists the model in identifying anomalies, a possible explanation for the lack of impro vement in the anomaly detection performance could be reasoned as follo ws: if a small fraction of labels is enough to push the model to ﬁnding such a clustered representation, a lar ger di versity within the label classes may not pro vide any further contrib ution. Perhaps a more div erse representation within the clusters would improv e the performance by helping the model to identify anomalous data which lie on the class boundaries. 3.1.4 Data generation The decoder part of a V AE samples the latent layer and attempts to reconstruct the original input. An advantage of the semi-supervised variant is that the decoder can be used as a generati ve model by providing both the target label and by sampling from the latent layer . Given that the prior distribution of the latent layer is Gaussian, we can simply sample 6 Augmenting V ariational Autoencoders with Sparse Labels: A uniﬁed framew ork for unsupervised, semi-(un)supervised, and supervised learning A P R E P R I N T from a normal distribution to feed as an input to the decoder . Depending on where we sample the normal distribution, the digit which is generated will be a representation of a dif ferent region of the training data. In other words, we can separate both class and style. This is demonstrated in Figure 2, which illustrates the range of styles for each digit the model has learned. Figure 2: Digits generated by our Semi-Supervised model when trained on MNIST 3.2 Fashion-MNIST Fashion-MNIST [ 7 ] is an image dataset published by Zalando. It is designed as a drop-in replacement of MNIST , i.e. exactly like the original MNIST dataset, Fashion-MNIST contains of 60 , 000 training (and 10 , 000 testing) 28 × 28 grey-scale images with a total of 10 classes. Howe ver , there is much more variability within a gi ven F ashion-MNIST class than there is within a given MNIST class. For instance, the individual samples of the class ’Ankle Boot’ vary much more wildly than any digit in the MNIST dataset. As a consequence, Fashion-MNIST is a much more challenging dataset than MNIST , and hence serves as more realistic proxy to ev aluate model performance; especially as most modern image classiﬁcation models can almost perfectly solv e MNIST . At the same time, Fashion-MNIST preserv es the big advantage of MNIST : It is still a small dataset that allows rapid training and experimentation when researching new models. 3.2.1 Semi-Supervised perf ormance The set-up for the semi-supervised classiﬁcation task is the same as with the original MNIST dataset. That is, only the labels for a subset of all samples the dataset are retained. The models E S D and E S C N N are then only trained on that subset, whilst the S S D and S S C N N models are trained on all the samples, b ut only make use of the labels of the subset. The other samples are treated as unlabeled. The results are shown in T able 7. T able 7: Semi-Supervised Fashion-MNIST classiﬁcation results model 100 labels 1 , 000 labels accuracy log loss accuracy log loss S S D 0 . 703 ± 0 . 007 1 . 3 ± 0 . 2 0 . 812 ± 0 . 004 1 . 19 ± 0 . 07 E S D 0 . 668 ± 0 . 008 1 . 4 ± 0 . 1 0 . 766 ± 0 . 008 1 . 6 ± 0 . 1 S S C N N 0 . 724 ± 0 . 008 1 . 19 ± 0 . 05 0 . 836 ± 0 . 001 0 . 83 ± 0 . 01 E S C N N 0 . 66 ± 0 . 01 1 . 5 ± 0 . 1 0 . 803 ± 0 . 004 1 . 20 ± 0 . 05 The difference in the accuracy scores between the semi-supervised and their supervised counterparts is almost identical as with the original MNIST . Unlike the original MNIST ho wev er , there is a clear improv ement in the log loss from each of the semi-supervised models. 7 Augmenting V ariational Autoencoders with Sparse Labels: A uniﬁed framew ork for unsupervised, semi-(un)supervised, and supervised learning A P R E P R I N T 3.2.2 Decoder as a regularizer The results of training our model and an equi v alent supervised model on the full dataset with e very datapoint labeled (see section 2.5) are displayed in T able 9. T able 8: Supervised Fashion-MNIST classiﬁcation results model mean accuracy log loss S S D 0 . 877 ± 0 . 004 0 . 4 ± 0 . 02 E S D 0 . 79 ± 0 . 01 4 . 8 ± 0 . 4 S S C N N 0 . 925 ± 0 . 001 0 . 33 ± 0 . 025 E S C N N 0 . 922 ± 0 . 0015 0 . 33 ± 0 . 015 As can be seen, the fully connected (dense) model proﬁts massi vely from using the decoder as a re gularizer . For the CNN model, the effect is less pronounced. Howe ver , the best accuracy is still achiev ed by S S C N N . 3.2.3 Semi-unsupervised learning The anomaly detection task was set up in the same way as with the original MNIST dataset. One of the 10 classes was designated anomalous, the others were declared normal. The model is trained on a subset of the samples from the remaining nine classes. The held out samples from these nine classes and the samples from the anomalous class are used as a validation set, with the performance of the results ev aluated using the A UC score. The results are summarized in T able 9. The error bounds were obtained by rerunning ev ery experiment ﬁ ve times. T able 9: Label assisted results of anomaly detection on Fashion-MNIST anomalous class A UC of S S D A UC of E U D T -shirt/top 0 . 712 ± 0 . 002 0 . 695 ± 0 . 002 T rouser 0 . 371 ± 0 . 005 0 . 347 ± 0 . 002 Pullov er 0 . 826 ± 0 . 005 0 . 8 ± 0 . 001 Dress 0 . 462 ± 0 . 003 0 . 416 ± 0 . 004 Coat 0 . 719 ± 0 . 007 0 . 681 ± 0 . 001 Sandal 0 . 413 ± 0 . 007 0 . 34 ± 0 . 002 Shirt 0 . 76 ± 0 . 002 0 . 762 ± 0 . 001 Sneaker 0 . 191 ± 0 . 006 0 . 21 ± 0 . 002 Bag 0 . 899 ± 0 . 009 0 . 871 ± 0 . 004 Ankle Boot 0 . 525 ± 0 . 006 0 . 547 ± 0 . 004 S S D performs better than E U D for most anomalous classes. The only exceptions are the ’Shirt class’, where there is a tie between both models within the denoted error bar, while E U D wins for the ’Sneaker class’ and the ’Ankle Boot class’; although the performance for the ’Sneaker class’ for both models is extremely bad. In general, the model performance varies drastically depending on the anomalous class. While they perform very well for classes like ’Pullov er’ or ’Bag’, they struggle with the ’Sneaker’, ’Ankle Boot’, ’T rouser’ and ’Dress’ classes. This can be understood by looking at a samples of each class (see Figure 4). In particular the ’Sneaker’ and ’Ankle Boot’ classes bear a resemblances, which explains why an anomaly detector trained on the ’Sneaker’ class has hard time to ﬂag the ’Ankle Boot’ samples as anomalous (and vice versa). The same is true, though to a lesser degree, for the ’T rouser’ and ’Dress’ classes. 3.2.4 Data Generation Figure 3 sho ws an example of generated F ashion-MNIST data, using the same strategy as described in section 3.1.4. Once again there is an indication of the styles the model has learned to the embed within the latent space. The style itself is mostly captured in the shape of the class and also the position of highlighted features, for example the position of the straps on the sandals. Comparing the generated data to some of the examples images from Figure 4, it can be seen that the generated data lacks any of the detailing of the original images. Giv en that the dataset is more detailed than the digits, but the model architecture is the same, this is likely due to the size of the latent layer and its lack of capacity for storing these additional details. 8 Augmenting V ariational Autoencoders with Sparse Labels: A uniﬁed framew ork for unsupervised, semi-(un)supervised, and supervised learning A P R E P R I N T Figure 3: Samples generated by our Semi-Supervised model when trained on Fashion-MNIST 3.3 Human activity r ecognition (UCI-HAR) The UCI-HAR dataset [ 9 ] contains 7 , 352 samples, consisting of gyroscopic data recorded from humans, labeled with one of the following six acti vities: walking, walking upstairs, w alking downstairs, sitting, standing and laying. The testing conducted on the UCI-HAR dataset was not as thorough as with MNIST and Fashion-MNIST , and we performed only one run per test scenario (hence no error bars are giv en). The classiﬁcation results are presented in T able 10 and compare the performance of the semi-supervised model to its supervised equi valent. T able 10: UCI-HAR classiﬁcation results model 100 labels 1 , 000 labels 7 , 352 labels accuracy log loss accuracy log loss accuracy log loss S S R 0 . 630 4 . 000 0 . 839 0 . 503 0 . 917 0 . 310 E S RN N 0 . 381 1 . 538 0 . 690 0 . 820 0 . 909 0 . 428 Again, the semi-supervised model is able to outperform its supervised equiv alent for each subset of labels. In this case, the performance gain is particularly high when labels are in short supply , as demonstrated by the 24% improv ement ov er the supervised model for the 100 labels test. The performance of the model as an anomaly detector was also brieﬂy ev aluated, using ’walking’ as the anomaly class. The results, displayed in T able 11, again show an improv ement in performance when pro viding an anomaly detector with labels. T able 11: UCI-HAR anomaly detection results model A UC score S S RN N 0 . 682 E S RN N 0 . 641 4 Disentangled repr esentations One major application of V AEs is to ﬁnd low dimensional representations of real world data [ 11 , 15 ]. For this task, disentanglement is considered an important quality metric [ 15 , 14 ]. Generally speaking, disentanglement attempts to quantify how well a particular frame work is able to identify important yet independent generating factors of its dataset. For this, multiple distinct metrics and benchmark data sets hav e been suggested, yet they ha ve been sho wn to agree at least on a qualitative le vel [ 11 ]. In order to benchmark our semi-supervised model, we chose to benchmark via the 9 Augmenting V ariational Autoencoders with Sparse Labels: A uniﬁed framew ork for unsupervised, semi-(un)supervised, and supervised learning A P R E P R I N T betaV AE score [ 12 ] on the Small-NORB data set [ 13 ]. Our scores are calculated based on the latent layer (but not the π layer), and are shown in T able 12. The network architecture is described in the appendix, T able 16 and is similar to 14 with two modiﬁcations: (a) the labels of the data set were incorporated using multiple cross-entropy loss terms (and one-hot sigmoid π layers) for each of the four dimensions, (b) in this architecture the π layers are intended to only function as an additional and sparse loss term on the latent layer , and thus is not forwarded to the decoder (no connection between π and ’Merge’ in Figure 1a). Latter step is necessary , since otherwise the network could bypass the latent layer using the π layer , while maintaining reconstruction quality . There are two hyperparameters, α , the ov erall weight of the supervised cross-entropy term (equ. 5) and β norm the ov erall weight of the KL-diver gence term. Both parameters were kept ﬁxed for all cases of this experiment. They were optimized for maximum disentanglement in the completely unsupervised case. Note that this puts all other cases at an disadvantage, since their optimal hyperparameters presumably dif fer from the unsupervised case - but this is to emulate a real life scenario in which labeled datapoints are either sparse and thus cannot be used for hyperparameter tuning. W e empirically found that good results are achieved when α is selected such that its average contrib ution to the total loss is about of the same order of magnitude as the reconstruction loss (after training conv erges). The optimal β norm was found to be rather small ( 0 . 25 ), in accordance with [ 11 ]. For the betaV AE score itself, we used a plain logistic regressor . Each datapoint for this classiﬁer was based on the averaged absolute dif ferences between the representations of 2 × 64 batches. The classiﬁer was trained and tested with 2048 datapoints each. The ﬁnal results are sho wn in T able 12: W e always used about 38 , 000 unlabeled datapoints, b ut augmented training by v arious amounts of labeled datapoints. As expected our approach can outperform both purely unsupervised and a supervised scenarios signiﬁcantly . Even a relativ ely small amount of labeled datapoints ( 300 , ∼ 1 %) seems sufﬁcient. It should be noted, that for few (around 100 ) labels there is a small, yet statistically signiﬁcant, decrease in the betaV AE score. This could be due to the aforementioned f act, that the hyperparameters used where optimized for an unsupervised scenario, yet 100 labels were not suf ﬁcient to of fset this disadvantage. T able 12: betaV AE score of the representations generated by our semi-supervised model on v arying label av ailability . All results were obtained using the same hyperparameters, which were optimized for the ﬁrst ro w . # unlabeled data # labeled data betaV AE score 38 , 000 0 82 ± 0 . 7 38 , 000 100 76 ± 1 . 6 38 , 000 300 83 ± 1 . 8 38 , 000 1 , 000 92 ± 0 . 6 38 , 000 3 , 000 95 ± 0 . 6 0 1 , 000 87 ± 0 . 8 5 Future w ork It would be interesting to see how much lar ger networks would beneﬁt from the suggested regularization technique. For instance, the same technique could directly be applied to state-of-the-art computer vision networks like [ 10 ]. W e leave this av enue for future in vestigation. While the results of the semi-unpervised learning already look promising, this approach so far makes no use of an additional input that labels could pro vide: incorporating user feedback by labeling a false-positi ve and false-ne gativ e detection as such, with the goal of suppressing future false-positi ve/false-ne gati ve detections. In principle, our model architecture should allow to incorporate such feedback. One way to achieve this would be as follows: Prepare new classes for false-positiv e and false-ne gativ e anomaly detections. In the beginning, there will be no samples in these classes. Howe ver , once a sufﬁcient amount of false- positi ve/false-ne gativ e detections accumulated, the model is re-trained. The anomaly score x could then be heuristically adjusted, for instance: x → 1 − p f p 1 − p f n · x (6) where p f p and p f n are the false-positiv e and false-negati ve probabilities, respectively , outputted by the model at inference time. 10 Augmenting V ariational Autoencoders with Sparse Labels: A uniﬁed framew ork for unsupervised, semi-(un)supervised, and supervised learning A P R E P R I N T This treatment would ev en work when there are no labels av ailable except the false-positi ve/false-ne gativ e assignments. 6 Conclusions W ith one of the most common issues associated with training machine learning models being the av ailability of training data, semi-supervised models present the perfect opportunity to take advantage of e very a vailable scrap of data. This is a particularly valuable impro vement in the supervised domain, given the a vailability of unlabeled data in comparison to labeled data. The value added by a semi-supervised approach is ev en greater when considering such a model in the context of failure prediction, giv en that with a purely supervised approach, the dataset would consist entirely of failure e vents. Depending on the system in question, such a dataset could take decades to collect; an obstacle which would often make a predicti ve system unobtainable. T aking advantage of the often abundant and easy to produce unlabeled data, this semi-supervised approach demonstrates the ability to con ver ge tow ards accurate predictions on only a fraction of the labels. The versatility of the semi-supervised model proposed in this paper deli vers concrete improv ements across the entire spectrum of label av ailability . W ithin the labeled domain, the val ue added by a semi-supervised approach is ev en greater when considering such a model in the context of failure prediction. With a purely supervised approach, a training dataset would consist entirely of failure e vents, requiring the system in question to fail hundreds if not thousands of times to gather a sizeable dataset. Depending on the system in question, collecting such a dataset from scratch could take decades; an obstacle which can often make a predicti ve system unobtainable. T aking advantage of the often ab undant and easy to produce unlabeled data, this semi-supervised approach demonstrates the ability to con verge to wards an accurate predicti ve system on only a fraction of the labels. In addition to reducing the time to deployment, the model of fers further beneﬁts in the supervised learning domain, able to outperform equiv alent classiﬁers due the regularizing effect of the decoder and its associated reconstruction task. In the purely unsupervised domain, the model achiev es identical performance to a V AE, yet demonstrates a huge increase of performance with a tiny fraction of labels. T raditionally , a V AE must ﬁnd a suitable dense representation of the system it’ s modelling in the latent layer . With the introduction of labels, the latent acti v ations must not only embed a representation of the system, but also a classiﬁcation of the system state. Ultimately , this additional information results in improv ed embedding of the system state, not only enabling classiﬁcations, but impro ving reconstructions. In short, the labels provide the model with a better understanding of the system which it is reconstructing. References [1] Diederik P . Kingma, Danilo J. Rezende, Shakir Mohamed, Max W elling Semi-supervised Learning with Deep Generativ e Models In Advances in Neural Information Pr ocessing Systems , 3581-3589, 2014. [2] Diederik P . Kingma, Max W elling Auto-Encoding V ariational Bayes In Pr oceedings of the 2nd International Confer ence on Learning Repr esentations (ICLR) , 2014. [3] Matthew J.F . W illets, Stephen J. Roberts, Christopher C. Holmes Semi-Unsupervised Learning with Deep Generativ e Models: Clustering and Classifying using Ultra-Sparse Labels arXiv:1901.08560 2019-01-24 [4] Matthew J.F . W illets, Aiden Doherty , Stephen J. Roberts, Chris Holmes Semi-Unsupervised Learning of Human Activity using Deep Generati ve models arXiv:1810.12176 2018-10-29 [5] Felix Berkhahn, Richard Ke ys, W ajih Ouertani, Nikhil Shetty One model to rule them all https: // relayr .io/ blog/ one- model- to- r ule- them- all/ 2018-09-21 [6] Y . LeCun, C.Cortes MNIST handwritten digit database http://yann.lecun.com/exdb/mnist/ [7] Han Xiao and Kashif Rasul and Roland V ollgraf, Fashion-MNIST : a Nov el Image Dataset for Benchmarking Machine Learning Algorithms, arXiv cs.LG/1708.07747 2017-08-28 [8] Geoffre y Hinton, Nitish Sriv astav a, Ke vin Swersky Overvie w of mini-batch gradient descent https: // www .cs. toronto .edu/ ~tijmen/ csc321/ slides/ lecture_slides_lec6.pdf [9] Davide Anguita, Alessandro Ghio, Luca Oneto, Xa vier Parra and Jorge L. Reyes-Ortiz, A Public Domain Dataset for Human Activity Recognition Using Smartphones 21th Eur opean Symposium on Artiﬁcial Neural Networks, Computational Intelligence and Machine Learning , ESANN 2013. Bruges, Belgium 24-26 April 11 Augmenting V ariational Autoencoders with Sparse Labels: A uniﬁed framew ork for unsupervised, semi-(un)supervised, and supervised learning A P R E P R I N T [10] Saining Xie, Ross Girshick, Piotr Dollar , Zhuowen T u, Kaiming He Aggregated Residual T ransformations for Deep Neural Networks IEEE Confer ence on Computer V ision and P attern Recognition (CVPR) 2017. Honolulu, HI, USA, 21-26 July 2017 [11] Francesco Locatello, Stefan Bauer , Mario Lucic, Gunnar Rätsch, Sylvain Gelly , Bernhard Schölkopf, Olivier Bachem Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations arXiv:1811.12359 29-11-2018 [12] Irina Higgins, Loïc Matthey , Arka Pal, Christopher Burgess, Xavier Glorot, Matthe w Botvinick, Shakir Mohamed, Alexander Lerchner beta-V AE: Learning Basic V isual Concepts with a Constrained V ariational Framework International Confer ence on Learning Repr esentations 2017 [13] Y ann LeCun, Fu Jie Huang, Léon Bottou Learning methods for generic object recognition with in v ariance to pose and lighting Pr oceedings of the 2004 IEEE Computer Society Confer ence on Computer V ision and P attern Recognition 2004. CVPR 2004. [14] Y oshua Bengio, Aaron Courville, Pascal V incent Representation learning: A re view and ne w perspectiv es IEEE transactions on pattern analysis and mac hine intelligence 5(8): 1798–1828, 2013 [15] Michael Tschannen, Olivier Bachem, Mario Lucic Recent advances in autoencoder-based representation learning arXiv:1812.05069 12-12-2018 [16] Ingo K ossyk, Zoltán-Csaba Márton Discriminati ve regularization of the latent manifold of v ariational auto- encoders J ournal of V isual Communication and Image Repr esentation 15-03-2019 A Network ar chitectures and training A.1 MNIST and Fashion-MNIST : Model 1, FCN W e used the raw pixels, nromalized to (0 , 1) as input, corresponing to size 28 · 28 = 784 feature vectors. The architecture is shown below in T able 13. All encoder and decoder layers are simply stacked. The latent layers ( µ and σ for a gaussian prior) are forked from the last encoder layer , and merged together with the resampled latent layer as input to the decoder . The model is trained with RMSprop [ 8 ] without decay and a momentum parameter of 0 . 9 . If not explicitly mentioned otherwise, the learning rate was set to 0 . 0005 . T able 13: Fully connected network architecture. The input image corresponds to a 28x28 image reshaped into a single feature vector . layer type dimensions comments encoder input layer 784 fully connected 1024 r elu activ ation fully connected 1024 r elu activ ation latent layer fully connected 2 linear activ ation; latent gaussian mean fully connected 2 linear activation; latent g aussian variance fully connected 10 softmax activ ation; class prediction decoder fully connected 1024 relu acti v ation fully connected 1024 r elu activ ation output layer 784 sigmoid activ ation A.2 MNIST and Fashion-MNIST : Model 2, CNN The images were rescaled to (0 , 1) but not reshaped. Here we used a series of con volutional layers as detailed in T able A.1. The last dimension in the dimensions column corresponds to the feature dimension, while the ﬁrst two correspond to the image dimensions. Again, all encoder and decoder layers were simply stacked, whereas the latent and π layer were forked from the last encoder layer . The resampled latent layer and π layer were concatenated as input for the decoder . The model is trained with RMSprop without decay and a momentum parameter of 0 . 9 . If not explicitly mentioned otherwise, the learning rate was set to 0 . 0005 . 12 Augmenting V ariational Autoencoders with Sparse Labels: A uniﬁed framew ork for unsupervised, semi-(un)supervised, and supervised learning A P R E P R I N T T able 14: Con volutional network architecture (CNN). The input shape corresponds to a single greysclae image (x- axis,y-axis, channel). BN is an abbriavtaion for batch normalization layer, Con vCNN for a transposed con volution layer . layer type dimensions comments encoder input layer (28 , 28 , 1) CNN (28, 28, 64) kernel 3 × 3 ; stride 1 BN (28, 28, 64) with r elu activ ation CNN (28, 28, 64) kernel 3 × 3 ; stride 1 BN (28, 28, 64) with r elu activ ation CNN (14, 14, 64) kernel 3 × 3 ; stride 2 BN (14, 14, 64) with r elu activ ation Flatten 12544 fully connected 512 BN 512 with r elu acti vation Dropout 512 dropout rate = 0 . 5 latent layer fully connected 2 linear activ ation; latent gaussian mean fully connected 2 linear activ ation; latent gaussian variance fully connected 10 softmax activ ation; class prediction decoder fully connected 12544 BN 12544 with r elu activ ation Dropout 12544 dropout rate = 0 . 5 Reshape (14, 14, 64) Con vCNN (14, 14, 64) kernel 3 × 3 ; stride 1 BN (14, 14, 64) with r elu activ ation Con vCNN (14, 14, 64) kernel 3 × 3 ; stride 1 BN (14, 14, 64) with r elu activ ation Con vCNN (28, 28, 64) kernel 3 × 3 ; stride 2 BN (28, 28, 64) with r elu activ ation Con vCNN (output) (28, 28, 1) kernel 1 × 1 ; stride 1 A.3 UCI-HAR dataset: RNN The recurrent V AE ﬂavor w as applied on the UCI HAR dataset [ 9 ]. W e used a look-back dimension of 128 with the full architecture described in the appendix in T able 15. The ﬁrst dimension in the dimensions column corresponds to the look-back dimension, while the second corresponds to the feature dimension. The latent layers are forked from the last encoder layer , and concatenated prior to the ﬁrst decoder layer . The model is trained with RMSprop without decay and a momentum parameter of 0 . 9 . If not explicitly mentioned otherwise, the learning rate was set to 0 . 001 . T able 15: RNN network architecture.The input shape corresponds to a single time series with (128 time steps, 6 features). layer type dimensions comments encoder input layer (128, 6) fully connected along feature dim (128, 40) relu acti v ation LSTM (128, 40) fully connected along feature dim (128, 30) relu acti v ation latent layer fully connected (128, 2) linear activ ation; latent gaussian mean fully connected (128, 2) linear activ ation; latent gaussian variance fully connected (128, 6) softmax activ ation; class prediction decoder fully connected along feature dim (128, 30) r elu acti vation LSTM (128, 40) fully connected along feature dim (128, 40) relu acti v ation output fully connected along feature dim (128, 6) linear activ ation; gaussian mean µ fully connected along feature dim (128, 6) softplus activ ation; gaussian variance σ 13 Augmenting V ariational Autoencoders with Sparse Labels: A uniﬁed framew ork for unsupervised, semi-(un)supervised, and supervised learning A P R E P R I N T T able 16: CNN network architecture used for generating represenations of the Small-NORB data set. The input shape corresponds to a single stero image pair (x-axis, y-axis, left/right). BN is an abbriavtaion for batch normalization layer , Con vCNN for a transposed con volution layer . layer type dimensions comments encoder input layer (96 , 96 , 2) CNN (48, 48, 32) kernel 7 × 7 ; stride 2 BN (48, 48, 32) with r elu acti vation CNN (24, 24, 32) kernel 7 × 7 ; stride 2 BN (24, 24, 32) with r elu acti vation CNN (12, 12, 64) kernel 7 × 7 ; stride 2 BN (12, 12, 64) with r elu acti vation CNN (6, 6, 64) kernel 7 × 7 ; stride 2 BN (6, 6, 64) with r elu activ ation Flatten 2304 fully connected 256 BN 256 with r elu activ ation Dropout 256 dropout rate = 0 . 5 latent layer fully connected 32 linear activ ation; latent gaussian mean fully connected 32 linear activ ation; latent gaussian variance fully connected 5 softmax acti vation; one-hot prediction (cate gory) fully connected 9 softmax activ ation; one-hot prediction (ele vation) fully connected 18 softmax activ ation; one-hot predictioon (azimuth) fully connected 6 softmax activ ation; one-hot prediction (lighting) decoder fully connected 2304 based on resampled gaussian latent layers BN 2304 with r elu activ ation Dropout 2304 dropout rate = 0 . 5 Reshape (6, 6, 64) Con vCNN (12, 12, 64) kernel 7 × 7 ; stride 2 BN (12, 12, 64) with r elu acti vation Con vCNN (24, 24, 64) kernel 7 × 7 ; stride 2 BN (24, 24, 64) with r elu acti vation Con vCNN (48, 48, 32) kernel 7 × 7 ; stride 2 BN (48, 48, 32) with r elu acti vation Con vCNN (96, 96, 32) kernel 7 × 7 ; stride 2 BN (96, 96, 32) with r elu acti vation Con vCNN (output) (96, 96, 2) kernel 7 × 7 ; stride 1 A.4 Small-NORB dataset: CNN The input images were rescaled to (0 , 1) . Each stero image pair was stacked into a single feature vector of shape (96 , 96 , 2) . The encoder and decoder are simply stacked and the full architecture is shown in T able 16. For each of the four generating factors of this data set (category , ele vation, azimuth and lighting) a sperate π layer was added after the encoder . The latent layers ( µ and σ ) of a gaussian prior are also added on top of the encoder . By contrast to the other models, only the resampled latent layer is used for the decoder . The model is trained with RMSprop without decay and a momentum parameter of 0 . 9 . If not explicitly mentioned otherwise, the learning rate was set to 0 . 001 . 14 Augmenting V ariational Autoencoders with Sparse Labels: A uniﬁed framew ork for unsupervised, semi-(un)supervised, and supervised learning A P R E P R I N T B Samples from F ashion-MNIST data set Figure 4: This ﬁgure shows some examples from the Fashion-MNIST dataset. Every class always corresponds to three consecuti ve ro ws. The classes are (from top to down): T -shirt/top, T rouser, Pullo ver , Dress, Coat, Sandal, Shirt, Sneak er , Bag and Ankle Boot. 15

Augmenting Variational Autoencoders with Sparse Labels: A Unified Framework for Unsupervised, Semi-(un)supervised, and Supervised Learning

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment