X-CNN: Cross-modal Convolutional Neural Networks for Sparse Datasets

X-CNN: Cross-modal Con v olutional Neural Networks for Sparse Datasets Petar V eli ˇ cko vi ´ c ∗ ‡ , Duo W ang ∗ , Nicholas D. Lane † ‡ and Pietro Li ` o ∗ ∗ Computer Laboratory , Uni versity of Cambridge, Cambridge CB3 0FD, UK † Department of Computer Science, Univ ersity College London, London WC1E 6BT , UK ‡ Nokia Bell Labs, Cambridge CB3 0F A, UK Email: { pv273, wd263, pl219 } @cam.ac.uk, niclane@acm.org Abstract —In this paper we propose cross-modal convolutional neural networks (X-CNNs), a novel biologically inspired type of CNN architectur es, treating gradient descent-specialised CNNs as individual units of processing in a larger -scale network topology , while allowing f or unconstrained information ﬂo w and/or weight sharing between analogous hidden layers of the network— thus generalising the already well-established concept of neural network ensembles (where information typically may ﬂow only between the output layers of the individual netw orks). The constituent networks ar e individually designed to learn the output function on their own subset of the input data, after which cross-connections between them ar e introduced after each pooling operation to periodically allow for information exchange between them. This injection of knowledge into a model (by prior partition of the input data through domain knowledge or unsupervised methods) is expected to yield greatest retur ns in sparse data en vironments, which are typically less suitable for training CNNs. For e valuation purposes, we ha ve compared a standard f our-lay er CNN as well as a sophisticated FitNet4 architecture against their cross-modal variants on the CIF AR-10 and CIF AR-100 datasets with differing percentages of the training data being remo ved, and ﬁnd that at lower lev els of data availability , the X-CNNs signiﬁcantly outperform their baselines (typically pro viding a 2– 6% beneﬁt, depending on the dataset size and whether data augmentation is used), while still maintaining an edge on all of the full dataset tests. I . I N TR O D U C T I O N In recent years, the number of success stories of machine learning has seen an all-time rise across a wide range of ﬁelds and tasks, examples including: computer vision [1], speech recognition [2], reinforcement learning [3] and guiding Monte Carlo tree search [4]. The unifying idea behind all of the above is deep learning , the utilisation of neural networks with many hidden layers, for the purposes of learning complex feature representations from raw data, rather than relying on hand- crafted feature extraction. As the networks become deeper , howe ver , they become more and more reliant on the amount of training examples provided for maximising their performance. While we are now able to extract large quantities of labelled information for many problems of interest, there remains a signiﬁcant proportion of tasks for which “big data” simply isn’t a vailable at this time, which makes it extremely difﬁcult to fully exploit a deep CNN architecture and properly learn generalisable features of the data. Here we will present an architectural methodology that attempts to extract additional predicti ve power from a con v olu- tional neural network (CNN) in such circumstances by instead focussing on the width of the data, i.e. the heterogeneity of information present within each training example. The key idea constitutes appropriate partitioning of this information and training smaller CNNs on these partitions (allowing them to train faster and more effecti vely under sparse data en vi- ronments), while allowing for information exchange between them at various stages (Fig. 1). A classic example where such an approach is bound to be useful are clinical studies , where there typically may not be that many patients, but for each patient there is potentially a heterogeneous wealth of information, such as v arious test results, patient history , ethnic background, body scans (CT , MRI. . . ) and so on, depending on the type of study . I I . C RO S S - M O DA L C N N S Our methodology is inspired by multilayer networks [5], mathematical structures encompassing sev eral layers of graphs ov er the same set of nodes, allowing for unrestricted intra- layer as well as inter-layer connections. They hav e been a demonstrably valuable tool for modelling a variety of natural and social systems ([6], [7], [8]), and their applicability to machine learning (within the context of hidden Markov models) was already demonstrated by some of the authors [9], managing to achie ve high performance on a sparse breast cancer classiﬁcation dataset in volving gene e xpression and methylation data. The network design process is initiated by appropriately par- titioning the input data—this may be done either manually (by exploiting existing domain knowledge) or through an unsupervised pre-training step, which will determine which (not necessarily disjoint) fragments of the input data are more likely to constructively inﬂuence one another . Afterwards, a cross-modal CNN is constructed such that a separate CNN superlayer is dedicated to each partition of the input data, attempting to learn the target function from its partition only . The purpose of the partitioning is to help the constituent CNNs become powerful predictors while requiring a smaller dimensionality of the input data, by allowing them access to Input (channel 1) Input (channel 2) Input (channel 3) Conv. Conv. Conv. Pool Pool Pool × conn. × conn. Conv. Conv. Conv. Pool Pool Pool Merge FC FC Softmax Fig. 1. Diagram of a simple cross-modal CNN for image classiﬁcation, generated from a baseline CNN of the form [ C onv → P ool ] × 2 → F C → S of tmax . Each of the three channels (RGB/YUV) of the input image receives its own CNN superlayer, with cross-connections inserted after the pooling operation, and full weight sharing in the fully connected layers. A more in-depth view of a potential cross-connection layout is provided by Figure 2. those parts of the input which are most signiﬁcantly related to each other in the context of the predictions that need to be made. Finally , the superlayers may be interconnected by any sort of (feedforward) cross-connection as is best seen ﬁt, and they may be combined in arbitrary ways at the output stage to pro- duce the ﬁnal output. Similarly , at any stage the weights of the superlayers may be shared—the simplest case, which we will explore in our analysis, constitutes complete weight sharing of the fully connected layers at the tail of the networks. This construction is biologically inspired by cr oss-modal systems [10] within the visual and auditory systems of the human brain (which in turn inspired the development of CNNs)—wherein sev eral cross-connections between various sensory networks hav e been discovered [11], [12]. T o quantify the gains of this approach, our ev aluation focusses on an already well-understood problem of coloured image classiﬁcation, on established CIF AR-10/100 [13] benchmarks for which an abundance of data is av ailable, so it is easier to in vestigate the ef fects of restricting the size of the training set on various CNN models. The partitioning of the input that we consider is per-channel—each of the three image channels will be an input to an indi vidual superlayer , and these superlayers will have identical high-lev el architecture (dif fering only in the number of feature maps per hidden layer)—as illustrated by Fig. 1. This also allows for a simple approach to cross- connections; namely , after every downsampling ( pooling ) op- eration we allo w for the feature maps to be exchanged between superlayers, after being passed through another con volutional layer (Fig. 2). While this model in itself constitutes a committee of CNNs, it differs from most traditional ensemble applications in two key ways: • An ensemble’ s constituent models typically e xchange in- formation only in the output stage, while the cross-modal framew ork allows for arbitrary (feedforward) information ﬂow at any stage of the processing pipeline; • Constituent models of an ensemble usually receiv e a full copy of the input each, while superlayers within a cross-modal neural network receive only a fraction of the input, allowing for a decrease in de gr ees-of-freedom of the model compared to an unrestricted network. In fact, this can be taken a step further: one may consider ensembles of cr oss-modal CNNs , which may compound on beneﬁts already giv en by X-CNNs themselves, on examples where the networks are potentially struggling to choose a proper class with sufﬁcient conﬁdence. As the X-CNN model can be observed as an ordinary CNN from a high level, any ensemble strategies that are found useful for CNNs should be useful for X-CNNs as well. Lastly , it should be noted that our approach is not restricted to CNNs, but it is then easiest to scrutinise, as the trained parameters are bound to obey a certain spatial structure. In line with this, an entire section of this manuscript will be dedicated to analysing the learned con volutional kernels within an X- CNN, as well as visualising the inputs that would maximise the model’ s cross-connection acti vations. I I I . M O D E L A R C H I T E C T U R E S For the purposes of ev aluating our proposed architecture’ s per- formance, we hav e implemented two baseline CNN models— along with their cross-modal variants—in Keras [14] (with Theano [15] back-end). For purposes of reproducibility , in this Conv olution Conv olution Max-Pool Max-Pool Conv. Conv. Conv. Conv. Merge Merge Conv. Conv. . . . . . . . . . . . . Fig. 2. Illustration of a single cross-connection segment within an X-CNN with two superlayers. After each pooling operation, we exchange the feature maps between the superlayers, after ﬁrst passing them through an additional conv olutional layer . W e may also perform an additional intra-superlayer conv olution before merging the feature maps in each superlayer via concatenation. section we will expose their architectures and hyperparameters as used for the ev aluation. The cross-modal variants’ feature map counts have been altered in such a way as to make the overall number of parameters as close as possible to the baseline, making for fair ev aluation with respect to degrees- of-freedom. For both of the models used, we represent images in the YUV colour space. As a linear transformation from RGB, it should not have an impact on performance of the baselines, while it has the beneﬁt of decoupling luminance from chr ominance , allowing for a simpler analysis of cross-connections (and relating its learned kernels to human vision processes). W e inject further domain kno wledge into the model by fa vouring the CNN superlayer corresponding to the Y channel in terms of feature map counts (typically doubled compared to the U/V superlayers within the same hidden layer). This corresponds to the assumption that the majority of rele vant information about an object is contained within its brightness channel, while colour usually represents auxiliary information. A. Ker asNet Our initial model of choice represents a simple CNN with four con volutional ReLU [16] layers, followed by two fully connected layers, one of which is also ReLU. W e will be referring to it as KerasNet throughout this manuscript as it is based on the Keras CIF AR-10 CNN example [17]. It represents a likely style of a “starting” model that one is going to attempt to apply on an image classiﬁcation problem (without particular prior knowledge about it), perhaps especially bearing in mind that the training data may be sparse. The architecture of the model, as well as its cross-modal variant (X-KerasNet) is outlined in T able I. Both models are trained for 200 epochs using the Adam SGD optimiser , with T ABLE I A R CH I T E CT U R E S F O R K E R AS N E T A N D X - K E RA S N E T Output size KerasNet X-KerasNet ∼ 4 . 46 M param. ∼ 4 . 37 M param. 32 × 32 [3 × 3 , 64] × 2 Y : [3 × 3 , 32] × 2 U/V : [3 × 3 , 16] × 2 16 × 16 2 × 2 Max-Pool, stride 2 Y → Y : identity U → U: identity V → V : identity Y U/V : [1 × 1 , 32] U/V Y : [1 × 1 , 16] [3 × 3 , 128] × 2 Y : [3 × 3 , 64] × 2 U/V : [3 × 3 , 32] × 2 8 × 8 2 × 2 Max-Pool, stride 2 1 × 1 Fully connected, 512-D 10/100-way softmax hyperparameters as described in [18], and a batch size of 32. Dropout [19] has been applied after both of the pooling operations (with p = 0 . 25 ) as well as after the ﬁrst fully connected layer (with p = 0 . 5 ). B. F itNet4 W e decided to implement FitNet4 by Romero et al. [20] as our second baseline, representing a sophisticated CNN close to the state-of-the-art on CIF AR-10/100. W e opted for this model as it is prominently featured in a v ariety of recent neural netw orks research ([21], [22]), and due to its design goal of being a “thin&deep” network, managing to keep its parameter count relativ ely low compared to many other successful models, and T ABLE II A R CH I T E CT U R E S F O R F I T N E T 4 A N D X - F IT N E T 4 Output size FitNet4 X-FitNet4 ∼ 2 . 75 M param. ∼ 2 . 72 M param. 32 × 32 [3 × 3 , 32] × 3 Y : [3 × 3 , 24] × 3 U/V : [3 × 3 , 12] × 3 [3 × 3 , 48] × 2 Y : [3 × 3 , 36] × 2 U/V : [3 × 3 , 18] × 2 16 × 16 2 × 2 Max-Pool, stride 2 Y → Y : [1 × 1 , 36] U → U: [1 × 1 , 18] V → V : [1 × 1 , 18] Y U/V : [1 × 1 , 12] U/V Y : [1 × 1 , 12] [3 × 3 , 80] × 6 Y : [3 × 3 , 60] × 6 U/V : [3 × 3 , 30] × 6 8 × 8 2 × 2 Max-Pool, stride 2 Y → Y : [1 × 1 , 60] U → U: [1 × 1 , 30] V → V : [1 × 1 , 30] Y U/V : [1 × 1 , 18] U/V Y : [1 × 1 , 18] [3 × 3 , 128] × 6 Y : [3 × 3 , 96] × 6 U/V : [3 × 3 , 48] × 6 1 × 1 8 × 8 (global) Max-Pool Fully connected, 500-D 10/100-way softmax therefore could still be a feasible ﬁrst choice for handling a sparse dataset. The FitNet4 consists of 17 conv olutional 2-way maxout [23] layers, follo wed by two fully connected layers, the ﬁrst of which is a 5-way maxout layer . The full architecture of this model—as well as its cross-modal variant (X-FitNet4)—is presented in T able II. Both models are initialised using Xa vier initialisation [24], and are then trained for 230 epochs using the Adam SGD optimiser with a batch size of 128. W e have applied batch normalisation [25] to the output of each hidden layer to signiﬁcantly acceler- ate the training procedure. L 2 regularisation with λ = 0 . 0005 has been applied to all weights in the model. Finally , dropout (with p = 0 . 2 ) was applied on the input, after every pooling operation, and after the fully connected maxout layer . I V . E V A L UAT I O N T o verify our insights, we hav e utilised two well-known image classiﬁcation benchmark datasets, CIF AR-10 and CIF AR-100 [13], for which an abundance of data is av ailable (50000 training and 10000 testing examples). This makes it easier to study the behaviour of the considered CNNs as dif ferent fractions of the training data are discarded. W e hypothesise that, at lower le vels of data a vailability (up to a threshold), our methodology will yield signiﬁcant gains over an equiv alent unrestricted CNN—and also that it will remain competitive at all higher training set sizes. The validity of our claim is in vestigated by performing comparativ e ev aluation, with the KerasNet and FitNet4 as baselines against X-KerasNet and X-FitNet4, respectively . In each individual test we evaluate the accuracy of these four models on the entire test set of 10000 samples, when the training routine is presented with only p % of the entire training dataset (chosen deterministically). The schedule for the tests is as follows: • Initially test in increments of 5% , until reaching 20% (at which time the training and testing sets have equal sizes); • Afterwards, test in increments of 10% until either reach- ing 50% or the accuracies of the two models get within 0 . 5% of each other (corresponding to a gain of ≤ 50 images properly classiﬁed), whichev er is later; • Specially , we always test on 1% (corresponding to a highly sparse en vironment with only 500 training images) and 100% of the training dataset. The images are preprocessed by applying a single batch normalisation operation on them; we have found this to yield slightly better results compared to doing global con- trast normalisation and ZCA whitening (the more common approach). Finally , given that it is, depending on the task, sometimes possible to signiﬁcantly enhance results in a sparse en vironment by way of data augmentation , we have run all of the above tests twice —with and without random translations and horizontal reﬂections applied to the training images— providing insight as to whether data augmentation compounds the effects of a cross-modal architecture, and to what extent. V . R E S U LT S A N D D I S C U S S I O N The full ev aluation results on the aforementioned tests are presented in T ables III – VI. The results on tests without data augmentation are completely in line with the claim of Section IV; at sufﬁciently lo w training data sizes, both X-KerasNet and X-FitNet4 signiﬁcantly out- perform their respecti ve baselines on the testing set, for both of the CIF AR-10/100 datasets. For CIF AR-10, the threshold at which the baselines “catch up” (in terms of being able to manually learn the domain knowledge directly injected into their cross-modal v ariants) is at around p = 40% , corresponding to 20000 training examples being av ailable. Furthermore, on CIF AR-100, such a threshold is ne ver reached, most lik ely due to the extreme sparsity of per - class examples making this problem particularly suitable for the X-CNN models; the only exception is the 1% scenario for FitNet4, where the data sparsity is probably too extreme (ﬁve examples/class) for such a deep model to reach its potential. Regardless of when the threshold is surpassed, we report that the cross-modal CNNs will generally continue to have a slight T ABLE III C O MPA R A T I VE E V A LU A T I ON R E S ULT S O N C I F A R - 1 0 W I T H OU T D A TA A U G M E NTA T I O N Model  p 1% 5% 10% 15% 20% 30% 40% 50% 60% 70% 80% 90% 100% KerasNet 37.94% 53.82% 62.95% 67.39% 70.26% 74.39% 76.62% 78.55% ——— ——— ——— ——— 82.50% X-KerasNet 41.19% 57.84% 65.01% 68.25% 71.36% 74.79% 76.96% 78.57% ——— ——— ——— ——— 82.62% FitNet4 38.97% 56.78% 70.37% 75.07% 78.50% 81.95% 83.95% 85.22% ——— ——— ——— ——— 89.56% X-FitNet4 39.21% 60.57% 70.82% 76.09% 79.40% 83.36% 84.25% 86.14% ——— ——— ——— ——— 90.13% T ABLE IV C O MPA R A T I VE E V A LU A T I ON R E S ULT S O N C I F A R - 1 0 W I T H D AT A A U G M E NTA T I O N Model  p 1% 5% 10% 15% 20% 30% 40% 50% 60% 70% 80% 90% 100% KerasNet 45.45% 67.01% 70.89% 78.83% 80.97% 83.23% 83.64% 85.02% ——— ——— ——— ——— 86.66% X-KerasNet 49.60% 69.28% 72.51% 78.96% 80.58% 83.10% 83.89% 85.37% ——— ——— ——— ——— 87.41% FitNet4 40.91% 65.73% 75.55% 80.85% 83.63% 86.23% 88.30% 89.11% ——— ——— ——— ——— 92.27% X-FitNet4 42.02% 65.54% 77.06% 81.33% 83.94% 86.41% 88.13% 89.37% ——— ——— ——— ——— 92.50% T ABLE V C O MPA R A T I VE E V A LU A T I ON R E S ULT S O N C I F A R - 1 00 W I TH O U T D AT A A U G ME N TA T I ON Model  p 1% 5% 10% 15% 20% 30% 40% 50% 60% 70% 80% 90% 100% KerasNet 7.55% 15.10% 20.24% 24.76% 28.18% 32.43% 36.29% 38.61% 41.63% 44.10% 45.56% 46.26% 48.26% X-KerasNet 8.05% 16.45% 23.04% 26.91% 30.08% 35.39% 39.13% 41.88% 42.50% 45.96% 46.73% 48.25% 49.98% FitNet4 6.48% 16.84% 22.12% 28.30% 35.52% 39.28% 43.59% 49.69% 50.42% 55.83% 56.62% 58.00% 59.78% X-FitNet4 6.64% 18.73% 27.57% 33.59% 38.38% 45.53% 49.68% 52.21% 55.55% 57.22% 59.52% 60.87% 62.20% T ABLE VI C O MPA R A T I VE E V A LU A T I ON R E S ULT S O N C I F A R - 1 00 W I TH D AT A A U G M EN TA T I ON Model  p 1% 5% 10% 15% 20% 30% 40% 50% 60% 70% 80% 90% 100% KerasNet 9.09% 24.68% 32.63% 38.64% 42.62% 47.64% 49.91% 52.46% 53.77% 54.26% 55.12% 55.42% 55.45% X-KerasNet 10.16% 27.15% 35.58% 42.05% 43.77% 48.80% 50.48% 54.25% 54.90% 55.33% 55.68% 56.82% 57.18% FitNet4 7.25% 17.94% 23.55% 29.24% 38.76% 48.07% 50.06% 56.01% 58.55% 59.80% 62.38% 63.60% 65.59% X-FitNet4 7.35% 20.39% 28.69% 37.86% 43.75% 50.48% 55.40% 57.92% 60.70% 62.76% 66.18% 66.27% 67.19% edge over the baselines—outperforming them on all of the full training dataset experiments, sometimes signiﬁcantly . This naturally in vites the conclusion that con verting a CNN into a X-CNN (if allowed by the task) is always a reasonable step; it can yield signiﬁcant beneﬁts (the signiﬁcance depending on the relation between the sparsity of the training dataset and the complexity of the baseline model), while rarely making performance signiﬁcantly worse. T o further verify this claim, we have performed experiments on the full datasets (with and without augmentation) where we monitored how the testing accuracy e volves as a function of training epoch. The resulting plots are summarised in Figure 3; it is clear that the X-CNNs are at least as powerful as their baselines, ev en when the full training sets are av ailable. Furthermore, it is possible to detect a narro w edge for the cross-modal models in the CIF AR-10 experiments, and a sig- niﬁcant edge in the CIF AR-100 experiments. The concluding remark is that even when the dataset under inv estigation is not very sparse, attempting to utilise a cross-modal v ariant of the considered models (if applicable) is a reasonable action, as it might yield noticeable returns in predictiv e power . The analysis of the interplay between data augmenta- tion and cross-modal networks on CIF AR-100 remains straightforward—the X-CNN models remaining consistently and signiﬁcantly ahead of their baselines throughout the entire spectrum of training set sizes. On CIF AR-10, howe ver , this is slightly more complicated; while the catch-up threshold gets expectedly decreased (to around p = 20% ), the behaviour of X-CNNs for smaller training set sizes does not always signiﬁcantly compound the beneﬁts of data augmentation. Speciﬁcally , at 5% of the training set the FitNet4 model manages to outperform X-FitNet4 (the roles do get rev ersed 0 20 40 60 80 100 120 140 160 180 200 220 240 0 . 45 0 . 5 0 . 55 0 . 6 0 . 65 0 . 7 0 . 75 0 . 8 0 . 85 0 . 9 0 . 95 Epo c h T est accuracy T est accuracy on CIF AR-10 (no aug.) KerasNet X-KerasNet FitNet4 X-FitNet4 0 20 40 60 80 100 120 140 160 180 200 220 240 0 . 5 0 . 55 0 . 6 0 . 65 0 . 7 0 . 75 0 . 8 0 . 85 0 . 9 0 . 95 Epo c h T est accuracy T est accuracy on CIF AR-10 (aug.) KerasNet X-KerasNet FitNet4 X-FitNet4 0 20 40 60 80 100 120 140 160 180 200 220 240 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 Epo c h T est accuracy T est accuracy on CIF AR-100 (no aug.) KerasNet X-KerasNet FitNet4 X-FitNet4 0 20 40 60 80 100 120 140 160 180 200 220 240 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 Epo c h T est accuracy T est accuracy on CIF AR-100 (aug.) KerasNet X-KerasNet FitNet4 X-FitNet4 Fig. 3. Plots of the test accuracy of the four CNN models under consideration as a function of the number of training epochs, under 100% of the training set av ailable. The experiments have been carried out on both CIF AR-10 and CIF AR-100, with and without data augmentation. The cross-modal CNNs are consistently competitiv e with their respective baselines across all four datasets, with a signiﬁcant edge present for CIF AR-100. starting from 10%, howe ver). As a possible cause of this phenomenon, we note that, at this data a vailability lev el, both of the FitNet4 models are signiﬁcantly inferior in performance to the KerasNet models, for which there is a signiﬁcant beneﬁt to the usage of X-CNNs. The takeaw ay lesson here is that, while the cross-modal architecture need not always compound nicely with data augmentation, an occurrence of such an event could signify that the baseline was not particularly suitable for properly accommodating data augmentation at this training set size in the ﬁrst place. If this happens, one should attempt to use a more suitable/shallower CNN—the X-CNN v ariant should then produce the desired beneﬁts. Finally , we hav e taken advantage of some of the smaller training set sizes to perform statistical signiﬁcance tests , typically scarce in deep learning literature. For training set sizes up to 15%, we trained the models ﬁ ve times (from dif- ferent initial conditions) and then performed t -tests, choosing p < 0 . 05 as our signiﬁcance threshold. Our ﬁndings show that, under these assumptions, the best-performing X-CNN model’ s performance advantages are statistically signiﬁcant in all scenarios , aside from the data-augmented CIF AR-10. V I . C RO S S - C O N N E C T I O N A N A L Y S I S A key element of the X-CNN architecture are the cross- connection layers, as they enable information ﬂo w between individual channels. It will therefore be of interest to under- stand and visualise what is the mode of operation for these layers. All of the visualisations in this section correspond to the learned weights after fully training on 100% of CIF AR-10 with data augmentation. W e will ﬁrst demonstrate that cross-connections inserted in the considered models, though being 1 × 1 con volutions, learn more complex functions than simple feature map passing. First, we note that the weights of a 1 × 1 conv olutional layer may be represented as a 2D table that maps input channels to output channels (akin to an adjacency matrix, where columns are the input channels and rows are the output channels). Fig. 4. W eight visualisation of the ﬁrst-level cross-connection layer for the X-FitNet4 CNN. The columns correspond to input channels, while rows correspond to output channels. Green colour indicates a positive-weight connection between an input channel and an output channel, while blue colour indicates a negati ve-weight connection. The colour intensities are proportional to the absolute weight values. T op: Y U/V (36 input channels, 12 output channels). Bottom, left-to-right: U Y and V Y (18 input channels, 12 output channels). Rather than displaying the raw table values, we decided to visualise weights in a heatmap style; Figure 4 sho wcases this visualisation for the ﬁrst cross-connection layer of X-FitNet4. Green colours indicate that an input channel has a positi ve connection weight to the respecti ve output channel while blue colours indicate neg ative weights. The colour intensities are proportional to the absolute weight values. It can be seen that each output channel of the cross-connection layer is obtained through a nontrivial weighted combination of input channels. W e hypothesise that the cross-connection layers selecti vely ﬁlter and combine input features that are more utilisable in another processing stream. T o delve deeper into what kinds of features the cross- connection layers are ﬁltering, combining and passing, we applied layer-wise feature-map activ ation techniques proposed by Simonyan et al. [26]. This technique performs gradient ascent on a white-noise input image to maximise acti vations of a speciﬁc channel of feature maps at an y of the layers within a pre-trained model. The objective function for gradient ascent is deﬁned as I 0 = argmax I Σ( I ) − λ k I k 2 (1) where I is input image, Σ( I ) is the activ ation of the considered neuron when provided with I as input, and λ is a regularisation factor . After iterating for a number of gradient ascent steps, the original white-noise image will be modiﬁed into patterns that approximate the detection function of a speciﬁc neuron. Lower -lev el conv olutional layers are well-known to learn ﬁlters approximating Gabor wavelet ﬁlters that act as edge detectors, corner detectors, etc; we can conﬁrm that in our experiments this has indeed been the case. For the ﬁrst cross- connection layer of X-FitNet4, we hav e visualised a selection of channel acti vations in Figure 5. This visualisation indicates that the cross-connection layer is indeed passing combined Fig. 5. Artiﬁcially generated images (from white noise) that cause strong activ ations of speciﬁc channels in the ﬁrst cross-connection layer of the X- FitNet4 model. T op: Three channels from the Y U/V cross-connections. Middle: Three channels from the U Y cross-connections. Bottom: Three channels from the V Y cross-connections. lower -lev el features, such as the addition of horizontal and vertical stripes in the upper right image in the ﬁgure. W e observe further that the pattern frequency for the Y channel’ s crossconnection layer is higher than the one for the U and V layers. This observation reﬂects the fact that the human vision system is able to detect higher frequency v ariations in intensity than chrominance. This is a solid indicator that the X-CNN architecture, when faced with an image classiﬁcation task in the YUV colour scheme, is actually attempting to mimic human vision. Our ﬁnal analysis focusses on the X-KerasNet model, where we transformed feature maps of arbitrary depths into RGB images by a colour-mapping scheme. Figure 6 shows the feature maps of the inputs and outputs of the Y U/V cross connections for representati ve images of the truck and airplane classes. These were easier to comparativ ely analyse on the X-K erasNet, as its cross-connection layers do not alter the number of feature maps, and therefore the same colour - mapping scheme remained meaningful for both. W e observe that cross-connection output maps have background and some features emphasised, while other features de-emphasised— which further indicates that the cross-connection layers are performing more complex inter-superlayer feature integration than simply passing feature maps between superlayers. V I I . C O N C L U S I O N W e hav e introduced cross-modal con volutional neural net- works (X-CNNs), a novel architecture that decouples con- volutional processing of (typically image-based) input par- titions, while allo wing for periodical information ﬂow be- Fig. 6. V isualisation of input and output feature maps of the Y U/V cross- connection layer of X-KerasNet. Left: input images ( truck / airplane ). Middle: input feature maps to the cross-connection layer. Right: output feature maps of the cross-connection layer . tween the processing pipelines, in order to achiev e perfor- mance improv ements in sparse data en vironments. W e have applied this methodology on the popular CIF AR-10/100 image classiﬁcation datasets for two baseline models, managing to signiﬁcantly outperform them in low-data environments, while remaining competitiv e in high-data en vironments— outperforming them on all of the full-dataset experiments. Aside from reinforcing the claim that the X-CNN architecture can only be beneﬁcial to a baseline model (depending on the lev els of training data sparsity , potentially highly signiﬁcantly), we ha ve further veriﬁed that the introduced cross-connection layers perform rather complex functions (thus they are not limited to simple feature map passing) and are capable of mimicking human vision processes—conﬁrming that the bi- ological inspiration behind such a model is justiﬁed. R E F E R E N C E S [1] A. Krizhevsky , I. Sutskever , and G. E. Hinton, “Imagenet classiﬁcation with deep conv olutional neural networks, ” in Advances in Neural Information Pr ocessing Systems 25 , F . Pereira, C. J. C. Burges, L. Bottou, and K. Q. W einberger , Eds. Curran Associates, Inc., 2012, pp. 1097–1105. [Online]. A vailable: http://papers.nips.cc/paper/ 4824- imagenet- classiﬁcation- with- deep- con volutional- neural- networks. pdf [2] G. Hinton, L. Deng, D. Y u, G. E. Dahl, A.-r . Mohamed, N. Jaitly , A. Senior, V . V anhouck e, P . Nguyen, T . N. Sainath et al. , “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups, ” IEEE Signal Pr ocessing Magazine , vol. 29, no. 6, pp. 82–97, 2012. [3] V . Mnih, K. Kavukcuoglu, D. Silver , A. A. Rusu, J. V eness, M. G. Bellemare, A. Graves, M. Riedmiller , A. K. Fidjeland, G. Ostrovski et al. , “Human-lev el control through deep reinforcement learning, ” Natur e , vol. 518, no. 7540, pp. 529–533, 2015. [4] D. Silver , A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. V an Den Driessche, J. Schrittwieser, I. Antonoglou, V . Panneershelv am, M. Lanctot et al. , “Mastering the game of go with deep neural networks and tree search, ” Nature , vol. 529, no. 7587, pp. 484–489, 2016. [5] M. Kiv el ¨ a, A. Arenas, M. Barthelemy , J. P . Gleeson, Y . Moreno, and M. A. Porter, “Multilayer networks, ” Journal of complex networks , vol. 2, no. 3, pp. 203–271, 2014. [6] M. De Domenico, C. Granell, M. A. Porter, and A. Arenas, “The physics of multilayer networks, ” arXiv preprint , 2016. [7] E. Estrada and J. G ´ omez-Garde ˜ nes, “Communicability rev eals a transi- tion to coordinated behavior in multiplex networks, ” Physical Review E , vol. 89, no. 4, p. 042819, 2014. [8] C. Granell, S. G ´ omez, and A. Arenas, “Competing spreading processes on multiplex networks: awareness and epidemics, ” Physical Review E , vol. 90, no. 1, p. 012808, 2014. [9] P . V eli ˇ ckovi ´ c and P . Li ` o, “Molecular multiplex network inference using gaussian mixture hidden markov models, ” Journal of Complex Networks , 2015. [Online]. A v ailable: http://comnet.oxfordjournals.org/ content/early/2015/12/25/comnet. \ cn v029.abstract [10] M. A. Eckert, N. V . Kamdar , C. E. Chang, C. F . Beckmann, M. D. Greicius, and V . Menon, “ A cross-modal system linking primary auditory and visual cortices: Evidence from intrinsic fmri connectivity analysis, ” Human brain mapping , vol. 29, no. 7, pp. 848–857, 2008. [11] A. L. Beer , T . Plank, and M. W . Greenlee, “Diffusion tensor imaging shows white matter tracts between human auditory and visual cortex, ” Experimental Brain Resear ch , vol. 213, no. 2, pp. 299–308, 2011. [Online]. A v ailable: http://dx.doi.org/10.1007/s00221- 011- 2715- y [12] W . Y ang, J. Y ang, Y . Gao, X. T ang, Y . Ren, S. T akahashi, and J. W u, “Effects of sound frequency on audiovisual integration: An event-related potential study , ” PLoS ONE , vol. 10, no. 9, pp. 1–15, 09 2015. [Online]. A v ailable: http://dx.doi.org/10.1371%2Fjournal.pone.0138296 [13] A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images, ” 2009. [14] F . Chollet, “Keras, ” https://github.com/fchollet/keras, 2015. [15] Theano Dev elopment T eam, “Theano: A Python framework for fast computation of mathematical expressions, ” arXiv e-prints , vol. abs/1605.02688, May 2016. [Online]. A vailable: http://arxiv .or g/abs/ 1605.02688 [16] V . Nair and G. E. Hinton, “Rectiﬁed linear units improv e restricted boltz- mann machines, ” in Pr oceedings of the 27th International Confer ence on Machine Learning (ICML-10) , 2010, pp. 807–814. [17] F . Chollet, “Keras cnn example for cifar-10, ” github.com/fchollet/keras/ blob/master/examples/cifar10 cnn.py, 2015. [18] D. Kingma and J. Ba, “ Adam: A method for stochastic optimization, ” arXiv pr eprint arXiv:1412.6980 , 2014. [19] N. Sriv astav a, G. E. Hinton, A. Krizhevsky , I. Sutskever , and R. Salakhutdinov , “Dropout: a simple way to pre vent neural networks from overﬁtting. ” Journal of Machine Learning Researc h , vol. 15, no. 1, pp. 1929–1958, 2014. [20] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y . Ben- gio, “Fitnets: Hints for thin deep nets, ” arXiv preprint , 2014. [21] R. K. Sriv astav a, K. Greff, and J. Schmidhuber, “T raining very deep networks, ” in Advances in neural information processing systems , 2015, pp. 2377–2385. [22] D. Mishkin and J. Matas, “ All you need is a good init, ” arXiv preprint arXiv:1511.06422 , 2015. [23] I. Goodfellow , D. W arde-Farle y , M. Mirza, A. Courville, and Y . Bengio, “Maxout networks, ” in Proceedings of The 30th International Confer- ence on Machine Learning , 2013, pp. 1319–1327. [24] X. Glorot and Y . Bengio, “Understanding the difﬁculty of training deep feedforward neural networks, ” in International Conference on Artiﬁcial Intelligence and Statistics , 2010, pp. 249–256. [25] S. Ioffe and C. Szegedy , “Batch normalization: Accelerating deep network training by reducing internal cov ariate shift, ” arXiv preprint arXiv:1502.03167 , 2015. [26] K. Simonyan, A. V edaldi, and A. Zisserman, “Deep inside conv olutional networks: V isualising image classiﬁcation models and saliency maps, ” arXiv pr eprint arXiv:1312.6034 , 2013.

X-CNN: Cross-modal Convolutional Neural Networks for Sparse Datasets

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment