Many could be better than all: A novel instance-oriented algorithm for Multi-modal Multi-label problem

MANY COULD BE BETTER THAN ALL: A NO VEL INST ANCE-ORIENTED ALGORITHM FOR MUL TI-MOD AL MUL TI-LABEL PROBLEM Y i Zhang, Cheng Zeng, Hao Cheng , Chongjun W ang, Lei Zhang ∗ National K ey Laboratory for No vel Softw are T echnology at Nanjing Univ ersity Nanjing Uni versity , Nanjing 210023, China { mg1733091, njuzengc, chengh } @smail.nju.edu.cn, { chjwang, zhangl } @nju.edu.cn ABSTRA CT W ith the emer gence of div erse data collection techniques, ob- jects in real applications can be represented as multi-modal features. What’ s more, objects may have multiple semantic meanings. Multi-modal and Multi-label [1] (MMML) prob- lem becomes a universal phenomenon. The quality of data collected from dif ferent channels are inconsistent and some of them may not beneﬁt for prediction. In real life, not all the modalities are needed for prediction. As a result, we pro- pose a nov el instance-oriented Multi-modal Classiﬁer Chains (MCC) algorithm for MMML problem, which can make con- vince prediction with partial modalities. MCC extracts dif- ferent modalities for dif ferent instances in the testing phase. Extensiv e experiments are performed on one real-world herbs dataset and two public datasets to validate our proposed al- gorithm, which re veals that it may be better to e xtract many instead of all of the modalities at hand. Index T erms — multi-modal, multi-label, extraction cost 1. INTR ODUCTION In man y natural scenarios, objects might be complicated with multi-modal features and have multiple semantic meanings simultaneously . For one thing, data is collected from div erse channels and e xhibits heterogeneous properties: each of these domains present different views of the same object, where each modal- ity can have its o wn indi vidual representation space and se- mantic meanings. Such forms of data are known as multi- modal data. In a multi-modal setting, different modalities are with v arious e xtraction cost. Pre vious researches, i.e., di- mensionality reduction methods, generally assume that all the multi-modal features of test instances ha ve been already ex- tracted without considering the extraction cost. While in prac- tical applications, there is no aforehand multi-modal features prepared, modality extraction need to be performed in the testing phase at ﬁrst. While for the comple x multi-modal data collection no wadays, the heavy computation burden of feature ∗ Corresponding author extraction for dif ferent modalities has become the dominant factor that hurts the ef ﬁciency . For another, real-world objects might hav e multiple se- mantic meanings. T o account for the multiple semantic mean- ings that one real-world object might have, one direct solution is to assign a set of proper labels to the object to explicitly express its semantics. In multi-label learning, each object is associated with a set of labels instead of a single label. Previ- ous researches, i.e., classiﬁer chains algorithm is a high-order approach considering the relationship among labels, but it is affected by the ordering speciﬁed by predicted labels. T o address all the abov e challenges, this paper intro- duces a no vel algorithm called Multi-modal Classiﬁer Chains (MCC) inspired by Long Short-T erm Memory (LSTM) [2][3]. Information of pre vious selected modalities can be considered as storing in memory cell. The deep-learning framew ork simultaneously generates next modality of fea- tures and conducts the classiﬁcation according to the input raw signals in a data-driv en way , which could av oid some bi- ases from feature engineering and reduce the mismatch be- tween feature extraction and classiﬁer . The main contribu- tions are: • W e propose a nov el MCC algorithm considering not only interrelation among dif ferent modalities, but also relationship among different labels. • MCC algorithm utilizes multi-modal information under budget, which shows that MCC can make a con vince prediction with less av erage modality extraction cost. The remainder of this paper is organized as follows. Section 2 introduces related work. Section 3 presents the proposed MCC model. In section 4, empirical ev aluations are given to sho w the superiority of MCC. Finally , section 5 presents conclusion and future work. 2. RELA TED WORK In this section, we brieﬂy present state-of-the-art methods in multi-modal and multi-label [4] ﬁelds. As for modality e x- traction in multi-modal learning, it is closely related to fea- ture extraction [5]. Therefore, we brieﬂy re view some related work on these two aspects in this section. Multi-label learning is a fundamental problem in machine leaning with a wide range of applications. In multi-label learning, each instance is associated with multiple interde- pendent labels. Binary Rele vance (BR) [6] algorithm is the most simple and ef ﬁcient solution of multi-label algorithms. Howe ver , the effecti veness of the resulting approaches might be suboptimal due to the ignorance of label correlations. T o tackle this problem, Classiﬁer Chains (CC) [7] was proposed as a high-order approach to consider correlations between la- bels. It is obviously that the performance of CC is seriously affected by the training order of labels. T o account for the effect of ordering, Ensembles of Classiﬁers Chains (ECC) [7] is an ensemble framework of CC, which can be built with n random permutation instead of inducing one classiﬁer chain. Entropy Chain Classiﬁer (ETCC) [8] extends CC by calcu- lating the contrib ution between two labels using information entropy theory while Latent Dirichlet Allocation Multi-Label (LD AML) [9] exploiting global correlations among labels. LD AML mainly solve the problem of large portion of sin- gle label instance in some special multi-label datasets. Due to high dimensionality of data , dimensionality reduction [10] or feature extraction should be taken into consideration. Originally , feature selection and dimensionality reduction are generally used for reducing the cost of feature extraction. [11] proposed regularized multilinear regression and selection for automatically selecting a set of features while optimizing prediction for high-dimensional data. Feature selection algo- rithms do not alter the original representation of the variables, but merely select part of them. Most of existing multi-label feature selection algorithms either boil do wn to solving mul- tiple single-label feature selection problems or directly mak e use of imperfect labels. Therefore, they may not be able to ﬁnd discriminati ve features that are shared by multiple labels. T o reduce the negativ e effects of imperfect label information in ﬁnding label correlations, [12] decomposes the multi-label information into a lo w-dimensional space and then employs the reduced space to steer the feature selection process. Fur- thermore, we al ways e xtract multiple features rather than sin- gle feature for classiﬁcation, sev eral adapti ve decision meth- ods for multi-modal feature e xtraction are proposed [13] [14]. T o further reduce the number of features for testing, [15] pro- poses a no vel Discriminativ e Modal Pursuit (DMP) approach. In this paper, taking both multi-label learning and feature extraction into consideration, we propose MCC model with an end-to-end approach [16] for MMML problem, which is inspired by adaptiv e decision methods. Different from pre- vious feature selection or dimensionality reduction methods, MCC extracts different modalities for dif ferent instances and different labels. Consequently , when presented with an un- seen instance, we would extract the most informative and cost-effecti ve modalities for it. Empirical study sho ws the ef- ﬁciency and effecti veness of MCC, which can achie ve better classiﬁcation performance with less av erage modalities. 3. METHODOLOGY This section ﬁrst summarizes some formal symbols and def- initions used throughout this paper , and then introduces the formulation of the proposed MCC model. An ov erview of our MCC algorithm is shown in Fig.1. instance M MCC instance 1 instance 2 instance 3 T raining Phase T esting Phase instance 1 instance 2 instance 3 modal 1 modal 2 modal P labels instance N Fig. 1 . Diagrammatic illustration of MCC model. In the test- ing phase, circles shado wed with red represent the features used for categorization prediction. 3.1. Notation In the following, bold character denotes v ector (e.g., X ). The task of this paper is to learn a function h : X → 2 Y from a training dataset with N data samples D = { ( X i , Y i ) } N i =1 . The i -th instance ( X i , Y i ) contains a feature vector X i ∈ X and a label vector Y i ∈ Y . X i = [ X 1 i , X 2 i , . . . , X P i ] ∈ R d 1 + d 2 + ··· + d P is a combi- nation of all modalities and d m is the dimensionality of fea- tures in m -th modality . Y i = [ y 1 i , y 2 i , . . . , y L i ] ∈ {− 1 , 1 } L denotes the label v ector of X i . P is the number of modalities and L is the number of labels. Moreov er , we deﬁne c = { c 1 , c 2 , . . . , c P } to represent the extraction cost of P modalities. Modality extraction se- quence of X i is denoted as S i = { S 1 i , S 2 i , . . . , S m i } , m ∈ { 1 , 2 , . . . , P } , m ≤ P , where S m i ∈ { 1 , 2 , . . . , P } represents m -th modality of features to extract of X i and satisﬁes the following condition: ∀ m, n ( m 6 = n ) ∈ { 1 , 2 , . . . , P } , S m i 6 = S n i . It is notew orthy that dif ferent instances not only cor - respond to different extraction sequences but also may have different length of modalities of features extraction sequence. Furthermore, we deﬁne some notations used for testing phase. Suppose there is a testing dataset with M data samples T = { ( X i , Y i ) } M i =1 . W e denote predicted labels of T as Z = { Z i } M i =1 , in which Z i = ( z 1 i , z 2 i , . . . , z L i ) represents all predicted labels of X i in T and Z j = ( z j 1 , z j 2 , . . . , z j M ) T represents j -th predicted labels of all testing dataset. 3.2. MCC algorithm On one hand, MMML is related to multi-label learning and here we extend Classiﬁer Chains to deal with it. On the other hand, each binary classiﬁcation problem in Classiﬁer Chains can be transferred into multi-modal problem and this proce- dure aims at making a con vince prediction with less a verage modality extraction cost. 3.2.1. Classiﬁer Chains Considering correlation among labels, we extend Classiﬁer Chains to deal with Multi-modal and Multi-label problem. Classiﬁer Chains algorithm transforms the multi-label learn- ing problem into a chain of binary classiﬁcation problems, where subsequent binary classiﬁers in the chain is built upon the predictions of preceding ones [4], thus to consider the full relativity of the label hereby . The greatest challenge to CC is how to form a recurrence relation chain τ . In this paper , we propose a heuristic Gini index based Classiﬁer Chains algo- rithm to specify τ . First of all, we split the multi-label dataset into se veral single-label datasets, i.e, for j -th label in { y 1 , y 2 , . . . , y L } , we rebuild dataset D j = { ( X i , y j i ) } N i =1 as j -th dataset of single-label. Secondly , we calculate Gini inde x [17] of each rebuilt single-label dataset D j , ( j = 1 , 2 , . . . , L ) . Gini ( D j ) = |Y | X k =1 X k 6 = k 0 p k p k 0 = 1 − |Y | X k =1 p 2 k (1) where p k represents the probability of randomly choosing two samples with same labels, p k 0 represents the probability of randomly choosing two samples with different labels and |Y | represents number of labels in D j . And then we get predicted label chain τ = { τ i } L i =1 , com- posed of inde xes of sorted { Gini ( D i ) } L i =1 which is sorted in descending order . For L class labels { y 1 , y 2 , . . . , y L } , we are supposed to split the label set one by one according to τ and then train L binary classiﬁers. For the j -th label y τ j , ( j = 1 , 2 , . . . , L ) in the ordered list τ , a corresponding binary training dataset is reconstructed by appending a set of labels preceding y τ j i to each instance X i : D τ j = { ([ X i , xd τ j i ] , y τ j i ) } N i =1 (2) where xd τ j i = ( y τ 1 i , . . . , y τ j − 1 i ) represents the binary as- signment of those labels preceding y τ j i on X i (speciﬁcally xd i,τ 1 = ∅ ) and [ X i , xd τ j i ] represents concatenating v ector X i and xd τ j i . W e denote c l as extraction cost of xd τ j i . If j > 1 , we combine the new set xd τ j i as a new modality in D τ j . Moreover , each instance in D τ j is composed of P + 1 modalities of features and e xtraction cost needs to be updated by appending c l to c . W e denote the ne w extraction cost as c 0 and c 0 = [ c , c l ] . Meanwhile, a corresponding binary testing dataset is con- structed by appending each instance with its relev ance to those labels preceding y τ j : T τ j = { ([ X i , xt τ j i ] , y τ j i ) } M i =1 (3) where xt τ j i = ( z τ 1 i , . . . , z τ j − 1 i ) represents the binary as- signment of those labels preceding z τ j i on X i (speciﬁcally xt τ 1 i = ∅ ) and [ X i , xt τ j i ] represents concatenating v ector X i and xt τ j i . W e denote c l as extraction cost of xt τ j i , which is the same as e xtraction cost of xd τ j i . If j > 1 , each instance in T τ j is composed of P + 1 modalities of features and one label y τ j i . After that, we propose an efﬁcient Multi-modal Classi- ﬁer Chains (MCC) algorithm, which will be introduced in the following paragraph. By passing a combination of training dataset D τ j and extraction cost c 0 as parameters of MCC, we get Z τ j . The ﬁnal predicted labels of T is the concatenation of Z τ j , ( j = 1 , 2 , . . . , L ) , i.e., Z = ( Z τ 1 , Z τ 2 , . . . , Z τ L ) 3.2.2. Multi-modal Classiﬁer Chains In order to induce a binary classiﬁer f l : X × {− 1 , 1 } with less a verage modality extraction cost and better performance in MCC, we design Multi-modal Classiﬁer Chains (MCC) al- gorithm which is inspired by LSTM. MCC extracts modali- ties of features one by one until it’ s able to make a conﬁdent prediction. MCC algorithm extracts dif ferent modalities se- quence with different length for dif ference instances, while previous feature extraction method extract all modalities of features and use the same features for all instances. MCC adopts LSTM netw ork to con vert the v ariable X 0 i ∈ X into a set of hidden representations H t i = [ h 1 i , h 2 i , . . . , h t i ] , h t i ∈ R h . Here, ˆ X S t i i = [ ˆ X 1 i , . . . , ˆ X m i , . . . , ˆ X P i ] is an adap- tation of X i . In the t -th step, the modality to be extracted is denoted as S t i . If m = S t i , ˆ X m i = X S t i i , 0 otherwise. For example, if S t i = 3 , ˆ X 3 i = [ 0 , 0 , X 3 i , . . . , 0 ] . Similar to peephole LSTM, MCC has three gates as well as two states: forget gate layer , input gate layer , cell state layer , output gate layer, hidden state layer , listed as follo ws: f t = σ ([ W f c , W f h , W f x ][ C t − 1 , h t − 1 , ˆ X t ] T + b f ) i t = σ ([ W ic , W ih , W ix ][ C t − 1 , h t − 1 , ˆ X t ] T + b i ) C t = f t · C t − 1 + i t · tanh ([ W ch , W cx ][ h t − 1 , ˆ X t ] T + b C ) o t = σ ([ W oc , W oh , W ox ][ C t , h t − 1 , ˆ X t ] T + b o ) h t = o t · tanh ( C t ) Different from LSTM, MCC adds two full connections to predict current label and ne xt modality to be extracted. For one thing, there is a full connection between hidden layer and label prediction layer , with weight vector ˆ W l . For another, there is a full connection between hidden layer and modality prediction layer, with weight vector ˆ W m . Moreov er , bias vector are denoted as b l and b m respectiv ely . • Label prediction layer: This layer predicts label accord- ing to a nonlinear softmax function f l j ( . ) . f l j ( H t i ) = σ ( H t i ˆ W l + b l ) (4) • Modality prediction layer: This layer predicts next modality according to a linear function f m j ( . ) and se- lects maximum as next modality to be e xtracted. f m j ( H t i ) = H t i ˆ W m + b m (5) W e use F L = [ f l 1 , f l 2 , . . . , f l L ] and F M = [ f m 1 , f m 2 , . . . , f m L ] to denote the label prediction function set and modality prediction function set respectiv ely . Next, we design loss function composed of loss term and regularization term for producing optimum and f aster results. Abov e all, we design loss of instance ˆ X i with S t i modality as follows. L t i = L l ( f l j ( H t i ) , y i ) + L m ( f m j ( H t i ) , ˆ X t i ) (6) Here we adopt log loss for label prediction loss function L l and hinge loss for modality prediction loss function L m , where modality prediction is measured by distances to K Nearest Neighbors [18]. Meanwhile, we add Ridge Regression (L2 norm) to the ov erall loss function. Ω t i = || ˆ W m || 2 + || ˆ W l || 2 + || c · f m j ( H t i ) || (7) where || . || represents L2 norm and c represents extraction cost of each modality . The loss term is the sum of loss in all instances at t -th step. The overall loss function is as follo ws. L t = N X i ( L t i + λ · Ω t i ) (8) where λ = 0 . 1 is trade-off between loss and re gularization. In order to optimize the aforementioned loss function L t , we adopt a novel pre-dimension learning rate method for gra- dient descent called AdaDelta [19]. Here, we denote all the parameters in Eq.8 as W = [ ˆ W m , ˆ W l , λ ] . At t -th step, we start by computing gradient g t = ∂ L t ∂ W t and accumulating decaying av erage of the squared gradients: E [ g 2 ] t = ρE [ g 2 ] t − 1 + (1 − ρ ) g 2 t (9) where ρ is a decay constant and ρ = 0 . 95 . The resulting parameter update is then: 4 W t = − p E [( 4 W ) 2 ] t − 1 +  p E [ g 2 ] t +  g t (10) where  is a constant and  = 1 e − 8 . Algorithm 1 The pseudo code of MCC algorithm Input: D = { ( X i , Y i ) } N i =1 : Training dataset; c = { c i } P i =1 : Extraction cost of P modalities; Output: F L : set of label prediction function F M : set of modality prediction function 1: Calculate predicted label chain τ = { τ i } L i =1 with Eq.1 2: for j in τ do 3: Construct D τ j with Eq.2 4: while cnt < N iter , cnt ++ do 5: Initial E [ g 2 ] 0 = E [ 4 W 2 ] 0 = 0 6: Choose N b samples in D τ j 7: for i = 1 : N b do 8: for t = 1 : P do 9: Select S t i with Eq.5 and calculate ˆ X S t i i 10: ˆ c t i = ˆ c t i + c S t i 11: if ˆ c t i > C th or a t i > A th then 12: break 13: end if 14: Calculate L t with Eq.8 15: Compute gradient g t = ∂ L t ∂ W t 16: Accumulate gradient E [ g 2 ] t with Eq.9 17: Compute Update 4 W t with Eq.10 18: Accumulate Updates E [ 4 W 2 ] t with Eq.11 19: Update W t +1 = W t + 4 W t 20: end for 21: end for 22: end while 23: Update f l j and f m j as in Eq.4 and Eq.5 24: end for 25: return F L , F M ; And then, we accumulate update: E [ 4 W 2 ] t = ρE [ 4 W 2 ] t − 1 + (1 − ρ ) 4 W 2 t (11) The pseudo-code of MCC is summarized in Algorithm 1. N b denotes batch size of training phase. N iter represents maximum number of iterations. C th represents the thresh- old of cost. A th represents the threshold of accuracy of the predicted label. ˆ c t i denotes the sum of extraction cost and a t i denotes accuracy of current predicted label. 4. EXPERIMENT 4.1. Dataset Description W e manually collect one real-world Herbs dataset and adapt two publicly available datasets including Emotions [20] and Scene [6]. As for Herbs , there are 5 modalities with ex- plicit modal partitions: channel tropism, symptom, function, dosage and ﬂa vor . As for Emotions and Scene , we di vide the features into different modalities according to information en- tropy gain. The details are summarized in T able 1. T able 1 . Datasets description. N , L and P denote the num- ber of instances, labels and modalities in each dataset, respec- tiv ely . D shows the dimensionality of each modality . Datasets N L P D Herbs 11104 29 5 [13, 653, 433, 768, 36] Emotions 593 6 3 [32, 32, 8] Scene 2407 6 6 [49, 49, 49, 49, 49, 49] 4.2. Experimental Settings All the experiments are running on a machine with 3.2GHz Inter Core i7 processor and 64GB main memory . W e compare MCC with four multi-label algorithms: BR, CC, ECC, MLKNN[21] and one state-of-the-art multi-modal algorithm: DMP[15]. For multi-label learner, all modalities of a dataset are concatenated together as a single modal input. For multi-modal method, we treat each label independently . F-measure is one of the most popular met rics for ev al- uation of binary classiﬁcation[22]. T o hav e a fair compari- son, we employ three widely adopted standard metrics, i.e., Micro-av erage, Hamming-Loss, Subset-Accuracy[4]. In ad- dition, we use Cost-average to measure the a verage modality extraction cost. For the sak e of conv enience in the regulariza- tion function computation, extraction cost of each modality is set to 1 . 0 in the experiment. Furthermore, we set the cost of new modality (predicted labels) to 0 . 1 to demonstrate its superiority compared with DMP . 4.3. Experimental Results For all these algorithms, we report the best results of the opti- mal parameters in terms of classiﬁcation performance. Mean- while, we perform 10-fold cross validation (CV) and tak e the av erage value of the results in the end. For one thing, table 2 shows the experimental results of our proposed MCC algorithm as well as other ﬁ ve comparing algorithms. It is obvious that MCC outperforms the other ﬁ ve algorithms on all metrics. For another, as shown in table 3, MCC uses less average modality extraction cost than DMP , while other four multi-label algorithms use all the modalities. 5. CONCLUSION Complex objects, i.e., the articles, the images, etc can al- ways be represented with multi-modal and multi-label infor- mation. Howe ver , the quality of modalities extracted from T able 2 . Comparison results (mean ± std). ↑ / ↓ indicates that the larger/smaller the better of a criterion. The best perfor- mance on each dataset is bolded. Algorithm Evaluation Metrics Micro- av erage ↑ Hamming- Loss ↓ Subset- Accuracy ↑ Herbs BR 0.621 ± 0.061 0.033 ± 0.004 0.349 ± 0.060 CC 0.624 ± 0.060 0.033 ± 0.004 0.363 ± 0.072 ECC 0.675 ± 0.010 0.035 ± 0.001 0.376 ± 0.018 MLKNN 0.544 ± 0.047 0.039 ± 0.005 0.281 ± 0.068 DMP 0.635 ± 0.073 0.032 ± 0.004 0.398 ± 0.063 MCC 0.706 ± 0.014 0.029 ± 0.004 0.437 ± 0.067 Emotions BR 0.536 ± 0.036 0.240 ± 0.021 0.175 ± 0.036 CC 0.541 ± 0.034 0.240 ± 0.022 0.184 ± 0.037 ECC 0.647 ± 0.038 0.202 ± 0.023 0.283 ± 0.042 MLKNN 0.529 ± 0.030 0.278 ± 0.022 0.212 ± 0.038 DMP 0.607 ± 0.072 0.206 ± 0.023 0.253 ± 0.061 MCC 0.659 ± 0.081 0.190 ± 0.023 0.292 ± 0.084 Scene BR 0.672 ± 0.140 0.098 ± 0.037 0.552 ± 0.160 CC 0.678 ± 0.119 0.109 ± 0.041 0.624 ± 0.117 ECC 0.705 ± 0.015 0.094 ± 0.005 0.596 ± 0.016 MLKNN 0.660 ± 0.114 0.113 ± 0.033 0.553 ± 0.112 DMP 0.815 ± 0.077 0.061 ± 0.020 0.704 ± 0.118 MCC 0.842 ± 0.040 0.057 ± 0.015 0.738 ± 0.082 T able 3 . Comparison results of the a verage modality e xtrac- tion cost, the smaller the better . OTHERS denotes BR, CC, ECC and MLKNN algorithms. The best performance on each dataset is bolded. Algorithm Herbs Emotions Scene O THERS 5.0 ± 0.0 3.0 ± 0.0 6.0 ± 0.0 DMP 3.031 ± 0.258 1.738 ± 0.279 3.202 ± 0.432 MCC 2.520 ± 0.182 1.519 ± 0.313 2.109 ± 0.324 various channels are inconsistent. Using data from all modal- ities is not a wise decision. In this paper , we propose a no vel Multi-modal Classiﬁer Chains (MCC) algorithm to improve supplements categorization prediction for MMML problem. Experiments in one real-w orld dataset and two public datasets validate the ef fectiveness of our algorithm. MCC makes great use of modalities, which can make a con vince prediction with many instead of all modalities. Consequently , MCC reduces modality extraction cost, but it has the limitation of time- consuming compared with other algorithms. In the future work, how to improve extraction parallelism is a very inter- esting work. 6. A CKNO WLEDGEMENT This paper is supported by the National Ke y Research and De- velopment Program of China (Grant No. 2016YFB1001102), the National Natural Science Foundation of China (Grant No. 61876080), the Collaborati ve Innovation Center of No vel Software T echnology and Industrialization at Nanjing Univ er- sity . 7. REFERENCES [1] Han-Jia Y e, De-Chuan Zhan, Xiaolin Li, Zhen-Chuan Huang, and Y uan Jiang, “College student scholarships and subsidies granting: A multi-modal multi-label ap- proach, ” in Data Mining (ICDM), 2016 IEEE 16th In- ternational Confer ence on . IEEE, 2016, pp. 559–568. [2] Alex Grav es and J ¨ urgen Schmidhuber , “Framewise phoneme classiﬁcation with bidirectional lstm and other neural network architectures, ” Neural Networks , vol. 18, no. 5-6, pp. 602–610, 2005. [3] R Bertolami, H Bunke, S Fernandez, A Grav es, M Li- wicki, and J Schmidhuber, “ A no vel connectionist sys- tem for improved unconstrained handwriting recogni- tion, ” IEEE T ransactions on P attern Analysis and Ma- chine Intelligence , v ol. 31, no. 5, 2009. [4] Min-Ling Zhang and Zhi-Hua Zhou, “ A revie w on multi-label learning algorithms, ” IEEE transactions on knowledge and data engineering , vol. 26, no. 8, pp. 1819–1837, 2014. [5] Yvan Saeys, I ˜ naki Inza, and Pedro Larra ˜ naga, “ A re- view of feature selection techniques in bioinformatics, ” bioinformatics , vol. 23, no. 19, pp. 2507–2517, 2007. [6] Matthew R Boutell, Jiebo Luo, Xipeng Shen, and Christopher M Brown, “Learning multi-label scene clas- siﬁcation, ” P attern reco gnition , vol. 37, no. 9, pp. 1757– 1771, 2004. [7] Jesse Read, Bernhard Pfahringer , Geoff Holmes, and Eibe Frank, “Classiﬁer chains for multi-label classiﬁ- cation, ” Machine learning , v ol. 85, no. 3, pp. 333, 2011. [8] Y ue Peng, Ming Fang, Chongjun W ang, and Jun yuan Xie, “Entropy chain multi-label classiﬁers f or tradi- tional medicine diagnosing parkinson’ s disease, ” in Bioinformatics and Biomedicine (BIBM), 2015 IEEE In- ternational Confer ence on . IEEE, 2015, pp. 856–862. [9] Y ue Peng, Chi T ang, Gang Chen, Junyuan Xie, and Chongjun W ang, “Multi-label learning by e xploiting la- bel correlations for tcm diagnosing parkinson’ s disease, ” in Bioinformatics and Biomedicine (BIBM), 2017 IEEE International Confer ence on . IEEE, 2017, pp. 590–594. [10] Sam T Roweis and Lawrence K Saul, “Nonlinear di- mensionality reduction by locally linear embedding, ” science , vol. 290, no. 5500, pp. 2323–2326, 2000. [11] Xiaonan Song and Haiping Lu, “Multilinear re gression for embedded feature selection with application to fmri analysis., ” in AAAI , 2017, pp. 2562–2568. [12] Ling Jian, Jundong Li, Kai Shu, and Huan Liu, “Multi- label informed feature selection., ” in IJCAI , 2016, pp. 1627–1633. [13] Joseph W ang, Kirill T rapezniko v , and V enkatesh Saligrama, “ An lp for sequential learning under bud- gets, ” in Artiﬁcial Intellig ence and Statistics , 2014, pp. 987–995. [14] Joseph W ang, Kirill T rapezniko v , and V enkatesh Saligrama, “Ef ﬁcient learning by directed acyclic graph for resource constrained prediction, ” in Advances in Neural Information Pr ocessing Systems , 2015, pp. 2152–2160. [15] Y ang Y ang, De-Chuan Zhan, Y ing Fan, and Y uan Jiang, “Instance speciﬁc discriminative modal pursuit: A se- rialized approach, ” in Asian Confer ence on Machine Learning , 2017, pp. 65–80. [16] W ei Li, Zheng Y ang, and Xu Sun, “Exploration on gen- erating traditional chinese medicine prescription from symptoms with an end-to-end method, ” arXiv preprint arXiv:1801.09030 , 2018. [17] Leo Breiman, Classiﬁcation and r egr ession tr ees , Rout- ledge, 2017. [18] Shemim Be gum, Debasis Chakraborty , and Ram Sarkar , “Data classiﬁcation using feature selection and knn ma- chine learning approach, ” in Computational Intelligence and Communication Networks (CICN), 2015 Interna- tional Confer ence on . IEEE, 2015, pp. 811–814. [19] Matthew D Zeiler , “ Adadelta: an adapti ve learning rate method, ” arXiv preprint , 2012. [20] Konstantinos T rohidis, Grigorios Tsoumakas, George Kalliris, and Ioannis P Vlahav as, “Multi-label classiﬁ- cation of music into emotions., ” in ISMIR , 2008, v ol. 8, pp. 325–330. [21] Min-Ling Zhang and Zhi-Hua Zhou, “Ml-knn: A lazy learning approach to multi-label learning, ” P attern r ecognition , vol. 40, no. 7, pp. 2038–2048, 2007. [22] Krzysztof Dembczynski, Arkadiusz Jachnik, W ojciech K otlowski, Willem W ae geman, and Eyke H ¨ ullermeier , “Optimizing the f-measure in multi-label classiﬁcation: Plug-in rule approach versus structured loss minimiza- tion, ” in International Confer ence on Machine Learn- ing , 2013, pp. 1130–1138.

Many could be better than all: A novel instance-oriented algorithm for Multi-modal Multi-label problem

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment