Softmax Q-Distribution Estimation for Structured Prediction: A Theoretical Interpretation for RAML

Reward augmented maximum likelihood (RAML), a simple and effective learning framework to directly optimize towards the reward function in structured prediction tasks, has led to a number of impressive empirical successes. RAML incorporates task-speci…

Authors: Xuezhe Ma, Pengcheng Yin, Jingzhou Liu

Softmax Q-Distribution Estimation for Structured Prediction: A   Theoretical Interpretation for RAML
Under revie w as a conference paper at ICLR 2018 S O F T M A X Q - D I S T R I B U T I O N E S T I M A T I O N F O R S T R U C - T U R E D P R E D I C T I O N : A T H E O R E T I C A L I N T E R P R E T A - T I O N F O R R A M L Xuezhe Ma & Pengcheng Y in & Jingzhou Liu & Graham Neubig & Eduard Hovy Language T echnologies Institute Carnegie Mellon Uni versity {xuezhem, pcyin, liujingzhou, gneubig, hovy}@cs.cmu.edu A B S T R AC T Rew ard augmented maximum likelihood (RAML), a simple and effecti ve learning framew ork to directly optimize to wards the re ward function in structured prediction tasks, has led to a number of impressiv e empirical successes. RAML incorporates task-specific rew ard by performing maximum-likelihood updates on candidate out- puts sampled according to an e xponentiated payoff distrib ution, which gives higher probabilities to candidates that are close to the reference output. While RAML is notable for its simplicity , ef ficiency , and its impressi ve empirical successes, the theoretical properties of RAML, especially the behavior of the exponentiated payoff distrib ution, has not been examined thoroughly . In this work, we introduce softmax Q-distrib ution estimation, a novel theoretical interpretation of RAML, which rev eals the relation between RAML and Bayesian decision theory . The softmax Q-distribution can be regarded as a smooth approximation of the Bayes decision boundary , and the Bayes decision rule is achie ved by decoding with this Q- distribution. W e further sho w that RAML is equi valent to approximately estimating the softmax Q-distribution, with the temperature τ controlling approximation error . W e perform two experiments, one on synthetic data of multi-class classification and one on real data of image captioning, to demonstrate the relationship between RAML and the proposed softmax Q-distribution estimation method, verifying our theoretical analysis. Additional experiments on three structured prediction tasks with rewards defined on sequential (named entity recognition), tree-based (dependency parsing) and irre gular (machine translation) structures show notable improv ements over maximum likelihood baselines. 1 I N T RO D U C T I O N Many problems in machine learning in volv e structured prediction, i.e., predicting a group of outputs that depend on each other . Recent advances in sequence labeling (Ma & Hovy, 2016), syntactic parsing (McDonald et al., 2005) and machine translation (Bahdanau et al., 2015) benefit from the dev elopment of more sophisticated discriminative models for structured outputs, such as the seminal work on conditional random fields (CRFs) (Laf ferty et al., 2001) and large margin methods (T askar et al., 2004), demonstrating the importance of the joint predictions across multiple output components. A principal problem in structured prediction is direct optimization tow ards the task-specific metrics (i.e., rew ards) used in ev aluation, such as token-lev el accuracy for sequence labeling or BLEU score for machine translation. In contrast to maximum likelihood (ML) estimation which uses likelihood to serve as a reasonable surrogate for the task-specific metric, a number of techniques (T askar et al., 2004; Gimpel & Smith, 2010; V olkovs et al., 2011; Shen et al., 2016) ha ve emerged to incorporate task-specific re wards in optimization. Among these methods, rew ard augmented maximum likelihood (RAML) (Norouzi et al., 2016) has stood out for its simplicity and effecti veness, leading to state-of- the-art performance on sev eral structured prediction tasks, such as machine translation (W u et al., 2016) and image captioning (Liu et al., 2016). Instead of only maximizing the log-lik elihood of the ground-truth output as in ML, RAML attempts to maximize the expected log-lik elihood of all possible candidate outputs w .r .t. the exponentiated payoff distribution , which is defined as the normalized 1 Under revie w as a conference paper at ICLR 2018 exponentiated reward. By incorporating task-specific reward into the payoff distribution, RAML combines the computational efficiency of ML with the conceptual advantages of reinforcement learning (RL) algorithms that optimize the expected re ward (Ranzato et al., 2016; Bahdanau et al., 2017). Simple as RAML appears to be, its empirical success has piqued interest in analyzing and justifying RAML from both theoretical and empirical perspecti ves. In their pioneering w ork, Norouzi et al. (2016) sho wed that both RAML and RL optimize the KL diver gence between the exponentiated payoff distribution and model distribution, but in opposite directions. Moreov er, when applied to log-linear model, RAML can also be sho wn to be equi valent to the softmax-margin training method (Gimpel & Smith, 2010; Gimpel, 2012). Nachum et al. (2016) applied the payoff distrib ution to improv e the exploration properties of policy gradient for model-free reinforcement learning. Despite these ef forts, the theoretical properties of RAML, especially the interpretation and behavior of the exponentiated payoff distribution, have largely remained under-studied (§2). First, RAML attempts to match the model distrib ution with the heuristically designed e xponentiated payof f distribu- tion whose behavior has lar gely remained under-appreciated, resulting in a non-intuiti ve asymptotic property . Second, there is no direct theoretical proof sho wing that RAML can deli ver a prediction function better than ML. Third, no attempt (to our best knowledge) has been made to further improve RAML from the algorithmic and practical perspectiv es. In this paper , we attempt to resolve the abov e-mentioned under-studied problems by providing an theoretical interpretation of RAML. Our contributions are three-fold: (1) Theoretically , we introduce the frame work of softmax Q-distribution estimation , through which we are able to interpret the role the payoff distribution plays in RAML (§3). Specifically , the softmax Q-distribution serves as a smooth approximation to the Bayes decision boundary . By comparing the payoff distrib ution with this softmax Q-distrib ution, we sho w that RAML approximately estimates the softmax Q-distrib ution, therefore approximating the Bayes decision rule. Hence, our theoretical results provide an e xplanation of what distribution RAML asymptotically models, and why the prediction function provided by RAML outperforms the one pro vided by ML. (2) Algorithmically , we further propose softmax Q-distribution maximum likelihood (SQDML) which impro ves RAML by achieving the exact Bayes decision boundary asymptotically . (3) Experimentally , through one experiment using synthetic data on multi-class classification and one using real data on image captioning, we v erify our theoretical analysis, showing that SQDML is consistently as good or better than RAML on the task-specific metrics we desire to optimize. Additionally , through three structured prediction tasks in natural language processing (NLP) with re wards defined on sequential (named entity recognition), tree-based (dependency parsing) and comple x irregular structures (machine translation), we deepen the empirical analysis of Norouzi et al. (2016), sho wing that RAML consistently leads to improved performance ov er ML on task-specific metrics, while ML yields better exact match accuracy (§4). 2 B A C K G RO U N D 2 . 1 N OTA T I O N S Throughout we use uppercase letters for random v ariables (and occasionally for matrices as well), and lo wercase letters for realizations of the corresponding random v ariables. Let X ∈ X be the input, and Y ∈ Y be the desired structured output, e.g., in machine translation X and Y are French and English sentences, resp. W e assume that the set of all possible outputs Y is finite. For instance, in machine translation all English sentences are up to a maximum length. r ( y , y ∗ ) denotes the task-specific rew ard function (e.g., BLEU score) which ev aluates a predicted output y against the ground-truth y ∗ . Let P denote the true distribution of the data, i.e., ( X, Y ) ∼ P , and D = { ( x i , y i ) } n i =1 be our training samples, where { x i , i = 1 , . . . , n } (resp. y i ) are usually i.i.d. samples of X (resp. Y ). Let P = { P θ : θ ∈ Θ } denote a parametric statistical model index ed by parameter θ ∈ Θ , where Θ is the parameter space. Some widely used parametric models are conditional log-linear models (Laf ferty et al., 2001) and deep neural networks (Sutske ver et al., 2014) (details in Appendix D.2). Once the parametric statistical model is learned, given an input x , model inference (a.k.a. decoding) is performed by finding an output y ∗ achieving the highest conditional probability: y ∗ = argmax y ∈Y P ˆ θ ( y | x ) (1) where ˆ θ is the set of parameters learned on training data D . 2 Under revie w as a conference paper at ICLR 2018 2 . 2 M A X I M U M L I K E L I H O O D Maximum likelihood minimizes the neg ative log-lik elihood of the parameters given training data: ˆ θ ML = argmin θ ∈ Θ n X i =1 − log P θ ( y i | x i ) = argmin θ ∈ Θ E ˜ P ( X ) [ KL ( ˜ P ( ·| X ) || P θ ( ·| X ))] (2) where ˜ P ( X ) and ˜ P ( ·| X ) is deri ved from the empirical distribution of training data D : ˜ P ( X = x, Y = y ) = 1 n n X i =1 I ( x i = x, y i = y ) (3) and I ( · ) is the indicator function. From (2) , ML attempts to learn a conditional model distribution P ˆ θ ML ( ·| X = x ) that is as close to the conditional empirical distrib ution ˜ P ( ·| X = x ) as possible, for each x ∈ X . Theoretically , under certain regularity conditions (W asserman, 2013), asymptotically as n → ∞ , P ˆ θ ML ( ·| X = x ) con verges to the true distribution P ( ·| X = x ) , since ˜ P ( ·| X = x ) con verges to P ( ·| X = x ) for each x ∈ X . 2 . 3 R E W A R D A U G M E N T E D M A X I M U M L I K E L I H O O D As proposed in Norouzi et al. (2016), RAML incorporates task-specific rew ards by re-weighting the log-likelihood of each possible candidate output proportionally to its exponentiated scaled re ward: ˆ θ RAML = argmin θ ∈ Θ n X i =1 n − X y ∈Y q ( y | y i ; τ ) log P θ ( y | x i ) o (4) where the re ward information is encoded by the e xponentiated payoff distrib ution with the temperature τ controlling it smoothness q ( y | y ∗ ; τ ) = exp( r ( y , y ∗ ) /τ ) P y 0 ∈Y exp( r ( y 0 , y ∗ ) /τ ) = exp( r ( y , y ∗ ) /τ ) Z ( y ∗ ; τ ) (5) Norouzi et al. (2016) showed that (4) can be re-e xpressed in terms of KL div ergence as follows: ˆ θ RAML = argmin θ ∈ Θ E ˜ P ( X ,Y ) [ KL ( q ( ·| Y ; τ ) || P θ ( ·| X ))] (6) where ˜ P is the empirical distrib ution in (3) . As discussed in Norouzi et al. (2016), the globally optimal solution of RAML is achie ved when the learned model distrib ution matches the e xponentiated payoff distrib ution, i.e., P ˆ θ RAML ( ·| X = x ) = q ( ·| Y = y ; τ ) for each ( x, y ) ∈ D and for some fixed value of τ . Open Problems in RAML W e identify three open issues in the theoretical interpretation of RAML: i) Though both P ˆ θ RAML ( ·| X = x ) and q ( ·| Y = y ; τ ) are distributions defined ov er the output space Y , the former is conditioned on the input X while the latter is conditioned on the output Y which appears to serve as ground-truth but is sampled from data distrib ution P . This mak es the behavior of RAML attempting to match them unintuitiv e; ii) Supposing that in the training data there exist two training instances with the same input b ut dif ferent outputs, i.e., ( x, y ) , ( x, y 0 ) ∈ D . Then P ˆ θ RAML ( ·| X = x ) has two “targets” q ( ·| Y = y ; τ ) and q ( ·| Y = y 0 ; τ ) , making it unclear what distribution P ˆ θ RAML ( ·| X = x ) asymptotically con verges to. iii) There is no rigorous theoretical e vidence showing that generating from P ˆ θ RAML ( y | x ) yields a better prediction function than generating from P ˆ θ ML ( y | x ) . T o our best knowledge, no attempt has been made to theoretically address these problems. The main goal of this work is to theoretically analyze the properties of RAML, in hope that we may ev entually better understand it by answering these questions and further improve it by proposing new training framew ork. T o this end, in the next section we introduce a softmax Q-distribution estimation framew ork, facilitating our later analysis. 3 Under revie w as a conference paper at ICLR 2018 3 S O F T M A X Q - D I S T R I B U T I O N E S T I M A T I O N W ith the end goal of theoretically interpreting RAML in mind, in this section we present the softmax Q-distribution estimation framework. W e first pro vide background on Bayesian decision theory (§3.1) and softmax approximation of deterministic distributions (§3.2). Then, we propose the softmax Q-distribution (§3.3), and establish the framework of estimating the softmax Q-distrib ution from training data, called softmax Q-distrib ution maximum likelihood (SQDML, §3.4). In §3.5, we analyze SQDML, which is central in linking RAML and softmax Q-distribution estimation. 3 . 1 B AY E S I A N D E C I S I O N T H E O RY Bayesian decision theory is a fundamental statistical approach to the problem of pattern classification, which quantifies the trade-of fs between various classification decisions using the probabilities and rew ards (losses) that accompany such decisions. Based on the notations setup in §2.1, let H denote all the possible prediction functions from input to output space, i.e., H = { h : X → Y } . Then, the expected r ewar d of a prediction function h is: R ( h ) = E P ( X ,Y ) [ r ( h ( X ) , Y )] (7) where r ( · , · ) is the re ward function accompanied with the structured prediction task. Bayesian decision theory states that the global maximum of R ( h ) , i.e., the optimal expected prediction rew ard is achieved when the prediction function is the so-called Bayes decision rule : h ∗ ( x ) = argmax y ∈Y E P ( Y | X = x ) [ r ( y , Y )] = argmax y ∈Y R ( y | x ) (8) where R ( y | x ) = E P ( Y | X = x ) [ r ( y , Y )] is called the conditional r ewar d . Thus, the Bayes decision rule states that to maximize the o verall re ward, compute the conditional rew ard for each output y ∈ Y and then select the output y for which R ( y | x ) is maximized. Importantly , when the rew ard function is the indicator function, i.e., I ( y = y 0 ) , the Bayes decision rule reduces to a specific instantiation called the Bayes classifier : h c ( x ) = argmax y ∈Y P ( y | X = x ) (9) where P ( Y | X = x ) is the true conditional distribution of data defined in §2.1. In §2.2, we see that ML attempts to learn the true distribution P . Thus, in the optimal case, decoding from the distrib ution learned with ML, i.e., P ˆ θ ML ( Y | X = x ) , produces the Bayes classifier h c ( x ) , but not the more general Bayes decision rule h ∗ ( x ) . In the rest of this section, we deriv e a theoretical proof showing that decoding from the distrib ution learned with RAML, i.e., P ˆ θ RAML ( Y | X = x ) approximately achie ves h ∗ ( x ) , illustrating why RAML yields a prediction function with improv ed performance tow ards the optimized reward function r ( · , · ) o ver ML. 3 . 2 S O F T M A X A P P R OX I M A T I O N O F D E T E R M I N I S T I C D I S T R I B U T I O N S Aimed at providing a smooth approximation of the Bayes decision boundary determined by the Bayes decision rule in (8) , we first describe a widely used approximation of deterministic distrib utions using the softmax function. Let F = { f k : k ∈ K } denote a class of functions, where f k : X → R , ∀ k ∈ K . W e assume that K is finite. Then, we define the random variable Z = argmax k ∈K f k ( X ) where X ∈ X is our input random variable. Obviously , Z is deterministic when X is giv en, i.e., P ( Z = z | X = x ) = ( 1 , if z = argmax k ∈K f k ( x ) 0 , otherwise. (10) for each z ∈ K and x ∈ X . The softmax function provides a smooth approximation of the point distribution in (10) , with a temperature parameter , τ > 0 , serving as a hyper -parameter that controls the smoothness of the 4 Under revie w as a conference paper at ICLR 2018 approximating distribution around the tar get one: Q ( Z = z | X = x ; τ ) = exp( f z ( x ) /τ ) P k ∈K exp( f k ( x ) /τ ) (11) It should be noted that at τ → 0 , the distribution Q reduces to the original deterministic distribution P in (10), and in the limit as τ → ∞ , Q is equiv alent to the uniform distribution Unif ( K ) . 3 . 3 S O F T M A X Q - D I S T R I B U T I O N W e are no w ready to propose the softmax Q-distrib ution , which is central in re vealing the relation- ship between RAML and Bayes decision rule. W e first define random v ariable Z = h ∗ ( X ) = argmax y ∈Y E P ( Y | X ) [ r ( y , Y )] . Then, Z is deterministic gi ven X , and according to (11) , we define the softmax Q-distribution to approximate the conditional distrib ution of Z giv en X : Q ( Z = z | X = x ; τ ) = exp  E P ( Y | X = x ) [ r ( z , Y )] /τ  P y ∈Y exp  E P ( Y | X = x ) [ r ( y , Y )] /τ  (12) for each x ∈ X and z ∈ Y . 1 Importantly , one can verify that decoding from the softmax Q-distribution provides us with the Bayes decision rule, h ( x ) = argmax y ∈Y Q ( y | x ; τ ) = argmax y ∈Y E P ( Y | X = x ) [ r ( y , Y )] = h ∗ ( x ) (13) with any v alue of τ > 0 . 3 . 4 S O F T M A X Q - D I S T R I B U T I O N M A X I M U M L I K E L I H O O D Because making predictions according to the softmax Q-distrib ution is equiv alent to the Bayes decision rule, we w ould lik e to construct a (parametric) statistical model P to directly model the softmax Q-distribution in (12) , similarly to how ML models the true data distribution P . W e call this frame work softmax Q-distribution maximum likelihood (SQDML) . This framew ork is model- agnostic, so any probabilistic model used in ML such as conditional log-linear models and deep neural networks, can be directly applied to modeling the softmax Q-distrib ution. Suppose that we use a parametric statistical model P = { P θ : θ ∈ Θ } to model the softmax Q-distribution. In order to learn “optimal” parameters θ from training data D = { ( x i , y i ) } n i =1 , an intuitiv e and well-motiv ated objectiv e function is the KL-diver gence between the empirical conditional distribution of Q ( ·| X ) , denoted as ˜ Q ( ·| X ) , and the model distribution P θ ( ·| X ) : ˆ θ SQDML = argmin θ ∈ Θ E ˜ Q ( X ) [ KL ( ˜ Q ( ·| X ) || P θ ( ·| X ))] (14) W e can directly set ˜ Q ( X ) = ˜ P ( X ) , which leav es the problem of defining the empirical conditional distribution ˜ Q ( Z | X ) . Before defining ˜ Q ( Z | X ) , we first note that if the defined empirical distribution ˜ Q ( X, Z ) asymptotically con verges to the true Q-distribution Q ( X, Z ) , the learned model distribution P ˆ θ SQDML ( ·| X = x ) con verges to Q ( ·| X = x ) . Therefore, decoding from P ˆ θ SQDML ( ·| X = x ) ideally achiev es the Bayes decision rule h ∗ ( x ) . A straightforward way to define ˜ Q ( Z | X = x ) is to use the empirical distribution ˜ P ( Y | X = x ) : ˜ Q ( Z = z | X = x ) = exp  E ˜ P ( Y | X = x ) [ r ( z , Y )] /τ  P y ∈Y exp  E ˜ P ( Y | X = x ) [ r ( y , Y )] /τ  (15) where ˜ P is the empirical distribution of P defined in (3) . Asymptotically as n → ∞ , ˜ P con verges to P . Thus, ˜ Q asymptotically con verges to Q . 1 In the following deri vations we omit τ in Q ( Z | X ; τ ) for simplicity when there is no ambiguity . 5 Under revie w as a conference paper at ICLR 2018 Unfortunately , the empirical distribution ˜ Q (15) is not efficient to compute, since the expectation term is inside the exponential function (See appendix D.2 for approximately learning ˜ θ SQDML in practice). This leads us to seek an approximation of the softmax Q-distribution and its corresponding empirical distribution. Here we propose the following Q 0 distribution to approximate the softmax Q-distribution Q defined in (12): Q 0 ( Z = z | X = x ; τ ) = E P ( Y | X = x )  exp ( r ( z , Y ) /τ ) P y ∈Y exp ( r ( y , Y ) /τ )  (16) where we move the expectation term outside the exponential function. Then, the corresponding empirical distribution of Q 0 ( X, Z ) can be written in the follo wing form: ˜ Q 0 ( X = x, Z = z ) = 1 n n X i =1      X y ∈Y exp( r ( z , y ) /τ ) P y 0 ∈Y exp( r ( y 0 , y ) /τ ) I ( x i = x, y i = y )      (17) Approximating ˜ Q ( X, Z ) with ˜ Q 0 ( X, Z ) , and plugging (17) into the RHS in (14), we hav e: ˆ θ SQDML ≈ argmin θ ∈ Θ E ˜ Q 0 ( X ) [ KL ( ˜ Q 0 ( ·| X ) || P θ ( ·| X ))] = argmin θ ∈ Θ n P i =1 n − P y ∈Y q ( y | y i ; τ ) log P θ ( y | x i ) o = ˆ θ RAML (18) where q ( y | y ∗ ; τ ) is the exponentiated payof f distribution of RAML in (5). Equation (18) states that RAML is an approximation of our proposed SQDML by approximating ˜ Q with ˜ Q 0 . Interestingly and mostly in practice, when the input is unique in the training data, i.e., 6 ∃ ( x 1 , y 1 ) , ( x 2 , y 2 ) ∈ D, s.t. x 1 = x 2 ∧ y 1 6 = y 2 , we hav e ˜ Q = ˜ Q 0 , resulting in ˆ θ SQDML = ˆ θ RAML . It states that the estimated distrib ution P ˆ θ SQDML and P ˆ θ RAML are exactly the same when the input x is unique in the training data, since the empirical distributions ˜ Q and ˜ Q 0 estimated from the training data are the same. 3 . 5 A N A LY S I S A N D D I S C U S S I O N O F S Q D M L In §3.4, we provided a theoretical interpretation of RAML by establishing the relationship between RAML and SQDML. In this section, we try to answer the questions of RAML raised in §2.3 using this interpretation and further analyze the lev el of approximation from the softmax Q-distribution Q in (13) to Q 0 in (16) by proving a upper bound of the approximation error . Let’ s first use our interpretation to answer the three questions regarding RAML in §2.3. First, instead of optimizing the KL di vergence between the artificially designed e xponentiated payoff distribution and the model distribution, RAML in our formulation approximately matches model distribution P θ ( ·| X = x ) with the softmax Q-distribution Q ( ·| X = x ; τ ) . Second, based on our interpretation, asymptotically as n → ∞ , RAML learns a distribution that con ver ges to Q 0 ( · ) in (16) , and therefore appr oximately con verges to the softmax Q-distrib ution. Third, as mentioned in §3.3, generating from the softmax Q-distribution produces the Bayes decision rule, which theoretically outperforms the prediction function from ML, w .r .t. the expected re ward. It is necessary to mention that both RAML and SQDML are trying to learn distrib utions, decoding from which (approximately) deliv ers the Bayes decision rule. There are other directions that can also achiev e the Bayes decision rule, such as minimum Bayes risk decoding (Kumar & Byrne, 2004), which attempts to estimate the Bayes decision rule directly by computing e xpectation w .r .t the data distribution learned from training data. So far our discussion has concentrated on the theoretical interpretation and analysis of RAML, without any concerns for how well Q 0 ( X, Z ) approximates Q ( X, Z ) . Now , we characterize the approximating error by proving a upper bound of the KL di vergence between them: Theorem 1. Given the input and output random variable X ∈ X and Y ∈ Y and the data distribution P ( X, Y ) . Suppose that the re ward function is bounded 0 ≤ r ( y , y ∗ ) ≤ R . Let Q ( Z | X ; τ ) and Q 0 ( Z | X ; τ ) be the softmax Q-distrib ution and its appr oximation defined in (12) and (16) . Assume that Q ( X ) = Q 0 ( X ) = P ( X ) . Then, KL( Q ( · , · ) k Q 0 ( · , · )) ≤ 2 R/τ (19) 6 Under revie w as a conference paper at ICLR 2018 From Theorem 1 (proof in Appendix A.1) we observ e that the le vel of approximation mainly depends on two factors: the upper bound of the reward function ( R ) and the temperature parameter τ . In practice, R is often less than or equal to 1, when metrics like accuracy or BLEU are applied. It should be noted that, at one extreme when τ becomes larger , the approximation error tends to be zero. At the same time, ho wever , the softmax Q-distrib ution becomes closer to the uniform distribution Unif ( Y ) , providing less information for prediction. Thus, in practice, it is necessary to consider the trade-off between approximation error and predicti ve po wer . What about the other e xtreme — τ “as close to zero as possible”? W ith suitable assumptions about the data distribution P , we can characterize the approximating error by using the same KL div ergence: Theorem 2. Suppose that the re ward function is bounded 0 ≤ r ( y , y ∗ ) ≤ R , and ∀ y 0 6 = y , r ( y , y ) − r ( y 0 , y ) ≥ γ R wher e γ ∈ (0 , 1) is a constant. Suppose additionally that, like a sub- Gaussian, for e very x ∈ X , P ( Y | X = x ) satisfies the e xponential tail bound w .r .t. r — that is, for each x ∈ X , there e xists a unique y ∗ ∈ Y such that for every t ∈ [0 , 1) P ( r ( y ∗ , y ∗ ) − r ( Y , y ∗ ) ≥ tR | X = x ) ≤ e − c t 2 (1 − t ) 2 (20) wher e c is a distribution-dependent constant. Assume that Q ( X ) = Q 0 ( X ) = P ( X ) . Denote b = γ 2 (1 − γ ) 2 . Then, as τ → 0 , KL( Q ( · , · ) k Q 0 ( · , · )) ≤ 1 1 + e cb . (21) Theorem 2 (proof in Appendix A.2) indicates that RAML can also achiev e little approximating error when τ is close to zero. 4 E X P E R I M E N T S In this section, we performed two sets of experiments to verity our theoretical analysis of the relation between SQDML and RAML. As discussed in §3.4, RAML and SQDML deliv er the same predictions when the input x is unique in the data. Thus, in order to compare SQDML against RAML, the first set of experiments are designed on two data sets in which x is not unique — synthetic data for cost-sensitiv e multi-class classification, and the MSCOCO benchmark dataset (Chen et al., 2015) for image captioning. T o further confirm the advantages of RAML (and SQDML) over ML, and thus the necessity for better theoretical understanding, we performed the second set of experiments on three structured prediction tasks in NLP . In these cases SQDML reduces to RAML, as the input is unique in these three data sets. 4 . 1 E X P E R I M E N T S O N S Q D M L 4 . 1 . 1 C O S T - S E N S I T I V E M U L T I - C L A S S C L A S S I FI C A T I O N First, we perform experiments on synthetic data for cost-sensitive multi-class classification designed to demonstrate that RAML learns a distrib ution approximately producing the Bayes decision rule, which is asymptotically the prediction function deliv ered by SQDML. The synthetic data set is for a 4-class classification task, where x ∈ X = [ − 1 , +1] × [ − 1 , +1] ⊂ R 2 , and y ∈ Y = { 0 , 1 , 2 , 3 } . W e define four base points, one for each class:    x 0 x 1 x 2 x 3    =    +1 +1 +1 − 1 − 1 − 1 − 1 +1    For data generation, the distrib ution P ( X ) is the uniform distribution on X , and the log form of the conditional distribution P ( Y | X = x ) for each x ∈ X is proportional to the negativ e distance of each base point: log P ( Y = y | X = x ) ∝ − d ( x, x y ) , for y ∈ { 0 , 1 , 2 , 3 } (22) where d ( · , · ) is the Euclidean distance between two points. T o generate training data, we first draw 1 million inputs x from P ( X ) . Then, we independently generate 10 outputs y from P ( Y | X = x ) for 7 Under revie w as a conference paper at ICLR 2018 (a) V alidation (b) T est Figure 1: A verage re ward relati ve to the temperature parameter τ , ranging from 0.1 to 3.0, on validation and test sets, respecti vely . (a) V alidation (b) T est Figure 2: A verage re ward relativ e to a wide range of τ (from 1.0 to 10,000) on validation and test sets, respectiv ely . each x to build a data set with multiple references. Thus, the total number of training instances is 10 million. F or validation and test data, we independently generate 0.1 million pairs of ( x, y ) from P ( X, Y ) , respectiv ely . The model we used is a feed-forw ard (dense) neural networks with 2 hidden layers, each of which has 8 units. Optimization is performed with mini-batch stochastic gradient descent (SGD) with learning rate 0.1 and momentum 0.9. Each model is trained with 100 epochs and we apply early stopping (Caruana et al., 2001) based on performance on validation sets. The rew ard function r ( · , · ) is designed to distinguish the four classes. F or “correct” predictions, the specific rew ard values assigned for the four classes are:    r (0 , 0) r (1 , 1) r (2 , 2) r (3 , 3)    =    e 2 . 0 e 1 . 6 e 1 . 2 e 1 . 1    For “wrong” predictions, re wards are always zero, i.e. r ( y , y ∗ ) = 0 when y 6 = y ∗ . Figure 1 depicts the ef fect of varying the temperature parameter τ on model performance, ranging from 0.1 to 3.0 with step 0.1. For each fix ed τ , we report the mean performance o ver 5 repetitions. Figure 1 shows the a veraged rew ards obtained as a function of τ on both validation and test datasets 8 Under revie w as a conference paper at ICLR 2018 R A M L S Q D M L R A M L S Q D M L τ Rew ard BLEU Re ward BLEU τ Rew ard BLEU Re ward BLEU τ = 0 . 80 10.77 27.02 10.82 27.08 τ = 1 . 00 10.84 27.26 10.82 27.03 τ = 0 . 85 10.81 27.27 10.78 26.92 τ = 1 . 05 10.82 27.29 10.80 27.20 τ = 0 . 90 10.88 27.62 10.91 27.54 τ = 1 . 10 10.74 26.89 10.78 26.98 τ = 0 . 95 10.82 27.33 10.79 27.02 τ = 1 . 15 10.77 27.01 10.72 26.66 T able 1: A verage Reward (sentence-lev el BLEU) and corpus-le vel BLEU (standard ev aluation metric) scores for image captioning task with different τ . of ML, RAML and SQDML, respecti vely . From Figure 1 we can see that when τ increases, the performance gap between SQDML and RAML keeps decreasing, indicting that RAML achieves better approximation to SQDML. This evidence verities the statement in Theorem 1 that the approximating error between RAML and SQDML decreases when τ continues to gro w . The results in Figure 1 raise a question: does larger τ necessarily yield better performance for RAML? T o further illustrate the effect of τ on model performance of RAML and SQDML, we perform e xperiments with a wide range of τ — from 1 to 10,000 with step 200. W e also repeat each experiment 5 times. The results are shown in Figure 2. W e see that the model performance (av erage reward), ho wev er , has not kept growing with increasing τ . As discussed in §3.5, the softmax Q-distribution becomes closer to the uniform distribution when τ becomes larger , making it less expressi ve for prediction. Thus, when applying RAML in practice, considerations regarding the trade-off between approximating error and predictive power of model are needed. More details, results and analysis of the conducted experiments are pro vided in Appendix B. 4 . 1 . 2 I M A G E C A P T I O N I N G W I T H M U L T I P L E R E F E R E N C E S Second, to show that optimizing toward our proposed SQDML objecti ve yields better predictions than RAML on real-world structured prediction tasks, we e valuate on the MSCOCO image captioning dataset. This dataset contains 123,000 images, each of which is paired with as least fiv e manually annotated captions. W e follow the of fline e valuation setting in (Karpathy & Li, 2015), and reserv e 5,000 images for validation and testing, respectively . W e implemented a simple neural image captioning model using a pre-trained VGGNet as the encoder and a Long Short-T erm Memory (LSTM) network as the decoder . Details of the experimental setup are in Appendix C. As in §4.1.1, for the sake of comparing SQDML with RAML to verify our theoretical analysis, we use the average re ward as the performance measure by simply defining the rew ard as pairwise sentence le vel BLEU score between model’ s prediction and each reference caption 2 , though the standard benchmark metric commonly used in image captioning (e.g., corpus-lev el BLEU-4 score) is not simply defined as averaging o ver the pairwise rewards between prediction and reference captions. W e use stochastic gradient descent to optimize the objectives for SQDML (14) and RAML (4) . Ho wever , the denominators of the softmax-Q distribution for SQDML ˜ Q ( Z | X ; τ ) (15) and the payoff distribution for RAML q ( y | y ∗ ; τ ) (5) contain summations o ver intractable exponential h ypotheses space Y . W e therefore propose a simple heuristic approach to approximate the denominator by restricting the exponential space Y using a fixed set S of sampled targets, i.e., Y ≈ S . Approximating the intractable hypotheses space using sampling is not new in structured prediction, and has been shown effecti ve in optimizing neural structured prediction models (Shen et al., 2016). Specifically , the sampled candidate set S is constructed by (i) including each ground-truth reference y ∗ into S ; and (ii) uniformly replacing an n -gram ( n ∈ { 1 , 2 , 3 } ) in one (randomly sampled) reference y ∗ with a randomly sampled n -gram. W e refer to this approach as n -gram r eplacement . W e provide more details of the training procedure in Appendix C. T able 1 lists the results. W e e valuate on both the a verage rew ard and the benchmark metric (corpus- lev el BLEU-4). W e also tested on a vanilla ML baseline, which achie ves 10.71 av erage reward and 26.91 corpus-le vel BLEU. Both SQDML and RAML outperform ML according to the two metrics. Interestingly , comparing SQDML with RAML we did not observe a significant improv ement of 2 Not that this is different from standard multi-reference sentence-lev el BLEU, which counts n-gram matches w .r .t. all sentences then uses these sufficient statistics to calculate a final score. 9 Under revie w as a conference paper at ICLR 2018 Ref 1 A gr oup of pe ople standing ar ound a wine cellar Ref 2 A couple of people are standing ar ound a table wit h wine Ref 3 M en and women ar e gather ed around the table Ref 4 People standing at a table wit h a lot of wine glasses and different flavors of wine Ref 5 A bunch of people at a table fi lled with wine glasses SQDML A gr oup of pe ople standing ar ound a tabl e with wine glasses Avg. Reward: 0.3040 Max Reward: 0.5900 RAML A gr oup of pe ople standing ar ound a tabl e Avg. Reward: 0.2557 Max Reward: 0.7421 ML A gr oup of pe ople standing ar ound a tabl e Avg. Reward: 0.2557 Max Reward: 0.7421 Ref 1 A one way sign that is on a pole Ref 2 A black and white picture of a traffic signal in a city Ref 3 A black and white image of some buildings and a str ee t light Ref 4 Intersection with traffic signal s in lar ge metr opol itan ar ea Ref 5 T raffic lights in front of lar ge buildings with a one way sign SQDML A black and white photo of a street sign on a pole Avg. Reward: 0.1424 Max Reward: 0.2620 RAML A black and white photo of a traffic light Avg. Reward: 0.1409 Max Reward: 0.3093 ML A black and white photo of a street sign Avg. Reward: 0.1253 Max Reward: 0.2643 Figure 3: T esting examples from MSCOCO image captioning task av erage reward. W e hypothesize that this is due to the fact that the reference captions for each image are largely different, making it highly non-tri vial for the model to predicate a “consensus” caption that agrees with multiple references. As an example, we randomly sampled 300 images from the validation set and compute the av eraged sentence-lev el BLEU between two references, which is only 10.09. Ne vertheless, through case studies we still found some interesting examples, which demonstrate that SQDML is capable of generating predictions that match with multiple candidates. Figure 3 gi ves two examples. In the two examples, SQDML ’ s predictions match with multiple references, re gistering the highest a verage reward. On the other hand, RAML giv es sub-optimal predictions in terms of av erage reward since it is an approximation of SQDML. And finally for ML, since its objectiv e is solely maximizing the reward w .r .t a single reference, it giv es the lowest a verage rew ard, while achieving higher maximum re ward. 4 . 2 E X P E R I M E N T S O N S T R U C T U R E D P R E D I C T I O N Norouzi et al. (2016) already e valuated the ef fectiv eness of RAML on sequence prediction tasks of speech recognition and machine translation using neural sequence-to-sequence models. In this section, we further confirm the empirical success of RAML (and SQDML) ov er ML: (i) W e apply RAML on three structured prediction tasks in NLP , including named entity recognition (NER), dependency parsing and machine translation (MT), using both classical feature-based log-linear models (NER and parsing) and state-of-the-art attentional recurrent neural networks (MT). (ii) Dif ferent from Norouzi et al. (2016) where edit distance is uniformly used as a surrogate training reward and the learning objecti ve in (4) is approximated through sampling, we use task-specific rewards, defined on sequential (NER), tree-based (parsing) and complex irregular structures (MT). Specifically , instead of sampling, we apply efficient dynamic programming algorithms (NER and parsing) to directly compute the analytical solution of (4) . (iii) W e present further analysis comparing RAML with ML, sho wing that due to different learning objecti ves, RAML re gisters better results under task-specific metrics, while ML yields better exact-match accurac y . 4 . 2 . 1 S E T U P In this section we describe experimental setups for three ev aluation tasks. W e refer readers to Appendix D for dataset statistics, modeling details and training procedure. 10 Under revie w as a conference paper at ICLR 2018 D E V . Results T E S T Results Method Acc F1 Acc F1 ML Baseline 98.2 90.4 97.0 84.9 τ = 0 . 1 98.3 90.5 97.0 85.0 τ = 0 . 2 98.4 91.2 97.3 86.0 τ = 0 . 3 98.3 90.2 97.1 84.7 τ = 0 . 4 98.3 89.6 97.1 84.0 τ = 0 . 5 98.3 89.4 97.1 83.3 τ = 0 . 6 98.3 88.9 97.0 82.8 τ = 0 . 7 98.3 88.6 97.0 82.2 τ = 0 . 8 98.2 88.5 96.9 81.9 τ = 0 . 9 98.2 88.5 97.0 82.1 T able 2: T oken accuracy and of ficial F1 for NER. D E V . Results T E S T Results Method U AS U AS ML Baseline 91.3 90.7 τ = 0 . 1 91.0 90.6 τ = 0 . 2 91.5 91.0 τ = 0 . 3 91.7 91.1 τ = 0 . 4 91.4 90.8 τ = 0 . 5 91.2 90.7 τ = 0 . 6 91.0 90.6 τ = 0 . 7 90.8 90.4 τ = 0 . 8 90.8 90.3 τ = 0 . 9 90.7 90.1 T able 3: U AS scores for dependency parsing. Named Entity Recognition (NER) For NER, we e xperimented on the English data from CoNLL 2003 shared task (Tjong Kim et al., 2003). There are four predefined types of named entities: PERSON, LOCA TION, ORGANIZA TION, and MISC . The dataset includes 15K training sentences, 3.4K for validation, and 3.7K for testing. W e built a linear CRF model (Laf ferty et al., 2001) with the same features used in Finkel et al. (2005). Instead of using the official F1 score o ver complete span predictions, we use token-lev el accuracy as the training rew ard, as this metric can be factorized to each word, and hence there exists efficient dynamic programming algorithm to compute the expected log-likelihood objecti ve in (4). Dependency Parsing For dependenc y parsing, we ev aluate on the English Penn Treebanks (PTB) (Marcus et al., 1993). W e follow the standard splits of PTB, using sections 2-21 for training, section 22 for validation and 23 for testing. W e adopt the Stanford Basic Dependencies (De Marneffe et al., 2006) using the Stanford parser v3.3.0 3 . W e applied the same data preprocessing procedure as in Dyer et al. (2015). W e adopt an edge-factorized tree-structure log-linear model with the same features used in Ma & Zhao (2012). W e use the unlabeled attachment score (UAS) as the training rew ard, which is also the official e valuation metric of parsing performance. Similar as NER, the expectation in (4) can be computed deficiently using dynamic programming since U AS can be factorized to each edge. Machine T ranslation (MT) W e tested on the German-English machine translation task in the IWSL T 2014 ev aluation campaign (Cettolo et al., 2014), a widely-used benchmark for ev aluating optimization techniques for neural sequence-to-sequence models. The dataset contains 153K training sentence pairs. W e follow pre vious works (W iseman & Rush, 2016; Bahdanau et al., 2017; Li et al., 2017) and use an attentional neural encoder -decoder model with LSTM networks. The size of the LSTM hidden states is 256. Similar as in §4.1.2, we use the sentence lev el BLEU score as the training rew ard and approximate the learning objecti ve using n -gram replacement ( n ∈ { 1 , 2 , 3 , 4 } ). W e ev aluate using standard corpus-lev el BLEU. 4 . 2 . 2 M A I N R E S U LT S The results of NER and dependency parsing are shown in T able 2 and T able 3, respectively . W e observed that the RAML model obtained the best results at τ = 0 . 2 for NER, and τ = 0 . 3 for dependency parsing. Beyond τ = 0 . 4 , RAML models get worse than the ML baseline for both the two tasks, sho wing that in practice selection of temperature τ is needed. In addition, the re wards we directly optimized in training (token-le vel accurac y for NER and U AS for dependency parsing) are more stable w .r .t. τ than the ev aluation metrics (F1 in NER), illustrating that in practice, choosing a training rew ard that correlates well with the ev aluation metric is important. T able 4 summarizes the results for MT . W e also compare our model with previous works on incorpo- rating task-specific re wards (i.e., BLEU score) in optimizing neural sequence-to-sequence models (c.f. T able 5). Our approach, albeit simple, surprisingly outperforms previous works. Specifically , 3 http://nlp.stanford.edu/software/lex-parser .shtml 11 Under revie w as a conference paper at ICLR 2018 τ S-B C-B τ S-B C-B τ = 0 . 1 28.67 27.42 τ = 0 . 6 29.37 28.49 τ = 0 . 2 29.44 28.38 τ = 0 . 7 29.52 28.59 τ = 0 . 3 29.59 28.40 τ = 0 . 8 29.54 28.63 τ = 0 . 4 29.80 28.77 τ = 0 . 9 29.48 28.58 τ = 0 . 5 29.55 28.45 τ = 1 . 0 29.34 28.40 T able 4: Sentence-le vel BLEU ( S-B , training rew ard) and corpus-level BLEU ( C-B , standard ev alua- tion metric) scores for RAML with different τ . Methods ML Baseline Proposed Model Ranzato et al. (2016) 20.10 21.81 W iseman & Rush (2016) 24.03 26.36 Li et al. (2017) 27.90 28.30 Bahdanau et al. (2017) 27.56 28.53 This W ork 27.66 28.77 T able 5: Comparison of our proposed approach with pre vious works. All pre vious methods require pre-training using an ML baseline, while RAML learns from scratch. NER Parsing MT Metric Acc. F1 E.M. UAS E.M. S-B C-B E.M. ML 97.0 84.9 78.8 90.7 39.9 29.15 27.66 3.79 RAML 97.3 86.0 80.1 91.1 39.4 29.80 28.77 3.35 T able 6: Performance of ML and RAML under different metrics for the three tasks on test sets. E.M. refers to exact match accurac y . all pre vious methods require a pre-trained ML baseline to initialize the model, while RAML learns from scratch. This suggests that RAML is easier and more stable to optimize compared with existing approaches like RL (e.g., Ranzato et al. (2016) and Bahdanau et al. (2017)), which requires sampling from the moving model distrib ution and suffers from high v ariance. Finally , we remark that RAML performs consistently better than the ML (27.66) across most temperature terms. 4 . 2 . 3 F U RT H E R C O M PA R I S O N W I T H M A X I M U M L I K E L I H O O D T able 6 illustrates the performance of ML and RAML under different metrics of the three tasks. W e observe that RAML outperforms ML on both the directly optimized re wards (token-le vel accuracy for NER, U AS for dependency parsing and sentence-le vel BLEU for MT) and task-specific e valuation metrics (F1 for NER and corpus-lev el BLEU for MT). Interestingly , we find a trend that ML gets better results on two out of the three tasks under exact match accurac y , which is the rew ard that ML attempts to optimize (as discussed in (9) ). This is in line with our theoretical analysis, in that RAML and ML achiev e better prediction functions w .r .t. their corresponding rewards the y try to optimize. 5 C O N C L U S I O N In this work, we propose the framework of estimating the softmax Q-distribution from training data. Based on our theoretical analysis, asymptotically , the prediction function learned by RAML approximately achiev es the Bayes decision rule. Experiments on three structured prediction tasks demonstrate that RAML consistently outperforms ML baselines. R E F E R E N C E S Dzmitry Bahdanau, Kyunghyun Cho, and Y oshua Bengio. Neural machine translation by jointly learning to align and translate. In Pr oceedings of ICLR , San Diego, California, 2015. Dzmitry Bahdanau, Philemon Brakel, K elvin Xu, Anirudh Goyal, Ryan Lo we, Joelle Pineau, Aaron Courville, and Y oshua Bengio. An actor-critic algorithm for sequence prediction. In Proceedings of ICLR , T oulon, France, 2017. 12 Under revie w as a conference paper at ICLR 2018 Rich Caruana, Steve Lawrence, and Giles Lee. Overfitting in neural nets: Backpropagation, conjugate gradient, and early stopping. In Pr oceedings of NIPS , volume 13, pp. 402. MIT Press, 2001. Mauro Cettolo, Jan Niehues, Sebastian Stuker , Luisa Bentiv ogli, and Marcello Federico. Report on the 11th iwslt ev aluation campaign, iwslt 2014. In Pr oceedings for the International W orkshop on Spoken Languag e T ranslation , pp. 2–11, 2014. Xinlei Chen, Hao Fang, Tsung-Y i Lin, Ramakrishna V edantam, Saurabh Gupta, Piotr Dollár , and C. Lawrence Zitnick. Microsoft COCO captions: Data collection and ev aluation server . CoRR , abs/1504.00325, 2015. Marie-Catherine De Marneffe, Bill MacCartney , Christopher D Manning, et al. Generating typed dependency parses from phrase structure parses. In Proceedings of LREC , pp. 449–454, 2006. Chris Dyer , Miguel Ballesteros, W ang Ling, Austin Matthews, and Noah A. Smith. T ransition-based dependency parsing with stack long short-term memory . In Proceedings of A CL , pp. 334–343, Beijing, China, July 2015. Jenny Rose Finkel, T rond Grenager , and Christopher Manning. Incorporating non-local information into information extraction systems by gibbs sampling. In Pr oceedings of ACL , pp. 363–370, Ann Arbor , Michigan, June 2005. K. Gimpel. Discriminative F eatur e-Rich Modeling for Syntax-Based Machine T ranslation . PhD thesis, Carnegie Mellon Uni versity , 2012. Ke vin Gimpel and Noah A. Smith. Softmax-margin CRFs: Training log-linear models with cost functions. In Pr oceedings of NAA CL , pp. 733–736, Los Angeles, California, June 2010. Sepp Hochreiter and Jürgen Schmidhuber . Long short-term memory . Neur al Computation , 9(8): 1735–1780, 1997. Andrej Karpathy and Fei-Fei Li. Deep visual-semantic alignments for generating image descriptions. In Pr oceedings of CVPR , pp. 3128–3137, Boston, MA, USA, June 2015. Shankar Kumar and W illiam Byrne. Minimum bayes-risk decoding for statistical machine translation. T echnical Report , 2004. John Laf ferty , Andrew McCallum, Fernando Pereira, et al. Conditional random fields: Probabilistic models for se gmenting and labeling sequence data. In Pr oceedings of ICML , v olume 1, pp. 282–289, San Francisco, California, 2001. Jiwei Li, Will Monroe, and Dan Jurafsky . Learning to decode for future success. CoRR , abs/1701.06549, 2017. Siqi Liu, Zhenhai Zhu, Ning Y e, Sergio Guadarrama, and Ke vin Murphy . Optimization of image description metrics using policy gradient methods. CoRR , abs/1612.00370, 2016. Thang Luong, Hieu Pham, and Christopher D. Manning. Ef fective approaches to attention-based neural machine translation. In Pr oceedings of EMNLP , pp. 1412–1421, Lisbon, Portugal, 2015. Xuezhe Ma and Eduard Hovy . End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF. In Pr oceedings of ACL , pp. 1064–1074, Berlin, Germany , August 2016. Xuezhe Ma and Hai Zhao. Probabilistic models for high-order projecti ve dependency parsing. T echnical Report, arXiv:1502.04174 , 2012. Mitchell Marcus, Beatrice Santorini, and Mary Ann Marcinkie wicz. Building a large annotated corpus of English: the Penn T reebank. Computational Linguistics , 19(2):313–330, 1993. Ryan McDonald, Koby Crammer , and Fernando Pereira. Online large-margin training of dependency parsers. In Pr oceedings of ACL , pp. 91–98, Ann Arbor , Michigan, June 25-30 2005. Ofir Nachum, Mohammad Norouzi, and Dale Schuurmans. Improving polic y gradient by exploring under-appreciated re wards. arXiv pr eprint arXiv:1611.09321 , 2016. 13 Under revie w as a conference paper at ICLR 2018 Mohammad Norouzi, Samy Bengio, Navdeep Jaitly , Mike Schuster , Y onghui W u, Dale Schuurmans, et al. Re ward augmented maximum likelihood for neural structured prediction. In Pr oceedings of NIPS , pp. 1723–1731, Barcelona, Spain, 2016. Mark A Paskin. Cubic-time parsing and learning algorithms for grammatical bigram models . Citeseer , 2001. Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and W ojciech Zaremba. Sequence le vel training with recurrent neural networks. In Pr oceedings of ICLR , San Juan, Puerto Rico, 2016. Shiqi Shen, Y ong Cheng, Zhongjun He, W ei He, Hua W u, Maosong Sun, and Y ang Liu. Minimum risk training for neural machine translation. In Pr oceedings of A CL , pp. 1683–1692, Berlin, Germany , August 2016. Karen Simonyan and Andre w Zisserman. V ery deep con volutional networks for lar ge-scale image recognition. In Pr oceedings of ICLR , San Diego, California, 2015. Ilya Sutske ver , Oriol V inyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Pr oceedings of NIPS , pp. 3104–3112, Montreal, Canada, 2014. Ben T askar , Carlos Guestrin, and Daphne K oller . Max-margin markov netw orks. Advances in neural information pr ocessing systems , 16:25, 2004. Sang Tjong Kim, Erik F ., and Fien De Meulder . Introduction to the conll-2003 shared task: Language- independent named entity recognition. In Pr oceedings of CoNLL-2003 - V olume 4 , pp. 142–147, Edmonton, Canada, 2003. Maksims N V olkovs, Hugo Larochelle, and Richard S Zemel. Loss-sensitiv e training of probabilistic conditional random fields. arXiv pr eprint arXiv:1107.1805 , 2011. Hanna M W allach. Conditional random fields: An introduction. 2004. Larry W asserman. All of statistics: a concise cour se in statistical inference . Springer Science & Business Media, 2013. Sam W iseman and Alexander M. Rush. Sequence-to-sequence learning as beam-search optimization. In Pr oceedings of EMNLP , pp. 1296–1306, Austin, T exas, 2016. Y onghui W u, Mike Schuster , Zhifeng Chen, Quoc V Le, Mohammad Norouzi, W olfgang Macherey , Maxim Krikun, Y uan Cao, Qin Gao, Klaus Macherey , et al. Google’ s neural machine translation sys- tem: Bridging the gap between human and machine translation. arXiv pr eprint arXiv:1609.08144 , 2016. Zhilin Y ang, Y e Y uan, Y uexin W u, W illiam W Cohen, and Ruslan R Salakhutdino v . Re view netw orks for caption generation. In Pr oceedings of NIPS , pp. 2361–2369, 2016. 14 Under revie w as a conference paper at ICLR 2018 A P P E N D I X : S O F T M A X Q - D I S T R I B U T I O N E S T I M A T I O N F O R S T R U C T U R E D P R E D I C T I O N : A T H E O R E T I C A L I N T E R P R E TA T I O N F O R R A M L A S O F T M A X Q - D I S T R I B U T I O N M A X I M U M L I K E L I H O O D A . 1 P R O O F O F T H E O R E M 1 Pr oof. Since the rew ard function is bounded 0 ≤ r ( y , y ∗ ) ≤ R, ∀ y , y ∗ ∈ Y , we have: 1 ≤ exp( r ( y , y ∗ ) /τ ) ≤ e R/τ Then, 1 |Y | e R/τ < 1 1 + ( |Y | − 1) e R/τ ≤ exp( r ( y , y ∗ ) /τ ) P y 0 ∈Y exp( r ( y 0 , y ∗ ) /τ ) ≤ e R/τ |Y | − 1 + e R/τ < e R/τ |Y | (1) Now we can bound the conditional distrib ution Q ( z | x ) and Q 0 ( z | x ) : 1 |Y | e R/τ < Q ( Z = z | X = x ; τ ) = exp ( E P [ r ( z , Y ) | X = x ] /τ ) P y ∈Y exp ( E P [ r ( y , Y ) | X = x ] /τ ) < e R/τ |Y | (2) and, 1 |Y | e R/τ < Q 0 ( Z = z | X = x ; τ ) = E P    exp ( r ( z , Y ) /τ ) P y ∈Y exp ( r ( y , Y ) /τ )    X = x    < e R/τ |Y | (3) Thus, ∀ x ∈ X , z ∈ Y , log Q ( z | x ) Q 0 ( z | x ) < 2 R/τ T o sum up, we have: KL( Q ( · , · ) k Q 0 ( · , · )) = P x ∈X Q ( x ) P z ∈Y Q ( z | x ) log Q ( z | x ) Q ( x ) Q 0 ( z | x ) Q 0 ( x ) = P x ∈X Q ( x ) P z ∈Y Q ( z | x ) log Q ( z | x ) Q 0 ( z | x ) < P x ∈X Q ( x ) P z ∈Y Q ( z | x )2 R/τ = 2 R/τ A . 2 P R O O F O F T H E O R E M 2 Lemma 3. F or every x ∈ X , P ( Y 6 = y ∗ | X = x ) ≤ e − cb wher e b ∆ = γ 2 (1 − γ ) 2 . Pr oof. From the assumption in Theorem 2 of Eq. (20) in §3.5, we hav e P ( Y 6 = y ∗ | X = x ) = P ( r ( y ∗ , y ∗ ) − r ( Y , y ∗ ) ≥ γ R ) ≤ e − c γ 2 (1 − γ ) 2 = e − cb Lemma 4. 1 1+( |Y |− 1) e R/τ ≤ q ( y | y ∗ ) ≤ 1 e γ R/τ +( |Y |− 1) e − R/τ , if y 6 = y ∗ 1 1+( |Y |− 1) e − γ R/τ ≤ q ( y | y ∗ ) ≤ 1 1+( |Y |− 1) e − R/τ , if y = y ∗ 15 Under revie w as a conference paper at ICLR 2018 Pr oof. From Eq. (1), we hav e 1 1 + ( |Y | − 1) e R/τ ≤ q ( y | y ∗ ) ≤ 1 1 + ( |Y | − 1) e − R/τ If y 6 = y ∗ , q ( y | y ∗ ) = exp( r ( y , y ∗ ) /τ ) P y 0 ∈Y exp( r ( y 0 , y ∗ ) /τ ) ≤ e r ( y ,y ∗ ) /τ e r ( y ∗ ,y ∗ ) /τ + ( |Y | − 1) ≤ 1 e γ R/τ + ( |Y | − 1) e − R/τ If y = y ∗ , q ( y | y ∗ ) = q ( y ∗ | y ∗ ) = 1 1 + P y 0 6 = y ∗ e ( r ( y 0 ,y ∗ ) − r ( y ∗ ,y ∗ )) /τ ≤ 1 1 + ( |Y | − 1) e − γ R/τ Lemma 5. 1 1+( |Y |− 1) e R/τ ≤ Q 0 ( z | x ) ≤ 1 e γ R/τ +( |Y |− 1) e − R/τ + e − cb 1+( |Y |− 1) e − R/τ , if z 6 = y ∗ 1 − e − cb 1+( |Y |− 1) e − γ R/τ ≤ Q 0 ( z | x ) ≤ 1 1+( |Y |− 1) e − R/τ , if y = y ∗ Pr oof. Q 0 ( z | x ) = E p [ q ( z | Y )] = X y ∈Y p ( y | x ) q ( z | y ) From Eq. (3) we hav e, 1 1 + ( |Y | − 1) e R/τ ≤ Q 0 ( z | x ) ≤ 1 1 + ( |Y | − 1) e − R/τ If z 6 = y ∗ , Q 0 ( z | x ) = p ( y ∗ | x ) q ( z | y ∗ )+ X y 6 = y ∗ p ( y | x ) q ( z | y ) ≤ q ( z | y ∗ )+ 1 1 + ( |Y | − 1) e − R/τ P ( Y 6 = y ∗ | X = x ) From Lemma 3 and Lemma 4, Q 0 ( z | x ) ≤ 1 e γ R/τ + ( |Y | − 1) e − R/τ + e − cb 1 + ( |Y | − 1) e − R/τ If z = y ∗ , Q 0 ( z | x ) = p ( y ∗ | x ) q ( z | y ∗ ) + X y 6 = y ∗ p ( y | x ) q ( z | y ) ≥ p ( y ∗ | x ) q ( z | y ∗ ) From Lemma 3 and Lemma 4, Q 0 ( z | x ) ≥ 1 − e − cb 1 + ( |Y | − 1) e − γ R/τ Lemma 6. 0 ≤ E[ r ( z , Y ) /τ ] ≤ P ( y ∗ | x ) r ( z , y ∗ ) /τ + e − cb R/τ , if z 6 = y ∗ P ( y ∗ | x ) r ( y ∗ , y ∗ ) /τ ≤ E[ r ( z , Y ) /τ ] ≤ R/τ , if z = y ∗ Pr oof. E[ r ( z , Y ) /τ ] = X y ∈Y P ( y | x ) r ( z , y ) /τ Since for ev ery y , y 0 ∈ Y , 0 ≤ r ( y , y 0 ) ≤ R , we hav e 0 ≤ E[ r ( z , Y ) /τ ] ≤ R/τ 16 Under revie w as a conference paper at ICLR 2018 If z 6 = y ∗ , E[ r ( z , Y ) /τ ] = P ( y ∗ | x ) r ( z , y ∗ ) /τ + X y 6 = y ∗ P ( y | x ) r ( z , y ) /τ ≤ P ( y ∗ | x ) r ( z , y ∗ ) /τ + e − cb R/τ If z = y ∗ , E[ r ( z , Y ) /τ ] = P ( y ∗ | x ) r ( z , y ∗ ) /τ + X y 6 = y ∗ P ( y | x ) r ( z , y ) /τ ≥ P ( y ∗ | x ) r ( y ∗ , y ∗ ) /τ Lemma 7. 1 1+( |Y |− 1) e R/τ ≤ Q ( z | x ) ≤ 1 e αR/τ +( |Y |− 1) e − R/τ , if z 6 = y ∗ 1 1+( |Y |− 1) e − αR/τ ≤ Q ( z | x ) ≤ 1 1+( |Y |− 1) e − R/τ , if y = y ∗ wher e α = γ − (1 + γ ) e − cb < γ . Pr oof. Q ( z | x ) = e E[ r ( z ,Y ) /τ ] P y 0 ∈Y e E[ r ( y 0 ,Y ) /τ ] From Lemma 6, 1 1 + ( |Y | − 1) e R/τ ≤ Q ( z | x ) ≤ 1 1 + ( |Y | − 1) e − R/τ If z 6 = y ∗ , Q ( z | x ) ≤ e E[ r ( z ,Y ) /τ ] e E[ r ( y ∗ ,Y ) /τ ] + |Y |− 1 ≤  e E[ r ( y ∗ ,Y ) /τ ] − E[ r ( z ,Y ) /τ ] + ( |Y | − 1) e − R/τ  − 1 ≤  e P ( y ∗ | x )( r ( y ∗ ,y ∗ ) /τ − r ( z ,y ∗ ) /τ ) − e − cbR/τ + ( |Y | − 1) e − R/τ  − 1 ≤  e (1 − e − cb ) γ R/τ − e − cb R/τ + ( |Y | − 1) e − R/τ  − 1 = 1 e αR/τ +( |Y |− 1) e − R/τ If z = y ∗ , Q ( z | x ) =   1 + X y 6 = y ∗ e E[ r ( y ,Y ) /τ ] − E[ r ( y ∗ ,Y ) /τ ]   − 1 ≤ 1 1 + ( |Y | − 1) e − αR/τ Now , we can prove Theorem 2 with the abo ve lemmas. Proof of Theor em 2 Pr oof. KL( Q ( ·| X = x ) k Q 0 ( ·| X = x )) = P y ∈Y Q ( y | x ) log Q ( y | x ) Q 0 ( y | x ) = Q ( y ∗ | x ) log Q ( y ∗ | x ) Q 0 ( y ∗ | x ) + P y 6 = y ∗ Q ( y | x ) log Q ( y | x ) Q 0 ( y | x ) ≤  1 + ( |Y | − 1) e − R/τ  − 1 log  1 1 − e − cb 1+( |Y |− 1) e − γ R/τ 1+( |Y |− 1) e − R/τ  + |Y |− 1 e αR/τ +( |Y |− 1) e − R/τ log 1+( |Y |− 1) e R/τ e αR/τ +( |Y |− 1) e − R/τ lim τ → 0 KL( Q ( ·| X = x ) k Q 0 ( ·| X = x )) ≤ log 1 1 − e − cb + lim τ → 0 |Y |− 1 e αR/τ log( |Y | − 1) e (1 − α ) R/τ = log 1 1 − e − cb + lim τ → 0 |Y |− 1 e αR/τ (log( |Y | − 1) + (1 − α ) R/τ ) = log 1 1 − e − cb = log 1 + e − cb 1 − e − cb ≤ e − cb 1 − e − cb = 1 1+ e cb 17 Under revie w as a conference paper at ICLR 2018 (a) Bayes decision rule (b) ML (c) RAML ( τ = 0 . 5 ) (d) RAML (best) (e) RAML ( τ = 10000 ) (f) SQDML ( τ = 0 . 5 ) (g) SQDML (best) (h) SQDML ( τ = 10000 ) Figure 4: Decision boundaries of different models, together with the Bayes decision rule in (a). (b) display the decision boundary of ML. (c), (d), (e) are the decision boundaries of RAML with τ = 0 . 5 , τ = 10000 and the one achiev es the best performance τ = 2 . 4 . (f), (g), (h) are the corresponding boundaries of SQDML. The best performance is achiev ed with τ = 1 . 1 B C O S T - S E N S I T I V E M U LT I - C L A S S C L A S S I FI C A T I O N T o better illustrate the properties of ML, RAML and SQDML, we display the decision boundary of the learned models in Figure 4. Figure 4a gi ves the boundary of the Bayes decision rule, and Figure 4b is the boundary of ML. W e can see that, as expected, ML gives “unbiased” boundary because it does not incorporating any information of the task-specific re ward. Figure 4c and 4f are the decision boundaries of RAML and SQDML with τ = 0 . 5 . W e can see that, ev en with small τ , SQDML is able to achie ve good decision boundary similar to that of the Bayes decision rule, while the boundary of RAML is similar to that of ML. This might suggest that RAML, as an approximation of SQDML, might “degenerates” to ML due to approximation error . Figure 4d and 4g provide the boundary of RAML and SQDML that achie ve the best performance ( τ = 2 . 4 for RAML and τ = 1 . 1 for SQDML). RAML is able to produce surprisingly good decisions with proper τ , which is comparable with SQDML. Figure 4e and 4h are the decision boundaries of RAML and SQDML with large τ = 10000 . W e can see that, consistently matching our analysis, neither RAML or SQDML can learn reasonable prediction function. The reason is, as we discussed in §3.5, when τ becomes larger , the softmax 18 Under revie w as a conference paper at ICLR 2018 Q-distribution becomes closer to the uniform distribution, pro viding less information of prediction, ev en though the approximation error tends to be zero. C I M AG E C A P T I O N I N G W I T H M U L T I P L E R E F E R E N C E S Encoder Follo wing (Y ang et al., 2016), we adopt the widely-used CNN architecture V GGNet (Si- monyan & Zisserman, 2015) as the image encoder . Specifically , we use the last fully connected layer fc7 as image representation (4096-dimensional), which is further fed into a decoder to generate captions. W e use a pre-trained VGGNet model 4 , and keep it fixed during the training of decoder . Decoder W e use an LSTM network as the decoder to predicate a sequence of target words: { y 1 , y 2 , . . . , y T } . Formally , the decoder uses its internal hidden state s t at each time step to track the generation process, defined as s t = f LSTM ( y t − 1 , s t − 1 ) , where y t − 1 is the embedding of the pre vious word y t − 1 . W e initialize the memory cell of the decoder by passing the fixed-length image representation x through an af fine transformation layer . The probability of the target w ord y t is then giv en by p ( y t | y token. The resulting vocab ulary size is 10,102. The dimensionality of w ord embeddings and LSTM hidden sates is 256 and 512, respectiv ely . For decoding, we use beam search with a beam size of 5. W e use a batch size of 10 for the ML baseline and a larger size of 100 for SQDML and RAML for the sake of efficienc y . D E X P E R I M E N T S O N S T R U C T U R E D P R E D I C T I O N D . 1 D A TA S E T S TA T I S T I C S W e present statistics of the datasets we used in T able 7. 4 Downloaded from https://github.com/kimiyoung/review_net 5 Since each image has around fiv e reference captions, this ensures that the number of sampled candidate y ’ s for each training example is roughly the same for SQDML and RAML. 19 Under revie w as a conference paper at ICLR 2018 Dataset CoNLL2003 PTB IWSL T2014 T R A I N # Sent 14,987 39,832 153,326 # T oken 204,567 843,029 2,687,420 / 2,836,554 D E V . # Sent 3,466 1,700 6,969 # T oken 51,578 35,508 122,327 / 129,091 T E S T # Sent 3,684 2,416 6,750 # T oken 46,666 49,892 125,738 / 131,141 T able 7: Dataset statistics. # Sent and # T oken refer to the number of sentences and tok ens in each data set, respecti vely (for IWSL T , they refer to the number of sentence pairs and tokens of source/tar get languages). D . 2 M O D E L S F O R S T R U C T U R E D P R E D I C T I O N D . 2 . 1 L O G - L I N E A R M O D E L A commonly used log-linear model defines a family of conditional probability P θ ( y | x ) ov er Y with the following form: P θ ( y | x ) = Φ( y , x ; θ ) P y 0 ∈Y Φ( y 0 , x ; θ ) = exp( θ T φ ( y , x )) P y 0 ∈Y exp( θ T φ ( y 0 , x )) (4) where φ ( y , x ) are the feature functions, θ are parameters of the model and Φ( y , x ; θ ) captures the dependency between the input and output v ariables. W e define the partition function : Z ( x ; θ ) = P y 0 ∈Y exp( θ T φ ( y 0 , x )) . Then, the conditional probability in (4) can be written as: P θ ( y | x ) = exp( θ T φ ( y , x )) Z ( x ; θ ) Now , the objectiv e of RAML for one training instance ( x, y ) is: L ( θ ) = − X y 0 ∈Y q ( y 0 | y ; τ ) log P θ ( y 0 | x ) = − θ T    X y 0 ∈Y q ( y 0 | y ; τ ) φ ( y 0 , x )    + log Z ( x ; θ ) (5) and the gradient is: ∂ L ( θ ) ∂ θ = − P y 0 ∈Y q ( y 0 | y ; τ ) φ ( y 0 , x ) + ∂ log Z ( x ; θ ) ∂ θ = − P y 0 ∈Y q ( y 0 | y ; τ ) φ ( y 0 , x ) + P y 0 ∈Y P θ ( y 0 | x ) φ ( y 0 , x ) = P y 0 ∈Y ( P θ ( y 0 | x ) − q ( y 0 | y ; τ )) φ ( y 0 , x ) (6) T o optimize L ( θ ) , we need to ef ficiently compute the objective and its gradient. In the next two sec- tions, we see that when the feature φ ( y , x ) and the re ward r ( y , y ∗ ) follo w some certain factorizations, efficient dynamic programming algorithms e xist. D . 2 . 2 S E Q U E N C E C R F In sequence CRF , Φ usually factorizes as sum of potential functions defined on pairs of successi ve labels: Φ( y , x ; θ ) = L Y i =1 ψ i ( y i − 1 , y i , x ; θ ) where ψ i ( y i − 1 , y i , x ; θ ) = exp( θ T φ i ( y i − 1 , y i , x )) . When we use the token lev el label accuracy as rew ard, the reward function can be f actorized as: r ( y , y ∗ ) = L X i =1 I ( y i = y ∗ i ) where y i is the label of the i th token (w ord). Then, the objecti ve and gradient in (5) and (6) can be computed by using the forward-backward algorithm (W allach, 2004). 20 Under revie w as a conference paper at ICLR 2018 D . 2 . 3 E D G E - FAC T O R I Z E D T R E E - S T RU C T U R E M O D E L In dependenc y parsing, y represents a generic dependency tree which consists of directed edges between heads and their dependents (modifiers). The edge-factorized model factorizes potential function Φ into the set of edges: Φ( y , x ; θ ) = Y e ∈ y ψ e ( e, x ; θ ) where e is an edge belonging to the tree y . ψ e ( e ; θ ) = exp( θ T φ e ( e, x )) . The re ward of U AS can be factorized as: r ( y , y ∗ ) = L X i =1 I ( y i = y ∗ i ) where y i is the head of the i th word in the sentence x . Then, we have: P y ∈Y P θ ( y | x ) φ ( y , x ) = P y ∈Y P e ∈ y P θ ( y | x ) φ e ( e, x ) = P e ∈E φ e ( e, x ) ( P y ∈Y ( e ) P θ ( y | x ) ) (7) where E is the set of all possible edges for sentence x and Y ( e ) = { y ∈ Y : e ∈ y } . W ith similar deriv ation, we hav e X y 0 ∈Y q ( y 0 | y ; τ ) φ ( y 0 , x ) = X e ∈E φ e ( e, x )    X y 0 ∈Y ( e ) q ( y 0 | y )    (8) Both (7) and (8) can be computed by using the inside-outside algorithm (Paskin, 2001; Ma & Zhao, 2012) D . 2 . 4 A T T E N T I O N A L N E U R A L M A C H I N E T R A N S L A T I O N M O D E L Model Overview W e apply a neural encoder -decoder model with attention and input feeding (Luong et al., 2015). Giv en a source sentence x of N words { x i } N i =1 , the conditional probability of the target sentence y = { y i } T i =1 , p ( y | x ) , is factorized as p ( y | x ) = Q T t =1 p ( y t | y