Meta-Learning with Warped Gradient Descent

Published as a conference paper at ICLR 2020 M E T A - L E A R N I N G W I T H W A R P E D G R A D I E N T D E S C E N T Sebastian Flennerhag, 1,2,3 Andrei A. Rusu, 3 Razvan Pascanu, 3 Francesco V isin, 3 Hujun Y in, 1,2 Raia Hadsell 3 1 The Univ ersity of Manchester , 2 The Alan T uring Institute, 3 DeepMind {flennerhag,andreirusu,razp,visin,raia}@google.com hujun.yin@manchester.ac.uk A B S T R AC T Learning an efﬁcient update rule from data that promotes rapid learning of new tasks from the same distrib ution remains an open problem in meta-learning. T yp- ically , previous works have approached this issue either by attempting to train a neural network that directly produces updates or by attempting to learn better initialisations or scaling f actors for a gradient-based update rule. Both of these approaches pose challenges. On one hand, directly producing an update forgoes a useful inductive bias and can easily lead to non-conv erging behaviour . On the other hand, approaches that try to control a gradient-based update rule typically resort to computing gradients through the learning process to obtain their meta- gradients, leading to methods that can not scale beyond fe w-shot task adaptation. In this work, we propose W arped Gradient Descent (W arpGrad), a method that intersects these approaches to mitigate their limitations. W arpGrad meta-learns an efﬁciently parameterised preconditioning matrix that facilitates gradient descent across the task distrib ution. Preconditioning arises by interleaving non-linear lay- ers, referred to as warp-layers , between the layers of a task-learner . W arp-layers are meta-learned without backpropagating through the task training process in a manner similar to methods that learn to directly produce updates. W arpGrad is computationally efﬁcient, easy to implement, and can scale to arbitrarily large meta-learning problems. W e provide a geometrical interpretation of the approach and ev aluate its effecti veness in a v ariety of settings, including few-shot, standard supervised, continual and reinforcement learning. 1 I N T RO D U C T I O N Learning (how) to learn implies inferring a learning strate gy from some set of past experiences via a meta-learner that a task-learner can le verage when learning a new task. One approach is to directly parameterise an update rule via the memory of a recurrent neural network ( Andrycho wicz et al. , 2016 ; Ravi & Larochelle , 2016 ; Li & Malik , 2016 ; Chen et al. , 2017 ). Such memory-based methods can, in principle, represent any learning rule by virtue of being uni versal function approximators ( ??? ). They can also scale to long learning processes by using truncated backpropagation through time, but the y lack an inducti ve bias as to what constitutes a reasonable learning rule. This renders them hard to train and brittle to generalisation as their parameter updates hav e no guarantees of con vergence. An alternati ve family of approaches deﬁnes a gradient-based update rule and meta-learns a shared initialisation that facilitates task adaptation across a distrib ution of tasks ( Finn et al. , 2017 ; Nichol et al. , 2018 ; Flennerhag et al. , 2019 ). Such methods are imbued with a strong inducti ve bias— gradient descent—but restrict kno wledge transfer to the initialisation. Recent work has sho wn that it is beneﬁcial to more directly control gradient descent by meta-learning an approximation of a parameterised matrix ( Li et al. , 2017 ; Lee & Choi , 2018 ; Park & Oliv a , 2019 ) that preconditions gradients during task training, similarly to second-order and Natural Gradient Descent methods ( Nocedal & Wright , 2006 ; Amari & Nagaoka , 2007 ). T o meta-learn preconditioning, these methods backpropagate through the gradient descent process, limiting them to fe w-shot learning. 1 Published as a conference paper at ICLR 2020 x f ( x ) h (1) ω (1) h (2) ω (2) θ θ (1) φ (1) θ (2) φ (2) ∇ f ( x ) L ( θ ; φ ) P (1) ∇ θ (1) L D ω (1) P (2) ∇ θ (2) L D ω (2) θ θ 0 θ 00 ∇ L P ∇ L min E θ [ J ( φ )] φ T ask Adaptation Meta-Learning T ask-learners Shared W arp Figure 1: Schematics of W arpGrad. W arpGrad preconditioning is embedded in task-learners f by in- terleaving warp-l ayers ( ω (1) , ω (2) ) between each task-learner’ s layers ( h (1) , h (2) ). W arpGrad achiev e preconditioning by modulating layer acti vations in the forward pass and gradients in the backward pass by backpropagating through warp-layers ( D ω ), which implicitly preconditions gradients by some matrix ( P ). W arp-parameters ( φ ) are meta-learned over the joint search space induced by task adaptation ( E θ [ J ( φ )] ) to form a geometry that facilitates task learning. In this paper , we propose a no vel frame work called W arped Gradient Descent (W arpGrad) 1 , that relies on the inducti ve bias of gradient-based meta-learners by deﬁning an update rule that preconditions gradients, but that is meta-learned using insights from memory-based methods. In particular , we lev erage that gradient preconditioning is deﬁned point-wise in parameter space and can be seen as a recurrent operator of order 1. W e use this insight to deﬁne a trajectory agnostic meta-objectiv e over a joint parameter search space where knowledge transfer is encoded in gradient preconditioning. T o achieve a scalable and ﬂexible form of preconditioning, we take inspiration from works that embed preconditioning in task-learners ( Desjardins et al. , 2015 ; Lee & Choi , 2018 ), but we relax the assumption that task-learners are feed-forward and replace their linear projection with a generic neural network ω , referred to as a warp layer . By introducing non-linearity , preconditioning is rendered data-dependent. This allo ws W arpGrad to model preconditioning be yond the block-diagonal structure of prior works and enables it to meta-learn ov er arbitrary adaptation processes. W e empirically v alidate W arpGrad and sho w it surpasses baseline gradient-based meta-learners on standard few-shot learning tasks ( mini ImageNet, tier ed ImageNet; V inyals et al. , 2016 ; Ravi & Larochelle , 2016 ; Ren et al. , 2018 ), while scaling beyond fe w-shot learning to standard supervised settings on the “multi”-shot Omniglot benchmark ( Flennerhag et al. , 2019 ) and a multi-shot version of tier ed ImageNet. W e further ﬁnd that W arpGrad outperforms competing methods in a reinforce- ment learning (RL) setting where pre vious gradient-based meta-learners fail (maze na vigation with recurrent neural networks ( Miconi et al. , 2019 )) and can be used to meta-learn an optimiser that prev ents catastrophic forgetting in a continual learning setting. 2 W A R P E D G R A D I E N T D E S C E N T 2 . 1 G R A D I E N T - B A S E D M E TA - L E A R N I N G W arpGrad belongs to the family of optimisation-based meta-learners that parameterise an update rule θ ← U ( θ ; ξ ) with some meta-parameters ξ . Speciﬁcally , gradient-based meta-learners deﬁne an update rule by relying on the gradient descent, U ( θ ; ξ ) : = θ − α ∇ L ( θ ) for some objecti ve L and learning rate α . A task is deﬁned by a training set D τ train and a test set D τ test , which deﬁnes learning objecti ves L D τ ( θ ) : = E ( x,y ) ∼D τ [ ` ( f ( x, θ ) , y )] ov er the task-learner f for some loss ` . MAML ( Finn 1 Open-source implementation av ailable at https://github.com/flennerhag/warpgrad . 2 Published as a conference paper at ICLR 2020 θ 0 θ 0 Figure 2: Gradient-based meta-learning. Colours denote different tasks ( τ ), dashed lines denote backpropagation through the adaptation process, and solid black lines denot e optimiser parameter ( φ ) gradients w .r .t. one step of task parameter ( θ ) adaptation. Left: A meta-learned initialisation compresses trajectory information into a single initial point ( θ 0 ). Middle: MAML-based optimisers interact with adaptation trajectories at ev ery step and backpropag ate through each interaction. Right: W arpGrad is trajectory agnostic. T ask adaptation deﬁnes an empirical distrib ution p ( τ , θ ) ov er which W arpGrad learns a geometry for adaptation by optimising for steepest descent directions. et al. , 2017 ) meta-learns a shared initialisation θ 0 by backpropagating through K steps of gradient descent across a giv en task distribution p ( τ ) , C MAML ( ξ ) : = X τ ∼ p ( τ ) L D τ test θ 0 − α K − 1 X k =0 U D τ train ( θ τ k ; ξ ) ! . (1) Subsequent works on gradient-based meta-learning primarily differ in the parameterisation of U . Meta-SGD (MSGD; Li & Malik , 2016 ) learns a vector of learning rates, whereas Meta-Curv ature (MC; P ark & Oliv a , 2019 ) deﬁnes a block-diagonal preconditioning matrix B , and T -Nets ( Lee & Choi , 2018 ) embed block-diagonal preconditioning in feed-forward learners via linear projections, U ( θ k ; θ 0 ) : = θ k − α ∇ L ( θ k ) MAML (2) U ( θ k ; θ 0 , φ ) : = θ k − α diag( φ ) ∇ L ( θ k ) MSGD (3) U ( θ k ; θ 0 , φ ) : = θ k − αB ( θ k ; φ ) ∇ L ( θ k ) MC (4) U ( θ k ; θ 0 , φ ) : = θ k − α ∇ L ( θ k ; φ ) T -Nets . (5) These methods optimise for meta-parameters ξ = { θ 0 , φ } by backpropagating through the gradient descent process (Eq. 1 ). This trajectory dependence limits them to fe w-shot learning as they become (1) computationally expensi ve, (2) susceptible to exploding/v anishing gradients, and (3) susceptible to a credit assignment problem ( W u et al. , 2018 ; Antoniou et al. , 2019 ; ? ). Our goal is to de velop a meta-learner that ov ercomes all three limitations. T o do so, we depart from the paradigm of backpropagating to the initialisation and exploit the fact that learning to precondition gradients can be seen as a Markov Process of order 1 that depends on the state b ut not the trajectory ( Li et al. , 2017 ). T o develop this notion, we ﬁrst establish a general-purpose form of preconditioning (Section 2.2 ). Based on this, we obtain a canonical meta-objecti ve from a geometrical point of view (Section 2.3 ), from which we deriv e a trajectory-agnostic meta-objectiv e (Section 2.4 ). 2 . 2 G E N E R A L - P U R P O S E P R E C O N D I T I O N I N G A preconditioned gradient descent rule, U ( θ ; φ ) : = θ − αP ( θ ; φ ) ∇ L ( θ ) , deﬁnes a geometry via P . T o disentangle the expressi ve capacity of this geometry from the expressi ve capacity of the task- learner f , we take inspiration from T -Nets that embed linear projections T in feed-forward layers, h = σ ( T W x + b ) . This in itself is not suf ﬁcient to achie ve disentanglement since the parameterisation of T is directly linked to that of W , b ut it can be achiev ed under non-linear preconditioning. T o this end, we relax the assumption that the task-learner is feed-forw ard and consider an arbitrary neural netw ork, f = h ( L ) ◦ · · · ◦ h (1) . W e insert warp-layers that are uni versal function approximators parameterised by neural networks into the task-learner without restricting their form or ho w they 3 Published as a conference paper at ICLR 2020 P τ W τ P τ 0 W τ 0 θ θ 0 γ γ 0 ω ω ∇ ( L ◦ ω )( θ ) G − 1 ∇ L ( γ ) Figure 3: Left: synthetic experiment illustrating how W arpGrad warps gradients (see Appendix D for full details). Each task f ∼ p ( f ) deﬁnes a distinct loss surface ( W , bottom ro w). Gradient descent (black) on these surfaces struggles to ﬁnd a minimum. W arpGrad meta-learns a warp ω to produce better update directions (magenta; Section 2.4 ). In doing so, W arpGrad learns a meta-geometry P where standard gradient descent is well beha ved (top row). Right: gradient descent in P is equiv alent to ﬁrst-order Riemannian descent in W under a meta-learned Riemann metric (Section 2.3 ). interact with f . In the simplest case, we interleave warp-layers between layers of the task-learner to obtain ˆ f = ω ( L ) ◦ h ( L ) ◦ · · · ◦ ω (1) ◦ h (1) , but other forms of interaction can be beneﬁcial (see Ap- pendix A for practical guidelines). Backpropagation automatically induces gradient preconditioning, as in T -Nets, but in our case via the Jacobians of the warp-layers: ∂ L ∂ θ ( i ) = E   ∇ ` T   L − ( i +1) Y j =0 D x ω ( L − j ) D x h ( L − j )   D x ω ( i ) D θ h ( i )   , (6) where D x and D θ denote the Jacobian with respect to input and parameters, respectiv ely . In the special case where f is feed-forward and each ω a linear projection, we obtain an instance of W arpGrad that is akin to T -Nets since preconditioning is given by D x ω = T . Conv ersely , by making warp-layers non-linear , we can induce interdependence between warp-layers, allowing W arpGrad to model preconditioning beyond the block-diagonal structure imposed by prior works. Further, this enables a form of task-conditioning by making Jacobians of warp-layers data dependent. As we ha ve made no assumptions on the form of the task-learner or warp-layers, W arpGrad methods can act on any neural network through an y form of w arping, including recurrence. W e show that increasing the capacity of the meta-learner by deﬁning warp-layers as Residual Networks ( ? ) improves performance on classiﬁcation tasks (Section 4.1 ). W e also introduce recurrent warp-layers for agents in a gradient- based meta-learner that is the ﬁrst, to the best of our knowledge, to outperform memory-based meta-learners on a maze navigation task that requires memory (Section 4.3 ). W arp-layers imbue W arpGrad with three powerful properties. First, due to preconditioned gradients, W arpGrad inherits gradient descent properties, importantly guarantees of conv ergence. Second, warp-layers form a distributed representation of preconditioning that disentangles the expressi veness of the geometry it encodes from the e xpressiv e capacity of the task-learner . Third, warp-layers are meta-learned across tasks and trajectories and can therefore capture properties of the task-distrib ution beyond local information. Figure 3 illustrates these properties in a synthetic scenario, where we construct a family of tasks f : R 2 → R (see Appendix D for details) and meta-learn across the task distribution. W arpGrad learns to produce warped loss surfaces (illustrated on two tasks τ and τ 0 ) that are smoother and more well-behav ed than their respectiv e nativ e loss-surfaces. 2 . 3 T H E G E O M E T RY O F W A R P E D G R A D I E N T D E S C E N T If the preconditioning matrix P is in vertible, it deﬁnes a valid Riemann metric ( Amari , 1998 ) and therefore enjoys similar con vergence guarantees to gradient descent. Thus, if warp-layers represent 4 Published as a conference paper at ICLR 2020 a v alid (meta-learned) Riemann metric, W arpGrad is well-behaved. For T -Nets, it is sufﬁcient to require T to be full rank, since T explicitly deﬁnes P as a block-diagonal matrix with block entries T T T . In contrast, non-linearity in warp-layers precludes such an explicit identiﬁcation. Instead, we must consider the geometry that warp-layers represent. For this, we need a metric tensor , G , which is a positi ve-deﬁnite, smoothly v arying matrix that measures curv ature on a manifold W . The metric tensor deﬁnes the steepest direction of descent by − G − 1 ∇ L ( Lee , 2003 ), hence our goal is to establish that warp-layers approximate some G − 1 . Let Ω represent the effect of w arp-layers by a reparameterisation h ( i ) ( x ; Ω( θ ; φ ) ( i ) ) = ω ( i ) ( h ( i ) ( x ; θ ( i ) ); φ ) ∀ x, i that maps from a space P onto the manifold W with γ = Ω( θ ; φ ) . W e induce a metric G on W by push-forward (Figure 2 ): ∆ θ : = ∇ ( L ◦ Ω) ( θ ; φ ) = [ D x Ω( θ ; φ )] T ∇ L ( γ ) P -space (7) ∆ γ : = D x Ω( θ ; φ ) ∆ θ = G ( γ ; φ ) − 1 ∇ L ( γ ) W -space , (8) where G − 1 : = [ D x Ω][ D x Ω] T . Provided Ω is not de generate ( G is non-singular), G − 1 is positi ve- deﬁnite, hence a v alid Riemann metric. While this is the metric induced on W by warp-layers, it is not the metric used to precondition gradients since we take gradient steps in P which introduces an error term (Figure 2 ). W e can bound the error by ﬁrst-order T aylor series expansion to establish ﬁrst-order equiv alence between the W arpGrad update in P (Eq. 7 ) and the ideal update in W (Eq. 8 ), ( L ◦ Ω)( θ − α ∆ θ ) = L ( γ − α ∆ γ ) + O ( α 2 ) . (9) Consequently , gradient descent under warp-layers (in P -space) is ﬁrst-order equi valent to warping the nati ve loss surface under a metric G to facilitate task adaptation. W arp parameters φ control the geometry induced by warping, and therefore what task-learners conv erge to. By meta-learning φ we can accumulate information that is conduciv e to task adaptation but that may not be av ailable during that process. This suggests that an ideal geometry (in W -space) should yield preconditioning that points in the direction of steepest descent, accounting for global information across tasks, min φ E L ,γ ∼ p ( L ,γ ) h L  γ − α G ( γ ; φ ) − 1 ∇ L ( γ ) i . (10) In contrast to MAML-based approaches (Eq. 1 ), this objective avoids backpropagation through learning processes. Instead, it deﬁnes task learning abstractly by introducing a joint distribution ov er objectiv es and parameterisations, opening up for general-purpose meta-learning at scale. 2 . 4 M E TA - L E A R N I N G W A R P P A R A M E T E R S The canonical objecti ve in Eq. 10 describes a meta-objecti ve for learning a geometry on ﬁrst principles that we can render into a trajectory-agnostic update rule for warp-layers. T o do so, we deﬁne a task τ = ( h τ , L τ meta , L τ task ) by a task-learner ˆ f that is embedded with a shared W arpGrad optimiser , a meta-training objective L τ meta , and a task adaptation objective L τ task . W e use L τ task to adapt task parameters θ and L τ meta to adapt warp parameters φ . Note that we allow meta and task objecti ves to dif fer in arbitrary w ays, but both are expectations ov er some data, as above. In the simplest case, they differ in terms of v alidation versus training data, but they may differ in terms of learning paradigm as well, as we demonstrate in continual learning experiment (Section 4.3 ). T o obtain our meta-objecti ve, we recast the canonical objecti ve (Eq. 10 ) in terms of θ using ﬁrst-order equiv alence of gradient steps (Eq. 9 ). Next, we factorise p ( τ , θ ) into p ( θ | τ ) p ( τ ) . Since p ( τ ) is giv en, it remains to consider a sampling strategy for p ( θ | τ ) . For meta-learning of warp-layers, we assume this distribution is giv en. W e later show ho w to incorporate meta-learning of a prior p ( θ 0 | τ ) . While any sampling strategy is v alid, in this paper we e xploit that task learning under stochastic gradient descent can be seen as sampling from an empirical prior p ( θ | τ ) ( Grant et al. , 2018 ); in particular , each iterate θ τ k can be seen as a sample from p ( θ τ k | θ τ k − 1 , φ ) . Thus, K -steps of gradient descent forms a Monte-Carlo chain θ τ 0 , . . . , θ τ K and sampling such chains deﬁne an empirical distribution p ( θ | τ ) around some prior p ( θ 0 | τ ) , which we will discuss in Section 2.5 . The joint distribution p ( τ , θ ) deﬁnes a joint search space across tasks. Meta-learning therefore learns 5 Published as a conference paper at ICLR 2020 a geometry over this space with the steepest expected direction of descent. This direction is howe ver not with respect to the objectiv e that produced the gradient, L τ task , but with respect to L τ meta , L ( φ ) : = X τ ∼ p ( τ ) X θ τ ∼ p ( θ | τ ) L τ meta  θ τ − α ∇L τ task ( θ τ ; φ ); φ  . (11) Decoupling the task gradient operator ∇L τ task from the geometry learned by L τ meta lets us infuse global kno wledge in warp-layers, a promising avenue for future research ( Metz et al. , 2019 ; Mendonca et al. , 2019 ). For example, in Section 4.3 , we meta-learn an update-rule that mitig ates catastrophic forgetting by deﬁning L τ meta ov er current and previous tasks. In contrast to other gradient-based meta-learners, the W arpGrad meta-objective is an expectation o ver gradient update steps sampled from the search space induced by task adaptation (for e xample, K steps of stochastic gradient descent; Figure 2 ). It is therefore trajectory agnostic and hence compatible with arbitrary task learning processes. Because the meta-gradient is independent of the number of task gradient steps, it a voids v anishing/exploding gradients and the credit assignment problem by design. It does rely on second-order gradients, a requirement we can relax by detaching task parameter gradients ( ∇L τ task ) in Eq. 11 , ˆ L ( φ ) : = X τ ∼ p ( τ ) X θ τ ∼ p ( θ | τ ) L τ meta  sg  θ τ − α ∇L τ task ( θ τ ; φ )  ; φ  , (12) where sg is the stop-gradient operator . In contrast to the ﬁrst-order approximation of MAML ( Finn et al. , 2017 ), which ignores the entire trajectory except for the ﬁnal gradient, this approximation retains all gradient terms and only discards local second-order effects, which are typically dominated by ﬁrst-order ef fect in long parameter trajectories ( Flennerhag et al. , 2019 ). Empirically , we ﬁnd that our approximation only incurs a minor loss of performance in an ablation study on Omniglot (Appendix F ). Interestingly , this approximation is a form of multitask learning with respect to φ ( Li & Hoiem , 2016 ; Bilen & V edaldi , 2017 ; Reb ufﬁ et al. , 2017 ) that mar ginalises over task parameters θ τ . Algorithm 1 W arpGrad: online meta-training Require: p ( τ ) : distribution ov er tasks Require: α, β , λ : hyper-parameters 1: initialise φ and p ( θ 0 | τ ) 2: while not done do 3: sample mini-batch of tasks T from p ( τ ) 4: g φ , g θ 0 ← 0 5: for all τ ∈ T do 6: θ τ 0 ∼ p ( θ 0 | τ ) 7: for all k in 0 , . . . , K τ − 1 do 8: θ τ k +1 ← θ τ k − α ∇L τ task ( θ τ k ; φ ) 9: g φ ← g φ + ∇ L ( φ ; θ τ k ) 10: g θ 0 ← g θ 0 + ∇ C ( θ 0 ; θ τ 0: k ) 11: end for 12: end for 13: φ ← φ − β g φ 14: θ 0 ← θ 0 − λβ g θ 0 15: end while Algorithm 2 W arpGrad: ofﬂine meta-training Require: p ( τ ) : distribution ov er tasks Require: α, β , λ, η : hyper-parameters 1: initialise φ , p ( θ 0 | τ ) 2: while not done do 3: initialise buf fer B = {} 4: sample mini-batch of tasks T from p ( τ ) 5: f or all τ ∈ T do 6: θ τ 0 ∼ p ( θ 0 | τ ) 7: B [ τ ] = [ θ τ 0 ] 8: for all k in 0 , . . . , K τ − 1 do 9: θ τ k +1 ← θ τ k − α ∇L τ task ( θ τ k ; φ ) 10: B [ τ ] .append ( θ τ k +1 ) 11: end for 12: end f or 13: i, g φ , g θ 0 ← 0 14: f or all ( τ , k ) ∈ B do 15: g φ ← g φ + ∇ L ( φ ; θ τ k ) 16: g θ 0 ← g θ 0 + ∇ C ( θ τ 0 ; θ τ 0: k ) 17: i ← i + 1 18: if i = η then 19: φ ← φ − β g φ 20: θ 0 ← θ 0 − λβ g θ 0 21: i, g φ , g θ 0 ← 0 22: end if 23: end for 24: end while 6 Published as a conference paper at ICLR 2020 2 . 5 I N T E G R A T I O N W I T H L E A R N E D I N I T I A L I S AT I O N S W arpGrad is a method for learning w arp layer parameters φ ov er a joint search space deﬁned by p ( τ , θ ) . Because W arpGrad takes this distribution as giv en, we can inte grate W arpGrad with methods that deﬁne or learn some form of “prior” p ( θ 0 | τ ) ov er θ τ 0 . For instance, (a) Multi-task solution : in online learning, we can alternate between updating a multi-task solution and tuning warp parameters. W e use this approach in our Reinforcement Learning e xperiment (Section 4.3 ); (b) Meta-learned point-estimate : when task adaptation occurs in batch mode, we can meta-learn a shared initialisation θ 0 . Our fe w-shot and supervised learning experiments take this approach (Section 4.1 ); (c) Meta- learned prior : W arpGrad can be combined with Bayesian methods that deﬁne a full prior ( Rusu et al. , 2019 ; Oreshkin et al. , 2018 ; Lacoste et al. , 2018 ; Kim et al. , 2018 ). W e incorporate such methods by some objectiv e C (potentially vacuous) o ver θ 0 that we optimise jointly with W arpGrad, J ( φ, θ 0 ) : = L ( φ ) + λ C ( θ 0 ) , (13) where L can be substituted for by ˆ L and λ ∈ [0 , ∞ ) is a hyper-parameter . W e train the W arp- Grad optimiser via stochastic gradient descent and solv e Eq. 13 by alternating between sampling task parameters from p ( τ , θ ) giv en the current parameter values for φ and taking meta-gradient steps ov er these samples to update φ . As such, our method can also be seen as a generalised form of gradient descent in the form of Mirror Descent with a meta-learned dual space ( Desjardins et al. , 2015 ; Beck & T eboulle , 2003 ). The details of the sampling procedure may vary depending on the speciﬁcs of the tasks (static, sequential), the design of the task-learner (feed-forward, recurrent), and the learning objecti ve (supervised, self-supervised, reinforcement learning). In Algorithm 1 we illustrate a simple online algorithm with constant memory and linear comple xity in K , assuming the same holds for C . A drawback of this approach is that it is relati vely data inefﬁcient; in Appendix B we detail a more complex of ﬂine training algorithm that stores task parameters in a replay buf fer for mini-batched training of φ . The gains of the ofﬂine variant can be dramatic: in our Omniglot experiment (Section 4.1 ), of ﬂine meta-training allows us to update warp parameters 2000 times with each meta-batch, improving ﬁnal test accurac y from 76 . 3% to 84 . 3 % (Appendix F ). 3 R E L A T E D W O R K Learning to learn, or meta-learning, has previously been explored in a variety of settings. Early work focused on e volutionary approaches ( Schmidhuber , 1987 ; Bengio et al. , 1991 ; Thrun & Pratt , 1998 ). Hochreiter et al. ( 2001 ) introduced gradient descent methods to meta-learning, speciﬁcally for recurrent meta-learning algorithms, extended to RL by W ang et al. ( 2016 ) and ? . A similar approach was tak en by Andrycho wicz et al. ( 2016 ) and Ravi & Larochelle ( 2016 ) to meta-learn a parameterised update rule in the form of a Recurrent Neural Network (RNN). A related idea is to separate parameters into “slo w” and “fast” weights, where the former captures meta-information and the latter encapsulates rapid adaptation ( Hinton & Plaut , 1987 ; Schmidhuber , 1992 ; Ba et al. , 2016 ). This can be implemented by embedding a neural network that dynamically adapts the parameters of a main architecture ( Ha et al. , 2016 ). W arpGrad can be seen as learning slow warp-parameters that precondition adaptation of fast weights. Recent meta-learning focuses almost exclusi vely on fe w-shot learning, where tasks are characterised by sev ere data scarcity . In this setting, tasks must be suf ﬁciently similar that a new task can be learned from a single or handful of examples ( Lake et al. , 2015 ; V inyals et al. , 2016 ; Snell et al. , 2017 ; Ren et al. , 2018 ). Sev eral meta-learners have been proposed that directly predict the parameters of the task- learner ( Bertinetto et al. , 2016 ; Munkhdalai et al. , 2018 ; Gidaris & K omodakis , 2018 ; Qiao et al. , 2018 ). T o scale, such methods typically pretrain a feature extractor and predict a small subset of the parameters. Closely related to our work are gradient-based fe w-shot learning methods that e xtend MAML by sharing some subset of parameters between task-learners that is ﬁx ed during task training but meta-learner across tasks, which may reduce o verﬁtting ( Mishra et al. , 2018 ; Lee & Choi , 2018 ; Munkhdalai et al. , 2018 ) or induce more robust con ver gence ( Zintgraf et al. , 2019 ). It can also be used to model latent v ariables for concept or task inference, which implicitly induce gradient modulation ( Zhou et al. , 2018 ; Oreshkin et al. , 2018 ; Rusu et al. , 2019 ; Lee et al. , 2019 ). Our work is also related to gradient-based meta-learning of a shared initialisation that scales be yond fe w-shot learning ( Nichol et al. , 2018 ; Flennerhag et al. , 2019 ). 7 Published as a conference paper at ICLR 2020 Meta-learned preconditioning is closely related to parallel work on second-order optimisation meth- ods for high dimensional non-con vex loss surfaces ( Nocedal & Wright , 2006 ; ? ; Kingma & Ba , 2015 ; ? ). In this setting, second-order optimisers typically struggle to improve upon ﬁrst-order baselines ( Sutske ver et al. , 2013 ). As second-order curvature is typically intractable to compute, such methods resort to low-rank approximations ( Nocedal & Wright , 2006 ; Martens , 2010 ; Martens & Grosse , 2015 ) and suf fer from instability ( Byrd et al. , 2016 ). In particular , Natural Gradient De- scent ( Amari , 1998 ) is a method that uses the Fisher Information Matrix as curvature metric ( Amari & Nagaoka , 2007 ). Several proposed methods for amortising the cost of estimating the metric ( Pascanu & Bengio , 2014 ; Martens & Grosse , 2015 ; Desjardins et al. , 2015 ). As noted by Desjardins et al. ( 2015 ), expressing preconditioning through interlea ved projections can be seen as a form of Mirror Descent ( Beck & T eboulle , 2003 ). W arpGrad of fers a new perspecti ve on gradient preconditioning by introducing a generic form of model-embedded preconditioning that exploits global information beyond the task at hand. 4 E X P E R I M E N T S T able 1: Mean test accuracy after task adaptation on held out e valuation tasks. † Multi-headed. ‡ No meta-training; see Appendix E and Appendix H . mini ImageNet 5-way 1-shot 5-way 5-shot Reptile 50 . 0 ± 0 . 3 66 . 0 ± 0 . 6 Meta-SGD 50 . 5 ± 1 . 9 64 . 0 ± 0 . 9 (M)T -Net 51 . 7 ± 1 . 8 − CA VIA (512) 51 . 8 ± 0 . 7 65 . 9 ± 0 . 6 MAML 48 . 7 ± 1 . 8 63 . 2 ± 0 . 9 W arp-MAML 52.3 ± 0 . 8 68.4 ± 0 . 6 tiered ImageNet 5-way 1-shot 5-way 5-shot MAML 51 . 7 ± 1 . 8 70 . 3 ± 1 . 8 W arp-MAML 57.2 ± 0 . 9 74.1 ± 0 . 7 tiered ImageNet Omniglot 10-way 640-shot 20-way 100-shot SGD ‡ 58 . 1 ± 1 . 5 51 . 0 KF A C ‡ − 56 . 0 Finetuning † − 76 . 4 ± 2 . 2 Reptile 76 . 52 ± 2 . 1 70 . 8 ± 1 . 9 Leap 73 . 9 ± 2 . 2 75 . 5 ± 2 . 6 W arp-Leap 80.4 ± 1 . 6 83.6 ± 1 . 9 W e ev aluate W arpGrad in a set of exper- iments designed to answer three ques- tions: (1) do W arpGrad methods retain the inductiv e bias of MAML-based fe w- shot learners? (2) Can W arpGrad meth- ods scale to problems beyond the reach of such methods? (3) Can W arpGrad generalise to complex meta-learning problems? 4 . 1 F E W - S H OT L E A R N I N G For fe w-shot learning, we test whether W arpGrad retains the inductive bias of gradient-based meta-learners while av oiding backpropagation through the gradient descent process. T o isolate the effect of the W arpGrad objective, we use linear warp-layers that we train using online meta-training (Algo- rithm 1 ) to make W arpGrad as close to T -Nets as possible. For a fair comparison, we meta-learn the initial- isation using MAML (W arp-MAML) with J ( θ 0 , φ ) : = L ( φ ) + λC MAML ( θ 0 ) . W e ev aluate the importance of meta- learning the initialisation in Appendix G and ﬁnd that W arpGrad achie ves similar performance under random task param- eter initialisation. All task-learners use a con volutional architecture that stacks 4 blocks made up of a 3 × 3 con volution, max-pooling, batch-norm, and ReLU activ ation. W e deﬁne W arp-MAML by inserting warp-layers in the form of 3 × 3 con volutions after each block in the baseline task-learner . All baselines are tuned with identical and independent hyper -parameter searches (including ﬁlter sizes – full e xperimental settings in Appendix H ), and we report best results from our experiments or the literature. W arp- MAML outperforms all baselines (T able 1 ), improving 1- and 5-shot accuracy by 3 . 6 and 5 . 5 percentage points on mini ImageNet ( V inyals et al. , 2016 ; Ra vi & Larochelle , 2016 ) and by 5 . 2 and 3 . 8 percentage points on tier ed ImageNet ( Ren et al. , 2018 ), which indicates that W arpGrad retains the inductiv e bias of MAML-based meta-learners. 8 Published as a conference paper at ICLR 2020 1 5 10 15 20 25 Num b er of tasks in meta-training set 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 T est accuracy on held-out tasks W arp-Leap Leap Reptile FT † SGD ‡ KF A C ‡ 0 20000 40000 60000 80000 100000 Num b er of Episo des 0 25 50 75 100 125 150 175 Rew ard W arp-RNN Hebb-RNN † Hebb-RNN ‡ RNN Figure 4: Left: Omniglot test accuracies on held-out tasks after meta-training on a v arying number of tasks. Shading represents standard deviation across 10 independent runs. W e compare W arp- Leap, Leap, and Reptile, multi-headed ﬁnetuning, as well as SGD and KF AC which used random initialisation but with 4x larger batch size and 10x larger learning rate. Right: On a RL maze navigation task, mean cumulati ve return is shown. Shading represents inter-quartile ranges across 10 independent runs. † Simple modulation and ‡ retroactiv e modulation are used ( Miconi et al. , 2019 ). 4 . 2 M U LT I - S H OT L E A R N I N G Next, we e v aluate whether W arpGrad can scale beyond fe w-shot adaptation on similar supervised problems. W e propose a ne w protocol for tier ed ImageNet that increases the number of adaptation steps to 640 and use 6 con volutional blocks in task-learners, which are otherwise deﬁned as abov e. Since MAML-based approaches cannot backpropagate through 640 adaptation steps for models of this size, we ev aluate W arpGrad against two gradient-based meta-learners that meta-learn an initialisation without such backpropagation, Reptile ( Nichol et al. , 2018 ) and Leap ( Flennerhag et al. , 2019 ), and we deﬁne a W arp-Leap meta-learner by J ( θ 0 , φ ) : = L ( φ ) + λC Leap ( θ 0 ) . Leap is an attractive complement as it minimises the expected gradient descent trajectory length across tasks. Under W arpGrad, this becomes a joint search for a geometry in which task adaptation deﬁnes geodesics (shortest paths, see Appendix C for details). While Reptile outperforms Leap by 2 . 6 percentage points on this benchmark, W arp-Leap surpasses both, with a margin of 3 . 88 to Reptile (T able 1 ). W e further ev aluate W arp-Leap on the multi-shot Omniglot ( Lake et al. , 2011 ) protocol proposed by Flennerhag et al. ( 2019 ), where each of the 50 alphabets is a 20-way classiﬁcation task. T ask adaptation in volv es 100 gradient steps on random samples that are preprocessed by random af ﬁne transformations. W e report results for W arp-Leap under ofﬂine meta-training (Algorithm 2 ), which updates warp parameters 2000 times per meta step (see Appendix E for experimental details). W arp- Leap enjoys similar performance on this task as well, impro ving over Leap and Reptile by 8 . 1 and 12 . 8 points respecti vely (T able 1 ). W e also perform an extensiv e ablation study varying the number of tasks in the meta-training set. Except for the case of a single task, W arp-Leap substantially outperforms all baselines (Figure 4 ), achie ving a higher rate of con vergence and reducing the ﬁnal test error from ~30% to ~15% . Non-linear warps, which go beyond block-diagonal preconditioning, reach ~11% test error (refer to Appendix F and T able 2 for the full results). Finally , we ﬁnd that W arpGrad methods beha ve distinctly different from Natural Gradient Descent methods in an ablation study (Appendix G ). It reduces ﬁnal test error from ~42% to ~19%, controlling for initialisation, while its preconditioning matrices differ from what the literature suggests ( Desjardins et al. , 2015 ). 4 . 3 C O M P L E X M E TA - L E A R N I N G : R E I N F O R C E M E N T A N D C O N T I N UA L L E A R N I N G (c.1) Reinforcement Learning T o illustrate how W arpGrad may be used both with recurrent neural networks and in meta-reinforcement learning, we ev aluate it in a maze navigation task proposed by Miconi et al. ( 2018 ). The en vironment is a ﬁxed maze and a task is deﬁned by randomly choosing a goal location. The agent’ s objecti ve is to ﬁnd the location as man y times as possible, being teleported to a random location each time it ﬁnds it. W e use advantage actor -critic with a basic recurrent neural network ( W ang et al. , 2016 ) as the task-learner , and we design a W arp-RNN as a HyperNetwork ( Ha et al. , 2016 ) that uses an LSTM that is ﬁxed during task training. This LSTM modulates the weights of the task-learning RNN (deﬁned in Appendix I ), which in turn is trained on mini-batches of 30 episodes for 200 000 steps. W e accumulate the gradient of ﬁxed w arp-parameters continually (Algorithm 3 , Appendix B ) at each task parameter update. W arp parameters are updated on e very 30 th 9 Published as a conference paper at ICLR 2020 Figure 5: Continual learning experiment. A verage log-loss o ver 100 randomly sampled tasks, each comprised of 5 sub-tasks. Left: learned sequentially as seen during meta-training. Right: learned in random order [sub-task 1, 3, 4, 2, 0]. step on task parameters (we control for meta-LSTM capacity in Appendix I ). W e compare against Learning to Reinforcement Learn ( W ang et al. , 2016 ) and Hebbian meta-learning ( Miconi et al. , 2018 ; 2019 ); see Appendix I for details. Notably , linear warps (T -Nets) do worse than the baseline RNN on this task while the W arp-RNN con verges to a mean cumulativ e rew ard of ~160 in 60 000 episodes, compared to baselines that reach at most a mean cumulati ve re ward of ~125 after 100 000 episodes (Figure 4 ), reaching ~150 after 200 000 episodes ( I ). (c.2) Continual Learning W e test if W arpGrad can prev ent catastrophic forgetting ( French , 1999 ) in a continual learning scenario. T o this end, we design a continual learning version of the sine regression meta-learning e xperiment in Finn et al. ( 2017 ) by splitting the input interval [ − 5 , 5] ⊂ R into 5 consecuti ve sub-tasks (an alternativ e protocol was recently proposed independently by Jav ed & White , 2019 ). Each sub-task is a regression problem with the tar get being a mixture of two random sine wav es. W e train 4-layer feed-forw ard task-learner with interlea ved warp-layers incrementally on one sub-task at a time (see Appendix J for details). T o isolate the behaviour of W arpGrad parameters, we use a ﬁx ed random initialisation for each task sequence. W arp parameters are meta-learned to pre vent catastrophic forgetting by deﬁning L τ meta to be the av erage task loss o ver current and previous sub-tasks, for each sub-task in a task sequence. This forces warp-parameters to disentangle the adaptation process of current and previous sub-tasks. W e train on each sub-task for 20 steps, for a total of 100 task adaptation steps. W e ev aluate W arpGrad on 100 random tasks and ﬁnd that it learns new sub-tasks well, with mean losses on an order of magnitude 10 − 3 . When switching sub-task, performance immediately deteriorates to ~ 10 − 2 but is stable for the remainder of training (Figure 5 ). Our results indicate that W arpGrad can be an effecti ve mechanism against catastrophic for getting, a promising av enue for further research. For detailed results, see Appendix J . 5 C O N C L U S I O N W e propose W arpGrad, a nov el meta-learner that combines the expressi ve capacity and ﬂe xibility of memory-based meta-learners with the inductiv e bias of gradient-based meta-learners. W arpGrad meta-learns to precondition gradients during task adaptation without backpropagating through the adaptation process and we ﬁnd empirically that it retains the inductive bias of MAML-based few-shot learners while being able to scale to complex problems and architectures. Further, by expressing preconditioning through warp-layers that are universal function approximators, W arpGrad can express geometries beyond the block-diagonal structure of prior works. W arpGrad provides a principled framew ork for general-purpose meta-learning that integrates learning paradigms, such as continual learning, an e xciting av enue for future research. W e introduce novel means for preconditioning, for instance with residual and recurrent warp-layers. Understanding how W arpGrad manifolds relate to second-order optimisation methods will further our understanding of gradient-based meta-learning and aid us in designing warp-layers with stronger inducti ve bias. In their current form, W arpGradshare some of the limitations of man y popular meta-learning ap- proaches. While W arpGrad avoids backpropagating through the task training process, as in W arp-Leap , the W arpGrad objective samples from parameter trajectories and has therefore linear computational complexity in the number of adaptation steps, currently an unresolv ed limitation of gradient-based meta-learning. Algorithm 2 hints at exciting possibilities for ov ercoming this limitation. 10 Published as a conference paper at ICLR 2020 A C K N O W L E D G E M E N T S The authors would like to thank Guillaume Desjardins for helpful discussions on an early draft as well as anonymous revie wers for their comments. SF gratefully acknowledges support from North W est Doctoral Training Centre under ESRC grant ES/J500094/1 and by The Alan T uring Institute under EPSRC grant EP/N510129/1. R E F E R E N C E S Amari, Shun-Ichi. Natural gradient works efﬁciently in learning. Neur al computation , 10(2):251–276, 1998. Amari, Shun-ichi and Nagaoka, Hiroshi. Methods of information geometry , volume 191. American Mathematical Society , 2007. Andrychowicz, Marcin, Denil, Misha, Gomez, Ser gio, Hoffman, Matthe w W , Pfau, David, Schaul, T om, Shillingford, Brendan, and De Freitas, Nando. Learning to learn by gradient descent by gradient descent. In Advances in Neural Information Pr ocessing Systems , 2016. Antoniou, Antreas, Edwards, Harrison, and Storkey , Amos J. How to train your MAML. In International Confer ence on Learning Repr esentations , 2019. Ba, Jimmy , Hinton, Geof frey E, Mnih, V olodymyr , Leibo, Joel Z, and Ionescu, Catalin. Using fast weights to attend to the recent past. In Advances in Neural Information Pr ocessing Systems , 2016. Beck, Amir and T eboulle, Marc. Mirror descent and nonlinear projected subgradient methods for con ve x optimization. Operations Resear ch Letters , 31:167–175, 2003. Bengio, Y oshua, Bengio, Samy , and Cloutier , Jocelyn. Learning a synaptic learning rule . Université de Montréal, Département d’informatique et de recherche opérationnelle, 1991. Bertinetto, Luca, Henriques, João F , V almadre, Jack, T orr, Philip, and V edaldi, Andrea. Learning feed-forward one-shot learners. In Advances in Neural Information Pr ocessing Systems , 2016. Bilen, Hakan and V edaldi, Andrea. Univ ersal representations: The missing link between faces, text, planktons, and cat breeds. arXiv pr eprint arXiv:1701.07275 , 2017. Byrd, R., Hansen, S., Nocedal, J., and Singer , Y . A stochastic quasi-ne wton method for large-scale optimization. SIAM Journal on Optimization , 26(2):1008–1031, 2016. Chen, Y utian, Hof fman, Matthew W , Colmenarejo, Sergio Gómez, Denil, Misha, Lillicrap, T imothy P , Botvinick, Matt, and de Freitas, Nando. Learning to learn without gradient descent by gradient descent. In International Confer ence on Machine Learning , 2017. Deng, Jia, Dong, W ei, Socher , Richard, Li, Li-Jia, Li, Kai, and Fei-Fei, Li. Imagenet: A large- scale hierarchical image database. In International Confer ence on Computer V ision and P attern Recognition , 2009. Desjardins, Guillaume, Simonyan, Karen, Pascanu, Razv an, and kavukcuoglu, koray . Natural neural networks. In Advances in Neural Information Pr ocessing Systems , 2015. Finn, Chelsea, Abbeel, Pieter , and Levine, Ser gey . Model-Agnostic Meta-Learning for F ast Adapta- tion of Deep Networks. In International Confer ence on Machine Learning , 2017. Flennerhag, Sebastian, Y in, Hujun, Keane, John, and Elliot, Mark. Breaking the activ ation function bottleneck through adaptive parameterization. In Advances in Neural Information Pr ocessing Systems , 2018. Flennerhag, Sebastian, Moreno, Pablo G., Lawrence, Neil D., and Damianou, Andreas. T ransferring knowledge across learning processes. In International Confer ence on Learning Repr esentations , 2019. French, Robert M. Catastrophic forgetting in connectionist networks. T r ends in cognitive sciences , 3 (4):128–135, 1999. 11 Published as a conference paper at ICLR 2020 Gidaris, Spyros and K omodakis, Nikos. Dynamic fe w-shot visual learning without forgetting. In International Confer ence on Computer V ision and P attern Recognition , 2018. Grant, Erin, Finn, Chelsea, Levine, Serge y , Darrell, Tre vor , and Grifﬁths, Thomas L. Recasting gradient-based meta-learning as hierarchical bayes. In International Conference on Learning Repr esentations , 2018. Ha, David, Dai, Andrew M., and Le, Quoc V . Hypernetworks. In International Confer ence on Learning Repr esentations , 2016. He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun, Jian. Deep residual learning for image recognition. In International Confer ence on Computer V ision and P attern Recognition , 2016. Hinton, Geof frey E and Plaut, Da vid C. Using fast weights to deblur old memories. In 9th Annual Confer ence of the Cognitive Science Society , 1987. Hochreiter , Sepp and Schmidhuber , Jürgen. Long short-term memory . Neural computation , 9: 1735–80, 12 1997. Hochreiter , Sepp, Y ounger , A. Stev en, and Conwell, Peter R. Learning to learn using gradient descent. In International Confer ence on Articial Neural Networks , 2001. Ioffe, Sergey and Sze gedy , Christian. Batch normalization: Accelerating deep network training by reducing internal cov ariate shift. In International Confer ence on Machine Learning , 2015. Jav ed, Khurram and White, Martha. Meta-learning representations for continual learning. arXiv pr eprint arXiv:1905.12588 , 2019. Kim, T aesup, Y oon, Jaesik, Dia, Ousmane, Kim, Sungwoong, Bengio, Y oshua, and Ahn, Sungjin. Bayesian model-agnostic meta-learning. arXiv pr eprint arXiv:1806.03836 , 2018. Kingma, Diederik P . and Ba, Jimmy . Adam: A Method for Stochastic Optimization. In International Confer ence on Learning Repr esentations , 2015. Lacoste, Alexandre, Oreshkin, Boris, Chung, W onchang, Boquet, Thomas, Rostamzadeh, Ne gar , and Krueger , David. Uncertainty in multitask transfer learning. In Advances in Neural Information Pr ocessing Systems , 2018. Lake, Brenden, Salakhutdinov , Ruslan, Gross, Jason, and T enenbaum, Joshua. One shot learning of simple visual concepts. In Pr oceedings of the Annual Meeting of the Cognitive Science Society , 2011. Lake, Brenden M., Salakhutdinov , Ruslan, and T enenbaum, Joshua B. Human-le vel concept learning through probabilistic program induction. Science , 350(6266):1332–1338, 2015. Lecun, Y ann, Bottou, Léon, Bengio, Y oshua, and Haffner , Patrick. Gradient-based learning applied to document recognition. In Pr oceedings of the IEEE , pp. 2278–2324, 1998. Lee, John M. Intr oduction to Smooth Manifolds . Springer , 2003. Lee, Kwonjoon, Maji, Subhransu, Ra vichandran, A vinash, and Soatto, Stefano. Meta-learning with differentiable con vex optimization. In CVPR , 2019. Lee, Sang-W oo, Kim, Jin-Hwa, Ha, JungW oo, and Zhang, Byoung-T ak. Overcoming Catastrophic For getting by Incremental Moment Matching. In Advances in Neural Information Pr ocessing Systems , 2017. Lee, Y oonho and Choi, Seungjin. Meta-Learning with Adaptive Layerwise Metric and Subspace. In International Confer ence on Machine Learning , 2018. Li, Ke and Malik, Jitendra. Learning to optimize. In International Confer ence on Machine Learning , 2016. Li, Zhenguo, Zhou, Fengwei, Chen, Fei, and Li, Hang. Meta-sgd: Learning to learn quickly for few-shot learning. arXiv preprint , 2017. Li, Zhizhong and Hoiem, Derek. Learning without forgetting. In Eur opean Conference on Computer V ision , 2016. 12 Published as a conference paper at ICLR 2020 Martens, James. Deep learning via hessian-free optimization. In International Conference on Mac hine Learning , 2010. Martens, James and Grosse, Roger . Optimizing neural networks with kroneck er-factored approximate curvature. In International Conference on Mac hine Learning , 2015. Mendonca, Russell, Gupta, Abhishek, Kralev , Rosen, Abbeel, Pieter, Levine, Sergey , and Finn, Chelsea. Guided meta-policy search. arXiv pr eprint arXiv:1904.00956 , 2019. Metz, Luke, Maheswaranathan, Niru, Cheung, Brian, and Sohl-Dickstein, Jascha. Meta-learning update rules for unsupervised representation learning. In International Conference on Learning Repr esentations , 2019. Miconi, Thomas, Clune, Jeff, and Stanley , Kenneth O. Differentiable plasticity: training plastic neural networks with backpropagation. In International Conference on Mac hine Learning , 2018. Miconi, Thomas, Clune, Jeff, and Stanley , Kenneth O. Backpropamine: training self-modifying neural netw orks with differentiable neuromodulated plasticity . In International Confer ence on Learning Repr esentations , 2019. Mishra, Nikhil, Rohaninejad, Mostafa, Chen, Xi, and Abbeel, Pieter . A Simple Neural Attentiv e Meta-Learner. In International Confer ence on Learning Representations , 2018. Mujika, Asier, Meier , Florian, and Steger , Angelika. Fast-slo w recurrent neural netw orks. In Advances in Neural Information Pr ocessing Systems , 2017. Munkhdalai, Tsendsuren, Y uan, Xingdi, Mehri, Soroush, W ang, T ong, and T rischler , Adam. Learning rapid-temporal adaptations. In International Confer ence on Machine Learning , 2018. Nichol, Alex, Achiam, Joshua, and Schulman, John. On First-Order Meta-Learning Algorithms. arXiv pr eprint ArXiv:1803.02999 , 2018. Nocedal, Jorge and Wright, Stephen. Numerical optimization . Springer , 2006. Oreshkin, Boris N, Lacoste, Alexandre, and Rodriguez, Paul. T adam: T ask dependent adaptive metric for improv ed few-shot learning. In Advances in Neural Information Pr ocessing Systems , 2018. Park, Eunbyung and Oli va, Junier B. Meta-curvature. arXiv pr eprint arXiv:1902.03356 , 2019. Pascanu, Razv an and Bengio, Y oshua. Revisiting natural gradient for deep networks. In International Confer ence on Learning Repr esentations , 2014. Perez, Ethan, Strub, Florian, De Vries, Harm, Dumoulin, V incent, and Courville, Aaron. Film: V isual reasoning with a general conditioning layer . In Association for the Advancement of Artiﬁcial Intelligence , 2018. Qiao, Siyuan, Liu, Chenxi, Shen, W ei, and Y uille, Alan L. Few-shot image recognition by predict- ing parameters from acti vations. In International Conference on Computer V ision and P attern Recognition , 2018. Rakelly , Kate, Zhou, Aurick, Quillen, Deirdre, Finn, Chelsea, and Le vine, Serge y . Ef ﬁcient off-polic y meta-reinforcement learning via probabilistic context variables. arXiv preprint , 2019. Ravi, Sachin and Larochelle, Hugo. Optimization as a model for few-shot learning. In International Confer ence on Learning Repr esentations , 2016. Rebuf ﬁ, Sylv estre-Alvise, Bilen, Hakan, and V edaldi, Andrea. Learning multiple visual domains with residual adapters. In Advances in Neural Information Pr ocessing Systems , 2017. Ren, Mengye, T riantaﬁllou, Eleni, Ra vi, Sachin, Snell, Jake, Swersky , Ke vin, T enenbaum, Joshua B., Larochelle, Hugo, and Zemel, Richard S. Meta-learning for semi-supervised fe w-shot classiﬁcation. In International Confer ence on Learning Repr esentations , 2018. Rusu, Andrei A., Rao, Dushyant, Sygnowski, Jakub, V inyals, Oriol, Pascanu, Razvan, Osindero, Simon, and Hadsell, Raia. Meta-learning with latent embedding optimization. In International Confer ence on Learning Repr esentations , 2019. 13 Published as a conference paper at ICLR 2020 Schmidhuber , Jürgen. Evolutionary principles in self-r efer ential learning . PhD thesis, T echnische Univ ersität München, 1987. Schmidhuber , Jürgen. Learning to control fast-weight memories: An alternative to dynamic recurrent networks. Neural Computation , 4(1):131–139, 1992. Snell, Jake, Swersk y , Ke vin, and Zemel, Richard S. Prototypical Networks for Fe w-shot Learning. In Advances in Neural Information Pr ocessing Systems , 2017. Suarez, Joseph. Language modeling with recurrent highway hypernetworks. In Advances in Neural Information Pr ocessing Systems , 2017. Sutske ver , Ilya, Martens, James, Dahl, George, and Hinton, Geoffre y . On the importance of ini- tialization and momentum in deep learning. In International Conference on Mac hine Learning , 2013. Thrun, Sebastian and Pratt, Lorien. Learning to learn: Introduction and overvie w . In In Learning T o Learn . Springer , 1998. V inyals, Oriol, Blundell, Charles, Lillicrap, Timoth y , Kavukcuoglu, K oray , and W ierstra, Daan. Matching Networks for One Shot Learning. In Advances in Neural Information Pr ocessing Systems , 2016. W ang, Jane X., Kurth-Nelson, Zeb, T irumala, Dhruva, Soyer , Hubert, Leibo, Joel Z., Munos, Rémi, Blundell, Charles, Kumaran, Dharshan, and Botvinick, Matthew . Learning to reinforcement learn. In Annual Meeting of the Cognitive Science Society , 2016. W u, Y uhuai, Ren, Mengye, Liao, Renjie, and Grosse, Roger B. Understanding short-horizon bias in stochastic meta-optimization. In International Confer ence on Learning Representations , 2018. Zhou, Fengwei, W u, Bin, and Li, Zhenguo. Deep meta-learning: Learning to learn in the concept space. arXiv pr eprint arXiv:1802.03596 , 2018. Zintgraf, Luisa M., Shiarlis, K yriacos, Kurin, V italy , Hofmann, Katja, and Whiteson, Shimon. Fast Context Adaptation via Meta-Learning. International Conference on Mac hine Learning , 2019. 14 Published as a conference paper at ICLR 2020 A P P E N D I X A W A R P G R A D D E S I G N P R I N C I P L E S F O R N E U R A L N E T S x warp task con v block con v y (a) W arp-Con vNet x con v block con v BN ⊕ BN y (b) W arp-ResNet x h z LSTM LSTM h linear y (c) W arp-LSTM x c h   4x linear 4x linear    h linear y (d) W arp-HyperNetwork Figure 6: Illustration of possible W arpGrad architectures. Orange represents task layers and blue represents warp-layers. ⊕ denotes residual connections and  any form of gating mechanism. W e can obtain warped architectures by interlea ving task- and warp-layers (a, c) or by designating some layers in standard architectures as task-adaptable and some as warp-layers (b, d). W arpGrad is a model-embedded meta-learned optimiser that allo ws for sev eral implementation strategies. T o embed warp-layers giv en a task-learner architecture, we may either insert ne w warp- layers in the giv en architecture or designate some layers as w arp-layers and some as task layers. W e found that W arpGrad can both be used in a high-capacity mode, where task-learners are relati vely weak to av oid overﬁtting, as well as in a lo w-capacity mode where task-learners are powerful and warp-layers are relati vely weak. The best approach depends on the problem at hand. W e highlight three approaches to designing W arpGrad optimisers, starting from a given architecture: (a) Model partitioning . Giv en a desired architecture, designate some operations as task-adaptable and the rest as warp-layers. T ask layers do not hav e to interleav e e xactly with warp-layers as gradient warping arises both through the forw ard pass and through backpropagation. This was how we approached the tier ed ImageNet and mini ImageNet experiments. (b) Model augmentation . Giv en a model, designate all layers as task-adaptable and interleav e warp- layers. W arp-layers can be relativ ely weak as backpropagation through non-linear acti vations ensures expressi ve gradient warping. This was our approach to the Omniglot experiment; our main architecture interleav es linear warp-layers in a standard architecture. (c) Information compr ession . Giv en a model, designate all layers as warp and interlea ve weak task layers. In this scenario, task-learners are prone to overﬁtting. Pushing capacity into the w arp allo ws it to encode general information the task-learner can draw on during task adaptation. This approach is similar to approaches in transfer and meta-learning that restrict the number of free parameters during task training ( Rebuf ﬁ et al. , 2017 ; Lee & Choi , 2018 ; Zintgraf et al. , 2019 ). Note that in either case, once w arp-layers have been chosen, standard backpropagation automatically warps gradients for us. Thus, W arpGrad is fully compatible with any architecture, for instance, Residual Neural Networks ( He et al. , 2016 ) or LSTMs. F or conv olutional neural networks, we may use an y form of conv olution, learned normalization (e.g. Iof fe & Sze gedy , 2015 ), or adaptor module (e.g. Rebuf ﬁ et al. , 2017 ; Perez et al. , 2018 ) to design task and warp-layers. For recurrent networks, we can use stacked LSTMs to interleave warped layers, as well as any type of HyperNetwork 15 Published as a conference paper at ICLR 2020 architecture (e.g. Ha et al. , 2016 ; Suarez , 2017 ; Flennerhag et al. , 2018 ) or partitioning of fast and slow weights (e.g. Mujika et al. , 2017 ). Figure 6 illustrates this process. B W A R P G R A D M E T A - T R A I N I N G A L G O R I T H M S In this section, we pro vide the variants of W arpGrad training algorithms used in this paper . Algo- rithm 1 describes a simple online algorithm, which accumulates meta-gradients online during task adaptation. This algorithm has constant memory and scales linearly in the length of task trajectories. In Algorithm 2 , we describe an ofﬂine meta-training algorithm. This algorithm is similar to Algo- rithm 1 in man y respects, but dif fers in that we do not compute meta-gradients online during task adaptation. Instead, we accumulate them into a replay buf fer of sampled task parameterisations. This buf fer is a Monte-Carlo sample of the e xpectation in the meta objective (Eq. 13 ) that can be thought of as a dataset in its own right. Hence, we can apply standard mini-batching with respect to the buf fer and perform mini-batch gradient descent on warp parameters. This allows us to update warp parameters sev eral times for a given sample of task parameter trajectories, which can greatly improve data ef ﬁciency . In our Omniglot experiment, we found of ﬂine meta-training to con ver ge faster: in fact, a mini-batch size of 1 (i.e. η = 1 in Algorithm 2 con ver ges rapidly without any instability . Finally , in Algorithm 3 , we present a continual meta-training process where meta-training occurs throughout a stream of learning experiences. Here, C represents a multi-task objectiv e, such as the average task loss, C multi = P τ ∼ p ( τ ) L τ task . Meta-learning arises by collecting experiences continuously (across dif ferent tasks) and using these to accumulate the meta-gradient online. W arp parameters are updated intermittently with the accumulated meta-gradient. W e use this algorithm in our maze navigation experiment, where task adaptation is internalised within the RNN task- learner . C W A R P G R A D O P T I M I S E R S In this section, we detail W arpGrad methods used in our experiments. W arp-MAML W e use this model for fe w-shot learning (Section 4.1 ). W e use the full warp- objectiv e in Eq. 11 together with the MAML objectiv e (Eq. 1 ), J W arp-MAML : = L ( φ ) + λC MAML ( θ 0 ) , (14) where C MAML = L MAML under the constraint P = I . In our experiments, we trained W arp-MAML using the online training algorithm (Algorithm 1 ). W arp-Leap W e use this model for multi-shot meta-learning. It is deﬁned by applying Leap ( Flen- nerhag et al. , 2019 ) to θ 0 (Eq. 16 ), J W arp-Leap : = L ( φ ) + λC Leap ( θ 0 ) , (15) where the Leap objectiv e is deﬁned by minimising the expected cumulativ e chordal distance, C Leap ( θ 0 ) : = X τ ∼ p ( τ ) K τ X k =1   sg [ ϑ τ k ] − ϑ τ k − 1   2 , ϑ τ k = ( θ τ k, 0 , . . . , θ τ k,n , L τ task ( θ τ k ; φ )) . (16) Note that the Leap meta-gradient makes a ﬁrst-order approximation to a void backpropagating through the adaptation process. It is giv en by ∇ C Leap ( θ 0 ) ≈ − X τ ∼ p ( τ ) K τ X k =1 ∆ L τ task ( θ τ k ; φ ) ∇L τ task  θ τ k − 1 ; φ  + ∆ θ τ k   ϑ τ k − ϑ τ k − 1   2 , (17) where ∆ L τ task ( θ τ k ; φ ) : = L τ task ( θ τ k ; φ ) − L τ task  θ τ k − 1 ; φ  and ∆ θ τ k : = θ τ k − θ τ k − 1 . In our experiments, we train W arp-Leap using Algorithm 1 in the multi-shot tier ed ImageNet e xperiment and Algorithm 2 16 Published as a conference paper at ICLR 2020 in the Omniglot experiment. W e perform an ablation study for training algorithms, comparing exact (Eq. 11 ) versus approximate (Eq. 12 ) meta-objecti ves, and sev eral implementations of the warp-layers on Omniglot in Appendix F . W arp-RNN For our Reinforcement Learning experiment, we deﬁne a W arpGrad optimiser by meta-learning an LSTM that modulates the weights of the task-learner (see Appendix I for details). For this algorithm, we f ace a continuous stream of experiences (episodes) that we meta-learn using our continual meta-training algorithm (Algorithm 3 ). In our experiment, both L τ task and L τ meta are the advantage actor -critic objectiv e ( W ang et al. , 2016 ); C is computed on one batch of 30 episodes, whereas L is accumulated over η = 30 such batches, for a total of 900 episodes. As each episode in volv es 300 steps in the en vironment, we cannot apply the e xact meta objectiv e, but use the approxi- mate meta objecti ve (Eq. 12 ). Speciﬁcally , let E τ = { s 0 , a 1 , r 1 , s 1 , . . . , s T , a T , r T , s T +1 } denote an episode on task τ , where s denotes state, a action, and r instantaneous reward. Denote a mini- batch of randomly sampled task episodes by E = { E τ } τ ∼ p ( τ ) and an ordered set of k consecutiv e mini-batches by E k = { E k − i } k − 1 i =0 . Then ˆ L ( φ ; E k ) = 1 /n P E i ∈E k P E τ i,j ∈ E i L τ meta ( φ ; θ , E τ i,j ) and Algorithm 1 W arpGrad: online meta-training Require: p ( τ ) : distribution ov er tasks Require: α, β , λ : hyper-parameters 1: initialise φ and θ 0 2: while not done do 3: Sample mini-batch of tasks B from p ( τ ) 4: g φ , g θ 0 ← 0 5: for all τ ∈ B do 6: θ τ 0 ← θ 0 7: for all k in 0 , . . . , K τ − 1 do 8: θ τ k +1 ← θ τ k − α ∇L τ task ( θ τ k ; φ ) 9: g φ ← g φ + ∇ L ( φ ; θ τ k ) 10: g θ 0 ← g θ 0 + ∇ C ( θ τ 0 ; θ τ 0: k ) 11: end for 12: end for 13: φ ← φ − β g φ 14: θ 0 ← θ 0 − λβ g θ 0 15: end while Algorithm 3 W arpGrad: continual meta-training Require: p ( τ ) : distribution ov er tasks Require: α, β , λ, η : hyper-parameters 1: initialise φ and θ 2: i, g φ , g θ ← 0 3: while not done do 4: Sample mini-batch of tasks B from p ( τ ) 5: for all τ ∈ B do 6: g φ ← g φ + ∇ L ( φ ; θ ) 7: g θ ← g θ + ∇ C ( θ ; φ ) 8: end for 9: θ ← θ − λβ g θ 10: g θ , i ← 0 , i + 1 11: if i = η then 12: φ ← φ − β g φ 13: i, g θ ← 0 14: end if 15: end while Algorithm 2 W arpGrad: ofﬂine meta-training Require: p ( τ ) : distribution ov er tasks Require: α, β , λ, η : hyper-parameters 1: initialise φ and θ 0 2: while not done do 3: Sample mini-batch of tasks B from p ( τ ) 4: T ← { τ : [ θ 0 ] for τ in B } 5: f or all τ ∈ B do 6: θ τ 0 ← θ 0 7: f or all k in 0 , . . . , K τ − 1 do 8: θ τ k +1 ← θ τ k − α ∇L τ task ( θ τ k ; φ ) 9: T [ τ ] .append ( θ τ k +1 ) 10: end f or 11: end f or 12: i, g φ , g θ 0 ← 0 13: while T not empty do 14: sample τ , k without replacement 15: g φ ← g φ + ∇ L ( φ ; θ τ k ) 16: g θ 0 ← g θ 0 + ∇ C ( θ τ 0 ; θ τ 0: k ) 17: i ← i + 1 18: if i = η then 19: φ ← φ − β g φ 20: θ 0 ← θ 0 − λβ g θ 0 21: i, g φ , g θ 0 ← 0 22: end if 23: end while 24: end while 17 Published as a conference paper at ICLR 2020 C multi ( θ ; E k ) = 1 /n 0 P E τ k,j ∈ E k L τ task ( θ ; φ, E τ k,j ) , where n and n 0 are normalising constants. The W arp-RNN objective is deﬁned by J W arp-RNN : =  L ( φ ; E k ) + λC multi ( θ ; E k ) if k = η λC multi ( θ ; E k ) otherwise . (18) W arpGrad for Continual Learning For this experiment, we focus on meta-learning warp- parameters. Hence, the initialisation for each task sequence is a ﬁxed random initialisation, (i.e. λC ( θ 0 ) = 0 ). For the warp meta-objectiv e, we take expectations ov er N task sequences, where each task sequence is a sequence of T = 5 sub-tasks that the task-learner observes one at a time; thus while the task loss is deﬁned o ver the current sub-task, the meta-loss a verages of the current and all prior sub-tasks, for each sub-task in the sequence. See Appendix J for detailed deﬁnitions. Importantly , because W arpGrad deﬁnes task adaptation abstractly by a probability distribution, we can readily im- plement a continual learning objective by modifying the joint task parameter distribution p ( τ , θ ) that we use in the meta-objectiv e (Eq. 11 ). A task deﬁnes a sequence of sub-tasks over which we generate parameter trajectories θ τ . Thus, the only difference from multi-task meta-learning is that parameter trajectories are not generated under a ﬁxed task, but arise as a function of the continual learning algorithm used for adaptation. W e deﬁne the conditional distrib ution p ( θ | τ ) as before by sampling sub-task parameters θ τ t from a mini-batch of such task trajectories, keeping track of which sub-task t it belongs to and which sub-tasks came before it in the gi ven task sequence τ . The meta-objectiv e is constructed, for any sub-task parameterisation θ τ t , as L τ meta ( θ τ t ) = 1 /t P t i =1 L τ task ( θ τ i , D i ; φ ) , where D j is data from sub-task j (Appendix J ). The meta-objectiv e is an expectation over task parameterisations, L CL ( φ ) : = X τ ∼ p ( τ ) T X t =1 X θ τ t ∼ p ( θ | τ t ) L τ meta  θ τ t ; φ  . (19) D S Y N T H E T I C E X P E R I M E N T T o build intuition for what it means to warp space, we construct a simple 2-D problem over loss surfaces. A learner is faced with the task of minimising an objectiv e function of the form f τ ( x 1 , x 2 ) = g τ 1 ( x 1 ) exp( g τ 2 ( x 2 )) − g τ 3 ( x 1 ) exp( g τ 4 ( x 1 , x 2 )) − g τ 5 exp( g τ 6 ( x 1 )) , where each task f τ is deﬁned by scale and rotation functions g τ that are randomly sampled from a predeﬁned distribution. Speciﬁcally , each task is deﬁned by the objecti ve function f τ ( x 1 , x 2 ) = b τ 1 ( a τ 1 − x 1 ) 2 exp( − x 2 1 − ( x 2 + a τ 2 ) 2 ) − b τ 2 ( x 1 /s τ − x 3 1 − x 5 2 ) exp( − x 2 1 − x 2 2 ) − b τ 3 exp( − ( x 1 + a τ 3 ) 2 − x 2 1 )) , where each a, b and s are randomly sampled parameters from s τ ∼ Cat(1 , 2 , . . . , 9 , 10) a τ i ∼ Cat( − 1 , 0 , 1) b τ i ∼ Cat( − 5 , − 4 , . . . , 4 , 5) . The task is to minimise the giv en objectiv e from a randomly sampled initialisation, x { i =1 , 2 } ∼ U ( − 3 , 3) . During meta-training, we train on a task for 100 steps using a learning rate of 0 . 1 . Each task has a unique loss-surf ace that the learner traverses from the randomly sampled initialisation. While each loss-surface is unique, the y share an underlying structure. Thus, by meta- learning a warp ov er trajectories on randomly sampled loss surfaces, we expect W arpGrad to learn a warp that is close to in variant to spurious descent directions. In particular, W arpGrad should produce a smooth warped space that is quasi-con vex for any gi ven task to ensure that the task-learner ﬁnds a minimum as fast as possible reg ardless of initialisation. T o visualise the geometry , we use an explicit warp Ω deﬁned by a 2-layer feed-forward network with a hidden-state size of 30 and tanh non-linearities. W e train warp parameters for 100 meta-training steps; 18 Published as a conference paper at ICLR 2020 Figure 7: Example trajectories on three task loss surfaces. W e start Gradient Descent (black) and W arpGrad (magenta) from the same initialisation; while SGD struggles with the curvature, the W arpGrad optimiser has learned a warp such that gradient descent in the representation space (top) leads to rapid con ver gence in model parameter space (bottom). in each meta-step we sample a ne w task surface and a mini-batch of 10 random initialisations that we train separately . W e train to con vergence and accumulate the warp meta-gradient online (Algorithm 1 ). W e ev aluate against gradient descent on randomly sampled loss surf aces (Figure 7 ). Both optimisers start from the same initialisation, chosen such that standard gradient descent struggles; we expect the W arpGrad optimisers to learn a geometry that is robust to the initialisation (top row). This is indeed what we ﬁnd; the geometry learned by W arpGrad smoothly warps the nativ e loss surface into a well-behav ed space where gradient descent con verges to a local minimum. E O M N I G L OT W e follow the protocol of Flennerhag et al. ( 2019 ), including the choice of hyper -parameters unless otherwise noted. In this setup, each of the 50 alphabets that comprise the dataset constitutes a distinct task. Each task is treated as a 20-w ay classiﬁcation problem. Four alphabets hav e fewer than 20 characters in the alphabet and are discarded, leaving us with 46 alphabets in total. 10 alphabets are held-out for ﬁnal meta-testing; which alphabets are held out depend on the seed to account for variations across alphabets; we train and ev aluate all baselines on 10 seeds. For each character in an alphabet, there are 20 raw samples. Of these, 5 are held out for ﬁnal e valuation on the task while the remainder is used to construct a training set. Raw samples are pre-processed by random afﬁne transformations in the form of (a) scaling between [0 . 8 , 1 . 2] , (b) rotation [0 , 360) , and (c) cropping height and width by a factor of [ − 0 . 2 , 0 . 2] in each dimension. This ensures tasks are too hard for fe w-shot learning. During task adaptation, mini-batches are sampled at random without ensuring class-balance (in contrast to fe w-shot classiﬁcation protocols ( V inyals et al. , 2016 )). Note that benchmarks under this protocol are not compatible with few-shot learning benchmarks. W e use the same con volutional neural netw ork architecture and hyper-parameters as in Flennerhag et al. ( 2019 ). This learner stacks a conv olutional block comprised of a 3 × 3 con volution with 64 ﬁlters, follo wed by 2 × 2 max-pooling, batch-normalisation, and ReLU acti vation, four times. All images are do wn-sampled to 28 × 28 , resulting in a 1 × 1 × 64 feature map that is passed on to a ﬁnal linear layer . W e create a W arp Leap meta-learner that inserts w arp-layers between each conv olutional block, W ◦ ω 4 ◦ h 4 ◦ · · · ◦ ω 1 ◦ h 1 , where each h is deﬁned as abov e. In our main experiment, each ω i is simply a 3 × 3 con volutional layer with zero padding; in Appendix F we consider both simpler and more sophisticated versions. W e ﬁnd that relatively sim ple warp-layers do quite well. Howe ver , adding capacity does impro ve generalisation performance. W e meta-learn the initialisation of task parameters using the Leap objectiv e (Eq. 16 ), detailed in Appendix C . Both L τ meta and L τ task are deﬁned as the negati ve log-likelihood loss; importantly , we e v aluate them on differ ent batches of task data to ensure w arp-layers encourage generalisation. W e found no additional beneﬁt in this e xperiment from using held-out data to e valuate L τ meta . W e use the of ﬂine meta-training algorithm (Appendix B , Algorithm 2 ); in particular , during meta-training, we sample mini-batches 19 Published as a conference paper at ICLR 2020 1 5 10 15 20 25 Num b er of tasks in meta-training set 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 T est accuracy on held-out tasks W arp-Leap Leap Reptile FT † SGD ‡ KF A C ‡ 1 5 10 15 20 25 Num b er of tasks in meta-training set 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 T rain Acc A UC on held-out tasks W arp-Leap Leap Reptile FT † SGD ‡ KF A C ‡ Figure 8: Omniglot results. T op: test accuracies on held-out tasks after meta-training on a v arying number of tasks. Bottom: A UC under accuracy curve on held-out tasks after meta-training on a varying number of tasks. Shading represents standard de viation across 10 independent runs. W e compare between W arp-Leap, Leap, and Reptile, multi-headed ﬁnetuning, as well as SGD and KF A C which used random initialisation but with a 10x lar ger learning rate. of 20 tasks and train task-learners for 100 steps to collect 2000 task parameterisations into a replay buf fer . T ask-learners share a common initialisation and warp parameters that are held ﬁx ed during task adaptation. Once collected, we iterate o ver the buf fer by randomly sampling mini-batches of task parameterisations without replacement. Unless otherwise noted, we used a batch size of η = 1 . For each mini-batch, we update φ by applying gradient descent under the canonical meta-objecti ve (Eq. 11 ), where we ev aluate L τ meta on a randomly sampled mini-batch of data from the corresponding task. Consequently , for each meta-batch, we take (up to) 2000 meta-gradient steps on warp parameters φ . W e ﬁnd that this form of mini-batching causes the meta-training loop to conv erge much faster and induces no discernible instability . W e compare W arp-Leap against no meta-learning with standard gradient descent (SGD) or KF A C ( Martens & Grosse , 2015 ). W e also benchmark against baselines provided in Flennerhag et al. ( 2019 ); Leap, Reptile ( Nichol et al. , 2018 ), MAML, and multi-headed ﬁne-tuning. All learners beneﬁt substantially from large batch sizes as this enables higher learning rates. T o render no-pretraining a competitiv e option within a fair computational budget, we allow SGD and KF AC to use 4x larger batch sizes, enabling 10x larger learning rates. 20 Published as a conference paper at ICLR 2020 T able 2: Mean test error after 100 training steps on held out e valuation tasks. † Multi-headed. ‡ No meta-training, but 10x lar ger learning rates). Method W arpGrad Leap Reptile Finetuning † MAML KF A C ‡ SGD ‡ No. Meta-training tasks 1 49 . 5 ± 7 . 8 37 . 6 ± 4 . 8 40 . 4 ± 4 . 0 53 . 8 ± 5 . 0 40 . 0 ± 2 . 6 56.0 51.0 3 68.8 ± 2 . 8 53 . 4 ± 3 . 1 53 . 1 ± 4 . 2 64 . 6 ± 3 . 3 48 . 6 ± 2 . 5 56.0 51.0 5 75.0 ± 3 . 6 59 . 5 ± 3 . 7 58 . 3 ± 3 . 3 67 . 7 ± 2 . 8 51 . 6 ± 3 . 8 56.0 51.0 10 81.2 ± 2 . 4 67 . 4 ± 2 . 4 65 . 0 ± 2 . 1 71 . 3 ± 2 . 0 54 . 1 ± 2 . 8 56.0 51.0 15 82.7 ± 3 . 3 70 . 0 ± 2 . 4 66 . 6 ± 2 . 9 73 . 5 ± 2 . 4 54 . 8 ± 3 . 4 56.0 51.0 20 82.0 ± 2 . 6 73 . 3 ± 2 . 3 69 . 4 ± 3 . 4 75 . 4 ± 3 . 2 56 . 6 ± 2 . 0 56.0 51.0 25 83.8 ± 1 . 9 74 . 8 ± 2 . 7 70 . 8 ± 1 . 9 76 . 4 ± 2 . 2 56 . 7 ± 2 . 1 56.0 51.0 F A B L A T I O N S T U D Y : W A R P L AY E R S , M E TA - O B J E C T I V E , A N D M E T A - T R A I N I N G W arpGrad provides a principled approach for model-informed meta-learning and offers se veral degrees of freedom. T o ev aluate these design choices, we conduct an ablation study on W arp-Leap where we vary the design of warp-layers as well as meta-training approaches. For the ablation study , we ﬁxed the number of pretraining tasks to 25 and report ﬁnal test accuracy over 4 independent runs. All ablations use the same hyper -parameters, except for online meta-training which uses a learning rate of 0 . 001 . First, we v ary the meta-training protocol by (a) using the approximate objecti ve (Eq. 12 ), (b) using online meta-training (Algorithm 1 ), and (c) whether meta-learning the learning rate used for task adaptation is beneﬁcial in this experiment. W e meta-learn a single scalar learning rate (as warp parameters can learn layer-wise scaling). Meta-gradients for the learning rate are clipped at 0 . 001 and we use a learning rate of 0 . 001 . Note that when using ofﬂine meta-training, we store both task parameterisations and the momentum buf fer in that phase and use them in the update rule when computing the canonical objectiv e (Eq. 11 ). Further , we v ary the architecture used for warp-layers. W e study simpler versions that use channel- wise scaling and more complex versions that use non-linearities and residual connections. W e also ev aluate a version where each w arp-layer has two stacked con volutions, where the ﬁrst warp con volution outputs 128 ﬁlters and the second warp con volution outputs 64 ﬁlters. Finally , in the tw o-layer warp-architecture, we e valuate a version that inserts a FiLM layer between the two warp con volutions. These are adapted during task training from a 0 initialisation; they amount to task embeddings that condition gradient warping on task statistics. Full results are reported in T able 3 . G A B L A T I O N S T U D Y : W A R P G R A D A N D N A T U R A L G R A D I E N T D E S C E N T T able 4: Ablation study: mean test error after 100 training steps on held out ev aluation tasks from a random initialisa- tion. Mean and standard deviation o ver 4 seeds. Method Preconditioning Accuracy SGD None 40 . 1 ± 6 . 1 KF A C (NGD) Linear (block-diagonal) 58 . 2 ± 3 . 2 W arpGrad Linear (block-diagonal) 68 . 0 ± 4 . 4 W arpGrad Non-linear (full) 81 . 3 ± 4 . 0 Here, we perform ablation studies to compare the geometry that a W arpGrad optimiser learns to the geometry that Natural Gradient Descent (NGD) meth- ods represent (approximately). For con- sistency , we run the ablation on Om- niglot. As computing the true Fisher In- formation Matrix is intractable, we can compare W arpGrad against two com- mon block-diagonal approximations, KF A C ( Martens & Grosse , 2015 ) and Natural Neural Nets ( Desjardins et al. , 2015 ). First, we isolate the effect of warping task loss surfaces by ﬁxing a random initialisation and only meta-learning warp parameters. That is, in this experiment, we set λC ( θ 0 ) = 0 . W e compare against two baselines, stochastic gradient descent (SGD) and KF AC, both trained from a random initialisation. W e use task mini-batch sizes of 200 and task learning rates of 1.0, otherwise we use 21 Published as a conference paper at ICLR 2020 T able 3: Ablation study: mean test error after 100 training steps on held out e valuation tasks. Mean and standard deviation o ver 4 independent runs. Ofﬂine refers to ofﬂine meta-training (Appendix B ), online to online meta-training Algorithm 1 ; full denotes Eq. 11 and appr ox denotes Eq. 12 ; † Batch Normalization ( Iof fe & Szegedy , 2015 ); ‡ equiv alent to FiLM layers ( Perez et al. , 2018 ); § Residual connection ( He et al. , 2016 ), when combined with BN, similar to the Residual Adaptor architec- ture ( Rebuf ﬁ et al. , 2017 ); ¶ FiLM task embeddings. Architecture Meta-training Meta-objective Accuracy None (Leap) Online None 74 . 8 ± 2 . 7 3 × 3 con v (default) Of ﬂine full ( L , Eq. 11 ) 84 . 4 ± 1 . 7 3 × 3 con v Ofﬂine approx ( ˆ L , Eq. 12 ) 83 . 1 ± 2 . 7 3 × 3 con v Online full 76 . 3 ± 2 . 1 3 × 3 con v Ofﬂine full, learned α 83 . 1 ± 3 . 3 Scaling ‡ Ofﬂine full 77 . 5 ± 1 . 8 1 × 1 con v Ofﬂine full 79 . 4 ± 2 . 2 3 × 3 con v + ReLU Ofﬂine full 83 . 4 ± 1 . 6 3 × 3 con v + BN † Ofﬂine full 84 . 7 ± 1 . 7 3 × 3 con v + BN † + ReLU Of ﬂine full 85 . 0 ± 0 . 9 3 × 3 con v + BN † + Res § + ReLU Ofﬂine full 86 . 3 ± 1 . 1 2-layer 3 × 3 con v + BN † + Res § Ofﬂine full 88.0 ± 1 . 0 2-layer 3 × 3 con v + BN † + Res § + T A ¶ Ofﬂine full 88.1 ± 1 . 0 1 2 3 4 Layer 0.1 0.0 0.1 0.2 0.3 0.4 0.5 Expected Activation pre-warp post-warp 1 2 3 4 Layer 0 2 4 6 8 10 Shatten-1 Norm of Cov - I pre-warp post-warp Figure 9: Ablation study . Left: Comparison of mean activ ation value E [ h ( x )] across layers, pre- and post-warping. Right: Shatten-1 norm of Co v ( h ( x ) , h ( x )) − I , pre- and post-norm. Statistics are gathered on held-out test set and av eraged ov er tasks and adaptation steps. the same hyper-parameters as in the main experiment. For W arpGrad, we meta-train with these hyper -parameters as well. W e ev aluate two W arpGrad architectures, in one, we use linear warp-layers, which giv es a block-diagonal preconditioning, as in KF A C. In the other , we use our most expressiv e warp conﬁguration from the ablation experiment in appendix F , where warp-layers are two-layer con volutional block with residual connections, batch normalisation, and ReLU acti vation. W e ﬁnd that w arped geometries f acilitate task adaptation on held-out tasks to a greater degree than either SGD or KF AC by a signiﬁcant margin (table 4 ). W e further ﬁnd that going beyond block-diagonal preconditioning yields a signiﬁcant improv ement in performance. Second, we explore whether the geometry that we meta-learn under in the full W arp-Leap al- gorithm is approximately Fisher . In this experiment, we use the main W arp-Leap architecture. W e use a meta-learner trained on 25 tasks and that we e valuate on 10 held-out tasks. Because warp-layers are linear in this conﬁguration, if the learned geometry is approximately Fisher , post-warp activations should be zero-centred and the layer-wise cov ariance matrix should satisfy Co v ( ω ( i ) ( h ( i ) ( x )) , ω ( i ) ( h ( i ) ( x ))) = I , where I is the identity matrix ( Desjardins et al. , 2015 ). If true, W arp-Leap would learn a block-diagonal approximation to the In verse Fisher Matrix, as Natural Neural Nets. 22 Published as a conference paper at ICLR 2020 T o test this, during task adaptation on held-out tasks, we compute the mean activ ation in each con volutional layer pre- and post-warping. W e also compute the Shatten-1 norm of the difference between layer activation cov ariance and the identity matrix pre- and post-warping, as described abov e. W e a verage statistics ov er task and adaptation step (we found no signiﬁcant v ariation in these dimensions). Figure 9 summarise our results. W e ﬁnd that, in general, W arpGrad-Leap has zero-centered post-w arp activ ations. That pre-warp activ ations are positive is an artefact of the ReLU acti vation function. Ho wever , we ﬁnd that the correlation structure is signiﬁcantly dif ferent from what we would expect if W arp-Leap were to represent the Fisher matrix; post-warp co variances are signiﬁcantly dissimilar from the identity matrix and varies across layers. These results indicate that W arpGrad methods behav e distinctly different from Natural Gradient Descent methods. One possibility is that W arpGrad methods do approximate the Fisher Information Matrix, but with higher accuracy than other methods. A more lik ely explanation is that W arpGrad methods encode a different geometry since the y can learn to lev erage global information beyond the task at hand, which enables them to express geometries that standard Natural Gradient Descent cannot. H mini I M AG E N E T A N D tier ed I M A G E N E T mini ImageNet This dataset is a subset of 100 classes sampled randomly from the 1000 base classes in the ILSVRC-12 training set, with 600 images for each class. Follo wing ( Ravi & Larochelle , 2016 ), classes are split into non-ov erlapping meta-training, meta-validation and meta-tests sets with 64, 16, and 20 classes in each respectiv ely . tiered ImageNet As described in ( Ren et al. , 2018 ), this dataset is a subset of ILSVRC-12 that stratiﬁes 608 classes into 34 higher -lev el categories in the ImageNet human-curated hierarchy ( Deng et al. , 2009 ). In order to increase the separation between meta-train and meta-e valuation splits, 20 of these categories are used for meta-training, while 6 and 8 are used for meta-v alidation and meta- testing respectiv ely . Slicing the class hierarchy closer to the root creates more similarity within each split, and correspondingly more di versity between splits, rendering the meta-learning problem more challenging. High-level cate gories are further divided into 351 classes used for meta-training, 97 for meta-validation and 160 for meta-testing, for a total of 608 base categories. All the training images in ILSVRC-12 for these base classes are used to generate problem instances for tier ed ImageNet, of which there are a minimum of 732 and a maximum of 1300 images per class. For all experiments, N -way K -shot classiﬁcation problem instances were sampled follo wing the standard image classiﬁcation methodology for meta-learning proposed in V inyals et al. ( 2016 ). A subset of N classes was sampled at random from the corresponding split. For each class, K arbitrary images were chosen without replacement to form the training dataset of that problem instance. As usual, a disjoint set of L images per class were selected for the v alidation set. Few-shot classiﬁcation In these experiments we used the established experimental protocol for ev aluation in meta-validation and meta-testing: 600 task instances were selected, all using N = 5 , K = 1 or K = 5 , as speciﬁed, and L = 15 . During meta-training we used N = 5 , K = 5 or K = 15 respectively , and L = 15 . T ask-learners used 4 con volutional blocks deﬁned by with 128 ﬁlters (or less, chosen by hyper- parameter tuning), 3 × 3 kernels and strides set to 1 , follo wed by batch normalisation with learned scales and of fsets, a ReLU non-linearity and 2 × 2 max-pooling. The output of the con volutional stack ( 5 × 5 × 128 ) was ﬂattened and mapped, using a linear layer, to the 5 output units. The last 3 con volutional layers were followed by warp-layers with 128 ﬁlters each. Only the ﬁnal 3 task-layer parameters and their corresponding scale and offset batch-norm parameters were adapted during task-training, with the corresponding warp-layers and the initial con volutional layer kept ﬁxed and meta-learned using the W arpGrad objective. Note that, with the exception of CA VIA, other baselines do worse with 128 ﬁlters as the y ov erﬁt; MAML and T -Nets achie ve 46% and 49 % 5-way-1-shot test accurac y with 128 ﬁlters, compared to their best reported results (48.7% and 51.7%, respectiv ely). 23 Published as a conference paper at ICLR 2020 Hyper-parameters were tuned independently for each condition using random grid search for highest test accuracy on meta-v alidation left-out tasks. Grid sizes were 50 for all experiments. W e choose the optimal hyper-parameters (using early stopping at the meta-lev el) in terms of meta-validation test set accuracy for each condition and we report test accuracy on the meta-test set of tasks. 60000 meta- training steps were performed using meta-gradients o ver a single randomly selected task instances and their entire trajectories of 5 adaptation steps. T ask-speciﬁc adaptation was done using stochastic gradient descent without momentum. W e use Adam ( Kingma & Ba , 2015 ) for meta-updates. Multi-shot classiﬁcation For these e xperiments we used N = 10 , K = 640 and L = 50 . T ask- learners are deﬁned similarly , but stacking 6 con volutional blocks deﬁned by 3 × 3 kernels and strides set to 1 , follo wed by batch normalisation with learned scales and offsets, a ReLU non-linearity and 2 × 2 max-pooling (ﬁrst 5 layers). The sizes of conv olutional layers were chosen by hyper-parameter tuning to { 64 , 64 , 160 , 160 , 256 , 256 } . The output of the conv olutional stack ( 2 × 2 × 256 ) was ﬂattened and mapped, using a linear layer , to the 10 output units. Hyper-parameters were tuned independently for each algorithm, version, and baseline using random grid search for highest test accuracy on meta-validation left-out tasks. Grid sizes were 200 for all multi-shot experiments. W e choose the optimal hyper-parameters in terms of mean meta-v alidation test set accuracy A UC (using early stopping at the meta-le vel) for each condition and we report test accuracy on the meta-test set of tasks. 2000 meta-training steps were performed using averaged meta-gradients ov er 5 random task instances and their entire trajectories of 100 adaptation steps with batch size 64 , or inner-loops. T ask-speciﬁc adaptation was done using stochastic gradient descent with momentum ( 0 . 9 ). Meta-gradients were passed to Adam in the outer loop. W e test W arpGrad against Leap, Reptile, and training from scratch with lar ge batches and tuned mo- mentum. W e tune all meta-learners for optimal performance on the validation set. W arpGrad outper- forms all baselines both in terms of rate of con ver gence and ﬁnal test performance (Figure 10 ). Figure 10: Multi-shot tiered ImageNet results. T op: mean learning curves (test classiﬁcation accuracy) on held-out meta-test tasks. Bottom: mean test classiﬁcation performance on held-out meta-test tasks during meta-training. Training from scratch omitted as it is not meta-trained. 24 Published as a conference paper at ICLR 2020 I M A Z E N A V I G A T I O N T o illustrate both how W arpGrad may be used with Recurrent Neural Netw orks in an online meta- learning setting, as well as in a Reinforcement Learning en vironment, we ev aluate it in a maze navigation task proposed by Miconi et al. ( 2018 ). The en vironment is a ﬁxed maze and a task is deﬁned by randomly choosing a goal location in the maze. During a task episode of length 200 , the goal location is ﬁxed but the agent gets teleported once it ﬁnds it. Thus, during an episode the agent must ﬁrst locate the goal, then return to it as many times as possible, each time being randomly teleported to a ne w starting location. W e use an identical setup as Miconi et al. ( 2019 ), except our grid is of size 11 × 11 as opposed to 9 × 9 . W e compare our W arp-RNN to Learning to Reinforcement Learn ( W ang et al. , 2016 ) and Hebbian meta-learners ( Miconi et al. , 2018 ; 2019 ). The task-learner in all cases is an adv antage actor-critic ( W ang et al. , 2016 ), where the actor and critic share an underlying basic RNN, whose hidden state is projected into a polic y and v alue function by two separate linear layers. The RNN has a hidden state size of 100 and tanh non-linearities. Follo wing ( Miconi et al. , 2019 ), for all benchmarks, we train the task-learner using Adam with a learning rate of 1 e − 3 for 200 000 steps using batches of 30 episodes, each of length 200 . Meta- learning arises in this setting as each episode encodes a diff erent task, as the goal location mo ves, and by learning across episodes the RNN is encoding meta-information in its parameters that it can lev erage during task adaptation (via its hidden state ( Hochreiter & Schmidhuber , 1997 ; W ang et al. , 2016 )). See Miconi et al. ( 2019 ) for further details. W e design a W arp-RNN by introducing a warp-layer in the form of an LSTM that is frozen for most of the training process. Following Flennerhag et al. ( 2018 ), we use this meta-LSTM to modulate the task RNN. Giv en an episode with input vector x t , the task RNN is deﬁned by h t = tanh  U 2 h,t V U 1 h,t h t − 1 + U 2 x,t W U 1 x,t x t + U b t b  , (20) where W , V , b are task-adaptable parameters; each U ( i ) j,t is a diagonal warp matrix produced by projecting from the hidden state of the meta-LSTM, U ( i ) j,t = diag(tanh( P ( i ) j z t )) , where z is the hidden-state of the meta-LSTM. See Flennerhag et al. ( 2018 ) for details. Thus, our W arp-RNN is a form of HyperNetwork (see Figure 6 , Appendix A ). Because the meta-LSTM is frozen for most of the training process, task adaptable parameters correspond to those of the baseline RNN. T o control for the capacity of the meta-LSTM, we also train a HyperRNN where the LSTM is updated with e very task adaptation; we ﬁnd this model does worse than the W arpGrad-RNN. W e also compare the non-linear preconditioning that we obtain in our W arp-RNN to linear forms of preconditioning deﬁned in prior works. W e implement a T -Nets-RNN meta-learner , deﬁned by embedding linear projections T h , T x and T b that are meta-learned in the task RNN, h t = tanh( T h V h t + T x W x t + b ) . Note that we cannot backpropagate to these meta-parameters as per the T -Nets (MAML) frame work. Instead, we train T h , T x , T b with the meta-objective and meta-training algorithm we use for the W arp-RNN. The T -Nets-RNN does worse than the baseline RNN and generally fails to learn. W e meta-train the W arp-RNN using the continual meta-training algorithm (Algorithm 3 , see Ap- pendix B for details), which accumulates meta-gradients continuously during training. Because task training is a continuous stream of batches of episodes, we accumulating the meta-gradient using the approximate objectiv e (Eq. 12 , where L τ task and L τ meta are both the same advantage actor -critic objec- tiv e) and update warp-parameters on e very 30th task parameter update. W e detail the meta-objectiv e in Appendix C (see Eq. 18 ). Our implementation of a W arp-RNN can be seen as meta-learning “slow” weights to facilitate learning of “fast” weights ( Schmidhuber , 1992 ; Mujika et al. , 2017 ). Implementing W arp-RNN requires four lines of code on top of the standard training script. The task-learner is the same in all experiments with the same number of learnable parameters and hidden state size. Compared to all baselines, we ﬁnd that the W arp-RNN conv erges faster and achiev es a higher cumulativ e rew ard (Figure 4 and Figure 11 ). J M E T A - L E A R N I N G F O R C O N T I N U A L L E A R N I N G Online SGD and related optimisation methods tend to adapt neural network models to the data distribution encountered last during training, usually leading to what has been termed “catastrophic 25 Published as a conference paper at ICLR 2020 0 20000 40000 60000 80000 100000 Num b er of Episo des 0 25 50 75 100 125 150 175 Rew ard W arp-RNN Hyp erRNN Hebb-RNN † Hebb-RNN ‡ RNN T-Nets-RNN 0 20000 40000 60000 80000 100000 120000 140000 160000 180000 200000 Num b er of Episo des 0 25 50 75 100 125 150 175 Rew ard W arp-RNN Hyp erRNN Hebb-RNN † Hebb-RNN ‡ RNN T-Nets-RNN Figure 11: Mean cumulativ e return for maze navig ation task, for 200000 training steps. Shading represents inter-quartile ranges across 10 independent runs. † Simple modulation and ‡ retroactiv e modulation, respectiv ely ( Miconi et al. , 2019 ). forgetting” ( French , 1999 ). In this e xperiment, we in vestigate whether W arpGrad optimisers can meta-learn to a void this problem altogether and directly minimise the joint objecti ve ov er all tasks with ev ery update in the fully online learning setting where no past data is retained. Continual Sine Regression W e propose a continual learning version of the sine re gression meta- learning experiment in Finn et al. ( 2017 ). W e split the input interval [ − 5 , 5] ⊂ R ev enly into 5 consecutiv e sub-interv als, corresponding to 5 regression tasks. These are presented one at a time to a task-learner , which adapts to each sub-task using 20 gradient steps on data from the giv en sub-task only . Batch sizes were set to 5 samples. Sub-tasks thus differ in their input domain. A task sequence is deﬁned by a target function composed of two randomly mix ed sine functions of the form f a i ,b i ( x ) = a i sin( x − b i ) each with randomly sampled amplitudes a i ∈ [0 . 1 , 5] and phases b i ∈ [0 , π ] . A task τ = ( a 1 , b 1 , a 2 , b 2 , o ) is therefore deﬁned by sampling the parameters that specify this mixture; a task speciﬁes a target function g τ by g τ ( x ) = α o ( x ) g a 1 ,b 1 ( x ) + (1 − α o ( x )) g a 2 ,b 2 ( x ) , (21) where α o ( x ) = σ ( x + o ) for a randomly sampled offset o ∈ [ − 5 , 5] , with σ being the sigmoid activ ation function. Model W e deﬁne a task-learner as 4 -layer feed-forward netw orks with hidden layer size 200 and ReLU non-linearities to learn the mapping between inputs and regression targets, f ( · , θ , φ ) . For each task sequence τ , a task-learner is initialised from a ﬁxed random initialisation θ 0 (that is not 26 Published as a conference paper at ICLR 2020 meta-learned). Each non-linearity is follo wed by a residual warping block consisting of 2 -layer feed- forward networks with 100 hidden units and tanh non-linearities, with meta-learned parameters φ which are ﬁxed during the task adaptation process. Continual learning as task adaptation The task tar get function g τ is partitioned into 5 sets of sub-tasks. The task-learner sees one partition at a time and is gi ven n = 20 gradient steps to adapt, for a total of K = 100 steps of online gradient descent updates for the full task sequence; recall that ev ery such sequence starts from a ﬁx ed random initialisation θ 0 . The adaptation is completely online since at step k = 1 , . . . , K we sample a new mini-batch D k task of 5 samples from a single sub-task (sub-interval). The data distrib ution changes after each n = 20 steps with inputs x coming from the next sub-interv al and targets form the same function g τ ( x ) . During meta-training we always present tasks in the same order , presenting intervals from left to right. The online (sub-)task loss is deﬁned on the current mini-batch D k task at step k : L τ task  θ τ k , D k task ; φ  = 1 2 | D k task | X x ∈ D k task  f ( x, θ τ k ; φ ) − g τ ( x )  2 . (22) Adaptation to each sub-task uses sub-task data only to form task parameter updates θ τ k +1 ← θ τ k − α ∇L τ task  θ τ k , D k task ; φ  . W e used a constant learning rate α = 0 . 001 . W arp-parameters φ are ﬁxed across the full task sequence during adaptation and are meta-learned across random samples of task sequences, which we describe next. Meta-learning an optimiser f or continual learning T o inv estigate the ability of W arpGrad to learn an optimiser for continual learning that mitigates catastrophic forgetting, we ﬁx a random initialisation prior to meta-training that is not meta-learned; every task-learner is initialised with these parameters. T o meta-learn an optimiser for continual learning, we need a meta-objectiv e that encourages such behaviour . Here, we take a ﬁrst step to wards a frame work for meta-learned continual learning. W e deﬁne the meta-objecti ve L τ meta as an incremental multitask objectiv e that, for each sub-task! τ t in a gi ven task sequence τ , av erages the validation sub-task losses (Eq. 22 ) for the current and ev ery preceding loss in the task sequence. The task meta-objectiv e is deﬁned by summing over all sub-tasks in the task sequence. For some sub-task parameterisation θ τ t , we hav e L τ meta  θ τ t ; φ  = t X i =1 1 n ( T − t + 1) L τ task  θ τ i , D i val ; φ  . (23) As before, the full meta-objecti ve is an expectation o ver the joint task parameter distrib ution (Eq. 11 ); for further details on the meta-objecti ve, see Appendix C , Eq. 19 . This meta-objecti ve gi ves equal weight to all the tasks in the sequence by a veraging the re gression step loss over all sub-tasks where a prior sub-task should be learned or remembered. For example, losses from the ﬁrst sub-task, deﬁned using the interval [ − 5 , − 3] , will appear nT times in the meta-objectiv e. Con versely , the last sub-task in a sequence, deﬁned on the interv al [3 , 5] , is learned only in the last n = 20 steps of task adaptation, and hence appears n times in the meta-objectiv e. Normalising on number of appearances corrects for this bias. W e trained warp-parameters using Adam and a meta-learning rate of 0 . 001 , sampling 5 random tasks to form a meta-batch and repeating the process for 20 000 steps of meta-training. Results Figure 12 shows a breakdo wn of the validation loss across the 5 sequentially learned tasks ov er the 100 steps of online learning during task adaptation. Results are averaged o ver 100 random regression problem instances. The meta-learned W arpGrad optimiser reduces the loss of the task currently being learned in each interv al while also lar gely retaining performance on previous tasks. There is an immediate relativ ely minor loss of performance, after which performance on previous tasks is retained. W e hypothesise that this is because the meta-objectiv es averages ov er the full learning curve, as opposed to only the performance once a task has been adapted to. As such, the W arpGrad optimiser may allow for some degree of performance loss. Intriguingly , in all cases, after an initial spike in previous sub-task losses when switching to a new task, the spike starts to revert back some way towards optimal performance, suggesting that the W arpGrad optimiser facilitates positiv e backward transfer , without this being explicitly enforced in the meta-objecti ve. Deriving a principled meta-objectiv e for continual learning is an exciting area for future research. 27 Published as a conference paper at ICLR 2020 (a) T ask order as seen during meta-training. (b) Random task order . Figure 12: Continual learning regression experiment. A verage log-loss ov er 100 randomly sampled tasks. Each task contains 5 sub-tasks learned (a) sequentially as seen during meta-training or (b) in random order [sub-task 1, 3, 4, 2, 0]. W e train on each sub-task for 20 steps, for a total of K = 100 task adaptation steps. (a) T ask order seen during meta-training. (b) Random task order . Figure 13: Continual learning regression: ev aluation after partial task adaptation. W e plot the ground truth (black), task-learner prediction before adaptation (dashed green) and task-learner prediction after adaptation (red). Each ro w illustrates how task-learner predictions evolv e (red) after training on sub-tasks up to and including that sub-task (current task illustrate in plot). (a) sub-tasks are presented in the same order as seen during meta-training; (b) sub-tasks are presented in random order at meta-test time in sub-task order [1, 3, 4, 2 and 0]. 28

Meta-Learning with Warped Gradient Descent

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment