Building Machines That Learn and Think Like People

Recent progress in artificial intelligence (AI) has renewed interest in building systems that learn and think like people. Many advances have come from using deep neural networks trained end-to-end in tasks such as object recognition, video games, an…

Authors: Brenden M. Lake, Tomer D. Ullman, Joshua B. Tenenbaum

Building Machines That Learn and Think Like People
In press at Behavior al and Br ain Scienc es . Building Mac hines That Learn and Think Lik e P eople Brenden M. Lak e, 1 T omer D. Ullman, 2 , 4 Josh ua B. T enenbaum, 2 , 4 and Sam uel J. Gershman 3 , 4 1 Cen ter for Data Science, New Y ork Universit y 2 Departmen t of Brain and Cognitive Sciences, MIT 3 Departmen t of Psychology and Center for Brain Science, Harv ard Universit y 4 Cen ter for Brains Minds and Machines Abstract Recen t progress in artificial intelligence (AI) has renew ed interest in building systems that learn and think like p eople. Man y adv ances ha v e come from using deep neural net w orks trained end-to-end in tasks such as ob ject recognition, video games, and b oard games, achieving perfor- mance that equals or even beats h umans in some respects. Despite their biological inspiration and performance achiev ements, these systems differ from h uman in telligence in crucial w ays. W e review progress in cognitiv e science suggesting that truly h uman-like learning and thinking mac hines will ha ve to reac h beyond curren t engineering trends in both what they learn, and ho w they learn it. Specifically , w e argue that these machines should (a) build causal mo dels of the w orld that supp ort explanation and understanding, rather than merely solving pattern recog- nition problems; (b) ground learning in intuitiv e theories of physics and psychology , to supp ort and enric h the knowledge that is learned; and (c) harness comp ositionalit y and learning-to-learn to rapidly acquire and generalize kno wledge to new tasks and situations. W e suggest concrete c hallenges and promising routes tow ards these goals that can combine the strengths of recent neural netw ork adv ances with more structured cognitive mo dels. 1 In tro duction Artificial in telligence (AI) has b een a story of b o oms and busts, y et b y an y traditional measure of success, the last few y ears ha ve b een mark ed b y exceptional progress. Muc h of this progress has come from recent adv ances in “deep learning,” characterized b y learning large neural-netw ork-style mo dels with m ultiple lay ers of representation. These mo dels ha v e ac hiev ed remark able gains in man y domains spanning ob ject recognition, sp eec h recognition, and con trol (LeCun, Bengio, & Hin ton, 2015; Sc hmidhuber, 2015). In ob ject recognition, Krizhevsky , Sutskev er, and Hin ton (2012) trained a deep con v olutional neural netw ork (convnets; LeCun et al., 1989) that nearly halv ed the error rate of the previous state-of-the-art on the most c hallenging b enchmark to date. In the years since, convnets con tin ue to dominate, recently approachin g h uman-level p erformance on some ob ject recognition b enc hmarks (He, Zhang, Ren, & Sun, 2015; Russako vsky et al., 2015; Szegedy et al., 2014). In automatic sp eec h recognition, Hidden Mark ov Mo dels (HMMs) ha ve b een the leading approach since the late 1980s (Juang & Rabiner, 1990), y et this framework has b een chipped aw ay piece by piece and replaced with deep learning comp onents (Hinton et al., 2012). No w, the leading approac hes to speech recognition are fully neural net w ork systems (Gra v es, Mohamed, & Hin ton, 2013; W eng, Y u, W atanab e, & Juang, 2014). Ideas from deep learning hav e also b een applied to learning complex control problems. V. Mnih et al. (2015) com bined ideas from deep learning and reinforcement learning to mak e a “deep reinforcement learning” algorithm that learns to play large classes of simple video games from just frames of pixels and the game score, ac hieving human or sup erh uman level p erformance on man y of these games (see also Guo, Singh, Lee, Lewis, & W ang, 2014; Schaul, Quan, Antonoglou, & Silver, 2016; Stadie, Levine, & Abb eel, 2016). These accomplishments hav e help ed neural netw orks regain their status as a leading paradigm in mac hine learning, muc h as they w ere in the late 1980s and early 1990s. The recen t success of neural net works has captured attention b eyond academia. In industry , companies such as Go ogle and F aceb o ok hav e active research divisions exploring these technologies, and ob ject and sp eec h recognition systems based on deep learning hav e b een deploy ed in core pro ducts on smart phones and the web. The media has also cov ered man y of the recent achiev emen ts of neural netw orks, often expressing the view that neural netw orks hav e ac hieved this recent success by virtue of their brain-lik e com putation and thus their ability to emulate h uman learning and h uman cognition. In this article, we view this excitement as an opp ortunity to examine what it means for a machine to learn or think like a p erson. W e first review some of the criteria previously offered by cognitive scien tists, developmen tal psyc hologists, and AI researchers. Second, we articulate what w e view as the essen tial ingredien ts for building suc h a machine that learns or thinks lik e a person, synthesizing theoretical ideas and exp erimental data from res earc h in cognitive science. Third, we consider con temp orary AI (and deep learning in particular) in light of these ingredients, finding that deep learning mo dels hav e yet to incorporate man y of them and so ma y b e solving some problems in differen t wa ys than p eople do. W e end by discussing what we view as the most plausible paths to wards building machines that learn and think lik e p eople. This includes prosp ects for integrating deep learning with the core cognitive ingredients w e identify , inspired in part by recen t work fusing neural netw orks with low er-level building blo cks from classic psyc hology and computer science (atten tion, working memory , stacks, queues) that hav e traditionally b een seen as incompatible. Bey ond the sp ecific ingredients in our prop osal, we draw a broader distinction b etw een t w o differ- en t computational approaches to intelligence. The statistical p attern r e c o gnition approach treats prediction as primary , usually in the context of a sp ecific classification, regression, or control task. In this view, learning is ab out discov ering features that ha v e high v alue states in common – a shared lab el in a classification setting or a shared v alue in a reinforcemen t learning setting – across a large, diverse set of training data. The alternativ e approach treats models of the world as pri- mary , where learning is the pro cess of mo del-building . Cognition is ab out using these mo dels to understand the world, to explain what we see, to imagine what could hav e happ ened that didn’t, or what could b e true that isn’t, and then planning actions to make it so. The difference b e- t ween pattern recognition and mo del-building, b et ween prediction and explanation, is central to our view of h uman in telligence. Just as scien tists seek to explain nature, not simply predict it, we see h uman though t as fundamentally a mo del-building activity . W e elab orate this key p oint with n umerous examples b elow. W e also discuss ho w pattern recognition, ev en if it is not the core of in telligence, can nonetheless supp ort mo del-building, through “mo del-free” algorithms that learn through exp erience how to make essential inferences more computationally efficient. 2 Before pro ceeding, we provide a few cav eats ab out the goals of this article and a brief o v erview of the k ey ideas. 1.1 What this article is not F or nearly as long as there ha v e b een neural netw orks, there hav e b een critiques of neural net works (Cric k, 1989; F o dor & Pylyshyn, 1988; Marcus, 1998, 2001; Minsky & P ap ert, 1969; Pink er & Prince, 1988). While we are critical of neural net works in this article, our goal is to build on their successes rather than dw ell on their shortcomings. W e see a role for neural netw orks in dev eloping more human-lik e learning machines: They hav e b een applied in comp elling wa ys to man y types of mac hine learning problems, demonstrating the p ow er of gradient-based learning and deep hierarchies of latent v ariables. Neural netw orks also hav e a rich history as computational mo dels of cognition (McClelland, Rumelhart, & the PDP Researc h Group, 1986; Rumelhart, McClelland, & the PDP Research Group, 1986) – a history we describ e in more detail in the next section. At a more fundamental lev el, any computational mo del of learning must ultimately b e grounded in the brain’s biological neural net w orks. W e also b eliev e that future generations of neural net w orks will lo ok very different from the current state-of-the-art. They may be endow ed with intuitiv e physics, theory of mind, causal reasoning, and other capacities we describ e in the sections that follo w. More structure and inductiv e biases could b e built in to the netw orks or learned from previous exp erience with related tasks, leading to more human-lik e patterns of learning and dev elopmen t. Net works may learn to effectiv ely searc h for and discov er new men tal mo dels or intuitiv e theories, and these improv ed mo dels will, in turn, enable subsequen t learning, allowing systems that learn-to-learn – using previous kno wledge to mak e ric her inferences from very small amounts of training data. It is also imp ortant to draw a distinction b et ween AI that purp orts to emulate or dra w inspiration from asp ects of human cognition, and AI that do es not. This article fo cuses on the former. The latter is a perfectly reasonable and useful approach to dev eloping AI algorithms – av oiding cognitiv e or neural inspiration as well as claims of cognitive or neural plausibility . Indeed, this is how man y researc hers ha v e pro ceeded, and this article has little p ertinence to work conducted under this researc h strategy . 1 On the other hand, we b elieve that reverse engineering h uman in telligence can usefully inform AI and machine learning (and has already done so), esp ecially for the types of domains and tasks that p eople excel at. Despite recen t computational achiev ements, p eople are b etter than mac hines at solving a range of difficult computational problems, including concept learning, scene understanding, language acquisition, language understanding, sp eec h recognition, etc. Other human cognitive abilities remain difficult to understand computationally , including creativit y , common sense, and general purp ose reasoning. As long as natural intelligence remains the b est example of in telligence, w e b elieve that the pro ject of reverse engineering the human solutions to difficult computational problems will con tin ue to inform and adv ance AI. Finally , while we fo cus on neural net w ork approaches to AI, w e do not wish to giv e the impres- sion that these are the only contributors to recen t adv ances in AI. On the con trary , some of the 1 In their influen tial textb o ok, Russell and Norvig (2003) state that “The quest for ‘artificial flight’ succeeded when the W righ t brothers and others stopp ed imitating birds and started using wind tunnels and learning ab out aero dynamics.” (p. 3). 3 T able 1: Glossary Neural netw ork : A netw ork of simple neuron-like pro cessing units that collectively p er- form complex computations. Neural net works are often organized in to la yers, including an input lay er that presen ts the data (e.g, an image), hidden lay ers that transform the data in to intermediate representations, and an output la yer that pro duces a response (e.g., a lab el or an action). Recurren t connections are also p opular when pro cessing sequential data. Deep learning : A neural netw ork with at least one hidden la y er (some net works ha ve dozens). Most state-of-the-art deep netw orks are trained using the bac kpropagation algo- rithm to gradually adjust their connection strengths. Bac kpropagation : Gradien t descent applied to training a deep neural net work. The gradien t of the ob jective function (e.g., classification error or log-likelihoo d) with resp ect to the mo del parameters (e.g., connection weigh ts) is used to make a series of small adjustmen ts to the parameters in a direction that impro ves the ob jective function. Con volutional net work (con vnet) : A neural netw ork that uses trainable filters instead of (or in addition to) fully-connected lay ers with indep endent weigh ts. The same filter is applied at man y lo cations across an image (or across a time series), leading to neural net works that are effectively larger but with lo cal connectivity and fewer free parameters. Mo del-free and mo del-based reinforcement learning : Mo del-free algorithms di- rectly learn a con trol p olicy without explicitly building a model of the environmen t (re- w ard and state transition distributions). Mo del-based algorithms learn a mo del of the en vironment and use it to select actions by planning. Deep Q-learning : A mo del-free reinforcement learning algorithm used to train deep neural netw orks on control tasks such as playing A tari games. A net w ork is trained to appro ximate the optimal action-v alue function Q ( s, a ), whic h is the exp ected long-term cum ulative rew ard of taking action a in state s and then optimally selecting future actions. Generativ e model : A mo del that sp ecifies a probability distribution ov er the data. F or instance, in a classification task with examples X and class lab els y , a generativ e model sp ecifies the distribution of data given lab els P ( X | y ), as w ell as a prior on lab els P ( y ), whic h can b e used for sampling new examples or for classification by using Bay es’ rule to compute P ( y | X ). A discriminativ e mo del sp ecifies P ( y | X ) directly , p ossibly b y using a neural netw ork to predict the lab el for a given data p oint, and cannot directly b e used to sample new examples or to compute other queries regarding the data. W e will generally b e concerned with directed generative mo dels (such as Ba y esian netw orks or probabilistic programs) whic h can be given a causal in terpretation, although undirected (non-causal) generativ e mo dels (such as Boltzmann machines) are also p ossible. Program induction : Constructing a program that computes some desired function, where that function is t ypically sp ecified b y training data consisting of example input- output pairs. In the case of probabilistic programs, whic h sp ecify candidate generative mo dels for data, an abstract description language is used to define a set of allow able programs and learning is a searc h for the programs lik ely to ha ve generated the data. 4 most exciting recent progress has b een in new forms of probabilistic machine learning (Ghahra- mani, 2015). F or example, researchers hav e developed automated statistical reasoning techniques (Llo yd, Duv enaud, Grosse, T enen baum, & Ghahramani, 2014), automated tec hniques for mo del building and selection (Grosse, Salakh utdinov, F reeman, & T enenbaum, 2012), and probabilistic programming languages (e.g., Gelman, Lee, & Guo, 2015; Go o dman, Mansinghk a, Roy , Bona witz, & T enenbaum, 2008; Mansinghk a, Selsam, & Pero v, 2014). W e b elieve that these approac hes will pla y imp ortant roles in future AI systems, and they are at least as compatible with the ideas from cognitiv e science we discuss here, but a full discussion of those connections is b eyond the scop e of the curren t article. 1.2 Ov erview of the k ey ideas The central goal of this pap er is to prop ose a set of core ingredients for building more human-lik e learning and thinking mac hines. W e will elab orate on eac h of these ingredien ts and topics in Section 4, but here w e briefly o verview the key ideas. The first set of ingredien ts fo cuses on dev elopmen tal “start-up softw are,” or cognitive capabilities presen t early in developmen t. There are sev eral reasons for this fo cus on developmen t. If an ingredien t is present early in developmen t, it is certainly active and av ailable well b efore a child or adult would attempt to learn the types of tasks discussed in this pap er. This is true regardless of whether the early-present ingredient is itself learned from exp erience or innately present. Also, the earlier an ingredient is present, the more lik ely it is to b e foundational to later developmen t and learning. W e fo cus on tw o pieces of developmen tal start-up softw are (see W ellman & Gelman, 1992, for a review of b oth). First is in tuitive physics (Section 4.1.1): Infan ts ha ve primitive ob ject concepts that allow them to trac k ob jects o ver time and allo w them to discount ph ysically implausible tra jectories. F or example, infants kno w that ob jects will p ersist ov er time and that they are solid and coheren t. Equipp ed with these general principles, p eople can learn more quickly and mak e more accurate predictions. While a task ma y b e new, ph ysics still works the same w a y . A second t yp e of softw are present in early dev elopmen t is intuitiv e psychology (Section 4.1.2): Infants understand that other p eople hav e mental states like goals and b eliefs, and this understanding strongly constrains their learning and predictions. A c hild w atc hing an exp ert pla y a new video game can infer that the a v atar has agency and is trying to seek reward while av oiding punishmen t. This inference immediately constrains other inferences, allowing the child to infer what ob jects are go o d and what ob jects are bad. These t yp es of inferences further accelerate the learning of new tasks. Our second set of ingredients fo cus on learning. While there are man y p ersp ectives on learning, we see mo del building as the hallmark of h uman-level learning, or explaining observ ed data through the construction of causal mo dels of the world (Section 4.2.2). Under this p ersp ective, the early- presen t capacities for intuitiv e physics and psychology are also causal mo dels of the world. A primary job of learning is to extend and enrich these mo dels, and to build analogous causally structured theories of other domains. Compared to state-of-the-art algorithms in mac hine learning, h uman learning is distinguished b y its 5 ric hness and its efficiency . Children come with the ability and the desire to unco v er the underlying causes of sparsely observed even ts and to use that knowledge to go far beyond the paucity of the data. It might seem paradoxical that p eople are capable of learning these richly structured mo dels from v ery limited amoun ts of exp erience. W e suggest that comp ositionalit y and learning-to- learn are ingredients that make this type of rapid model learning p ossible (Sections 4.2.1 and 4.2.3, resp ectiv ely). A final set of ingredients concerns ho w the rich mo dels our minds build are put into action, in real time (Section 4.3). It is remark able how fast we are to p erceive and to act. People can comprehend a nov el scene in a fraction of a second, and or a no vel utterance in little more than the time it tak es to say it and hear it. An imp ortant motiv ation for using neural netw orks in machine vision and sp eech systems is to resp ond as quickly as the brain do es. Although neural net w orks are usually aiming at pattern recognition rather than mo del-building, we will discuss wa ys in whic h these “mo del-free” metho ds can accelerate slo w mo del-based inferences in p erception and cognition (Section 4.3.1). By learning to recognize patterns in these inferences, the outputs of inference can b e predicted without having to go through costly intermediate steps. In tegrating neural netw orks that “learn to do inference” with rich mo del-building learning mechanisms offers a promising w a y to explain ho w human minds can understand the world so well, so quickly . W e will also discuss the in tegration of mo del-based and mo del-free metho ds in reinforcement learn- ing (Section 4.3.2), an area that has seen rapid recent progress. Once a causal mo del of a task has b een learned, humans can use the mo del to plan action sequences that maximize future rew ard; when rewards are used as the metric for successs in mo del-building, this is known as mo del-based reinforcemen t learning. How ever, planning in complex mo dels is cum b ersome and slow, making the speed-accuracy trade-off unfa v orable for real-time control. By con trast, mo del-free reinforce- men t learning algorithms, such as current instan tiations of deep reinforcemen t learning, supp ort fast con trol but at the cost of inflexibility and possibly accuracy . W e will review evidence that h umans combine mo del-based and mo del-free learning algorithms b oth comp etitiv ely and co op er- ativ ely , and that these interactions are sup ervised by metacognitive pro cesses. The sophistication of human-lik e reinforcement learning has y et to b e realized in AI systems, but this is an area where crosstalk b et ween cognitive and engineering approaches is esp ecially promising. 2 Cognitiv e and neural inspiration in artificial intelligence The questions of whether and how AI should relate to human cognitive psychology are older than the terms ‘artificial intelligence’ and ‘cognitive psychology .’ Alan T uring susp ected that it is easier to build and educate a c hild-machine than try to fully capture adult human cognition (T uring, 1950). T uring pictured the child’s mind as a noteb o ok with “rather little mechanism and lots of blank sheets,” and the mind of a c hild-machine as filling in the noteb o ok by resp onding to rew ards and punishmen ts, similar to reinforcemen t learning. This view on representation and learning ec ho es b ehaviorism, a dominan t psychological tradition in T uring’s time. It also echoes the strong empiricism of mo dern connectionist mo dels, the idea that w e can learn almost everything we know from the statistical patterns of sensory inputs. Cognitiv e science repudiated the o ver-simplified b ehaviorist view and came to play a cen tral role 6 in early AI researc h (Bo den, 2006). New ell and Simon (1961) developed their “General Problem Solv er” as b oth an AI algorithm and a mo del of h uman problem solving, which they subsequently tested exp erimen tally (Newell & Simon, 1972). AI pioneers in other areas of research explicitly referenced human cognition, and even published pap ers in cognitive psyc hology journals (e.g., Bobro w & Winograd, 1977; Ha yes-Roth & Ha y es-Roth, 1979; Winograd, 1972). F or example, Sc hank (1972), writing in the journal Co gnitive Psycholo gy , declared that We hop e to b e able to build a pr o gr am that c an le arn, as a child do es, how to do what we have describ e d in this p ap er inste ad of b eing sp o on-fe d the tr emendous information ne c essary . A similar sen timen t was expressed by Minsky (1974): I dr aw no b oundary b etwe en a the ory of human thinking and a scheme for making an intel ligent machine; no purp ose would b e serve d by sep ar ating these to day sinc e neither domain has the ories go o d enough to explain—or to pr o duc e—enough mental c ap acity. Muc h of this research assumed that h uman knowledge represen tation is symbolic and that reasoning, language, planning and vision could b e understo o d in terms of sym b olic op erations. Parallel to these developmen ts, a radically differen t approac h was b eing explored, based on neuron-like “sub- sym b olic” computations (e.g., F ukushima, 1980; Grossb erg, 1976; Rosen blatt, 1958). The represen tations and algorithms used by this approac h w ere more directly inspired by neuroscience than by cognitive psychology , although ultimately it would flo wer in to an influen tial school of though t ab out the nature of cognition— p ar al lel distribute d pr o c essing (PDP) (McClelland et al., 1986; Rumelhart, McClelland, & the PDP Researc h Group, 1986). As its name suggests, PDP emphasizes parallel computation b y combining simple units to collectively implemen t sophisticated computations. The knowledge learned by these neural netw orks is thus distributed across the collection of units rather than lo calized as in most sym b olic data structures. The resurgence of recen t interest in neural net w orks, more commonly referred to as “deep learning,” share the same represen tational commitmen ts and often even the same learning algorithms as the earlier PDP mo dels. “Deep” refers to the fact that more p ow erful mo dels can b e built by comp osing many la yers of representation (see LeCun et al., 2015; Schmidh ub er, 2015, for recent reviews), still very m uch in the PDP style while utilizing recent adv ances in hardware and computing capabilities, as w ell as massive datasets, to learn deep er mo dels. It is also imp ortant to clarify that the PDP p ersp ectiv e is compatible with “mo del building” in addition to “pattern recognition.” Some of the original work done under the banner of PDP (Rumelhart, McClelland, & the PDP Research Group, 1986) is closer to mo del building than pattern recognition, whereas the recent large-scale discriminativ e deep learning systems more purely exemplify pattern recognition (see Bottou, 2014, for a related discussion). But, as discussed, there is also a question of the nature of the learned represen tations within the mo del – their form, comp ositionalit y , and transferability – and the developmen tal start-up soft ware that was used to get there. W e fo cus on these issues in this pap er. Neural netw ork mo dels and the PDP approach offer a view of the mind (and intelligence more broadly) that is sub-sym b olic and often p opulated with minimal constrain ts and inductive biases 7 to guide learning. Prop onents of this approach main tain that many classic types of structured kno wledge, such as graphs, grammars, rules, ob jects, structural descriptions, programs, etc. can b e useful yet misleading metaphors for characterizing thought. These structures are more epiphenom- enal than real, emergent prop erties of more fundamental sub-sym b olic cognitive processes (McClel- land et al., 2010). Compared to other paradigms for studying cognition, this position on the nature of represen tation is often accompanied by a relativ ely “blank slate” vision of initial kno wledge and represen tation, muc h like T uring’s blank noteb o ok. When attempting to understand a particular cognitive ability or phenomenon within this paradigm, a common scien tific strategy is to train a relatively generic neural net w ork to p erform the task, adding additional ingredients only when necessary . This approac h has shown that neural netw orks can b ehav e as if they learned explicitly structured knowledge, such as a rule for pro ducing the past tense of words (Rumelhart & McClelland, 1986), rules for solving simple balance-b eam physics problems (McClelland, 1988), or a tree to represent t yp es of living things (plants and animals) and their distribution of prop erties (Rogers & McClelland, 2004). T raining large-scale relativ ely generic net works is also the best curren t approac h for ob ject recognition (He et al., 2015; Krizhevsky et al., 2012; Russak ovsky et al., 2015; Szegedy et al., 2014), where the high-lev el feature represen tations of these con volutional nets ha ve also been used to predict patterns of neural resp onse in h uman and macaque IT cortex (Khaligh-Raza vi & Kriegeskorte, 2014; Kriegesk orte, 2015; Y amins et al., 2014) as well as human t ypicality ratings (Lake, Zarem ba, F ergus, & Gurec kis, 2015) and similarit y ratings (P eterson, Abb ott, & Griffiths, 2016) for images of common ob jects. Moreov er, researchers ha ve trained generic netw orks to p erform structured and ev en strategic tasks, suc h as the recent w ork on using a Deep Q-learning Net work (DQN) to play simple video games (V. Mnih et al., 2015). If neural net w orks hav e suc h broad application in machine vision, language, and control, and if they can b e trained to emulate the rule-like and structured b ehaviors that c haracterize cognition, do w e need more to develop truly human-lik e learning and thinking mac hines? How far can relatively generic neural net w orks bring us tow ards this goal? 3 Challenges for building more h uman-lik e machines While cognitive science has not yet con v erged on a single account of the mind or intelligence, the claim that a mind is a collection of general purp ose neural netw orks with few initial constraints is rather extreme in contemporary cognitive science. A different picture has emerged that highlights the importance of early inductiv e biases, including core concepts suc h as num b er, space, agency and ob jects, as well as p ow erful learning algorithms that rely on prior knowledge to extract knowledge from small amoun ts of training data. This kno wledge is often richly organized and theory-like in structure, capable of the graded inferences and pro ductiv e capacities characteristic of human though t. Here w e present t w o challenge problems for mac hine learning and AI: learning simple visual concepts (Lak e, Salakhutdino v, & T enen baum, 2015) and learning to pla y the Atari game F rostbite (V. Mnih et al., 2015). W e also use the problems as running examples to illustrate the importance of core cognitiv e ingredien ts in the sections that follow. 8 3.1 The Characters Challenge The first challenge concerns handwritten character recognition, a classic problem for comparing differen t t yp es of machine learning algorithms. Hofstadter (1985) argued that the problem of recognizing characters in all the wa ys p eople do – b oth handwritten and prin ted – contains most if not all of the fundamen tal challenges of AI. Whether or not this statement is right, it highligh ts the surprising complexity that underlies even “simple” h uman-lev el concepts like letters. More practically , handwritten character recognition is a real problem that children and adults m ust learn to solve, with practical applications ranging from reading env elop e addresses or chec ks in an A TM mac hine. Handwritten character recognition is also simpler than more general forms of ob ject recognition – the ob ject of interest is tw o-dimensional, separated from the bac kground, and usually uno ccluded. Compared to how p eople learn and see other t yp es of ob jects, it seems p ossible, in the near term, to build algorithms that can see most of the structure in c haracters that p eople can see. The standard b enchmark is the MNIST data set for digit recognition, whic h inv olves classifying images of digits into the categories ‘0’-‘9’ (LeCun, Bottou, Bengio, & Haffner, 1998). The training set provides 6,000 images p er class for a total of 60,000 training images. With a large amoun t of training data a v ailable, many algorithms achiev e resp ectable performance, including K-nearest neigh b ors (5% test error), supp ort v ector machines (ab out 1% test error), and con v olutional neu- ral net works (b elow 1% test error; LeCun et al., 1998). The b est results achiev ed using deep con volutional nets are very close to human-lev el p erformance at an error rate of 0.2% (Ciresan, Meier, & Schmidh ub er, 2012). Similarly , recen t results applying conv olutional nets to the far more c hallenging ImageNet ob ject recognition b enchmark hav e shown that h uman-level p erformance is within reac h on that data set as well (Russako vsky et al., 2015). While humans and neural netw orks may p erform equally w ell on the MNIST digit recognition task and other large-scale image classification tasks, it do es not mean that they learn and think in the same w ay . There are at least t w o imp ortant differences: p eople learn from fewer examples and they learn richer represen tations, a comparison true for b oth learning handwritten c haracters as w ell as learning more general classes of ob jects (Figure 1). People can learn to recognize a new handwritten character from a single example (Figure 1A-i), allowing them to discriminate b et ween no vel instances dra wn by other p eople and similar lo oking non-instances (Lake, Salakh utdinov, & T enenbaum, 2015; E. G. Miller, Matsakis, & Viola, 2000). Moreo ver, p eople learn more than ho w to do pattern recognition: they learn a concept – that is, a mo del of the class that allows their acquired kno wledge to b e flexibly applied in new wa ys. In addition to recognizing new examples, p eople can also generate new examples (Figure 1A-ii), parse a character in to its most important parts and relations (Figure 1A-iii; Lake, Salakh utdino v, and T enenbaum (2012)), and generate new c haracters given a small set of related characters (Figure 1A-iv). These additional abilities come for free along with the acquisition of the underlying concept. Ev en for these simple visual concepts, p eople are still b etter and more sophisticated learners than the b est algorithms for character recognition. People learn a lot more from a lot less, and cap- turing these h uman-level learning abilities in mac hines is the Char acters Chal lenge . W e recently rep orted progress on this c hallenge using probabilistic program induction (Lake, Salakhutdino v, & T enenbaum, 2015), yet asp ects of the full h uman cognitive ability remain out of reac h. While b oth p eople and mo del represent c haracters as a sequence of p en strokes and relations, p eople hav e 9 A i) ii) iii) iv) B iii) i) ii) iv) A B Figure 1: The characters c hallenge: human-lev el learning of a no vel handwritten characters (A), with the same abilities also illustrated for a nov el tw o-wheeled vehicle (B). A single example of a new visual concept (red b ox) can b e enough information to supp ort the (i) classification of new examples, (ii) generation of new examples, (iii) parsing an ob ject in to parts and relations, and (iv) generation of new concepts from related concepts. Adapted from Lake, Salakhutdino v, and T enenbaum (2015). a far richer rep ertoire of structural relations b et ween strok es. F urthermore, p eople can efficiently in tegrate across multiple examples of a character to infer which ha ve optional elements, such as the horizontal cross-bar in ‘7’s, combining different v arian ts of the same character in to a single co- heren t representation. Additional progress ma y come b y com bining deep learning and probabilistic program induction to tac kle even richer versions of the Characters Challenge. 3.2 The F rostbite Challenge The second challenge concerns the A tari game F rostbite (Figure 2), which w as one of the control problems tac kled b y the DQN of V. Mnih et al. (2015). The DQN was a significant adv ance in reinforcemen t learning, sho wing that a single algorithm can learn to pla y a wide v ariety of complex tasks. The netw ork was trained to play 49 classic Atari games, prop osed as a test domain for reinforcement learning (Bellemare, Naddaf, V eness, & Bowling, 2013), impressiv ely achieving h uman-level p erformance or ab ov e on 29 of the games. It did, how ever, hav e particular trouble with F rostbite and other games that required temp orally extended planning strategies. In F rostbite, play ers control an agent (F rostbite Bailey) tasked with constructing an iglo o within a time limit. The iglo o is built piece-by-piece as the agen t jumps on ice flo es in w ater (Figure 2A-C). The challenge is that the ice flo es are in constant motion (moving either left or right), and ice flo es only contribute to the construction of the iglo o if they are visited in an active state (white rather than blue). The agent ma y also earn extra p oin ts b y gathering fish while av oiding a num b er of fatal hazards (falling in the water, sno w geese, p olar b ears, etc.). Success in this game requires a 10 A B C D Figure 2: Screenshots of F rostbite, a 1983 video game designed for the A tari game console. A) The start of a level in F rostbite. The agen t must construct an iglo o by hopping b et ween ice flo es and a voiding obstacles such as birds. The flo es are in constant motion (either left or right), making m ulti-step planning essential to success. B) The agen t receiv es pieces of the iglo o (top righ t) by jumping on the activ e ice flo es (white), which then deactiv ates them (blue). C) At the end of a lev el, the agent m ust safely reach the completed iglo o. D) Later levels include additional rew ards (fish) and deadly obstacles (crabs, clams, and b ears). temp orally extended plan to ensure the agen t can accomplish a sub-goal (suc h as reac hing an ice flo e) and then safely pro ceed to the next sub-goal. Ultimately , once all of the pieces of the iglo o are in place, the agent must pro ceed to the igloo and th us complete the level b efore time expires (Figure 2C). The DQN learns to play F rostbite and other A tari games by com bining a p ow erful pattern recognizer (a deep con volutional neural netw ork) and a simple mo del-free reinforcement learning algorithm (Q-learning; W atkins & Da yan, 1992). These components allo w the net work to map sensory inputs (frames of pixels) onto a policy o ver a small set of actions, and b oth the mapping and the policy are trained to optimize long-term cumulativ e reward (the game score). The netw ork embo dies the strongly empiricist approach c haracteristic of most connectionist mo dels: v ery little is built into the netw ork apart from the assumptions ab out image structure inherent in conv olutional netw orks, so the net work has to essentially learn a visual and conceptual system from scratc h for eac h new game. In V. Mnih et al. (2015), the net work arc hitecture and h yp er-parameters w ere fixed, but 11 the netw ork was trained anew for each game, meaning the visual system and the p olicy are highly sp ecialized for the games it was trained on. More recent work has shown how these game-sp ecific net works can share visual features (Rusu et al., 2016) or b e used to train a multi-task netw ork (P arisotto, Ba, & Salakh utdino v, 2016), achieving mo dest b enefits of transfer when learning to pla y new games. Although it is interesting that the DQN learns to pla y games at h uman-lev el p erformance while assuming very little prior knowledge, the DQN may b e learning to pla y F rostbite and other games in a very different wa y than people do. One w ay to examine the differences is b y considering the amoun t of experience required for learning. In V. Mnih et al. (2015), the DQN w as compared with a professional gamer who received appro ximately tw o hours of practice on each of the 49 Atari games (although he or she likely had prior exp erience with some of the games). The DQN was trained on 200 million frames from each of the games, whic h equates to approximately 924 hours of game time (ab out 38 days), or almost 500 times as muc h exp erience as the human receiv ed. 2 Additionally , the DQN incorp orates exp erience replay , where each of these frames is replay ed approximately 8 more times on a v erage ov er the course of learning. With the full 924 hours of unique experience and additional replay , the DQN achiev ed less than 10% of human-lev el p erformance during a controlled test session (see DQN in Fig. 3). More recent v arian ts of the DQN hav e demonstrated sup erior p erformance (Schaul et al., 2016; Stadie et al., 2016; v an Hasselt, Guez, & Silver, 2016; W ang et al., 2016), reac hing 83% of the professional gamer’s score by incorp orating smarter exp erience replay (Sc haul et al., 2016) and 96% b y using smarter replay and more efficient parameter sharing (W ang et al., 2016) (see DQN+ and DQN++ in Fig. 3). 3 But they requires a lot of experience to reach this level: the learning curve pro vided in Schaul et al. (2016) sho ws p erformance is around 46% after 231 hours, 19% after 116 hours, and b elo w 3.5% after just 2 hours (which is close to random pla y , approximately 1.5%). The differences b et ween the human and machine learning curves suggest that they ma y b e learning differen t kinds of kno wledge, using different learning mechanisms, or b oth. The contrast b ecomes ev en more dramatic if we lo ok at the very earliest stages of learning. While b oth the original DQN and these more recent v arian ts require multiple hours of exp erience to p erform reliably b etter than random play , even non-professional h umans can grasp the basics of the game after just a few minutes of pla y . W e sp eculate that p eople do this b y inferring a general sc hema to describ e the goals of the game and the ob ject types and their interactions, using the kinds of intuitiv e theories, mo del-building abilities and mo del-based planning mec ha- nisms we describ e b elow. While novice pla yers may mak e some mistakes, such as inferring that fish are harmful rather than helpful, they can learn to play b etter than c hance within a few min- utes. If h umans are able to first w atch an exp ert playing for a few minutes, they can learn ev en faster. In informal exp erimen ts with tw o of the authors playing F rostbite on a Jav ascript emu- lator (h ttp://www.virtualatari.org/soft.php?soft=F rostbite), after w atching videos of exp ert pla y on Y ouT ube for just tw o minutes, w e found that we were able to reach scores comparable to or 2 The time required to train the DQN (compute time) is not the same as the game (exp erience) time. Compute time can be longer. 3 The rep orted scores use the “human starts” measure of test p erformance, designed to preven t netw orks from just memorizing long sequences of successful actions from a single starting p oint. Both faster learning (Blundell et al., 2016) and higher scores (W ang et al., 2016) hav e b een rep orted using other metrics, but it is unclear how well the net works are generalizing with these alternative metrics. 12 Figure 3: Comparing learning sp eed for people v ersus Deep Q-Netw orks (DQNs). T est performance on the Atari 2600 game “F rostbite” is plotted as a function of game exp erience (in hours at a frame rate of 60 fps), which do es not include additional exp erience repla y . Learning curves (if a v ailable) and scores are shown from different netw orks: DQN (V. Mnih et al., 2015), DQN+ (Sc haul et al., 2016), and DQN++ (W ang et al., 2016). Random play achiev es a score of 66.4. The “human starts” p erformance measure is used (v an Hasselt et al., 2016). b etter than the human exp ert rep orted in V. Mnih et al. (2015) after at most 15-20 minutes of total practice. 4 There are other b ehavioral signatures that suggest fundamen tal differences in represen tation and learning b et ween p eople and the DQN. F or instance, the game of F rostbite provides incremental rew ards for reaching each active ice flo e, pro viding the DQN with the relev ant sub-goals for com- pleting the larger task of building an iglo o. Without these sub-goals, the DQN w ould hav e to tak e random actions until it accidentally builds an iglo o and is rewarded for completing the en tire level. In contrast, p eople lik ely do not rely on incremental scoring in the same w ay when figuring out ho w to play a new game. In F rostbite, it is p ossible to figure out the higher-level goal of building an iglo o without incremen tal feedbac k; similarly , sparse feedback is a source of difficulty in other A tari 2600 games such as Montezuma’s Revenge where p eople substantially outp erform curren t DQN approac hes. The learned DQN netw ork is also rather inflexible to c hanges in its inputs and goals: c hanging the color or app earance of ob jects or c hanging the goals of the netw ork w ould hav e dev astating consequences on p erformance if the netw ork is not retrained. While an y specific model is necessarily 4 More precisely , the human exp ert in V. Mnih et al. (2015) scored an a verage of 4335 points across 30 game sessions of up to five min utes of play . In individual sessions lasting no longer than five min utes, author TDU obtained scores of 3520 p oints after appro ximately 5 minutes of gameplay , 3510 points after 10 minutes, and 7810 p oints after 15 minutes. Author JBT obtained 4060 after approximately 5 min utes of gameplay , 4920 after 10-15 minutes, and 6710 after no more than 20 minutes. TDU and JBT each watc hed approximately tw o minutes of exp ert play on Y ouT ube (e.g., https://www.y outub e.com/watc h?v=ZpUFztf9Fjc, but there are many similar examples that can be found in a Y ouT ub e search). 13 simplified and should not be held to the standard of general h uman in telligence, the con trast b et ween DQN and human flexibility is striking nonetheless. F or example, imagine you are tasked with pla ying F rostbite with an y one of these new goals: • Get the low est p ossible score. • Get closest to 100, or 300, or 1000, or 3000, or any level, without going ov er. • Beat y our friend, who’s pla ying next to y ou, but just barely , not by to o m uch, so as not to em barrass them. • Go as long as you can without dying. • Die as quickly as you can. • P ass eac h level at the last p ossible minute, right b efore the temp erature timer hits zero and y ou die (i.e., come as close as y ou can to dying from frostbite without actually dying). • Get to the furthest unexplored level without regard for your score. • See if you can discov er secret Easter eggs. • Get as many fish as you can. • T ouc h all the individual ice flo es on screen once and only once. • T eac h your friend how to play as efficiently as p ossible. This range of goals highligh ts an essen tial comp onen t of human intelligence: p eople can learn mo dels and use them for arbitrary new tasks and goals. While neural net w orks can learn multiple mappings or tasks with the same set of stimuli – adapting their outputs dep ending on a sp ecified goal – these mo dels require substantial training or reconfiguration to add new tasks (e.g., Collins & F rank, 2013; Eliasmith et al., 2012; Rougier, No elle, Bra ver, Cohen, & O’Reilly, 2005). In con trast, p eople require little or no retraining or reconfiguration, adding new tasks and goals to their rep ertoire with relative ease. The F rostbite example is a particularly telling contrast when compared with human pla y . Ev en the b est deep net works learn gradually ov er man y thousands of game episo des, tak e a long time to reac h go o d p erformance and are lo ck ed into particular input and goal patterns. Humans, after playing just a small num b er of games o v er a span of minutes, can understand the game and its goals w ell enough to perform b etter than deep net works do after almost a thousand hours of exp erience. Even more impressiv ely , p eople understand enough to in ven t or accept new goals, generalize o v er changes to the input, and explain the game to others. Why are p eople different? What core ingredients of h uman intelligence might the DQN and other mo dern machine learning metho ds b e missing? One might ob ject that b oth the F rostbite and Characters challenges draw an unfair comparison b et ween the sp eed of h uman learning and neural net w ork learning. W e discuss this ob jection in detail in Section 5, but w e feel it is imp ortan t to anticipate here as well. T o paraphrase one review er of an earlier draft of this article, “It is not that DQN and p eople are solving the same task 14 differen tly . They may b e b etter seen as solving different tasks. Human learners – unlik e DQN and man y other deep learning systems – approach new problems armed with extensiv e prior exp erience. The human is encountering one in a years-long string of problems, with rich ov erlapping structure. Humans as a result often hav e imp ortant domain-sp ecific knowledge for these tasks, even before they ‘b egin.’ The DQN is starting completely from scratch.” W e agree, and indeed this is another w ay of putting our p oin t here. Human learners fundamentally take on different learning tasks than to da y’s neural net works, and if we w ant to build mac hines that learn and think like p eople, our mac hines need to confront the kinds of tasks that human learners do, not shy a wa y from them. P eople never start completely from scratch, or even close to “from scratch,” and that is the secret to their success. The challenge of building mo dels of human learning and thinking then b ecomes: Ho w do we bring to b ear ric h prior kno wledge to learn new tasks and solv e new problems so quickly? What form do es that prior knowledge tak e, and ho w is it constructed, from some combination of in built capacities and previous exp erience? The core ingredien ts we propose in the next section offer one route to meeting this c hallenge. 4 Core ingredien ts of h uman in telligence In the Introduction, we laid out what w e see as core ingredients of in telligence. Here we consider the ingredien ts in detail and con trast them with the curren t state of neural netw ork mo deling. While these are hardly the only ingredients needed for h uman-like learning and thought (see our discussion of language in Section 5), they are k ey building blo c ks which are not presen t in most curren t learning-based AI systems – certainly not all present together – and for whic h additional atten tion may prov e esp ecially fruitful. W e b eliev e that integrating them will pro duce significan tly more p o werful and more h uman-like learning and thinking abilities than w e curren tly see in AI systems. Before considering eac h ingredient in detail, it is imp ortant to clarify that by “core ingredient” w e do not necessarily mean an ingredient that is innately sp ecified by genetics or must b e “built in” to an y learning algorithm. W e intend our discussion to b e agnostic with regards to the origins of the k ey ingredients. By the time a child or an adult is pic king up a new c haracter or learning ho w to pla y F rostbite, they are armed with extensive real world exp erience that deep learning systems do not b enefit from – exp erience that would b e hard to emulate in any general sense. Certainly , the core ingredients are enric hed by this exp erience, and some may ev en b e a pro duct of the exp erience itself. Whether learned, built in, or enric hed, the key claim is that these ingredients play an activ e and imp ortant role in pro ducing human-lik e learning and thought, in wa ys contemporary mac hine learning has y et to capture. 4.1 Dev elopmental start-up softw are Early in developmen t, humans hav e a foundational understanding of several core domains (Sp elk e, 2003; Sp elk e & Kinzler, 2007). These domains include n umber (numerical and set op era- tions), space (geometry and navigation), physics (inanimate ob jects and mechanics) and psyc hology (agen ts and groups). These core domains cleav e cognition at its conceptual join ts, and each domain 15 is organized by a set of entities and abstract principles relating the entities. The underlying cogni- tiv e represen tations can b e understo o d as “intuitiv e theories,” with a c ausal structure resembling a scientific theory (Carey, 2004, 2009; Gopnik et al., 2004; Gopnik & Meltzoff, 1999; Gweon, T enenbaum, & Sc hulz, 2010; L. Sch ulz, 2012; W ellman & Gelman, 1992, 1998). The “child as scientist” prop osal further views the pro cess of learning itself as also scien tist-like, with recent exp erimen ts showing that c hildren seek out new data to distinguish b et ween h yp otheses, isolate v ari- ables, test causal hypotheses, mak e use of the data-generating pro cess in drawing conclusions, and learn selectively from others (Co ok, Go o dman, & Sc h ulz, 2011; Gweon et al., 2010; L. E. Sch ulz, Gopnik, & Glymour, 2007; Stahl & F eigenson, 2015; Tsividis, Gershman, T enenbaum, & Sch ulz, 2013). W e will address the nature of learning mechanisms in Section 4.2. Eac h core domain has b een the target of a great deal of study and analysis, and together the domains are though t to b e shared cross-culturally and partly with non-h uman animals. All of these domains may b e imp ortan t augmentations to current mac hine learning, though b elow w e fo cus in particular on the early understanding of ob jects and agen ts. 4.1.1 Intuitiv e ph ysics Y oung children hav e rich knowledge of intuitiv e physics. Whether learned or innate, imp ortan t ph ysical concepts are presen t at ages far earlier than when a child or adult learns to play F rostbite, suggesting these resources ma y b e used for solving this and many everyda y physics-related tasks. A t the age of 2 months and p ossibly earlier, human infants exp ect inanimate ob jects to follo w principles of p ersistence, contin uity , cohesion and solidity . Y oung infants b elieve ob jects should mo ve along smo oth paths, not wink in and out of existence, not in ter-p enetrate and not act at a distance (Sp elke, 1990; Sp elk e, Gutheil, & V an de W alle, 1995). These exp ectations guide ob ject segmen tation in early infancy , emerging b efore app earance-based cues such as color, texture, and p erceptual go o dness (Sp elke, 1990). These exp ectations also go on to guide later learning. A t around 6 months, infants hav e already dev elop ed differen t exp ectations for rigid b o dies, soft b o dies and liquids (Rips & Hesp os, 2015). Liquids, for example, are expected to go through barriers, while solid ob jects cannot (Hesp os, F erry , & Rips, 2009). By their first birthday , infan ts hav e gone through several transitions of compre- hending basic ph ysical concepts such as inertia, supp ort, containmen t and collisions (Baillargeon, 2004; Baillargeon, Li, Ng, & Y uan, 2009; Hesp os & Baillargeon, 2008). There is no single agreed-upon computational account of these early physical principles and con- cepts, and previous suggestions hav e ranged from decision trees (Baillargeon et al., 2009), to cues, to lists of rules (Siegler & Chen, 1998). A promising recent approach sees in tuitive physical rea- soning as similar to inference ov er a physics softw are engine, the kind of sim ulators that p o wer mo dern-da y animations and games (Bates, Yildirim, T enenbaum, & Battaglia, 2015; Battaglia, Hamric k, & T enenbaum, 2013; Gerstenberg, Go o dman, Lagnado, & T enenbaum, 2015; Sanborn, Mansinghk a, & Griffiths, 2013). According to this h yp othesis, p eople reconstruct a p erceptual scene using internal represen tations of the ob jects and their ph ysically relev ant prop erties (suc h as mass, elasticit y , and surface friction), and forces acting on ob jects (such as gravit y , friction, or collision impulses). Relativ e to physical ground truth, the in tuitiv e physical state representation 16 A B Changes to Input A dd blocks, block s made of st yr ofoa m, block s made of lead, block s made of goo, ta ble is made of r ubber, ta ble is actua lly quicksand, pour w at er on the tower, pour h oney on the tower, blue bloc ks a r e glued toge ther, r ed bloc ks a r e magnetic, gra vity is r eve rsed, wind blows over ta ble, table ha s slippery ice on top... Figure 4: The intuitiv e ph ysics-engine approac h to scene understanding, illustrated through tow er stabilit y . (A) The engine takes in inputs through perception, language, memory and other faculties. It then constructs a ph ysical scene with ob jects, ph ysical properties and forces, sim ulates the scene’s dev elopment ov er time and hands the output to other reasoning systems. (B) Man y p ossible ‘t weaks’ to the input can result in muc h differen t scenes, requiring the p oten tial discov ery , training and ev aluation of new features for each tw eak. Adapted from Battaglia et al. (2013). is approximate and probabilistic, and o versimplified and incomplete in many wa ys. Still, it is rich enough to supp ort men tal simulations that can predict how ob jects will mov e in the immediate future, either on their o wn or in resp onses to forces we might apply . This “in tuitive physics engine” approac h enables flexible adaptation to a wide range of ev eryday scenarios and judgmen ts in a w ay that go es b ey ond p erceptual cues. F or example (Figure 4), a ph ysics-engine reconstruction of a tow er of woo den blo c ks from the game Jenga can b e used to predict whether (and how) a tow er will fall, finding close quantitativ e fits to how adults make these predictions (Battaglia et al., 2013) as well as simpler kinds of physical predictions that hav e b een studied in infants (T ´ egl´ as et al., 2011). Simulation-based mo dels can also capture how p eople mak e hypothetical or coun terfactual predictions: What w ould happen if certain blo cks are tak en a wa y , more blo cks are added, or the table supp orting the tow er is jostled? What if certain blo cks w ere glued together, or attac hed to the table surface? What if the blo cks were made of differen t materials (Styrofoam, lead, ice)? What if the blo cks of one color w ere mu c h heavier than other colors? Each of these ph ysical judgmen ts may require new features or new training for a pattern recognition accoun t to work at the same level as the mo del-based simulator. What are the prosp ects for embedding or acquiring this kind of intuitiv e physics in deep learning systems? Connectionist mo dels in psychology hav e previously b een applied to physical reasoning tasks suc h as balance-b eam rules (McClelland, 1988; Sh ultz, 2003) or rules relating distance, v elo city , and time in motion (Buc kingham & Sh ultz, 2000), but these net works do not attempt to w ork with complex scenes as input or a wide range of scenarios and judgments as in Figure 4. 17 A recent pap er from F aceb o ok AI researchers (Lerer, Gross, & F ergus, 2016) represen ts an excit- ing step in this direction. Lerer et al. (2016) trained a deep con v olutional net work-based system (Ph ysNet) to predict the stabilit y of blo c k tow ers from simulated images similar to those in Figure 4A but with muc h simpler configurations of tw o, three or four cubical blo cks stack ed v ertically . Impressiv ely , Ph ysNet generalized to simple real images of blo ck tow ers, matching h uman perfor- mance on these images, mean while exceeding h uman p erformance on syn thetic images. Human and Ph ysNet confidence were also correlated across tow ers, although not as strongly as for the appro x- imate probabilistic simulation mo dels and exp erimen ts of Battaglia et al. (2013). One limitation is that PhysNet currently requires extensive training – b et ween 100,000 and 200,000 scenes – to learn judgmen ts for just a single task (will the to w er fall?) on a narrow range of scenes (tow ers with tw o to four cub es). It has b een sho wn to generalize, but also only in limited wa ys (e.g., from to wers of t w o and three cub es to to wers of four cub es). In con trast, p eople require far less exp erience to p erform an y particular task, and can generalize to many nov el judgmen ts and complex scenes with no new training required (although they receive large amoun ts of physics exp erience through in teracting with the w orld more generally). Could deep learning systems suc h as Ph ysNet cap- ture this flexibilit y , without explicitly sim ulating the causal in teractions betw een ob jects in three dimensions? W e are not sure, but we hop e this is a challenge they will tak e on. Alternativ ely , instead of trying to make predictions w ithout simulating physics, could neural net- w orks b e trained to emulate a general-purp ose ph ysics simulator, giv en the righ t type and quantit y of training data, such as the ra w input exp erienced b y a child? This is an active and in triguing area of research, but it to o faces significan t c hallenges. F or net w orks trained on ob ject classification, deep er lay ers often b ecome sensitive to successiv ely higher-lev el features, from edges to textures to shap e-parts to full ob jects (Y osinski, Clune, Bengio, & Lipson, 2014; Zeiler & F ergus, 2014). F or deep netw orks trained on ph ysics-related data, it remains to b e seen whether higher lay ers will enco de ob jects, general physical prop erties, forces and approximately Newtonian dynamics. A generic netw ork trained on dynamic pixel data might learn an implicit representation of these con- cepts, but w ould it generalize broadly b ey ond training contexts as p eople’s more explicit physical concepts do? Consider for example a netw ork that learns to predict the tra jectories of several balls b ouncing in a b ox (Ko dratoff & Mic halski, 2014). If this net work has actually learned something lik e Newtonian mechanics, then it should b e able to generalize to in terestingly different scenarios – at a minim um differen t num b ers of differently shaped ob jects, b ouncing in b oxes of differen t shap es and sizes and orien tations with resp ect to gra vit y , not to mention more severe generalization tests suc h as all of the tow er tasks discussed ab ov e, whic h also fall under the Newtonian domain. Neural net work researc hers ha ve y et to tak e on this c hallenge, but w e hope they will. Whether suc h mo dels can b e learned with the kind (and quan tity) of data av ailable to human infants is not clear, as we discuss further in Section 5. It may b e difficult to integrate ob ject and physics-based primitiv es into deep neural netw orks, but the pay off in terms of learning sp eed and p erformance could b e great for man y tasks. Consider the case of learning to pla y F rostbite. Although it can b e difficult to discern exactly ho w a net w ork learns to solve a particular task, the DQN probably do es not parse a F rostbite screenshot in terms of stable ob jects or sprites moving according to the rules of in tuitiv e physics (Figure 2). But incorp orating a physics-engine-based representation could help DQNs learn to play games such as F rostbite in a faster and more general wa y , whether the physics kno wledge is captured implicitly in a neural netw ork or more explicitly in simulator. Beyond reducing the amount of training data and 18 p oten tially improving the level of p erformance reached by the DQN, it could eliminate the need to retrain a F rostbite net work if the ob jects (e.g., birds, ice-flo es and fish) are slightly altered in their b eha vior, reward-structure, or app earance. When a new ob ject type such as a b ear is introduced, as in the later levels of F rostbite (Figure 2D), a netw ork endo wed with intuitiv e physics would also ha ve an easier time adding this ob ject type to its kno wledge (the c hallenge of adding new ob jects w as also discussed in Marcus, 1998, 2001). In this w ay , the integration of in tuitive ph ysics and deep learning could b e an imp ortant step tow ards more human-lik e learning algorithms. 4.1.2 Intuitiv e psyc hology In tuitive psychology is another early-emerging ability with an imp ortant influence on human learn- ing and thought. Pre-verbal infants distinguish animate agents from inanimate ob jects. This distinction is partially based on innate or early-presen t detectors for low-lev el cues, such as the presence of eyes, motion initiated from rest, and biological motion (Johnson, Slaugh ter, & Carey, 1998; Premack & Premac k, 1997; Sc hlottmann, Ra y , Mitchell, & Demetriou, 2006; T remoulet & F eldman, 2000). Such cues are often sufficient but not necessary for the detection of agency . Bey ond these low-lev el cues, infants also exp ect agen ts to act contingen tly and recipro cally , to ha ve goals, and to tak e efficient actions to wards those goals sub ject to constraints (Csibra, 2008; Csibra, Biro, Ko os, & Gergely, 2003; Spelke & Kinzler, 2007). These goals can b e so cially directed; at around three months of age, infan ts b egin to discriminate an ti-so cial agen ts that hurt or hinder others from neutral agents (Hamlin, 2013; Hamlin, Wynn, & Blo om, 2010), and they later distinguish b etw een anti-social, neutral, and pro-so cial agents (Hamlin, Ullman, T enenbaum, Go o dman, & Baker, 2013; Hamlin, Wynn, & Blo om, 2007). It is generally agreed that infants exp ect agents to act in a goal-directed, efficient, and so cially sensitiv e fashion (Sp elke & Kinzler, 2007). What is less agreed on is the computational architecture that supp orts this reasoning and whether it includes an y reference to men tal states and explicit goals. One p ossibility is that intuitiv e psychology is simply cues “all the wa y do wn” (Schlottmann, Cole, W atts, & White, 2013; Scholl & Gao, 2013), though this w ould require more and more cues as the scenarios b ecome more complex. Consider for example a scenario in which an agen t A is moving to wards a b ox, and an agent B mov es in a wa y that blo c ks A from reaching the b o x. Infants and adults are likely to interpret B’s b eha vior as ‘hindering’ (Hamlin, 2013). This inference could b e captured by a cue that states ‘if an agen t’s exp ected tra jectory is preven ted from completion, the blo c king agent is given some negative asso ciation.’ While the cue is easily calculated, the scenario is also easily changed to necessitate a differen t type of cue. Supp ose A was already negatively asso ciated (a ‘bad guy’); acting negatively tow ards A could then be seen as goo d (Hamlin, 2013). Or suppose something harmful was in the b ox which A didn’t kno w ab out. Now B w ould b e seen as helping, protecting, or defending A. Supp ose A knew there w as something bad in the b ox and wan ted it an ywa y . B could b e seen as acting paternalistically . A cue-based accoun t would b e twisted into gnarled combinations suc h as ‘If an exp ected tra jectory is preven ted from completion, the blo c king agent is given some negativ e asso ciation, unless that tra jectory leads to a negative outcome or the blo cking agent is previously asso ciated as positive, 19 or the blo c ked agent is previously asso ciated as negative, or...’ One alternative to a cue-based account is to use generativ e mo dels of action choice, as in the Ba yesian inv erse planning (or “Ba yesian theory-of-mind”) mo dels of Bak er, Saxe, and T enenbaum (2009) or the “naiv e utilit y calculus” mo dels of Jara-Ettinger, Gw eon, T enenbaum, and Sch ulz (2015) (See also Jern and Kemp (2015) and T aub er and Steyv ers (2011), and a related alternative based on predictive co ding from Kilner, F riston, and F rith (2007)). These mo dels formalize ex- plicitly mentalistic concepts suc h as ‘goal,’ ‘agent,’ ‘planning,’ ‘cost,’ ‘efficiency ,’ and ‘b elief,’ used to describ e core psychological reasoning in infancy . They assume adults and c hildren treat agen ts as appro ximately rational planners who c ho ose the most efficient means to their goals. Planning computations ma y b e formalized as solutions to Marko v Decision Pro cesses (or POMDPs), taking as input utilit y and b elief functions defined ov er an agent’s state-space and the agent’s state-action transition functions, and returning a series of actions the agen t should p erform to most efficiently fulfill their goals (or maximize their utilit y). By sim ulating these planning pro cesses, people can predict what agen ts migh t do next, or use in verse reasoning from observing a series of actions to infer the utilities and b eliefs of agen ts in a scene. This is directly analogous to how simulation engines can b e used for in tuitive physics, to predict what will happ en next in a scene or to infer ob jects’ dynamical prop erties from ho w they mov e. It yields similarly flexible reasoning abilities: Utilities and beliefs can b e adjusted to tak e into account how agents might act for a wide range of no v el goals and situations. Imp ortan tly , unlike in intuitiv e physics, sim ulation-based reasoning in intuitiv e psychology can b e nested recursively to understand so cial interactions – we can think ab out agents thinking ab out other agents. As in the case of in tuitive ph ysics, the success that generic deep net works will hav e in capturing in tu- itiv e psychological reasoning will dep end in part on the representations h umans use. Although deep net works hav e not y et b een applied to scenarios inv olving theory-of-mind and intuitiv e psychol ogy , they could probably learn visual cues, heuristics and summary statistics of a scene that happens to inv olve agents. 5 If that is all that underlies human psyc hological reasoning, a data-driven deep learning approac h can likely find success in this domain. Ho wev er, it seems to us that any full formal account of in tuitiv e psyc hological reasoning needs to include represen tations of agency , goals, efficiency , and recipro cal relations. As with ob jects and forces, it is unclear whether a complete representation of these concepts (agents, goals, etc.) could emerge from deep neural netw orks trained in a purely predictive capacity . Similar to the intuitiv e ph ysics domain, it is p ossible that with a tremendous num b er of training tra jectories in a v ariety of scenarios, deep learning techniques could approximate the reasoning found in infancy ev en without learning anything ab out goal-directed or so cial-directed b ehavior more generally . But this is also unlik ely to resem ble how h umans learn, understand, and apply intuitiv e psychology unless the concepts are gen uine. In the same wa y that altering the setting of a scene or the target of inference in a ph ysics-related task ma y be difficult to generalize without an understanding of ob jects, altering the setting of an agen t or their goals and b eliefs is difficult to reason ab out without understanding in tuitive psychology . In in tro ducing the F rostbite challenge, we discussed how p eople can learn to pla y the game ex- 5 While connectionist netw orks ha ve b een used to mo del the general transition that c hildren undergo b et ween the ages of 3 and 4 regarding false b elief (e.g., Berthiaume, Shultz, & Onishi, 2013), we are referring here to scenarios whic h require inferring goals, utilities, and relations. 20 tremely quic kly by watc hing an exp erienced play er for just a few min utes and then playing a few rounds themselv es. Intuitiv e psychology pro vides a basis for efficien t learning from others, esp e- cially in teaching settings with the goal of comm unicating knowledge efficien tly (Shafto, Go o dman, & Griffiths, 2014). In the case of watc hing an exp ert pla y F rostbite, whether or not there is an explicit goal to teac h, intuitiv e psyc hology lets us infer the b eliefs, desires, and in tentions of the exp erienced pla yer. F or instance, w e can learn that the birds are to b e av oided from seeing how the exp erienced play er app ears to av oid them. W e do not need to exp erience a single example of encoun tering a bird – and watc hing the F rostbite Bailey die b ecause of the bird – in order to infer that birds are probably dangerous. It is enough to see that the exp erienced pla y er’s av oidance b eha vior is b est explained as acting under that b elief. Similarly , consider how a sidekick agen t (increasingly p opular in video-games) is exp ected to help a pla yer achiev e their goals. This agent can b e useful in different wa ys under differen t circumstances, suc h as getting items, clearing paths, fighting, defending, healing, and providing information – all under the general notion of b eing helpful (Macindo e, 2013). An explicit agent representation can predict ho w such an agent will b e helpful in new circumstances, while a b ottom-up pixel-based represen tation is lik ely to struggle. There are several wa ys that intuitiv e psychology could b e incorp orated into contemporary deep learning systems. While it could b e built in, intuitiv e psychology may arise in other w a ys. Connec- tionists hav e argued that innate constraints in the form of hard-wired cortical circuits are unlikely (Elman, 2005; Elman et al., 1996), but a simple inductive bias, for example the tendency to notice things that mo ve other things, can bo otstrap reasoning ab out more abstract concepts of agency (S. Ullman, Harari, & Dorfman, 2012). 6 Similarly , a great deal of goal-directed and so cially- directed actions can also b e b oiled down to a simple utilit y-calculus (e.g., Jara-Ettinger et al., 2015), in a w ay that could be shared with other cognitiv e abilities. While the origins of in tuitive psyc hology is still a matter of debate, it is clear that these abilities are early-emerging and play an imp ortan t role in h uman learning and though t, as exemplified in the F rostbite c hallenge and when learning to pla y nov el video games more broadly . 4.2 Learning as rapid mo del building Since their inception, neural netw orks mo dels hav e stressed the imp ortance of learning. There are many learning algorithms for neural net works, including the p erceptron algorithm (Rosenblatt, 1958), Hebbian learning (Hebb, 1949), the BCM rule (Bienenstock, Co op er, & Munro, 1982), bac k- propagation (Rumelhart, Hinton, & Williams, 1986), the w ake-sleep algorithm (Hinton, Day an, F rey , & Neal, 1995), and contrastiv e divergence (Hinton, 2002). Whether the goal is sup ervised or unsup ervised learning, these algorithms implement learning as a pro cess of gradual adjustmen t of connection strengths. F or sup ervised learning, the up dates are usually aimed at impro ving the algorithm’s pattern recognition capabilities. F or unsup ervised learning, the up dates work tow ards gradually matc hing the statistics of the mo del’s in ternal patterns with the statistics of the input data. 6 W e must b e careful here ab out what “simple” means. An inductive bias may app ear simple in the sense that w e can compactly describ e it, but it may require complex computation (e.g., motion analysis, parsing images into ob jects, etc.) just to pro duce its inputs in a suitable form. 21 In recent years, mac hine learning has found particular success using backpropagation and large data sets to solv e difficult pattern recognition problems. While these algorithms hav e reached h uman- lev el p erformance on sev eral challenging b enchmarks, they are still far from matc hing human-lev el learning in other w ays. Deep neural netw orks often need more data than people do in order to solv e the same types of problems, whether it is learning to recognize a new t yp e of ob ject or learning to pla y a new game. When learning the meanings of words in their nativ e language, children make meaningful generalizations from very sparse data (Carey & Bartlett, 1978; Landau, Smith, & Jones, 1988; E. M. Markman, 1989; Smith, Jones, Landau, Gershkoff-Sto we, & Sam uelson, 2002; F. Xu & T enenbaum, 2007, although see Horst and Samuelson 2008 regarding memory limitations). Children ma y only need to see a few examples of the concepts hairbrush , pine apple or lightsab er b efore they largely ‘get it,’ grasping the b oundary of the infinite set that defines each concept from the infinite set of all p ossible ob jects. Children are far more practiced than adults at learning new concepts – learning roughly nine or ten new words each day after b eginning to sp eak through the end of high school (Blo om, 2000; Carey, 1978) – yet the ability for rapid “one-shot” learning do es not disapp ear in adultho o d. An adult ma y need to see a single image or mo vie of a no vel t w o-wheeled v ehicle to infer the b oundary b et w een this concept and others, allowing him or her to discriminate new examples of that concept from similar lo oking ob jects of a different type (Fig. 1B-i). Con trasting with the efficiency of h uman learning, neural net works – b y virtue of their generalit y as highly flexible function appro ximators – are notoriously data hungry (the bias/v ariance dilemma; Geman, Bienensto ck, & Doursat, 1992). Benchmark tasks such as the ImageNet data set for ob ject recognition provides hundreds or thousands of examples p er class (Krizhevsky et al., 2012; Russak ovsky et al., 2015) – 1000 hairbrushes, 1000 pineapples, etc. In the context of learning new handwritten c haracters or learning to play F rostbite, the MNIST b enc hmark includes 6000 examples of each handwritten digit (LeCun et al., 1998), and the DQN of V. Mnih et al. (2015) pla yed eac h Atari video game for appro ximately 924 hours of unique training experience (Figure 3). In b oth cases, the algorithms are clearly using information less efficiently than a p erson learning to p erform the same tasks. It is also imp ortant to mention that there are many classes of concepts that p eople learn more slo wly . Concepts that are learned in sc ho ol are usually far more c hallenging and more difficult to acquire, including mathematical functions, logarithms, deriv ativ es, in tegrals, atoms, electrons, gra vity , DNA, evolution, etc. There are also domains for whic h mac hine learners outp erform h uman learners, suc h as combing through financial or w eather data. But for the v ast ma jority of cognitiv ely natural concepts – the t yp es of things that children learn as the meanings of words – p eople are still far b etter learners than machines. This is the t yp e of learning w e focus on in this section, which is more suitable for the enterprise of reverse engineering and articulating additional principles that mak e h uman learning successful. It also op ens the p ossibilit y of building these ingredients in to the next generation of mac hine learning and AI algorithms, with p oten tial for making progress on learning concepts that are b oth easy and difficult for humans to acquire. Ev en with just a few examples, p eople can learn remark ably rich conceptual mo dels. One indicator of richness is the v ariet y of functions that these mo dels supp ort (A. B. Markman & Ross, 2003; Solomon, Medin, & Lynch, 1999). Beyond classification, concepts supp ort prediction (Murphy & Ross, 1994; Rips, 1975), action (Barsalou, 1983), comm unication (A. B. Markman & Makin, 1998), imagination (Jern & Kemp, 2013; W ard, 1994), explanation (Lombrozo, 2009; Williams 22 & Lombrozo, 2010), and comp osition (Murphy, 1988; Osherson & Smith, 1981). These abilities are not independent; rather they hang together and interact (Solomon et al., 1999), coming for free with the acquisition of the underlying concept. Returning to the previous example of a nov el t wo wheeled vehicle, a p erson can s k etch a range of new instances (Figure 1B-ii), parse the c oncept in to its most imp ortant comp onents (Figure 1B-iii), or even create a new complex concept through the com bination of familiar concepts (Figure 1B-iv). Likewise, as discussed in the con text of F rostbite, a learner who has acquired the basics of the game could flexibly apply their kno wledge to an infinite set of F rostbite v ariants (Section 3.2). The acquired knowledge supp orts reconfiguration to new tasks and new demands, such as mo difying the goals of the game to survive while acquiring as few p oin ts as p ossible, or to efficiently teach the rules to a friend. This richness and flexibility suggests that learning as mo del building is a b etter metaphor than learning as pattern recognition. F urthermore, the human capacity for one-shot learning suggests that these models are built up on ric h domain kno wledge rather than starting from a blank slate (Mik olov, Joulin, & Baroni, 2016; Mitc hell, Keller, & Kedar-cab elli, 1986). In con trast, m uc h of the recent progress in deep learning has b een on pattern recognition problems, including ob ject recognition, sp eec h recognition, and (mo del-free) video game learning, that utilize large data sets and little domain kno wledge. There has b een recent work on other t yp es of tasks including learning generative mo dels of images (Den ton, Chintala, Szlam, & F ergus, 2015; Gregor, Danihelk a, Grav es, Rezende, & Wierstra, 2015), caption generation (Karpath y & F ei-F ei, 2015; Vin yals, T oshev, Bengio, & Erhan, 2014; K. Xu et al., 2015), question answ ering (Sukhbaatar, Szlam, W eston, & F ergus, 2015; W eston, Chopra, & Bordes, 2015), and learning simple algorithms (Gra ves, W ayne, & Danihelk a, 2014; Grefenstette, Hermann, Suleyman, & Blunsom, 2015); we discuss question answering and learning simple algorithms in Section 6.1. Y et, at least for image and caption generation, these tasks ha v e b een mostly studied in the big data setting that is at o dds with the impressive human ability for generalizing from small data sets (although see Rezende, Mohamed, Danihelk a, Gregor, & Wierstra, 2016, for a deep learning approach to the Character Challenge). And it has b een difficult to learn neural-net work-st yle representations that effortlessly generalize to new tasks that they w ere not trained on (see Davis & Marcus, 2015; Marcus, 1998, 2001). What additional ingredients ma y b e needed in order to rapidly learn more p ow erful and more general-purp ose representations? A relev ant case study is from our o wn w ork on the Characters Challenge (Section 3.1; Lak e, 2014; Lak e, Salakhutdino v, & T enen baum, 2015). People and v arious machine learning approac hes w ere compared on their ability to learn new handwritten characters from the world’s alphab ets. In addition to ev aluating sev eral types of deep learning mo dels, we developed an algorithm using Ba yesian Program Learning (BPL) that represen ts concepts as simple stochastic programs – that is, structured pro cedures that generate new examples of a concept when executed (Figure 5A). These programs allo w the mo del to express causal kno wledge ab out how the raw data are formed, and the probabilistic seman tics allow the mo del to handle noise and p erform creative tasks. Structure sharing across concepts is accomplished b y the comp ositional reuse of sto chastic primitiv es that can com bine in new w ays to create new concepts. Note that w e are ov erloading the word “mo del” to refer to b oth the BPL framework as a whole (whic h is a generative mo del), as well as the individual probabilistic mo dels (or concepts) that it infers from images to represent no vel handwritten c haracters. There is a hierarc hy of mo dels: a 23 ... relation: ! attached along relation: ! attached along relation: ! attached at start v) exemplars vi) raw data iv) object ! template iii) parts ii) sub-parts i) primitives A B 1 2 1 2 1 2 1 2 Human or Machine? Figure 5: A causal, compositional mo del of handwritten c haracters. A) New t yp es are generated comp ositionally b y c ho osing primitiv e actions (color coded) from a library (i), com bining these sub- parts (ii) to mak e parts (iii), and com bining parts with relations to define simple programs (iv). These programs can create different tokens of a concept (v) that are rendered as binary images (vi). B) Probabilistic inference allo ws the mo del to generate new examples from just one example of a new concept, shown here in a visual T uring T est. An example image of a new concept is shown ab o ve each pair of grids. One grid w as generated by 9 p eople and the other is 9 samples from the BPL model. Which grid in eac h pair (A or B) w as generated by the mac hine? Answers b y ro w: 1,2;1,1. Adapted from Lake, Salakhutdino v, and T enenbaum (2015). higher-lev el program that generates differen t t yp es of concepts, which are themselv es programs that can be run to generate tok ens of a concept. Here, describing learning as “rapid mo del building” refers to the fact that BPL constructs generativ e models (lo wer-lev el programs) that pro duce tokens of a concept (Figure 5B). Learning mo dels of this form allo ws BPL to p erform a challenging one-shot classification task at h uman lev el p erformance (Figure 1A-i) and to outp erform current deep learning mo dels such as con volutional netw orks (Koch, Zemel, & Salakhutdino v, 2015). 7 The representations that BPL learns also enable it to generalize in other, more creativ e human-lik e w ays, as ev aluated using “visual T uring tests” (e.g., Figure 5B). These tasks include generating new examples (Figure 1A-ii and Figure 5B), parsing ob jects in to their essen tial comp onen ts (Figure 1A-iii), and generating new concepts in the st yle of a particular alphab et (Figure 1A-iv). The following sections discuss the three main ingredients – comp ositionalit y , causality , and learning-to-learn – that were imp ortan t to the success of this framework and we b elieve are imp ortant to understanding human learning as rapid mo del building more broadly . While these ingredien ts fit naturally within a BPL or a probabilistic program induction framework, they could also be integrated in to deep learning mo dels and other t yp es of machine learning algorithms, prosp ects we discuss in more detail b elo w. 7 A new approac h using conv olutional “matc hing net works” achiev es goo d one-shot classification performance when discriminating b etw een characters from different alphab ets (Viny als, Blundell, Lillicrap, Kavuk cuoglu, & Wierstra, 2016). It has not y et been directly compared with BPL, whic h w as ev aluated on one-shot classification with characters from the same alphab et. 24 4.2.1 Comp ositionality Comp ositionalit y is the classic idea that new representations can b e constructed through the com- bination of primitiv e elemen ts. In computer programming, primitive functions can b e combined together to create new functions, and these new functions can b e further combined to create even more complex functions. This function hierarch y provides an efficient description of higher-level functions, lik e a part hierarch y for describing complex ob jects or scenes (Bienensto ck, Geman, & P otter, 1997). Comp ositionalit y is also at the core of pro ductivit y: an infinite n umber of repre- sen tations can b e constructed from a finite set of primitives, just as the mind can think an infinite n umber of thoughts, utter or understand an infinite n umber of sentences, or learn new concepts from a seemingly infinite space of p ossibilities (F o dor, 1975; F o dor & Pylyshyn, 1988; Marcus, 2001; Piantadosi, 2011). Comp ositionalit y has b een broadly influential in b oth AI and cognitive science, esp ecially as it p ertains to theories of ob ject recognition, conceptual representation, and language. Here we fo cus on comp ositional representations of ob ject concepts for illustration. Structural description mo dels represen t visual concepts as comp ositions of parts and relations, which provides a strong inductiv e bias for constructing mo dels of new concepts (Biederman, 1987; Hummel & Biederman, 1992; Marr & Nishihara, 1978; v an den Hengel et al., 2015; Winston, 1975). F or instance, the nov el t wo-wheeled v ehicle in Figure 1B migh t b e represen ted as tw o wheels connected by a platform, which pro vides the base for a p ost, which holds the handlebars, etc. Parts can themselves b e comp osed of sub-parts, forming a “partonom y” of part-whole relationships (G. A. Miller & Johnson-Laird, 1976; Tv ersky & Hemen wa y, 1984). In the no vel vehicle example, the parts and relations can b e shared and reused from existing related concepts, suc h as cars, sco oters, motorcycles, and unicycles. Since the parts and relations are themselves a pro duct of previous learning, their facilitation of the construction of new mo dels is also an example of learning-to-learn – another ingredient that is co vered b elo w. While comp ositionality and learning-to-learn fit naturally together, there are also forms of comp ositionalit y that rely less on previous learning, suc h as the b ottom-up parts-based represen tation of Hoffman and Richards (1984). Learning mo dels of no vel handwritten c haracters can be operationalized in a similar w ay . Handwrit- ten c haracters are inheren tly comp ositional, where the parts are p en strokes and relations describ e ho w these strokes connect to each other. Lake, Salakh utdinov, and T enenbaum (2015) mo deled these parts using an additional lay er of comp ositionalit y , where parts are complex mov emen ts cre- ated from simpler sub-part mov ements. New characters can b e constructed by combining parts, sub-parts, and relations in nov el wa ys (Figure 5). Comp ositionality is also central to the construc- tion of other t yp es of sym b olic concepts b ey ond characters, where new spoken words can be created through a no vel combination of phonemes (Lake, Lee, Glass, & T enen baum, 2014) or a new gesture or dance mo v e can b e created through a combination of more primitive b o dy mo v ements. An efficient representation for F rostbite should b e similarly comp ositional and pro ductiv e. A scene from the game is a comp osition of v arious ob ject types, including birds, fish, ice flo es, iglo os, etc. (Figure 2). Representing this comp ositional structure explicitly is b oth more economical and b etter for generalization, as noted in previous work on ob ject-oriented reinforcemen t learning (Diuk, Cohen, & Littman, 2008). Many rep etitions of the same ob jects are present at different lo cations in the scene, and thus represen ting each as an identical instance of the same ob ject with the 25 an airplane is parked on the tarmac at an airport a group of people standing on top of a beach a woman riding a horse on a dirt road Figure 6: Perceiving scenes without in tuitiv e physics, in tuitive psychology , comp ositionalit y , and causalit y . Image captions are generated b y a deep neural net work (Karpath y & F ei-F ei, 2015) using co de from github.com/karpathy/neuraltalk2 . Image credits: Gabriel Villena F ern´ andez (left), TVBS T aiwan / Agence F rance-Presse (middle) and AP Photo / Dav e Martin (right). Similar examples using images from Reuters news can b e found at twitter.com/interesting jpg . same prop erties is imp ortant for efficien t represen tation and quick learning of the game. F urther, new levels may contain different num b ers and combinations of ob jects, where a compositional represen tation of ob jects – using intuitiv e ph ysics and intuitiv e psychology as glue – would aid in making these crucial generalizations (Figure 2D). Deep neural netw orks hav e at least a limited notion of comp ositionality . Net works trained for ob ject recognition enco de part-like features in their deep er lay ers (Zeiler & F ergus, 2014), whereb y the presentation of new types of ob jects can activ ate nov el combinations of feature detectors. Similarly , a DQN trained to play F rostbite may learn to represen t multiple replications of the same ob ject with the same features, facilitated by the inv ariance prop erties of a conv olutional neural net work architecture. Recen t w ork has shown how this t yp e of comp ositionalit y can b e made more explicit, where neural netw orks can be used for efficient inference in more structured generativ e mo dels (b oth neural netw orks and 3D scene mo dels) that explicitly represen t the num b er of ob jects in a scene (Eslami et al., 2016). Beyond the comp ositionality inherent in parts, ob jects, and scenes, comp ositionalit y can also b e imp ortan t at the level of goals and sub-goals. Recent w ork on hierarchical-DQNs shows that b y providing explicit ob ject representations to a DQN, and then defining sub-goals based on reaching those ob jects, DQNs can learn to play games with sparse rew ards (such as Mon tezuma’s Revenge) by combining these sub-goals together to ac hiev e larger goals (Kulk arni, Narasimhan, Saeedi, & T enen baum, 2016). W e look forward to seeing these new ideas contin ue to dev elop, p otentially providing ev en ric her notions of comp ositionalit y in deep neural netw orks that lead to faster and more flexible learning. T o capture the full extent of the mind’s comp ositionalit y , a mo del must include explicit represen- tations of ob jects, identit y , and relations – all while maintaining a notion of “coherence” when understanding nov el configurations. Coherence is related to our next principle, causality , which is discussed in the section that follo ws. 26 4.2.2 Causality In concept learning and scene understanding, causal mo dels represen t hypothetical real w orld pro- cesses that pro duce the p erceptual observ ations. In con trol and reinforcement learning, causal mo dels represent the structure of the environmen t, such as mo deling state-to-state transitions or action/state-to-state transitions. Concept learning and vision mo dels that utilize causality are usually generativ e (as opp osed to discriminativ e; see Glossary in T able 1), but not every generativ e mo del is also causal. While a generativ e mo del describ es a pro cess for generating data, or at least assigns a probability distribu- tion ov er p ossible data p oints, this generative pro cess may not resem ble how the data are pro duced in the real world. Causality refers to the sub class of generativ e models that resem ble, at an abstract lev el, ho w the data are actually generated. While generativ e neural netw orks such as Deep Belief Net works (Hin ton, Osindero, & T eh, 2006) or v ariational auto-enco ders (Gregor, Besse, Rezende, Danihelk a, & Wierstra, 2016; Kingma, Rezende, Mohamed, & W elling, 2014) ma y generate com- p elling handwritten digits, they mark one end of the “causality sp ectrum,” since the steps of the generativ e pro cess b ear little resemblance to steps in the actual pro cess of writing. In contrast, the generativ e model for characters using Ba y esian Program Learning (BPL) do es resem ble the steps of writing, although ev en more causally faithful mo dels are p ossible. Causalit y has b een influential in theories of p erception. “Analysis-by-syn thesis” theories of p er- ception main tain that sensory data can b e more ric hly represen ted b y mo deling the pro cess that generated it (Bev er & P o epp el, 2010; Eden, 1962; Halle & Stev ens, 1962; Neisser, 1966). Relat- ing data to its causal source provides strong priors for p erception and learning, as well as a richer basis for generalizing in new wa ys and to new tasks. The canonical examples of this approach are sp eech and visual p erception. F or instance, Lib erman, Co op er, Shankweiler, and Studdert- Kennedy (1967) argued that the ric hness of sp eech p erception is b est explained b y inv erting the pro duction plan, at the lev el of v o cal tract mov ements, in order to explain the large amounts of acoustic v ariability and the blending of cues across adjacent phonemes. As discussed, causalit y does not ha ve to b e a literal inv ersion of the actual generative mechanisms, as prop osed in the motor theory of sp eech. F or the BPL of learning handwritten characters, causality is op erationalized by treating concepts as motor programs, or abstract causal descriptions of how to pro duce examples of the concept, rather than concrete configurations of sp ecific m uscles (Figure 5A). Causalit y is an imp ortan t factor in the mo del’s success in classifying and generating new examples after seeing just a single example of a new concept (Lak e, Salakh utdinov, & T enen baum, 2015) (Figure 5B). Causal kno wledge has also b een shown to influence ho w p eople learn new concepts; providing a learner with different types of causal knowledge c hanges how they learn and generalize. F or example, the structure of the causal netw ork underlying the features of a category influences ho w p eople categorize new examples (Rehder, 2003; Rehder & Hastie, 2001). Similarly , as related to the Characters Challenge, the wa y p eople learn to write a nov el handwritten character influences later p erception and categorization (F reyd, 1983, 1987). T o explain the role of causalit y in learning, conceptual representations hav e b een likened to intu- itiv e theories or explanations, providing the glue that lets core features stic k while other equally applicable features w ash a wa y (Murphy & Medin, 1985). Borro wing examples from Murph y and Medin (1985), the feature “flammable” is more closely attac hed to w o o d than money due to the 27 underlying causal roles of the concepts, even though the feature is equally applicable to b oth; these causal roles derive from the functions of ob jects. Causality can also glue some features together by relating them to a deep er underlying cause, explaining why some features such as “can fly ,” “has wings,” and “has feathers” co-o ccur across ob jects while others do not. Bey ond concept learning, people also understand scenes b y building causal mo dels. Human-level scene understanding inv olves comp osing a story that explains the p erceptual observ ations, dra wing up on and in tegrating the ingredien ts of in tuitiv e ph ysics, in tuitive psychology , and comp ositionalit y . P erception without these ingredients, and absent the causal glue that binds them together, can lead to revealing errors. Consider image captions generated b y a deep neural netw ork (Figure 6; Karpath y & F ei-F ei, 2015). In man y cases, the netw ork gets the k ey ob jects in a scene correct but fails to understand the ph ysical forces at work, the mental states of the p eople, or the causal relationships b et ween the ob jects – in other words, it do es not build the right causal mo del of the data. There ha ve been steps to w ards deep neural netw orks and related approac hes that learn causal mod- els. Lop ez-Paz, Muandet, Scholk¨ opf, and T olstikhin (2015) in tro duced a discriminativ e, data-driv en framew ork for distinguishing the direction of causalit y from examples. While it outp erforms exist- ing metho ds on v arious causal prediction tasks, it is unclear how to apply the approac h to inferring ric h hierarc hies of latent causal v ariables, as needed for the F rostbite Challenge and (esp ecially) the Characters Challenge. Grav es (2014) learned a generativ e mo del of cursive handwriting using a recurrent neural netw ork trained on handwriting data. While it syn thesizes impressive examples of handwriting in v arious styles, it requires a large training corpus and has not b een applied to other tasks. The DRA W netw ork p erforms b oth recognition and generation of handwritten digits using recurren t neural net works with a windo w of atten tion, pro ducing a limited circular area of the image at each time step (Gregor et al., 2015). A more recen t v ariant of DRA W was applied to generating examples of a nov el character from just a single training example (Rezende et al., 2016). While the mo del demonstrates an impressive ability to make plausible generalizations that go b eyond the training examples, it generalizes to o broadly in other cases, in w a ys that are not esp ecially h uman-lik e. It is not clear that it could y et pass an y of the “visual T uring tests” in Lake, Salakh utdinov, and T enenbaum (2015) (Figure 5B), although we hop e DRA W-st yle netw orks will con tinue to b e extended and enriched, and could b e made to pass these tests. Incorp orating causalit y may greatly impro ve th ese deep learning mo dels; they were trained without access to causal data ab out ho w characters are actually pro duced, and without any incen tive to learn the true causal pro cess. An atten tional window is only a crude approximation to the true causal pro cess of drawing with a pen, and in Rezende et al. (2016) the atten tional window is not p en-lik e at all, although a more accurate p en mo del could b e incorp orated. W e anticipate that these sequential generative neural netw orks could make sharp er one-shot inferences – with the goal of tackling the full Characters Challenge – by incorp orating additional causal, comp ositional, and hierarc hical structure (and b y con tin uing to utilize learning-to-learn, describ ed next), p oten tially leading to a more computationally efficient and neurally grounded v arian t of the BPL model of handwritten c haracters (Figure 5). A causal mo del of F rostbite w ould hav e to b e more complex, gluing together ob ject represen tations and explaining their interactions with in tuitive physics and in tuitiv e psyc hology , muc h lik e the game engine that generates the game dynamics and ultimately the frames of pixel images. Inference is 28 the pro cess of inv erting this causal generative mo del, explaining the raw pixels as ob jects and their interactions, such as the agent stepping on an ice flo e to deactiv ate it or a crab pushing the agen t into the w ater (Figure 2). Deep neural netw orks could pla y a role in t wo wa ys: serving as a b ottom-up prop oser to make probabilistic inference more tractable in a structured generative mo del (Section 4.3.1) or by serving as the causal generative mo del if imbued with the right set of ingredien ts. 4.2.3 Learning-to-learn When h umans or machines mak e inferences that go far beyond the data, strong prior kno wledge (or inductiv e biases or constraints) must b e making up the difference (Geman et al., 1992; Griffiths, Chater, Kemp, P erfors, & T enenbaum, 2010; T enenbaum, Kemp, Griffiths, & Go o dman, 2011). One w a y p eople acquire this prior knowledge is through “learning-to-learn,” a term introduced by Harlo w (1949) and closely related to the mac hine learning notions of “transfer learning”, “m ulti-task learning” or “represen tation learning.” These terms refer to w ays that learning a new task (or a new concept) can b e accelerated through previous or parallel learning of other related tasks (or other related concepts). The strong priors, constraints, or inductiv e bias needed to learn a particular task quickly are often shared to some extent with other related tasks. A range of mec hanisms hav e b een dev elop ed to adapt the learner’s inductiv e bias as they learn sp ecific tasks, and then apply these inductiv e biases to new tasks. In hierarchical Ba y esian mo deling (Gelman, Carlin, Stern, & Rubin, 2004), a general prior on concepts is shared b y m ultiple sp ecific concepts, and the prior itself is learned ov er the course of learning the sp ecific concepts (Salakhutdino v, T enen baum, & T orralba, 2012, 2013). These mo dels ha ve b een used to explain the dynamics of h uman learning-to-learn in man y areas of cognition, including word learning, causal learning, and learning intuitiv e theories of physical and social domains (T enen baum et al., 2011). In mac hine vision, for deep con volutional netw orks or other discriminativ e methods that form the core of recen t recognition systems, learning-to-learn can o ccur through the sharing of features b etw een the mo dels learned for old ob jects (or old tasks) and the mo dels learned for new ob jects (or new tasks) (Anselmi et al., 2016; Baxter, 2000; Bottou, 2014; Lop ez-P az, Bottou, Scholk¨ opf, & V apnik, 2016; Rusu et al., 2016; Salakhutdino v, T orralba, & T enenbaum, 2011; Sriv asta v a & Salakh utdinov, 2013; T orralba, Murphy , & F reeman, 2007; Zeiler & F ergus, 2014). Neural netw orks can also learn-to-learn by optimizing h yp erparameters, including the form of their weigh t update rule (Andrycho wicz et al., 2016), o ver a set of related tasks. While transfer learning and multi-task learning are already imp ortant themes across AI, and in deep learning in particular, they ha ve not y et led to systems that learn new tasks as rapidly and flexibly as humans do. Capturing more h uman-like learning-to-learn dynamics in deep netw orks and other machine learning approaches could facilitate muc h stronger transfer to new tasks and new problems. T o gain the full b enefit that humans get from learning-to-learn, ho wev er, AI systems migh t first need to adopt the more comp ositional (or more language-like, see Section 5) and causal forms of represen tations that we hav e argued for ab ov e. W e can see this p otential in b oth of our Challenge problems. In the Characters Challenge as presen ted in Lake, Salakh utdinov, and T enenbaum (2015), all viable mo dels use “pre-training” 29 on man y c haracter concepts in a bac kground set of alphab ets to tune the represen tations they use to learn new character concepts in a test set of alphab ets. But to p erform w ell, current neural net work approaches require muc h more pre-training than do p eople or our Ba y esian program learning approac h, and they are still far from solving the Characters Challenge. 8 W e cannot b e sure how p eople get to the kno wledge they hav e in this domain, but we do understand ho w this works in BPL, and we think people might b e similar. BPL transfers readily to new concepts b ecause it learns ab out ob ject parts, sub-parts, and relations, capturing learning ab out what eac h concept is like and what concepts are like in general. It is crucial that learning-to-learn o ccurs at m ultiple lev els of the hierarchical generativ e pro cess. Previously learned p rimitiv e actions and larger generativ e pieces can b e re-used and re-combined to define new generativ e mo dels for new c haracters (Figure 5A). F urther transfer o ccurs by learning ab out the t ypical lev els of v ariabilit y within a t ypical generative model; this provides knowledge ab out how far and in what wa ys to generalize when we hav e seen only one example of a new character, which on its own could not p ossibly carry an y information ab out v ariance. BPL could also b enefit from deep er forms of learning-to-learn than it curren tly do es: Some of the imp ortan t structure it exploits to generalize w ell is built in to the prior and not learned from the background pre-training, whereas p eople might learn this kno wledge, and ultimately a human-lik e machine learning system should as well. Analogous learning-to-learn o ccurs for humans in learning many new ob ject mo dels, in vision and cognition: Consider the nov el t w o-wheeled v ehicle in Figure 1B, where learning-to-learn can operate through the transfer of previously learned parts and relations (sub-concepts such as wheels, motors, handle bars, attac hed, p o wered by , etc.) that reconfigure comp ositionally to create a mo del of the new concept. If dee p neural netw orks could adopt similarly comp ositional, hierarchical, and causal represen tations, w e exp ect they might b enefit more from learning-to-learn. In the F rostbite Challenge, and in video games more generally , there is a similar interdependence b et ween the form of the represen tation and the effectiv eness of learning-to-learn. People seem to transfer knowledge at m ultiple levels, from low-lev el p erception to high-level strategy , exploiting comp ositionalit y at all levels. Most basically , they immediately parse the game environmen t into ob jects, types of ob jects, and causal relations b etw een them. People also understand that video games like this ha ve goals, whic h often inv olv e approaching or a voiding ob jects based on their type. Whether the p erson is a child or a seasoned gamer, it seems obvious that in teracting with the birds and fish will change the game state in some w a y , either go o d or bad, b ecause video games t ypically yield costs or rewards for these types of in teractions (e.g., dying or p oints). These types of hypotheses can b e quite sp ecific and rely on prior kno wledge: When the p olar b ear first app ears and tracks the agen t’s lo cation during adv anced levels (Figure 2D), an atten tive learner is sure to a void it. Dep ending on the level, ice flo es can b e spaced far apart (Figure 2A-C) or close together (Figure 2D), suggesting the agent may b e able to cross some gaps but not others. In this w ay , 8 Humans t ypically hav e direct exp erience with only one or a few alphab ets, and even with related drawing exp erience, this likely amounts to the equiv alent of a few hundred character-lik e visual concepts at most. F or BPL, pre-training with c haracters in only fiv e alphab ets (for around 150 character types in total) is sufficien t to p erform human-lev el one-shot classification and generation of new examples. The best neural netw ork classifiers (deep conv olutional netw orks) hav e error rates approximately fiv e times higher than humans when pre-trained with fiv e alphab ets (23% v ersus 4% error), and tw o to three times higher when pre-training on six times as m uch data (30 alphab ets) (Lake, Salakhutdino v, & T enen baum, 2015). The current need for extensive pre-training is illustrated for deep generative mo dels by Rezende et al. (2016), who present extensions of the DRA W architecture capable of one-shot learning. 30 general world kno wledge and previous video games ma y help inform exploration and generalization in new scenarios, helping p eople learn maximally from a single mistake or a void mistakes altogether. Deep reinforcement learning systems for pla ying Atari games ha ve had some impressiv e successes in transfer learning, but they still ha ve not come close to learning to pla y new games as quic kly as h umans can. F or example, Parisotto et al. (2016) presen ts the “Actor-mimic” algorithm that first learns 13 Atari games b y watc hing an exp ert netw ork pla y and trying to mimic the exp ert net work action selection and/or internal states (for ab out four million frames of exp erience each, or 18.5 hours p er game). This algorithm can then learn new games faster than a randomly initialized DQN: Scores that might hav e taken four or fiv e million frames of learning to reach migh t no w b e reac hed after one or t w o million frames of practice. But anecdotally w e find that humans can still reac h these scores with a few minutes of practice, requiring far less exp erience than the DQNs. In sum, the interaction b etw een representation and previous exp erience ma y b e key to building mac hines that learn as fast as p eople do. A deep learning system trained on many video games ma y not, by itself, b e enough to learn new games as quickly as people do. Y et if such a system aims to learn comp ositionally structured causal mo dels of a each game – built on a foundation of in tuitive physics and psychology – it could transfer knowledge more efficien tly and thereb y learn new games m uc h more quickly . 4.3 Thinking F ast The previous section fo cused on learning rich models from sparse data and prop osed ingredients for ac hieving these human-lik e learning abilities. These cognitive abilities are even more striking when considering the sp eed of p erception and thought – the amount of time required to understand a scene, think a thought, or choose an action. In general, richer and more structured mo dels require more complex (and slow er) inference algorithms – similar to ho w complex mo dels require more data – making the sp eed of p erception and thought all the more remark able. The combination of rich mo dels with efficien t inference suggests another wa y psychology and neu- roscience ma y usefully inform AI. It also suggests an additional wa y to build on the successes of deep learning, where efficient inference and scalable learning are imp ortan t strengths of the ap- proac h. This section discusses p ossible paths tow ards resolving the conflict b etw een fast inference and structured represen tations, including Helmholtz-machine-st yle appro ximate inference in gener- ativ e mo dels (Day an, Hin ton, Neal, & Zemel, 1995; Hinton et al., 1995) and co op eration b et w een mo del-free and mo del-based reinforcement learning systems. 4.3.1 Approximate inference in structured mo dels Hierarc hical Bay esian mo dels op erating ov er probabilistic programs (Go o dman et al., 2008; Lake, Salakh utdinov, & T enen baum, 2015; T enen baum et al., 2011) are equipp ed to deal with theory- lik e structures and ric h causal representations of the w orld, y et there are formidable algorithmic c hallenges for efficient inference. Computing a probability distribution o ver an entire space of programs is usually in tractable, and often even finding a single high-probability program p oses an in tractable search problem. In con trast, while representing intuitiv e theories and structured causal 31 mo dels is less natural in deep neural net works, recent progress has demonstrated the remark able effectiv eness of gradient-based learning in high-dimensional parameter spaces. A complete account of learning and inference must explain ho w the brain do es so muc h with limited computational resources (Gershman, Horvitz, & T enenbaum, 2015; V ul, Go o dman, Griffiths, & T enenbaum, 2014). P opular algorithms for approximate inference in probabilistic mac hine learning hav e b een prop osed as psychological mo dels (see Griffiths, V ul, & San b orn, 2012, for a review). Most prominently , it has b een prop osed that humans can approximate Ba y esian inference using Mon te Carlo metho ds, whic h sto c hastically sample the space of p ossible h yp otheses and ev aluate these samples according to their consistency with the data and prior kno wledge (Bona witz, Denison, Griffiths, & Gopnik, 2014; Gershman, V ul, & T enen baum, 2012; T. D. Ullman, Go o dman, & T enen baum, 2012; V ul et al., 2014). Monte Carlo sampling has b een in vok ed to explain b ehavioral phenomena ranging from c hildren’s resp onse v ariabilit y (Bonawitz et al., 2014) to garden-path effects in sentence pro cessing (Levy , Reali, & Griffiths, 2009) and perceptual m ultistability (Gershman et al., 2012; Moreno- Bote, Knill, & Pouget, 2011). Moreov er, we are b eginning to understand how such metho ds could b e implemen ted in neural circuits (Buesing, Bill, Nessler, & Maass, 2011; Huang & Rao, 2014; P ecevski, Buesing, & Maass, 2011). 9 While Mon te Carlo metho ds are p ow erful and come with asymptotic guarantees, it is challenging to make them w ork on complex problems lik e program induction and theory learning. When the hypothesis space is v ast and only a few h yp otheses are consisten t with the data, how can go o d mo dels b e disco vered without exhaustive search? In at least some domains, p eople may not ha ve an esp ecially clev er solution to this problem, instead grappling with the full combinatorial complexit y of theory learning (T. D. Ullman et al., 2012). Discov ering new theories can b e slo w and arduous, as testified by the long timescale of cognitiv e developmen t, and learning in a saltatory fashion (rather than through gradual adaptation) is characteristic of asp ects of human intelligence, including discov ery and insight during dev elopmen t (L. Sch ulz, 2012), problem-solving (Sternberg & Da vidson, 1995), and ep o ch-making discov eries in scientific research (Langley , Bradshaw, Simon, & Zytko w, 1987). Disco vering new theories can also happ en m uc h more quickly – A p erson learning the rules of F rostbite will probably undergo a lo osely ordered sequence of “Aha!” moments: they will learn that jumping on ice flo es causes them to change color, changing the color of ice flo es causes an iglo o to b e constructed piece-b y-piece, that birds make you lose points, that fish make y ou gain p oints, that you can c hange the direction of ice flo e at the cost of one iglo o piece, and so on. These little fragments of a “F rostbite theory” are assem bled to form a causal understanding of the game relatively quickly , in what seems more like a guided pro cess than arbitrary prop osals in a Monte Carlo inference sc heme. Similarly , as described in the Characters Challenge, p eople can quic kly infer motor programs to draw a new character in a similarly guided pro cesses. F or domains where program or theory learning happ ens quic kly , it is p ossible that p eople employ inductiv e biases not only to ev aluate hypotheses, but also to guide hypothesis selection. L. Sc h ulz (2012) has suggested that abstract structural prop erties of problems con tain information ab out the abstract forms of their solutions. Ev en without knowing the answer to the question “Where is the deep est p oint in the Pacific Ocean?” one still knows that the answer must b e a lo cation on a 9 In the interest of brevity , w e do not discuss here another imp ortan t v ein of work linking neural circuits to v ariational appro ximations (Bastos et al., 2012), which ha ve received less atten tion in the psychological literature. 32 map. The answer “20 inches” to the question “What year was Lincoln b orn?” can b e inv alidated a priori , even without knowing the correct answ er. In recent exp eriments, Tsividis, T enen baum, and Sch ulz (2015) found that children can use high-lev el abstract features of a domain to guide h yp othesis selection, b y reasoning ab out distributional prop erties like the ratio of seeds to flow ers, and dynamical prop erties like p erio dic or monotonic relationships b etw een causes and effects (see also Magid, Sheskin, & Sc h ulz, 2015). Ho w might efficien t mappings from questions to a plausible subset of answers b e learned? Recen t w ork in AI spanning both deep learning and graphical mo dels has attempted to tackle this c hal- lenge by “amortizing” probabilistic inference computations into an efficient feed-forward mapping (Eslami, T arlow, Kohli, & Winn, 2014; Heess, T arlow, & Winn, 2013; A. Mnih & Gregor, 2014; Stuhlm¨ uller, T aylor, & Go o dman, 2013). W e can also think of this as “learning to do inference,” which is indep endent from the ideas of learning as mo del building discussed in the previous section. These feed-forward mappings can b e learned in v arious wa ys, for example, using paired generative/recognition netw orks (Day an et al., 1995; Hinton et al., 1995) and v ariational optimization (Gregor et al., 2015; A. Mnih & Gregor, 2014; Rezende, Mohamed, & Wierstra, 2014) or nearest-neigh b or density estimation (Kulk arni, Kohli, T enenbaum, & Mansinghk a, 2015; Stuhlm ¨ uller et al., 2013). One implication of amortization is that solutions to different problems will b ecome correlated due to the sharing of amortized computations; some evidence for inferential correlations in humans w as rep orted by Gershman and Go o dman (2014). This trend is an av enue of p otential in tegration of deep learning mo dels with probabilistic mo dels and probabilistic pro- gramming: training neural netw orks to help p erform probabilistic inference in a generative mo del or a probabilistic program (Eslami et al., 2016; Kulk arni, Whitney , Kohli, & T enen baum, 2015; Yildirim, Kulk arni, F reiwald, & T e, 2015). Another av enue for p oten tial integration is through differen tiable programming (Dalrmple, 2016) – b y ensuring that the program-lik e hypotheses are differen tiable and th us learnable via gradien t descent – a p ossibility discussed in the concluding section (Section 6.1). 4.3.2 Mo del-based and mo del-free reinforcemen t learning The DQN in tro duced by V. Mnih et al. (2015) used a simple form of mo del-free reinforcement learning in a deep neural netw ork that allows for fast selection of actions. There is indeed sub- stan tial evidence that the brain uses similar mo del-free learning algorithms in simple asso ciative learning or discrimination learning tasks (see Niv, 2009, for a review). In particular, the phasic firing of midbrain dopaminergic neurons is qualitatively (Sch ultz, Day an, & Montague, 1997) and quan titatively (Bay er & Glimcher, 2005) consisten t with the rew ard prediction error that driv es up dating of mo del-free v alue estimates. Mo del-free learning is not, ho w ev er, the whole story . Considerable evidence suggests that the brain also has a mo del-based learning system, resp onsible for building a “cognitive map” of the en vironment and using it to plan action sequences for more complex tasks (Daw, Niv, & Da y an, 2005; Dolan & Day an, 2013). Model-based planning is an essential ingredient of h uman in telli- gence, enabling flexible adaptation to new tasks and goals; it is where all of the rich mo del-building abilities discussed in the previous sections earn their v alue as guides to action. As w e argued in our discussion of F rostbite, one can design numerous v ariants of this simple video game that are 33 iden tical except for the rew ard function – that is, go verned by an identical environmen t mo del of state-action-dep enden t transitions. W e conjecture that a comp etent F rostbite pla y er can easily shift b eha vior appropriately , with little or no additional learning, and it is hard to imagine a w ay of doing that other than ha ving a mo del-based planning approac h in which the environmen t mo del can b e mo dularly combined with arbitrary new reward functions and then deploy ed immediately for plan- ning. One b oundary condition on this flexibility is the fact that the skills b ecome “habitized” with routine application, p ossibly reflecting a shift from model-based to mo del-free control. This shift ma y arise from a rational arbitration b et w een learning systems to balance the trade-off b etw een flexibilit y and sp eed (Daw et al., 2005; Keramati, Dezfouli, & Piray, 2011). Similarly to ho w probabilistic computations can b e amortized for efficiency (see previous section), plans can b e amortized in to cached v alues by allo wing the mo del-based system to sim ulate training data for the mo del-free system (Sutton, 1990). This pro cess might o ccur offline (e.g., in dreaming or quiet wak efulness), suggesting a form of consolidation in reinforcemen t learning (Gershman, Markman, & Otto, 2014). Consistent with the idea of coop eration b etw een learning systems, a recen t exp erimen t demonstrated that mo del-based b ehavior b ecomes automatic ov er the course of training (Economides, Kurth-Nelson, L ¨ ubb ert, Guitart-Masip, & Dolan, 2015). Th us, a marriage of flexibility and efficiency might be achiev able if we use the human reinforcement learning systems as guidance. In trinsic motiv ation also plays an imp ortant role in h uman learning and b ehavior (Berlyne, 1966; Deci & Ry an, 1975; Harlow, 1950). While muc h of the previous discussion assumes the standard view of b eha vior as seeking to maximize reward and minimize punishment, all externally pro vided rew ards are reinterpreted according to the “in ternal v alue” of the agen t, whic h may dep end on the curren t goal and mental state. There may also b e an in trinsic drive to reduce uncertaint y and construct models of the en vironment (Edelman, 2015; Schmidh ub er, 2015), closely related to learning-to-learn and m ulti-task learning. Deep reinforcement learning is only just starting to address in trinsically motiv ated learning (Kulk arni et al., 2016; Mohamed & Rezende, 2015). 5 Resp onses to common questions In discussing the argumen ts in this pap er with colleagues, three lines of questioning or critiques ha ve come up frequently . W e think it is helpful to address these p oints directly , to maximize the p oten tial for moving forward together. 1. Comparing the learning sp eeds of humans and neural netw orks on sp ecific tasks is not meaningful, b ecause h umans hav e extensive prior exp erience. It may seem unfair to compare neural netw orks and humans on the amount of training exp erience required to p erform a task, such as learning to play new A tari games or learning new handwritten c haracters, when h umans hav e had extensive prior exp erience that these netw orks ha ve not benefited from. People ha v e had many hours playing other games, and exp erience reading or writing many other handwritten characters, not to mention exp erience in a v ariety of more lo osely related tasks. If neural netw orks were “pre-trained” on the same exp erience, the argument go es, then they might generalize similarly to h umans w hen exp osed to nov el tasks. 34 This has b een the rationale b ehind m ulti-task learning or transfer learning, a strategy with a long history that has sho wn some promising results recently with deep netw orks (e.g., Donahue et al., 2013; Luong, Le, Sutskev er, Viny als, & Kaiser, 2015; Parisotto et al., 2016). F urthermore, some deep learning adv o cates argue, the human brain effectively b enefits from ev en more exp erience through evolution. If deep learning researchers see themselv es as trying to capture the equiv alen t of h umans’ collectiv e ev olutionary exp erience, this w ould b e equiv alent to a truly immense “pre- training” phase. W e agree that humans hav e a muc h richer starting p oint than neural netw orks when learning most new tasks, including learning a new concept or to play a new video game. That is the p oin t of the “developmen tal start-up softw are” and other building blo cks that we argued are k ey to creating this ric her starting p oin t. W e are less committed to a particular story regarding the origins of the ingredien ts, including the relative roles of genetically programmed and experience- driv en developmen tal mec hanisms in building these comp onen ts in early infancy . Either wa y , we see them as fundamen tal building blo c ks for facilitating rapid learning from sparse data. Learning-to-learn across multiple tasks is conceiv ably one route to acquiring these ingredien ts, but simply training con ven tional neural netw orks on man y related tasks may not be sufficien t to generalize in human-lik e w ays for no vel tasks. As w e argued in Section 4.2.3, successful learning- to-learn – or at least, h uman-level transfer learning – is enabled by ha ving mo dels with the righ t represen tational structure, including the other building blo c ks discussed in this pap er. Learning- to-learn is a p ow erful ingredient, but it can be more p ow erful when operating ov er compositional represen tations that capture the underlying causal structure of the en vironmen t, while also building on the in tuitiv e physics and psychology . Finally , we recognize that some researchers still hold out hop e that if only they can just get big enough training datasets, sufficien tly ric h tasks, and enough computing pow er – far b ey ond what has b een tried out so far – then deep learning metho ds might b e sufficient to learn representations equiv alen t to what evolution and learning provides h umans with. W e can sympathize with that hop e and b elieve it deserves further exploration, although w e are not sure it is a realistic one. W e understand in principle ho w evolution could build a brain with the cognitiv e ingredients w e discuss here. Sto chastic hill-climbing is slow – it may require massively parallel exploration, ov er millions of years with innumerable dead-ends – but it can build complex structures with complex functions if we are willing to w ait long enough. In contrast, trying to build these represen tations from scratch using backpropagation, deep Q-learning or an y sto chastic gradien t-descent weigh t up date rule in a fixed netw ork architecture may be unfeasible regardless of how muc h training data are av ailable. T o build these representations from scratch might require exploring fundamen tal structural v ariations in the netw ork’s architecture, which gradien t-based learning in weigh t space is not prepared to do. Although deep learning researchers do explore many such arc hitectural v ariations, and hav e b een devising increasingly clever and p ow erful ones recen tly , it is the researc hers who are driving and directing this pro cess. Exploration and creative innov ation in the space of net work architectures hav e not yet b een made algorithmic. Perhaps they could, using genetic programming metho ds (Koza, 1992) or other structure-search algorithms (Y amins et al., 2014). W e think this would b e a fascinating and promising direction to explore, but w e may hav e to acquire more patience than mac hine learning researchers typically express with their algorithms: the dynamics of structure-search may look muc h more lik e the slow random hill-clim bing of ev olution than the smo oth, metho dical progress of sto c hastic gradient-descen t. An alternative strategy is to 35 build in appropriate infan t-like knowledge represen tations and core ingredien ts as the starting p oin t for our learning-based AI systems, or to build learning systems with strong inductive biases that guide them in this direction. Regardless of which wa y an AI dev elop er chooses to go, our main p oints are orthogonal to this ob jection. There are a set of core cognitiv e ingredien ts for h uman-like learning and though t. Deep learning mo dels could incorp orate these ingredien ts through some com bination of additional structure and p erhaps additional learning mechanisms, but for the most part ha ve y et to do so. An y approach to human-lik e AI, whether based on deep learning or not, is likely to gain from incorp orating these ingredients. 2. Biological plausibility suggests theories of intelligence should start with neural net works. W e ha ve fo cused on ho w cognitiv e science can motiv ate and guide efforts to engineer h uman-lik e AI, in con trast to some advocates of deep neural netw orks who cite neuroscience for inspiration. Our approac h is guided by a pragmatic view that the clearest path to a computational formalization of h uman intelligence comes from understanding the “softw are” before the “hardware.” In the case of this article, w e prop osed key ingredients of this softw are in previous sections. Nonetheless, a cognitive approach to intelligence should not ignore what we know ab out the brain. Neuroscience can provide v aluable inspirations for both cognitiv e models and AI researchers: the cen trality of neural netw orks and mo del-free reinforcement learning in our prop osals for “Thinking fast” (Section 4.3) are prime exemplars. Neuroscience can also in principle imp ose constraints on cognitiv e accoun ts, b oth at the cellular and systems level. If deep learning embo dies brain-lik e computational mechanisms and those mechanisms are incompatible with some cognitiv e theory , then this is an argumen t against that cognitive theory and in fa vor of deep learning. Unfortunately , what we “know” ab out the brain is not all that clear-cut. Many seemingly w ell-accepted ideas regarding neural computation are in fact biologically dubious, or uncertain at b est – and th us should not disqualify cognitive ingredien ts that p ose challenges for implementation within that approac h. F or example, most neural netw orks use some form of gradient-based (e.g., backpropagation) or Hebbian learning. It has long b een argued, how ever, that backpropagation is not biologically plausible; as Crick (1989) famously pointed out, bac kpropagation seems to require that information b e transmitted backw ards along the axon, which do es not fit with realistic mo dels of neuronal function (although recent mo dels circumv ent this problem in v arious wa ys Liao, Leib o, & Poggio, 2015; Lillicrap, Cownden, Tweed, & Akerman, 2014; Scellier & Bengio, 2016). This has not prev ented backpropagation b eing put to go o d use in connectionist models of cognition or in building deep neural netw orks for AI. Neural net w ork researchers must regard it as a very go o d thing, in this case, that concerns of biological plausibility did not hold bac k researc h on this particular algorithmic approac h to learning. 10 W e strongly agree: Although neuroscien tists ha ve not found an y mec hanisms for implementing bac kpropagation in the brain, neither hav e they pro duced definitive evidence against it. The existing data simply offer little constraint either w ay , and bac kpropagation has b een of obviously great v alue in engineering to day’s b est pattern recognition systems. 10 Mic hael Jordan made this p oint forcefully in his 2015 speech accepting the Rumelhart Prize. 36 Hebbian learning is another case in p oint. In the form of long-term p oten tiation (L TP) and spike- timing dep enden t plasticit y (STDP), Hebbian learning mechanisms are often cited as biologically supp orted (Bi & Poo, 2001). Ho wev er, the cognitive significance of an y biologically grounded form of Hebbian learning is unclear. Gallistel and Matzel (2013) hav e p ersuasiv ely argued that the critical interstim ulus interv al for L TP is orders of magnitude smaller than the interv als that are b ehaviorally relev an t in most forms of learning. In fact, exp erimen ts that sim ultaneously manipulate the in terstim ulus and in tertrial in terv als demonstrate that no critical interv al exists. Beha vior can p ersist for weeks or months, whereas L TP decays to baseline o v er the course of days (P ow er, Thompson, Moy er, & Disterhoft, 1997). Learned b ehavior is rapidly reacquired after extinction (Bouton, 2004), whereas no such facilitation is observ ed for L TP (de Jonge & Racine, 1985). Most relev an tly for our fo cus, it w ould be esp ecially c hallenging to try to implement the ingredien ts des cribed in this article using purely Hebbian mechanisms. Claims of biological plausibility or implausibility usually rest on rather stylized assumptions ab out the brain that are wrong in many of their details. Moreo v er, these claims usually p ertain to the cellular and synaptic lev el, with few connections made to systems level neuroscience and sub cor- tical brain organization (Edelman, 2015). Understanding which details matter and whic h do not requires a computational theory (Marr, 1982). Moreov er, in the absence of strong constrain ts from neuroscience, w e can turn the biological argument around: Perhaps a hypothetical biological mec hanism should be viewed with sk epticism if it is cognitiv ely implausible. In the long run, we are optimistic that neuroscie nce will ev entually place more constrain ts on theories of intelligence. F or now, w e b eliev e cognitive plausibility offers a surer foundation. 3. Language is essential for h uman in telligence. Wh y is it not more prominent here? W e hav e said little in this article ab out p eople’s abilit y to communicate and think in natural lan- guage, a distinctiv ely human cognitive capacit y where mac hine capabilities lag strikingly . Certainly one could argue that language should b e included on any short list of key ingredien ts in human in telligence: for instance, Mikolo v et al. (2016) featured language prominently in their recent pap er sk etching challenge problems and a road map for AI. Moreov er, while natural language pro cessing is an active area of research in deep learning (e.g., Bahdanau, Cho, & Bengio, 2015; Mik olov, Sutsk ever, & Chen, 2013; K. Xu et al., 2015), it is widely recognized that neural netw orks are far from implemen ting human language abilities. The question is, how do we develop machines with a ric her capacit y for language? W e ourselves b eliev e that understanding language and its role in in telligence goes hand-in-hand with understanding the building blo cks discussed in this article. It is also true that language builds on the core abilities for in tuitive physics, intuitiv e psychology , and rapid learning with comp ositional, causal mo dels that we do fo cus on. These capacities are in place b efore children master language, and they provide the building blo cks for linguistic meaning and language acquisition (Carey, 2009; Jac kendoff, 2003; Kemp, 2007; O’Donnell, 2015; Pinker, 2007; F. Xu & T enen baum, 2007). W e hope that b y better understanding these earlier ingredien ts and ho w to implement and in tegrate them computationally , w e will b e b etter p ositioned to understand linguistic meaning and acquisition in computational terms, and to explore other ingredien ts that mak e human language p ossible. What else might we need to add to these core ingredients to get language? Man y researchers ha v e sp eculated ab out k ey features of human cognition that gives rise to language and other uniquely 37 h uman mo des of thought: Is it recursion, or some new kind of recursiv e structure building abilit y (Berwic k & Chomsky, 2016; Hauser, Chomsky , & Fitch, 2002)? Is it the ability to reuse sym b ols b y name (Deacon, 1998)? Is it the ability to understand others in tentionally and build shared in tentionalit y (Blo om, 2000; F rank, Go o dman, & T enenbaum, 2009; T omasello, 2010)? Is it some new version of these things, or is it just mor e of the asp ects of these capacities that are already present in infan ts? These are imp ortant questions for future w ork with the p otential to expand the list of k ey ingredien ts; we did not intend our list to b e complete. Finally , we should k eep in mind all the w ays that acquiring language extends and enriches the ingredien ts of cognition we fo cus on in this article. The intuitiv e physics and psyc hology of infants is likely limited to reasoning ab out ob jects and agents in their immediate spatial and temp oral vicinit y , and to their simplest prop erties and states. But with language, older children b ecome able to reason about a muc h wider range of ph ysical and psyc hological situations (Carey, 2009). Language also facilitates more p ow erful learning-to-learn and comp ositionality (Mik olov et al., 2016), allo wing people to learn more quickly and flexibly b y represen ting new concepts and thoughts in relation to existing concepts (Lupy an & Bergen, 2016; Lup y an & Clark, 2015). Ultimately , the full pro ject of building machines that learn and think like h umans m ust hav e language at its core. 6 Lo oking forw ard In the last few decades, AI and mac hine learning hav e made remark able progress: Computer programs b eat chess masters; AI systems b eat Jeopardy c hampions; apps recognize photos of your friends; machines riv al humans on large-scale ob ject recognition; smart phones recognize (and, to a limited exten t, understand) sp eech. The coming years promise still more exciting AI applications, in areas as v aried as self-driving cars, medicine, genetics, drug design and rob otics. As a field, AI should be proud of these accomplishments, which ha v e help ed mov e research from academic journals in to systems that improv e our daily lives. W e should also b e mindful of what AI has achiev ed and what it has not. While the pace of progress has been impressive, natural in telligence is still b y far the best example of in telligence. Machine p erformance may riv al or exceed h uman p erformance on particular tasks, and algorithms may take inspiration from neuroscience or asp ects of psychology , but it do es not follo w that the algorithm learns or thinks lik e a p erson. This is a higher bar w orth reac hing for, p otentially leading to more p o werful algorithms while also helping unlo ck the mysteries of the h uman mind. When comparing p eople and the curren t b est algorithms in AI and machine learning, p eople learn from less data and generalize in ric her and more flexible wa ys. Ev en for relatively simple concepts suc h as handwritten characters, p eople need to see just one or a few examples of a new concept b efore b eing able to recognize new examples, generate new examples, and generate new concepts based on related ones (Figure 1A). So far, these abilities elude even the b est deep neural net w orks for character recognition (Ciresan et al., 2012), whic h are trained on man y examples of each concept and do not flexibly generalize to new tasks. W e suggest that the comparativ e p o wer and flexibilit y of p eople’s inferences come from the causal and comp ositional nature of their representations. W e b eliev e that deep learning and other learning paradigms can mov e closer to h uman-lik e learning 38 and thought if they incorp orate psyc hological ingredients including those outlined in this pap er. Be- fore closing, we discuss some recent trends that w e see as some of the most promising developmen ts in deep learning – trends w e hop e will contin ue and lead to more imp ortant adv ances. 6.1 Promising directions in deep learning There has b een recent interest in in tegrating psychological ingredients with deep neural net w orks, esp ecially selectiv e attention (Bahdanau et al., 2015; V. Mnih, Heess, Grav es, & Kavuk cuoglu, 2014; K. Xu et al., 2015), augmen ted working memory (Grav es et al., 2014, 2016; Grefenstette et al., 2015; Sukhbaatar et al., 2015; W eston et al., 2015), and exp erience replay (McClelland, McNaugh ton, & O’Reilly, 1995; V. Mnih et al., 2015). These ingredients are lo wer-lev el than the key cognitive ingredien ts discussed in this paper, y et they suggest a promising trend of using insigh ts from cognitive psyc hology to improv e deep learning, one that ma y b e even furthered by incorp orating higher-level cognitive ingredients. P aralleling the human p erceptual apparatus, selectiv e attention forces deep learning mo dels to pro cess raw p erceptual data as a series of high-resolution “fov eal glimpses” rather than all at once. Somewhat surprisingly , the incorp oration of attention has led to substantial p erformance gains in a v ariet y of domains, including in machine translation (Bahdanau et al., 2015), ob ject recognition (V. Mnih et al., 2014), and image caption generation (K. Xu et al., 2015). Atten tion ma y help these mo dels in several w ays. It helps to co ordinate complex (often sequen tial) outputs b y attending to only sp ecific asp ects of the input, allowing the mo del to fo cus on smaller sub-tasks rather than solving an entire problem in one shot. F or instance, during caption generation, the attentional windo w has b een shown to track the ob jects as they are mentioned in the caption, where the net work may fo cus on a b o y and then a F risb ee when pro ducing a caption lik e, “A b oy throws a F risb ee” (K. Xu et al., 2015). Atten tion also allows larger mo dels to b e trained without requiring ev ery mo del parameter to affect every output or action. In generative neural net w ork mo dels, atten tion has b een used to concentrate on generating particular regions of the image rather than the whole image at once (Gregor et al., 2015). This could b e a stepping stone tow ards building more causal generative mo dels in neural netw orks, suc h as a neural v ersion of the Ba yesian Program Learning mo del that could b e applied to tackling the Characters Challenge (Section 3.1). Researc hers are also developing neural net w orks with “working memories” that augmen t the shorter- term memory provided b y unit activ ation and the longer-term memory provided by the connection w eights (Grav es et al., 2014, 2016; Grefenstette et al., 2015; Reed & de F reitas, 2016; Sukhbaatar et al., 2015; W eston et al., 2015). These dev elopmen ts are also part of a broader trend tow ards “differen tiable programming,” the incorp oration of c lassic data structures such a random access memory , stacks, and queues, into gradient-based learning systems (Dalrmple, 2016). F or example, the Neural T uring Machine (NTM; Grav es et al., 2014) and its successor the Differentiable Neural Computer (DNC; Grav es et al., 2016) are neural net w orks augmented with a random access external memory with read and write op erations that main tains end-to-end differen tiability . The NTM has b een trained to p erform sequence-to-sequence prediction tasks such as sequence copying and sorting, and the DNC has b een applied to solving blo ck puzzles and finding paths b et ween no des in a graph (after memorizing the graph). Additionally , Neural Programmer-Interpreters learn to represent and execute algorithms such as addition and sorting from fewer examples by observing 39 input-output pairs (like the NTM and DNC) as well as execution traces (Reed & de F reitas, 2016). Eac h mo del seems to learn genuine programs from examples, albeit in a representation more like assem bly language than a high-level programming language. While this new generation of neural net w orks has y et to tackle the types of challenge problems in tro duced in this pap er, differen tiable programming suggests the intriguing p ossibility of com bining the b est of program induction and deep learning. The types of structured represen tations and mo del building ingredien ts discussed in this pap er – ob jects, forces, agen ts, causalit y , and comp ositionalit y – help to explain imp ortan t facets of h uman learning and thinking, yet they also bring challenges for p erforming efficient inference (Section 4.3.1). Deep learning systems hav e not y et sho wn they can w ork with these representations, but they hav e demonstrated the surprising effectiv eness of gradien t descent in large mo dels with high-dimensional parameter spaces. A syn thesis of these approac hes, able to p erform efficient inference o ver programs that ric hly mo del the causal structure an infan t sees in the world, would b e a ma jor step forward for building human-lik e AI Another example of combining pattern recognition and mo del-based search comes from recen t AI researc h in to the game Go. Go is considerably more difficult for AI than c hess, and it was only recen tly that a computer program – A lphaGo – first b eat a world-class pla y er (Chouard, 2016) b y using a combination of deep conv olutional neural net works (con vnets) and Monte Carlo T ree searc h (Silv er et al., 2016). Eac h of these components has made gains against artificial and real Go play ers (Gelly & Silver, 2008, 2011; Silver et al., 2016; Tian & Zhu, 2016), and the notion of com bining pattern recognition and mo del-based search go es back decades in Go and other games. Sho wing that these approaches can be integrated to beat a human Go c hampion is an imp ortant AI accomplishmen t (see Figure 7). Just as imp ortant, ho w ev er, are the new questions and directions it op ens up for the long-term pro ject of building genuinely human-lik e AI. One worth y goal would b e to build an AI system that b eats a w orld-class pla y er with the amount and kind of training h uman champions receive – rather than ov erp ow ering them with Go ogle-scale computational resources. AlphaGo is initially trained on 28.4 million p ositions and mov es from 160,000 unique games play ed b y h uman exp erts; it then improv es through reinforcement learning, pla ying 30 million more games against itself. Bet ween the publication of Silver et al. (2016) and b efore facing world c hampion Lee Sedol, AlphaGo w as iterativ ely retrained several times in this w ay; the basic system alwa ys learned from 30 million games, but it pla yed against successively stronger v ersions of itself, effectively learning from 100 million or more games altogether (Silv er, 2016). In contrast, Lee has probably play ed around 50,000 games in his entire life. Lo oking at n umbers lik e these, it is impressive that Lee can even comp ete with AlphaGo at all. What would it tak e to build a professional-level Go AI that learns from only 50,000 games? Perhaps a system that com bines the adv ances of AlphaGo with some of the complemen tary ingredients for intelligence we argue for here w ould b e a route to that end. AI could also gain muc h by trying to matc h the learning speed and flexibility of normal human Go play ers. People tak e a long time to master the game of Go, but as with the F rostbite and Characters challenges (Sections 3.1 and 3.2), humans can learn the basics of the game quickly through a combination of explicit instruction, watc hing others, and exp erience. Playing just a few games teaches a h uman enough to b eat someone who has just learned the rules but never play ed b efore. Could AlphaGo mo del these earliest stages of real h uman learning curves? Human Go pla yers can also adapt what they hav e learned to inn umerable game v arian ts. The Wikip edia page 40 Under re vie w as a conference paper at ICLR 2016 (a)$ (b )$ (c )$ Figure 2: Some special situations in Go. (a) Ko . After black captures white stone by playing at a , white is prohibited to capture back imme d i ately by playing at b to pre v ent repetition of g ame state. (b) K o fight . Black captures white at 1, white cannot capture back. Instead, white can plays at 2, threatening the three black stones (called K o thr eat ). If bla ck plays at 3 to connect, white can then win back the K o. (c) Ladder . Black pl ays at 1, threatening to capt ure the white stone at circle. White escapes b ut e v entually gets captured at the border . Each time after black plays, white’ s liberties shrink from 2 to 1. Images from Sensei’ s Library ( http://senseis.xmp.net/ ). Name T ype Description #planes standard our/opponent liberties binary true if the group has 1, 2 and  3 liberties 6 K o (See Fig. 2(a)) binary true if it is a K o location (ille g al mo v e) 1 our/opponent stones/empty binary - 3 our/opponent history real ho w long our/opponent stone is placed 2 opponent rank binary All true if opponent is at that rank 9 e xtended border binary true if at border 1 position mask real exp (  . 5 ⇤ distan ce 2 ) to the board center 1 our/opponent territory binary true if the location is closer to us/opponent. 2 T able 1: Features e xtracted from the current board situation as the input of the netw ork. Note that e xtended feature set also includes standard set. As a resul t, standard set has 21 channels while e xtended one has 25 channels. planes), in particular , free from one step forw ard simulation. In comparison, Maddison et al. (2015) uses such features lik e liberties after the mo v e, captures after the mo v e, etc. W e use a similar w ay to encode rank in 9 planes as in Maddison et al. (2015). That is, all k yu-players ha v e all nine planes zero, 1d players has their first plane all-1, 2d players ha v e their second plane all-1, etc. F or 9d and professional players, all the planes are filled with 1 . 2.2 N ET W O R K A RCH ITE CT URE Fig. 3 sho ws the architecture of the netw ork for our best model. W e use a 12-layered ( d = 12 ) full con v olutional netw ork. Each con v olution layer is follo wed by a ReLU nonlinearity . Except for the first layer , all layers use the same width w = 384 . No weight sharing is used. W e do not use pooling since the y ne g ati v ely af fect the performance. Instead of using tw o softmax outputs [Maddison et al. (2015)] to predict black and white mo v es, we only use one softmax layer to predict the ne xt mo v e, reducing the number of parameters. 25 feature planes Conv layer 92 channels 5 × 5 kernel Conv layers x 10 384 channels 3 × 3 kernel Conv layer k maps 3 × 3 kernel k parallel softmax x"10" Our next move (next-1) Opponent move (next-2) Our counter move (next-3) Current boar d Figure 3: Our netw ork structure ( d = 12 , w = 384 ). The input is the current board situation (with history information), the output is to predict ne xt k mo v es. 3 Under re vie w as a conference paper at ICLR 2016 2/10% 2/ 10% 2/ 10% 1/ 1% 20/ 30% 10/ 18% 9/ 10% 10/ 12% 1/ 8% 22/ 40% 1/1% 2/10% 2/10% 1/1% 20/30% 10/18% 9/10% 10/12% 1/8% Synced DCNN server 22/40% 2/10% 1/1% 21/31% 11/19% 10/11 % 10/12% 1/8% 23/41% 1/1% (a) (b ) (c ) T ree-policy% Default-policy% Figure 4: A brief illustration of MCTS with DCNN. (a) A g ame tree. F or each node, the statistics m/n indicates that from the node, n g ames are emulated, out of which m are w on by black. Root represents the current g am e state. (b) A ne w rollout starting from the root. It picks a mo v e from the current state using tr ee policy and adv ances to the ne xt g ame state, until it picks the a ne w mo v e and e xpand a ne w leaf. From the leaf, we run default policy until the g ame ends (black wins in the illustration). At the same time, the l eaf status is sent to a DCNN serv er for e v aluation. F or synchronized implementation, this ne w node is a v ailable for tree polic y aft er the e v aluation is returned. (c) The statistics along the trajectory of the tree polic y is updated accordingly . situations ha v e been e v aluated for a gi v en number of MCTS rollouts. In sync hr onized implemen- tation , MCTS will w ait until DCNN e v aluates the board situation of a leaf node, and then e xpands the leaf. Def ault polic y can be e x ecut ed before or after DCNN e v aluation. This is much slo wer b ut guarantees that each node is e xpanded according to the suggested mo v es gi v en by DCNN. In our e xperiments, we e v aluate the synchronized case, which achie v es 90% win rate ag ainst its ra w DCNN player with only 1000 rollouts. Note that our impl ementation is not directly comparable to the asynchronized v ersion in Maddison et al. (2015), achie ving 86 . 7% with 100k rollouts. 3E XP ER IM E NT S 3.1 S ET U P W e use the public KGS dataset ( ⇠ 170k g ames), which is used in Maddison et al. (2015). W e use all g ames before 2012 as the training set and 2013-2015 g ames as the test set. This leads to 144,748 g ames for traini ng and 26,814 g ames for testing. W e also use GoGoD dataset 1 ( ⇠ 80k g ames), which is also used in Clark & Stork e y (2015). 75,172 g ames are used for training and 2,592 for testing. F or e v aluation, our model competes wi th GnuGo, P achi [Baudis & Gailly (2012)] and Fue go [En- zenber ger et al. (2010)]. W e use GnuGo 3.8 le v el 10, P achi 11.99 (Genjo-de v el) with t h e pattern files, and Fue go 1.1 throughout our e xperiments. 3.2 M OV E P R E D I C T I O N T able 3 sho ws the performance comparison for mo v e prediction. F or models that predict the ne xt k mo v es, we only e v aluate their prediction accurac y for the immediate ne xt mo v e. Maddison et al. (2015) d=12,w=384 d=12,w=512 d=16,w=512 d=17,w=512 55.2 57.1 57.3 56.6 56.4 T able 3: Comparison of T op-1 accuracies for ne xt mo v e predictions using standard features. d is the depth of the model while w is the number of filters at con v olutional layers (e xcept the first layer). W ith our training frame w ork, we a re able to achie v e slightly higher T op-1 prediction accurac y (after hundreds of epochs) compared to Maddison et al. (2015). Note that using standard or e xtended features seem to ha v e mar ginal g ains (Fig. 5). F or the remaining e xperiments, we thus use d = 12 and w = 384 , as sho wn in Fig. 3. 1 W e used GoG oD 2015 summer v ersion, purchased from http://www.gogod.co.uk . W e skip ancient g ames and only use g ame records after 1800 AD. 5 (a)  (b)  (c)  Figure 7: An AI system for playing Go com bining a deep conv olutional netw ork (convnet) and mo del-based searc h through Mon te-Carlo T ree Searc h (MCTS). (A) The con vnet on its own can b e used to predict the next k mov es given the curren t b oard. (B) A search tree with the curren t b oard state as its ro ot and the current “win/total” statistics at each no de. A new MCTS rollout selects mov es along the tree according to the MCTS p olicy (red arrows) until it reaches a new leaf (red circle), where the next mov e is chosen b y the convnet. F rom there, pla y pro ceeds until the game’s end according to a pre-defined default p olicy based on the Pac hi program (Baudi ˇ s & Gailly, 2012), itself based on MCTS. (C) The end-game result of the new leaf is used to up date the search tree. Adapted from Tian and Zhu (2016) with p ermission. 41 “Go v ariants” describ es v ersions suc h as pla ying on bigger or smaller b oard sizes (ranging from 9 × 9 to 38 × 38, not just the usual 19 × 19 board), or pla ying on b oards of different shap es and connectivity structures (rectangles, triangles, hexagons, ev en a map of the English city Milton Keynes). The b oard can b e a torus, a mobius strip, a cub e or a diamond lattice in three dimensions. Holes can b e cut in the b oard, in regular or irregular w ays. The rules can b e adapted to what is kno wn as First Capture Go (the first play er to capture a stone wins), NoGo (the play er who av oids capturing an y enem y stones longer wins) or Time Is Money Go (pla yers b egin with a fixed amoun t of time and at the end of the game, the n um b er of seconds remaining on eac h pla yer’s clo c k is added to their score). Pla yers may receive b on uses for creating certain stone patterns or capturing territory near certain landmarks. There could b e four or more play ers, comp eting individually or in teams. In eac h of these v arian ts, effective play needs to change from the basic game, but a skilled play er can adapt and does not simply hav e to relearn the game from scratc h. Could AlphaGo? While tec hniques for handling v ariable sized inputs in con vnets ma y help for pla ying on different b oard sizes (Sermanet et al., 2014), the v alue functions and p olicies that AlphaGo learns seem unlik ely to generalize as flexibly and automatically as p eople do. Many of the v arian ts describ ed ab ov e w ould require significant reprogramming and retraining, directed by the smart humans who programmed AlphaGo, not the system itself. As impressiv e as AlphaGo is in b eating the w orld’s b est play ers at the standard game – and it is extremely impressive – the fact that it cannot ev en conceive of these v ariants, let alone adapt to them autonomously , is a sign that it do es not understand the game as humans do. Human pla yers can understand these v arian ts and adapt to them b ecause they explicitly represent Go as a game, with a goal to b eat an adversary who is playing to achiev e the same goal they are, gov erned b y rules ab out ho w stones can be placed on a b oard and ho w b oard p ositions are scored. Humans represen t their strategies as a resp onse to these constrain ts, suc h that if the game changes, they can b egin to adjust their strategies accordingly . In sum, Go presents comp elling c hallenges for AI b eyond matching world-class h uman performance, in trying to match h uman levels of understanding and generalization, based on the same kinds and amoun ts of data, explicit instructions, and opp ortunities for so cial learning afforded to p eople. In learning to pla y Go as quic kly and as flexibly as they do, p eople are dra wing on most of the cognitiv e ingredien ts this paper has laid out. They are learning-to-learn with comp ositional kno wledge. They are using their core intuitiv e psychology , and asp ects of their in tuitive physics (spatial and ob ject represen tations). And like AlphaGo, they are also in tegrating mo del-free pattern recognition with mo del-based searc h. W e believe that Go AI systems could b e built to do all of these things, p oten tially capturing b etter how h umans learn and understand the game. W e b eliev e it w ould b e richly rewarding for AI and cognitive science to pursue this c hallenge together, and that suc h systems could b e a comp elling testb ed for the principles this pap er argues for – as w ell as building on all of the progress to date that AlphaGo represen ts. 6.2 F uture applications to practical AI problems In this pap er, w e suggested some ingredients for building computational mo dels with more human- lik e learning and thought. These principles w ere explained in the con text of the Characters and F rostbite Challenges, with sp ecial emphasis on reducing the amount of training data required and facilitating transfer to nov el yet related tasks. W e also see wa ys these ingredien ts can spur progress on core AI problems with practical applications. Here we offer some sp eculative thoughts on these 42 applications. 1. Sc ene understanding . Deep learning is moving b ey ond ob ject recognition and tow ards scene understanding, as evidenced b y a flurry of recent work fo cused on generating natural language captions for images (Karpathy & F ei-F ei, 2015; Vin yals et al., 2014; K. Xu et al., 2015). Y et curren t algorithms are still better at recognizing ob jects than understanding scenes, often getting the k ey ob jects right but their causal relationships wrong (Figure 6). W e see com- p ositionalit y , causality , in tuitive physics and in tuitive psyc hology as playing an increasingly imp ortan t role in reaching true scene understanding. F or example, picture a cluttered garage w orkshop with screw drivers and hammers hanging from the wall, w o o d pieces and to ols stac ked precariously on a w ork desk, and shelving and b o xes framing the scene. In order for an autonomous agent to effectively na vigate and p erform tasks in this en vironment, the agen t would need intuitiv e physics to prop erly reason ab out stability and supp ort. A holistic mo del of the scene would require the comp osition of individual ob ject mo dels, glued together b y relations. Finally , causality helps infuse the recognition of existing to ols (or the learning of new ones) with an understanding of their use, helping to connect different ob ject mo dels in the prop er wa y (e.g., hammering a nail into a wall, or using a saw horse to supp ort a b eam b eing cut by a sa w). If the scene includes p eople acting or interacting, it will b e nearly imp ossible to understand their actions without thinking ab out their thoughts, and esp ecially their goals and in ten tions tow ards the other ob jects and agents they b elieve are presen t. 2. A utonomous agents and intel ligent devic es . Rob ots and p ersonal assistan ts (such as cell- phones) cannot b e pre-trained on all p ossible concepts they ma y encounter. Like a child learning the meaning of new w ords, an intelligen t and adaptive system should b e able to learn new concepts from a small n um b er of examples, as they are encountered naturally in the environmen t. Common concept t yp es include new sp ok en words (names like “Ban Ki-Mo on” or “Kofi Annan”), new gestures (a secret handshak e or a “fist bump”), and new activities, and a h uman-like system would b e able to learn to b oth recognize and pro duce new instances from a small n um b er of examples. Like with handwritten characters, a system ma y b e able to quickly learn new concepts by constructing them from pre-existing primitiv e actions, informed b y knowledge of the underlying causal pro cess and learning-to-learn. 3. A utonomous driving . P erfect autonomous driving requires intuitiv e psychology . Beyond de- tecting and a v oiding p edestrians, autonomous cars could more accurately predict p edestrian b eha vior by inferring men tal states, including their b eliefs (e.g., Do they think it is safe to cross the street? Are they pa ying atten tion?) and desires (e.g., Where do they w ant to go? Do they wan t to cross? Are they retrieving a ball lost in the street?). Similarly , other driv ers on the road ha v e similarly complex mental states underlying their b eha vior (e.g., Do they wan t to c hange lanes? Pass another car? Are they swerving to a void a hidden hazard? Are they distracted?). This type of psyc hological reasoning, along with other types of mo del-based causal and ph ysical reasoning, are lik ely to be esp ecially v aluable in challenging and nov el driving circumstances for which there is little relev ant training data (e.g. navigating unusual construction zones, natural disasters, etc.) 4. Cr e ative design . Creativity is often though t to b e a pinnacle of human intelligence: chefs de- sign new dishes, m usicians write new songs, arc hitects design new buildings, and en trepreneurs 43 start new businesses. While we are still far from developing AI systems that can tackle these t yp es of tasks, w e see comp ositionality and causality as central to this goal. Many com- monplace acts of creativity are combinatorial, meaning they are unexp ected com binations of familiar concepts or ideas (Bo den, 1998; W ard, 1994). As illustrated in Figure 1-iv, no v el v ehicles can b e created as a combination of parts from existing v ehicles, and similarly nov el c haracters can b e constructed from the parts of st ylistically similar c haracters, or familiar c haracters can b e re-conceptualized in nov el styles (Rehling, 2001). In eac h case, the free com bination of parts is not enough on its o wn: While comp ositionalit y and learning-to-learn can provide the parts for new ideas, causality pro vides the glue that gives them coherence and purp ose. 6.3 T ow ards more h uman-like learning and thinking machines Since the birth of AI in the 1950s, p eople hav e wan ted to build mac hines that learn and think like p eople. W e hop e researc hers in AI, mac hine learning, and cognitive science will accept our challenge problems as a testb ed for progress. Rather than just building systems that recognize handwritten c haracters and pla y F rostbite or Go as the end result of an asymptotic pro cess, w e suggest that deep learning and other computational paradigms should aim to tac kle these tasks using as little training data as p eople need, and also to ev aluate mo dels on a range of h uman-like generalizations b eyond the one task the mo del w as trained on. W e hop e that the ingredients outlined in this article will pro ve useful for working tow ards this goal: seeing ob jects and agents rather than features, building causal mo dels and not just recognizing patterns, recombining representati ons without needing to retrain, and learning-to-learn rather than starting from scratc h. Ac knowledgmen ts W e are grateful to P eter Battaglia, Matt Botvinic k, Y-Lan Boureau, Shimon Edelman, Nando de F reitas, Anatole Gershman, George Kachergis, Leslie Kaelbling, Andrej Karpathy , George Konidaris, T ejas Kulk arni, T ammy Kwan, Michael Littman, Gary Marcus, Kevin Murph y , Steven Pink er, Pat Shafto, David Sontag, Pedro Tsividis, and four anon ymous reviewers for helpful com- men ts on early v ersions of this man uscript. T om Schaul was very helpful in answ ering questions regarding the DQN learning curv es and F rostbite scoring. This work w as supp orted by the Center for Minds, Brains and Machines (CBMM), under NSF STC aw ard CCF-1231216, and the Mo ore- Sloan Data Science En vironmen t at NYU. References Andryc howicz, M., Denil, M., Gomez, S., Hoffman, M. W., Pfau, D., Sc haul, T., & de F reitas, N. (2016). Learning to learn by gradient descent by gradien t descen t. arXiv pr eprint . Anselmi, F., Leib o, J. Z., Rosasco, L., Mutch, J., T acchetti, A., & Poggio, T. (2016). Unsupervised learning of in v arian t representations. The or etic al Computer Scienc e . 44 Bahdanau, D., Cho, K., & Bengio, Y. (2015). Neural Machine T ranslation by Join tly Learning to Align and T ranslate. In International Confer enc e on L e arning R epr esentations (ICLR). Retriev ed from Baillargeon, R. (2004). Infan ts’ physical world. Curr ent Dir e ctions in Psycholo gic al Scienc e , 13 , 89–94. doi: 10.1111/j.0963-7214.2004.00281.x Baillargeon, R., Li, J., Ng, W., & Y uan, S. (2009). An accoun t of infants physical reasoning. L e arning and the infant mind , 66–116. Bak er, C. L., Saxe, R., & T enenbaum, J. B. (2009). Action understanding as in verse planning. Co gnition , 113 (3), 329–349. Barsalou, L. W. (1983). Ad ho c categories. Memory & Co gnition , 11 (3), 211–227. Bastos, A. M., Usrey , W. M., Adams, R. A., Mangun, G. R., F ries, P ., & F riston, K. J. (2012). Canonical micro circuits for predictive co ding. Neur on , 76 , 695–711. Bates, C. J., Yildirim, I., T enen baum, J. B., & Battaglia, P. W. (2015). Humans predict liquid dynamics using probabilistic simulation. In Pr o c e e dings of the 37th Annual Confer enc e of the Co gnitive Scienc e So ciety. Battaglia, P. W., Hamrick, J. B., & T enenbaum, J. B. (2013). Sim ulation as an engine of ph ysical scene understanding. Pr o c e e dings of the National A c ademy of Scienc es , 110 (45), 18327– 18332. Baudi ˇ s, P ., & Gailly , J.-l. (2012). P ac hi: State of the art op en source go program. In A dvanc es in c omputer games (pp. 24–38). Springer. Baxter, J. (2000). A mo del of inductive bias learning. Journal of A rtificial Intel ligenc e R ese ar ch , 12 , 149–198. Ba yer, H. M., & Glimc her, P. W. (2005). Midbrain dopamine neurons enco de a quan titative reward prediction error signal. Neur on , 47 , 129–141. Bellemare, M. G., Naddaf, Y., V eness, J., & Bowling, M. (2013). The arcade learning environmen t: An ev aluation platform for general agents. Journal of A rtificial Intel ligenc e R ese ar ch , 47 , 253–279. Berlyne, D. E. (1966). Curiosity and exploration. Scienc e , 153 , 25–33. Berthiaume, V. G., Sh ultz, T. R., & Onishi, K. H. (2013). A constructivist connectionist mo del of transitions on false-b elief tasks. Co gnition , 126 (3), 441–458. Berwic k, R. C., & Chomsky , N. (2016). Why only us: L anguage and evolution . Cam bridge, MA: MIT Press. Bev er, T. G., & Poepp el, D. (2010). Analysis by syn thesis: a (re-) emerging program of research for language and vision. Biolinguistics , 4 , 174–200. Bi, G.-q., & Poo, M.-m. (2001). Synaptic mo dification b y correlated activity: Hebb’s p ostulate revisited. Annual R eview of Neur oscienc e , 24 , 139–166. Biederman, I. (1987). Recognition-by-components: A theory of human image understanding. Psycholo gic al R eview , 94 (2), 115–147. Bienensto c k, E., Co op er, L. N., & Munro, P. W. (1982). Theory for the developmen t of neuron selectivit y: orientation sp ecificit y and bino cular interaction in visual cortex. The Journal of Neur oscienc e , 2 (1), 32–48. Bienensto c k, E., Geman, S., & Potter, D. (1997). Comp ositionality , MDL Priors, and Ob ject Recognition. In A dvanc es in Neur al Information Pr o c essing Systems. Blo om, P . (2000). How Childr en L e arn the Me anings of Wor ds . Cambridge, MA: MIT Press. Blundell, C., Uria, B., Pritzel, A., Li, Y., Ruderman, A., Leib o, J. Z., . . . Hassabis, D. (2016). 45 Mo del-F ree Episo dic Control. arXiv pr eprint . Bobro w, D. G., & Winograd, T. (1977). An o v erview of KRL, a knowledge representation language. Co gnitive Scienc e , 1 , 3–46. Bo den, M. A. (1998). Creativit y and artificial intelligence. A rtificial Intel ligenc e , 103 (I 998), 347–356. Bo den, M. A. (2006). Mind as machine: A history of c o gnitive scienc e . Oxford Universit y Press. Bona witz, E., Denison, S., Griffiths, T. L., & Gopnik, A. (2014). Probabilistic mo dels, learning algorithms, and resp onse v ariability: sampling in cognitive developmen t. T r ends in Co gnitive Scienc es , 18 , 497–500. Bottou, L. (2014). F rom machine learning to mac hine reasoning. Machine le arning , 94 (2), 133–149. Bouton, M. E. (2004). Con text and b ehavioral pro cesses in extinction. L e arning & Memory , 11 , 485–494. Buc kingham, D., & Shultz, T. R. (2000). The dev elopmen tal course of distance, time, and velocity concepts: A generative connectionist model. Journal of Co gnition and Development , 1 (3), 305–345. Buesing, L., Bill, J., Nessler, B., & Maass, W. (2011). Neural dynamics as sampling: a mo del for sto chastic computation in recurrent net w orks of spiking neurons. PL oS Computational Biolo gy , 7 , e1002211. Carey , S. (1978). The Child as W ord Learner. In J. Bresnan, G. Miller, & M. Halle (Eds.), Linguistic the ory and psycholo gic al r e ality (pp. 264–293). Carey , S. (2004). Bo otstrapping and the origin of concepts. Dae dalus , 133 (1), 59–68. Carey , S. (2009). The Origin of Conc epts . New Y ork, New Y ork, USA: Oxford Universit y Press. Carey , S., & Bartlett, E. (1978). Acquiring a single new word. Pap ers and R ep orts on Child L anguage Development , 15 , 17–29. Chouard, T. (2016, March). The go files: AI c omputer wr aps up 4-1 victory against human champion. ([Online; p osted 15-March-2016]) Ciresan, D., Meier, U., & Sc hmidh ub er, J. (2012). Multi-column Deep Neural Netw orks for Image Classification. In Computer Vision and Pattern R e c o gnition (CVPR) (pp. 3642–3649). Collins, A. G. E., & F rank, M. J. (2013). Cognitive con trol o v er learning: Creating, clustering, and generalizing task-set structure. Psycholo gic al R eview , 120 (1), 190–229. Co ok, C., Go o dman, N. D., & Sch ulz, L. E. (2011). Where science starts: sp ontaneous exp erimen ts in presc ho olers’ exploratory play. Co gnition , 120 (3), 341–9. Cric k, F. (1989). The recent excitement ab out neural netw orks. Natur e , 337 , 129–132. Csibra, G. (2008). Goal attribution to inanimate agents by 6.5-month-old infants. Co gnition , 107 , 705–717. Csibra, G., Biro, S., Ko os, O., & Gergely , G. (2003). One-year-old infan ts use teleological repre- sen tations of actions pro ductively . Co gnitive Scienc e , 27 , 111–133. Dalrmple, D. (2016). Differ entiable Pr o gr amming. Retrieved from https://www.edge.org/ response-detail/26794 Da vis, E., & Marcus, G. (2015). Commonsense Reas oning and Commonsense Kno wledge in Arti- ficial In te lligence. Communic ations of the ACM , 58 (9), 92–103. Da w, N. D., Niv, Y., & Da y an, P . (2005). Uncertaint y-based comp etition b et ween prefrontal and dorsolateral striatal systems for b eha vioral control. Natur e Neur oscienc e , 8 , 1704–1711. Da yan, P ., Hinton, G. E., Neal, R. M., & Zemel, R. S. (1995). The Helmholtz mac hine. Neur al Computation , 7 (5), 889–904. 46 Deacon, T. W. (1998). The symb olic sp e cies: The c o-evolution of language and the br ain . WW Norton & Compan y. Deci, E. L., & Ry an, R. M. (1975). Intrinsic motivation . Wiley Online Library. de Jonge, M., & Racine, R. J. (1985). The effects of rep eated induction of long-term p otentiation in the den tate gyrus. Br ain R ese ar ch , 328 , 181–185. Den ton, E., Chintala, S., Szlam, A., & F ergus, R. (2015). Deep Generative Image Mo dels using a Laplacian Pyramid of Adv ersarial Net works. In A dvanc es in Neur al Information Pr o c essing Systems 29. Retrieved from Diuk, C., Cohen, A., & Littman, M. L. (2008). An Ob ject-Oriented represen tation for efficient reinforcemen t learning. In Pr o c e e dings of the 25th International Confer enc e on Machine L e arning (ICML) (pp. 240–247). Dolan, R. J., & Da y an, P . (2013). Goals and habits in the brain. Neur on , 80 , 312–325. Donah ue, J., Jia, Y., Viny als, O., Hoffman, J., Zhang, N., Tzeng, E., & Darrell, T. (2013). Decaf: A deep conv olutional activ ation feature for generic visual recognition. arXiv pr eprint arXiv:1310.1531 . Economides, M., Kurth-Nelson, Z., L ¨ ubb ert, A., Guitart-Masip, M., & Dolan, R. J. (2015). Mo del- based reasoning in humans b ecomes automatic with training. PL oS Computation Biolo gy , 11 , e1004463. Edelman, S. (2015). The minorit y rep ort: some common assumptions to reconsider in the modelling of the brain and b ehaviour. Journal of Exp erimental & The or etic al A rtificial Intel ligenc e , 28 (4), 751–776. Eden, M. (1962). Handwriting and P attern Recognition. IRE T r ansactions on Information The ory , 160–166. Eliasmith, C., Stewart, T. C., Cho o, X., Bekola y , T., DeW olf, T., T ang, Y., & Rasm ussen, D. (2012). A large-scale mo del of the functioning brain. Scienc e , 338 (6111), 1202–1205. Elman, J. L. (2005). Connectionist mo dels of cognitive developmen t: Where next? T r ends in Co gnitive Scienc es , 9 (3), 111–117. Elman, J. L., Bates, E. A., Johnson, M. H., Karmiloff-Smith, A., Parisi, D., & Plunkett, K. (1996). R ethinking innateness . Cambridge, MA: MIT Press. Eslami, S. M. A., Heess, N., W eb er, T., T assa, Y., Kavuk cuoglu, K., & Hin ton, G. E. (2016). A ttend, infer, rep eat: F ast scene understanding with generative mo dels. arXiv pr eprint arXiv:1603.08575 . Eslami, S. M. A., T arlo w, D., Kohli, P ., & Winn, J. (2014). Just-in-time learning for fast and flexible inference. In A dvanc es in Neur al Information Pr o c essing Systems (pp. 154–162). F o dor, J. A. (1975). The L anguage of Thought . Harv ard Universit y Press. F o dor, J. A., & Pylysh yn, Z. W. (1988). Connectionism and cognitive architecture: A critical analysis. Co gnition , 28 , 3–71. F rank, M. C., Go o dman, N. D., & T enen baum, J. B. (2009). Using sp eak ers’ referential inten tions to mo del early cross-situational word learning. Psycholo gic al Scienc e , 20 , 578–585. F reyd, J. (1983). Representing the dynamics of a static form. Memory and Co gnition , 11 (4), 342–346. F reyd, J. (1987). Dynamic Mental Representations. Psycholo gic al R eview , 94 (4), 427–438. F ukushima, K. (1980). Neo cognitron: A self-organizing neural net w ork mo del for a mec hanism of pattern recognition unaffected b y shift in p osition. Biolo gic al Cyb ernetics , 36 , 193–202. Gallistel, C., & Matzel, L. D. (2013). The neuroscience of learning: b ey ond the Hebbian synapse. 47 A nnual R eview of Psycholo gy , 64 , 169–200. Gelly , S., & Silver, D. (2008). Achieving master level play in 9 x 9 computer go.. Gelly , S., & Silver, D. (2011). Monte-carlo tree search and rapid action v alue estimation in computer go. Artificial Intel ligenc e , 175 (11), 1856–1875. Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (2004). Bayesian Data Analysis . Chapman and Hall/CR C. Gelman, A., Lee, D., & Guo, J. (2015). Stan a probabilistic programming language for Bay esian inference and optimization. Journal of Educ ational and Behavior al Statistics , 40 , 530–543. Geman, S., Bienensto c k, E., & Doursat, R. (1992). Neural netw orks and the bias/v ariance dilemma. Neur al Computation , 4 , 1–58. Gershman, S. J., & Go o dman, N. D. (2014). Amortized inference in probabilistic reas oning. In Pr o c e e dings of the 36th Annual Confer enc e of the Co gnitive Scienc e So ciety. Gershman, S. J., Horvitz, E. J., & T enen baum, J. B. (2015). Computational rationalit y: A con verging paradigm for in telligence in brains, minds, and mac hines. Scienc e , 349 , 273–278. Gershman, S. J., Markman, A. B., & Otto, A. R. (2014). Retrosp ectiv e rev aluation in sequential decision making: A tale of tw o systems. Journal of Exp erimental Psycholo gy: Gener al , 143 , 182–194. Gershman, S. J., V ul, E., & T enenbaum, J. B. (2012). Multistability and p erceptual inference. Neur al Computation , 24 , 1–24. Gersten b erg, T., Go odman, N. D., Lagnado, D. a., & T enenbaum, J. B. (2015). Ho w, whether, wh y: Causal judgments as coun terfactual contrasts. Pr o c e e dings of the 37th Annual Confer enc e of the Co gnitive Scienc e So ciety . Ghahramani, Z. (2015). Probabilistic machine learning and artificial intelligence. Natur e , 521 , 452–459. Go o dman, N. D., Mansinghk a, V. K., Ro y , D. M., Bonawitz, K., & T enenbaum, J. B. (2008). Ch urch: A language for generative mo dels. Unc ertainty in Artificial Intel ligenc e . Gopnik, A., Glymour, C., Sob el, D. M., Sch ulz, L. E., Kushnir, T., & Danks, D. (2004). A theory of causal learning in children: Causal maps and Ba yes nets. Psycholo gic al R eview , 111 (1), 3–32. Gopnik, A., & Meltzoff, A. N. (1999). W ords, Thoughts, and Theories. Mind: A Quarterly R eview of Philosophy , 108 , 0. Gra ves, A. (2014). Generating sequences with recurren t neural net w orks. arXiv pr eprint . Retrieved from Gra ves, A., Mohamed, A.-r., & Hin ton, G. (2013). Sp eec h recognition with deep recurrent neu- ral netw orks. In A c oustics, sp e e ch and signal pr o c essing (ic assp), 2013 ie e e international c onfer enc e on (pp. 6645–6649). Gra ves, A., W a yne, G., & Danihelk a, I. (2014). Neural T uring Machines. arXiv pr eprint . Retrieved from Gra ves, A., W ayne, G., Reynolds, M., Harley , T., Danihelk a, I., Grabsk a-Barwi ´ nsk a, A., . . . Has- sabis, D. (2016). Hybrid computing using a neural netw ork with dynamic external memory . Natur e . Grefenstette, E., Hermann, K. M., Suleyman, M., & Blunsom, P . (2015). Learning to T ransduce with Un b ounded Memory. In A dvanc es in Neur al Information Pr o c essing Systems. Gregor, K., Besse, F., Rezende, D. J., Danihelk a, I., & Wierstra, D. (2016). T ow ards Conceptual Compression. arXiv pr eprint . Retrieved from 48 Gregor, K., Danihelk a, I., Grav es, A., Rezende, D. J., & Wierstra, D. (2015). DRA W: A Recurren t Neural Net work F or Image Generation. In International Confer enc e on Machine L e arning (ICML). Griffiths, T. L., Chater, N., Kemp, C., Perfors, A., & T enen baum, J. B. (2010). Probabilistic models of cognition: exploring represen tations and inductive biases. T r ends in Co gnitive Scienc es , 14 (8), 357–64. Griffiths, T. L., V ul, E., & Sanborn, A. N. (2012). Bridging levels of analysis for probabilistic mo dels of cognition. Curr ent Dir e ctions in Psycholo gic al Scienc e , 21 , 263–268. Grossb erg, S. (1976). Adaptive pattern classification and univ ersal reco ding: I. parallel developmen t and co ding of neural feature detectors. Biolo gic al Cyb ernetics , 23 , 121–134. Grosse, R., Salakhutdino v, R., F reeman, W. T., & T enenbaum, J. B. (2012). Exploiting comp osi- tionalit y to explore a large space of mo del structures. In Unc ertainty in Artificial Intel ligenc e. Guo, X., Singh, S., Lee, H., Lewis, R. L., & W ang, X. (2014). Deep learning for real-time Atari game play using offline Mon te-Carlo tree search planning. In A dvanc es in neur al information pr o c essing systems (pp. 3338–3346). Gw eon, H., T enenbaum, J. B., & Sch ulz, L. E. (2010). Infants consider b oth the sample and the sampling pro cess in inductive generalization. Pr o c e e dings of the National A c ademy of Scienc es , 107 , 9066–9071. doi: 10.1073/pnas.1003095107 Halle, M., & Stev ens, K. (1962). Sp eech Recognition: A Mo del and a Program for Research. IRE T r ansactions on Information The ory , 8 (2), 155–159. Hamlin, K. J. (2013). Moral Judgment and Action in Prev erbal Infan ts and T o ddlers: Evidence for an Innate Moral Core. Curr ent Dir e ctions in Psycholo gic al Scienc e , 22 , 186–193. doi: 10.1177/0963721412470687 Hamlin, K. J., Ullman, T., T enenbaum, J., Go o dman, N. D., & Baker, C. (2013). The mentalistic basis of core so cial cognition: Experiments in preverb al infants and a computational mo del. Developmental Scienc e , 16 , 209–226. doi: 10.1111/desc.12017 Hamlin, K. J., Wynn, K., & Blo om, P . (2007). So cial ev aluation b y preverbal infan ts. Natur e , 450 , 557–560. Hamlin, K. J., Wynn, K., & Blo om, P . (2010). Three-month-olds show a negativity bias in their so cial ev aluations. Developmental Scienc e , 13 , 923–929. doi: 10.1111/j.1467-7687.2010.00951 .x Harlo w, H. F. (1949). The formation of learning sets. Psycholo gic al R eview , 56 (1), 51–65. Harlo w, H. F. (1950). Learning and satiation of resp onse in intrinsically motiv ated complex puzzle p erformance by monkeys. Journal of Comp ar ative and Physiolo gic al Psycholo gy , 43 , 289–294. Hauser, M. D., Chomsky , N., & Fitch, W. T. (2002). The faculty of language: what is it, who has it, and ho w did it ev olve? Scienc e , 298 , 1569–1579. Ha yes-Roth, B., & Hay es-Roth, F. (1979). A cognitiv e model of planning. Co gnitive Scienc e , 3 , 275–310. He, K., Zhang, X., Ren, S., & Sun, J. (2015). Deep Residual Learning for Image Recognition. arXiv pr eprint . Retriev ed from Hebb, D. O. (1949). The or ganization of b ehavior . Wiley. Heess, N., T arlow, D., & Winn, J. (2013). Learning to pass exp ectation propagation messages. In A dvanc es in Neur al Information Pr o c essing Systems (pp. 3219–3227). Hesp os, S. J., & Baillargeon, R. (2008). Y oung infan ts’ actions rev eal their dev eloping knowledge of supp ort v ariables: Con v erging evidence for violation-of-exp ectation findings. Co gnition , 49 107 , 304–316. Hesp os, S. J., F erry , A. L., & Rips, L. J. (2009). Fiv e-month-old infan ts hav e different exp ectations for solids and liquids. Psycholo gic al Scienc e , 20 (5), 603–611. Hin ton, G. E. (2002). T raining pro ducts of exp erts by minimizing con trastiv e divergence. Neur al Computation , 14 (8), 1771–800. Hin ton, G. E., Da y an, P ., F rey , B. J., & Neal, R. M. (1995). The “wak e-sleep” algorithm for unsup ervised neural netw orks. Scienc e , 268 (5214), 1158–61. Hin ton, G. E., Deng, L., Y u, D., Dahl, G. E., Mohamed, A.-r., Jaitly , N., . . . Kingsbury , B. (2012). Deep neural net works for acoustic mo deling in sp eech recognition. IEEE Signal Pr o c essing Magazine , 29 , 82–97. Hin ton, G. E., Osindero, S., & T eh, Y. W. (2006). A fast learning algorithm for deep b elief nets. Neur al Computation , 18 , 1527–1554. Hoffman, D. D., & Ric hards, W. A. (1984). Parts of recognition. Co gnition , 18 , 65–96. Hofstadter, D. R. (1985). Metamagic al themas: Questing for the essenc e of mind and p attern . New Y ork: Basic Bo oks. Horst, J. S., & Samuelson, L. K. (2008). F ast Mapping but P o or Retention b y 24-Month-Old Infan ts. Infancy , 13 (2), 128–157. Huang, Y., & Rao, R. P . (2014). Neurons as Mon te Carlo samplers: Ba y esian? inference and learning in spiking netw orks. In A dvanc es in neur al information pr o c essing systems (pp. 1943–1951). Hummel, J. E., & Biederman, I. (1992). Dynamic binding in a neural netw ork for shap e recognition. Psycholo gic al R eview , 99 (3), 480–517. Jac kendoff, R. (2003). F oundations of L anguage . Oxford Universit y Press. Jara-Ettinger, J., Gw eon, H., T enenbaum, J. B., & Sch ulz, L. E. (2015). Childrens understanding of the costs and rew ards underlying rational action. Co gnition , 140 , 14–23. Jern, A., & Kemp, C. (2013). A probabilistic account of exemplar and category generation. Co gnitive Psycholo gy , 66 (1), 85–125. Jern, A., & Kemp, C. (2015). A decision netw ork account of reasoning ab out other p eoples choices. Co gnition , 142 , 12–38. Johnson, S. C., Slaughter, V., & Carey , S. (1998). Whose gaze will infants follow? The elicitation of gaze-follo wing in 12-month-olds. Developmental Scienc e , 1 , 233–238. doi: 10.1111/1467 -7687.00036 Juang, B. H., & Rabiner, L. R. (1990). Hidden Mark o v mo dels for speech recognition. T e chnometric , 33 (3), 251–272. Karpath y , A., & F ei-F ei, L. (2015). Deep Visual-Semantic Alignments for Generating Image Desscriptions. In Computer Vision and Pattern R e c o gnition (CVPR). Kemp, C. (2007). The ac quisition of inductive c onstr aints . Unpublished doctoral dissertation, MIT. Keramati, M., Dezfouli, A., & Piray , P . (2011). Sp eed/accuracy trade-off b etw een the habitual and the goal-directed pro cesses. PL oS Computational Biolo gy , 7 , e1002055. Khaligh-Raza vi, S.-M., & Kriegesk orte, N. (2014). Deep Sup ervised, but Not Unsup ervised, Models Ma y Explain IT Cortical Representation. PL oS Computational Biolo gy , 10 (11), e1003915. Kilner, J. M., F riston, K. J., & F rith, C. D. (2007). Predictive co ding: An accoun t of the mirror neuron system. Co gnitive Pr o c essing , 8 (3), 159–166. Kingma, D. P ., Rezende, D. J., Mohamed, S., & W elling, M. (2014). Semi-sup ervised Learning 50 with Deep Generativ e Mo dels. In Neur al Information Pr o c essing Systems (NIPS). Ko c h, G., Zemel, R. S., & Salakhutdino v, R. (2015). Siamese neural netw orks for one-shot image recognition. In ICML De ep L e arning Workshop. Ko dratoff, Y., & Michalski, R. S. (2014). Machine le arning: An artificial intel ligenc e appr o ach (V ol. 3). Morgan Kaufmann. Koza, J. R. (1992). Genetic pr o gr amming: on the pr o gr amming of c omputers by me ans of natur al sele ction (V ol. 1). MIT press. Kriegesk orte, N. (2015). Deep Neural Netw orks: A New F ramew ork for Mo deling Biological Vision and Brain Information Pro cessing. Annur al R eview of Vision Scienc e , 1 , 417–446. Krizhevsky , A., Sutskev er, I., & Hinton, G. E. (2012). ImageNet classification with deep con- v olutional neural netw orks. In A dvanc es in Neur al Information Pr o c essing Systems 25 (pp. 1097–1105). Kulk arni, T. D., Kohli, P ., T enenbaum, J. B., & Mansinghk a, V. (2015). Picture: A probabilistic programming language for scene p erception. In Computer Vision and Pattern R e c o gnition (CVPR). Kulk arni, T. D., Narasimhan, K. R., Saeedi, A., & T enenbaum, J. B. (2016). Hierarchical Deep Reinforcemen t Learning: In tegrating T emp oral Abstraction and Intrinsic Motiv ation. arXiv pr eprint . Kulk arni, T. D., Whitney , W., Kohli, P ., & T enen baum, J. B. (2015). Deep Conv olutional Inv erse Graphics Net w ork. In Computer Vision and Pattern R e c o gnition (CVPR). Lak e, B. M. (2014). T owar ds mor e human-like c onc ept le arning in machines: Comp ositionality, c ausality, and le arning-to-le arn . Unpublished do ctoral dissertation, MIT. Lak e, B. M., Lee, C.-y ., Glass, J. R., & T enenbaum, J. B. (2014). One-shot learning of generative sp eec h concepts. In Pr o c e e dings of the 36th Annual Confer enc e of the Co gnitive Scienc e So ciety (pp. 803–808). Lak e, B. M., Salakhutdino v, R., & T enenbaum, J. B. (2012). Concept learning as motor program induction: A large-scale empirical study. In Pr o c e e dings of the 34th Annual Confer enc e of the Co gnitive Scienc e So ciety. Lak e, B. M., Salakhutdin o v, R., & T enen baum, J. B. (2015). Human-lev el concept learning through probabilistic program induction. Scienc e , 350 (6266), 1332–1338. Lak e, B. M., Zarem ba, W., F ergus, R., & Gureckis, T. M. (2015). Deep Neural Netw orks Predict Category Typicalit y Ratings for Images. In Pr o c e e dings of the 37th Annual Confer enc e of the Co gnitive Scienc e So ciety. Landau, B., Smith, L. B., & Jones, S. S. (1988). The imp ortance of shap e in early lexical learning. Co gnitive Development , 3 (3), 299–321. Langley , P ., Bradsha w, G., Simon, H. A., & Zytko w, J. M. (1987). Scientific disc overy: Computa- tional explor ations of the cr e ative pr o c esses . MIT press. LeCun, Y., Bengio, Y., & Hin ton, G. (2015). Deep learning. Natur e , 521 , 436–444. LeCun, Y., Boser, B., Denker, J. S., Henderson, D., How ard, R. E., Hubbard, W., & Jack el, L. D. (1989). Backpropagation applied to handwritten zip co de recognition. Neur al Computation , 1 , 541–551. LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P . (1998). Gradien t-based learning applied to do cumen t recognition. Pr o c e e dings of the IEEE , 86 (11), 2278–2323. Lerer, A., Gross, S., & F ergus, R. (2016). Learning Physical In tuition of Blo ck T o wers b y Example. arXiv pr eprint . Retriev ed from 51 Levy , R. P ., Re ali, F., & Griffiths, T. L. (2009). Mo deling the effects of memory on human online sen tence pro cessing with particle filters. In A dvanc es in Neur al Information Pr o c essing Systems (pp. 937–944). Liao, Q., Leibo, J. Z., & Poggio, T. (2015). How imp ortant is weigh t symmetry in bac kpropagation? arXiv pr eprint arXiv:1510.05067 . Lib erman, A. M., Co op er, F. S., Shankweiler, D. P ., & Studdert-Kennedy , M. (1967). P erception of the sp eec h co de. Psycholo gic al R eview , 74 (6), 431–461. Lillicrap, T. P ., Cownden, D., Tweed, D. B., & Akerman, C. J. (2014). Random feedback weigh ts supp ort learning in deep neural netw orks. arXiv pr eprint arXiv:1411.0247 . Llo yd, J., Duv enaud, D., Grosse, R., T enen baum, J., & Ghahramani, Z. (2014). Automatic con- struction and natural-language description of nonparametric regression mo dels. In Pr o c e e dings of the National Confer enc e on Artificial Intel ligenc e (V ol. 2, pp. 1242–1250). Lom brozo, T. (2009). Explanation and categorization: How “why?” informs “what?”. Co gnition , 110 (2), 248–53. Lop ez-P az, D., Bottou, L., Scholk¨ opf, B., & V apnik, V. (2016). Unifying distillation and privileged information. In International Confer enc e on L e arning R epr esentations (ICLR). Lop ez-P az, D., Muandet, K., Sc holk¨ opf, B., & T olstikhin, I. (2015). T ow ards a Learning Theory of Cause-Effect Inference. In Pr o c e e dings of the 32nd International Confer enc e on Machine L e arning (ICML). Luong, M.-T., Le, Q. V., Sutskev er, I., Vin yals, O., & Kaiser, L. (2015). Multi-task sequence to sequence learning. arXiv pr eprint arXiv:1511.06114 . Lup yan, G., & Bergen, B. (2016). How Language Programs the Mind. T opics in Co gnitive Scienc e , 8 (2), 408–424. Retriev ed from http://doi.wiley.com/10.1111/tops.12155 Lup yan, G., & Clark, A. (2015). W ords and the world: Predictive co ding and the language- p erception-cognition interface. Curr ent Dir e ctions in Psycholo gic al Scienc e , 24 (4), 279–284. Macindo e, O. (2013). Sidekick agents for se quential planning pr oblems . Unpublished do ctoral dissertation, Massac h usetts Institute of T ec hnology . Magid, R. W., Sheskin, M., & Sch ulz, L. E. (2015). Imagination and the generation of new ideas. Co gnitive Development , 34 , 99–110. Mansinghk a, V., Selsam, D., & P ero v, Y. (2014). V enture: A higher-order probabilistic program- ming platform with programmable inference. arXiv pr eprint arXiv:1404.0099 . Marcus, G. (1998). Rethinking Eliminative Connectionism. Co gnitive Psycholo gy , 282 (37), 243– 282. Marcus, G. (2001). The algebr aic mind: Inte gr ating c onne ctionism and c o gnitive scienc e . MIT press. Markman, A. B., & Makin, V. S. (1998). Referential communication and category acquisition. Journal of Exp erimental Psycholo gy: Gener al , 127 (4), 331–54. Markman, A. B., & Ross, B. H. (2003). Category use and category learning. Psycholo gic al Bul letin , 129 (4), 592–613. Markman, E. M. (1989). Cate gorization and Naming in Childr en . Cam bridge, MA: MIT Press. Marr, D. C. (1982). Vision . San F rancisco, CA: W.H. F reeman and Company. Marr, D. C., & Nishihara, H. K. (1978). Representation and recognition of the spatial organization of three-dimensional shapes. Pr o c e e dings of the R oyal So ciety of L ondon. Series B , 200 (1140), 269–94. McClelland, J. L. (1988). Par al lel distribute d pr o c essing: Implic ations for c o gnition and development 52 (T ech. Rep.). DTIC Do cument. McClelland, J. L., Botvinic k, M. M., No elle, D. C., Plaut, D. C., Rogers, T. T., Seiden b erg, M. S., & Smith, L. B. (2010). Letting structure emerge: connectionist and dynamical systems approac hes to cognition. T r ends in Co gnitive Scienc es , 14 (8), 348–56. McClelland, J. L., McNaugh ton, B. L., & O’Reilly , R. C. (1995). Why there are complemen tary learning systems in the hipp o campus and neo cortex: insights from the successes and failures of connectionist mo dels of learning and memory . Psycholo gic al R eview , 102 (3), 419–57. McClelland, J. L., Rumelhart, D. E., & the PDP Research Group. (1986). Par al lel Distribute d Pr o c essing: Explor ations in the micr ostructur e of c o gnition. V olume II. Cambridge, MA: MIT Press. Mik olov, T., Joulin, A., & Baroni, M. (2016). A Roadmap to wards Machine Intelligence. arXiv pr eprint . Retriev ed from Mik olov, T., Sutskev er, I., & Chen, K. (2013). Distributed Represen tations of W ords and Phrases and their Comp ositionalit y. A dvanc es in Neur al Information Pr o c essing Systems . Miller, E. G., Matsakis, N. E., & Viola, P. A. (2000). Learning from one example through shared densities on transformations. In Pr o c e e dings of the IEEE Confer enc e on Computer Vision and Pattern R e c o gnition. Miller, G. A., & Johnson-Laird, P. N. (1976). L anguage and p er c eption . Cam bridge, MA: Belknap Press. Minsky , M. L. (1974). A framework for representing knowledge. MIT-AI L ab or atory Memo 306 . Minsky , M. L., & Papert, S. A. (1969). Per c eptr ons: A n intr o duction to c omputational ge ometry . MIT Press. Mitc hell, T. M., Keller, R. R., & Kedar-cabelli, S. T. (1986). Explanation-Based Generalization: A Unifying View. Machine L e arning , 1 , 47–80. Mnih, A., & Gregor, K. (2014). Neural v ariational inference and learning in belief netw orks. In Pr o c e e dings of the 31st International Confer enc e on Machine L e arning (pp. 1791–1799). Mnih, V., Heess, N., Gra ves, A., & Kavuk cuoglu, K. (2014). Recurrent Models of Visual Atten tion. In A dvanc es in Neur al Information Pr o c essing Systems 27 (pp. 1–9). Mnih, V., Ka vukcuoglu, K., Silver, D., Rusu, A. A., V eness, J., Bellemare, M. G., . . . Hassabis, D. (2015). Human-level control through deep reinforcemen t learning. Natur e , 518 (7540), 529–533. Mohamed, S., & Rezende, D. J. (2015). V ariational information maximisation for intrinsically motiv ated reinforcement learning. In A dvanc es in neur al information pr o c essing systems (pp. 2125–2133). Moreno-Bote, R., Knill, D. C., & P ouget, A. (2011). Bay esian sampling in visual p erception. Pr o c e e dings of the National A c ademy of Scienc es , 108 , 12491–12496. Murph y , G. L. (1988). Comprehending complex concepts. Co gnitive Scienc e , 12 (4), 529–562. Murph y , G. L., & Medin, D. L. (1985). The role of theories in conceptual coherence. Psycholo gic al R eview , 92 (3), 289–316. Murph y , G. L., & Ross, B. H. (1994). Predictions from Uncertain Categorizations. Co gnitive Psycholo gy , 27 , 148–193. Neisser, U. (1966). Co gnitive Psycholo gy . New Y ork: Appleton-Cen tury-Crofts. New ell, A., & Simon, H. A. (1961). Gps, a pr o gr am that simulates human thought . Defense T echnical Information Cen ter. New ell, A., & Simon, H. A. (1972). Human pr oblem solving . Pren tice-Hall. 53 Niv, Y. (2009). Reinforcemen t learning in the brain. Journal of Mathematic al Psycholo gy , 53 , 139–154. O’Donnell, T. J. (2015). Pr o ductivity and R euse in L anguage: A The ory of Linguistic Computation and Stor age . Cambridge, MA: MIT Press. Osherson, D. N., & Smith, E. E. (1981). On the adequacy of prototype theory as a theory of concepts. Co gnition , 9 (1), 35–58. P arisotto, E., Ba, J. L., & Salakhutdino v, R. (2016). Actor-Mimic: Deep Multitask and T ransfer Reinforcemen t Learning. In International Confer enc e on L e arning R epr esentations (ICLR). Retriev ed from P ecevski, D., Buesing, L., & Maass, W. (2011). Probabilistic inference in general graphical mo dels through sampling in sto c hastic netw orks of spiking neurons. PL oS Computational Biolo gy , 7 , e1002294. P eterson, J. C., Abbott, J. T., & Griffiths, T. L. (2016). Adapting Deep Net w ork F eatures to Capture Psychological Representations. In Pr o c e e dings of the 38th Annual Confer enc e of the Co gnitive Scienc e So ciety. Pian tadosi, S. T. (2011). L e arning and the language of thought . Unpublished do ctoral dissertation, Massac husetts Institute of T ec hnology . Pink er, S. (2007). The Stuff of Thought . P enguin. Pink er, S., & Prince, A. (1988). On language and connectionism: Analysis of a parallel distributed pro cessing mo del of language acquisition. Co gnition , 28 , 73–193. P ow er, J. M., Thompson, L. T., Moy er, J. R., & Disterhoft, J. F. (1997). Enhanced synaptic transmission in ca1 hipp o campus after eyeblink conditioning. Journal of Neur ophysiolo gy , 78 , 1184–1187. Premac k, D., & Premac k, A. J. (1997). Infants Attribute V alue to the Go al-Dir e cte d A ctions of Self-pr op el le d Obje cts (V ol. 9). doi: 10.1162/jo cn.1997.9.6.848 Reed, S., & de F reitas, N. (2016). Neural Programmer-Interpreters. In International Confer enc e on L e arning R epr esentations (ICLR). Retriev ed from Rehder, B. (2003). A causal-mo del theory of conceptual represen tation and categorization. Journal of Exp erimental Psycholo gy: L e arning, Memory, and Co gnition , 29 (6), 1141–59. Rehder, B., & Hastie, R. (2001). Causal Knowledge and Categories: The Effects of Causal Beliefs on Categorization, Induction, and Similarity. Journal of Exp erimental Psycholo gy: Gener al , 130 (3), 323–360. Rehling, J. A. (2001). L etter Spirit (Part Two): Mo deling Cr e ativity in a Visual Domain . Unpub- lished do ctoral dissertation, Indiana Universit y . Rezende, D. J., Mohamed, S., Danihelk a, I., Gregor, K., & Wierstra, D. (2016). One-Shot Gen- eralization in Deep Generative Mo dels. In International Confer enc e on Machine L e arning (ICML). Retrieved from Rezende, D. J., Mohamed, S., & Wierstra, D. (2014). Sto c hastic bac kpropagation and approxi- mate inference in deep generativ e mo dels. In International Confer enc e on Machine L e arning (ICML). Rips, L. J. (1975). Inductiv e judgmen ts ab out natural categories. Journal of V erb al L e arning and V erb al Behavior , 14 (6), 665–681. Rips, L. J., & Hesp os, S. J. (2015). Divisions of the physical world: Concepts of ob jects and substances. Psycholo gic al Bul letin , 141 , 786–811. Rogers, T. T., & McClelland, J. L. (2004). Semantic Co gnition . Cam bridge, MA: MIT Press. 54 Rosen blatt, F. (1958). The p erceptron: a probabilistic mo del for information storage and organi- zation in the brain. Psycholo gic al R eview , 65 , 386–408. Rougier, N. P ., Noelle, D. C., Brav er, T. S., Cohen, J. D., & O’Reilly , R. C. (2005). Prefrontal cortex and flexible cognitive control: Rules without symbols. Pr o c e e dings of the National A c ademy of Scienc es (PNAS) , 102 (20), 7338–7343. Rumelhart, D. E., Hin ton, G., & Williams, R. (1986). Learning representations b y back-propagating errors. Natur e , 323 (9), 533–536. Rumelhart, D. E., & McClelland, J. L. (1986). On Learning the Past T enses of English V erbs. In Par al lel distribute d pr o c essing: Explor ations in the micr ostructur e of c o gnition (pp. 216–271). Cam bridge, MA: MIT Press. Rumelhart, D. E., McClelland, J. L., & the PDP Research Group. (1986). Par al lel Distribute d Pr o c essing: Explor ations in the micr ostructur e of c o gnition. V olume I. Cam bridge, MA: MIT Press. Russak ovsky , O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., . . . F ei-F ei, L. (2015). ImageNet lar ge sc ale visual r e c o gnition chal lenge (T ec h. Rep.). Russell, S., & Norvig, P . (2003). Artificial Intel ligenc e: A Mo dern Appr o ach . Upp er Saddle River, NJ: Pren tice Hall. Rusu, A. A., Rabino witz, N. C., Desjardins, G., So y er, H., Kirkpatric k, J., Kavuk cuoglu, K., . . . Hadsell, R. (2016). Progressive Neural Netw orks. arXiv pr eprint . Retrieved from http:// Salakh utdinov, R., T enenbaum, J., & T orralba, A. (2012). One-shot learning with a hierarchical nonparametric Bay esian mo del. JMLR Workshop on Unsup ervise d and T r ansfer L e arning , 27 , 195–207. Salakh utdinov, R., T enenbaum, J. B., & T orralba, A. (2013). Learning with Hierarchical-Deep Mo dels. IEEE T r ansactions on Pattern Analysis and Machine Intel ligenc e , 35 (8), 1958–71. Salakh utdinov, R., T orralba, A., & T enenbaum, J. (2011). Learning to Share Visual Appearance for Multiclass Ob ject Detection. In Computer Vision and Pattern R e c o gnition (CVPR). San b orn, A. N., Mansinghk a, V. K., & Griffiths, T. L. (2013). Reconciling intuitiv e ph ysics and newtonian mec hanics for colliding ob jects. Psycholo gic al R eview , 120 (2), 411. Scellier, B., & Bengio, Y. (2016). T ow ards a biologically plausible bac kprop. arXiv pr eprint arXiv:1602.05179 . Sc hank, R. C. (1972). Conceptual dep endency: A theory of natural language understanding. Co gnitive Psycholo gy , 3 , 552–631. Sc haul, T., Quan, J., An tonoglou, I., & Silver, D. (2016). Prioritized Exp erience Replay. In International Confer enc e on L e arning R epr esentations (ICLR). Retrieved from http://arxiv .org/abs/1511.05952 Sc hlottmann, A., Cole, K., W atts, R., & White, M. (2013). Domain-sp ecific p erceptual causalit y in c hildren dep ends on the spatio-temp oral configuration, not motion onset. F r ontiers in Psycholo gy , 4 . doi: 10.3389/fpsyg.2013.00365 Sc hlottmann, A., Ray , E. D., Mitchell, A., & Demetriou, N. (2006). Perceiv ed physical and so cial causalit y in animated motions: Sp on taneous rep orts and ratings. A cta Psycholo gic a , 123 , 112–143. doi: 10.1016/j.actpsy .2006.05.006 Sc hmidhuber, J. (2015). Deep learning in neural net w orks: An o verview. Neur al Networks , 61 , 85–117. Sc holl, B. J., & Gao, T. (2013). Perceiving Animacy and Inten tionalit y: Visual Pro cessing or 55 Higher-Lev el Judgment? So cial p er c eption: Dete ction and interpr etation of animacy, agency, and intention . Sc hultz, W., Da y an, P ., & Montague, P. R. (1997). A neural substrate of prediction and reward. Scienc e , 275 , 1593–1599. Sc hulz, L. (2012). The origins of inquiry: Inductive inference and exploration in early childhoo d. T r ends in Co gnitive Scienc es , 16 (7), 382–9. Sc hulz, L. E., Gopnik, A., & Glymour, C. (2007). Presc ho ol children learn ab out causal structure from conditional interv entions. Developmental Scienc e , 10 , 322–332. doi: 10.1111/j.1467 -7687.2007.00587.x Sermanet, P ., Eigen, D., Zhang, X., Mathieu, M., F ergus, R., & LeCun, Y. (2014). Ov erF eat: In tegrated Recognition, Lo calization and Detection using Conv olutional Netw orks. In Inter- national Confer enc e on L e arning R epr esentations (ICLR). Shafto, P ., Go o dman, N. D., & Griffiths, T. L. (2014). A rational accoun t of p edagogical reasoning: T eaching b y , and learning from, examples. Co gnitive Psycholo gy , 71 , 55–89. Sh ultz, T. R. (2003). Computational developmental psycholo gy . MIT Press. Siegler, R. S., & Chen, Z. (1998). Dev elopmental differences in rule learning: A microgenetic analysis. Co gnitive Psycholo gy , 36 (3), 273–310. Silv er, D. (2016). Personal comm unication. Silv er, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Driessche, G. V. D., . . . Hassabis, D. (2016). Mastering the game of Go with deep neural netw orks and tree search. Natur e , 529 (7585), 484–489. Smith, L. B., Jones, S. S., Landau, B., Gershk off-Sto w e, L., & Samuelson, L. (2002). Ob ject name learning pro vides on-the-job training for attention. Psycholo gic al Scienc e , 13 (1), 13–19. Solomon, K., Medin, D., & Lynch, E. (1999). Concepts do more than categorize. T r ends in Co gnitive Scienc es , 3 (3), 99–105. Sp elk e, E. S. (1990). Principles of Ob ject P erception. Co gnitive Scienc e , 14 (1), 29–56. Sp elk e, E. S. (2003). Core kno wledge. Attention and p erformanc e , 20 . Sp elk e, E. S., Gutheil, G., & V an de W alle, G. (1995). The developmen t of ob ject p erception. In Visual c o gnition: A n invitation to c o gnitive scienc e, vol. 2 (2nd e d.). an invitation to c o gnitive scienc e (pp. 297–330). Sp elk e, E. S., & Kinzler, K. D. (2007). Core kno wledge. Developmental Scienc e , 10 (1), 89–96. Sriv asta v a, N., & Salakhutdino v, R. (2013). Discriminativ e T ransfer Learning with T ree-based Priors. In A dvanc es in Neur al Information Pr o c essing Systems 26. Stadie, B. C., Levine, S., & Abb eel, P . (2016). Incentivizing Exploration In Reinforcement Learning With Deep Predictiv e Mo dels. arXiv pr eprint . Retriev ed from 1507.00814 Stahl, A. E., & F eigenson, L. (2015). Observing the unexp ected enhances infants’ learning and exploration. Scienc e , 348 (6230), 91–94. Stern b erg, R. J., & Davidson, J. E. (1995). The natur e of insight . The MIT Press. Stuhlm ¨ uller, A., T aylor, J., & Go o dman, N. D. (2013). Learning sto c hastic in v erses. In A dvanc es in Neur al Information Pr o c essing Systems (pp. 3048–3056). Sukh baatar, S., Szlam, A., W eston, J., & F ergus, R. (2015). End-T o-End Memory Netw orks. In A dvanc es in Neur al Information Pr o c essing Systems 29. Retriev ed from abs/1503.08895 Sutton, R. S. (1990). In tegrated architectures for learning, planning, and reacting based on ap- 56 pro ximating dynamic programming. In Pr o c e e dings of the Seventh International Confer enc e on Machine L e arning (pp. 216–224). Szegedy , C., Liu, W., Jia, Y., Sermanet, P ., Reed, S., Anguelov, D., . . . Rabino vic h, A. (2014). Going Deep er with Conv olutions. arXiv pr eprint . Retrieved from 1409.4842 T aub er, S., & Steyvers, M. (2011). Using inv erse planning and theory of mind for so cial goal inference. In Pr o c e e dings of the 33r d annual c onfer enc e of the c o gnitive scienc e so ciety (pp. 2480–2485). T ´ egl´ as, E., V ul, E., Girotto, V., Gonzalez, M., T enenbaum, J. B., & Bonatti, L. L. (2011). Pure reasoning in 12-mon th-old infants as probabilistic inference. Scienc e , 332 (6033), 1054–9. T enenbaum, J. B., Kemp, C., Griffiths, T. L., & Go o dman, N. D. (2011). How to Grow a Mind: Statistics, Structure, and Abstraction. Scienc e , 331 (6022), 1279–85. Tian, Y., & Zhu, Y. (2016). Better Computer Go Play er with Neural Net work and Long-term Prediction. In International Confer enc e on L e arning R epr esentations (ICLR). Retriev ed from T omasello, M. (2010). Origins of human c ommunic ation . MIT press. T orralba, A., Murphy , K. P ., & F reeman, W. T. (2007). Sharing visual features for m ulticlass and m ultiview ob ject detection. IEEE T r ansactions on Pattern Analysis and Machine Intel ligenc e , 29 (5), 854–869. T remoulet, P. D., & F eldman, J. (2000). P erception of animacy from the motion of a single ob ject. Per c eption , 29 , 943–951. Tsividis, P ., Gershman, S. J., T enenbaum, J. B., & Sch ulz, L. (2013). Information Selection in Noisy En vironments with Large Action Spaces. In Pr o c e e dings of the 36th Annual Confer enc e of the Co gnitive Scienc e So ciety (pp. 1622–1627). Tsividis, P ., T enenbaum, J. B., & Sch ulz, L. E. (2015). Constrain ts on h yp othesis selection in causal learning. Pr o c e e dings of the 37th Annual Co gnitive Scienc e So ciety . T uring, A. M. (1950). Computing Machine and In telligence. MIND , LIX , 433–460. Retrieved from http://mind.oxfordjournals.org/content/LIX/236/433 doi: http://dx.doi.org/ 10.1093 \ %2Fmind \ %2FLIX.236.433 Tv ersky , B., & Hemenw ay , K. (1984). Ob jects, Parts, and Categories. Journal of Exp erimental Psycholo gy: Gener al , 113 (2), 169–191. Ullman, S., Harari, D., & Dorfman, N. (2012). F rom simple innate biases to complex visual concepts. Pr o c e e dings of the National A c ademy of Scienc es , 109 (44), 18215–18220. Ullman, T. D., Go o dman, N. D., & T enen baum, J. B. (2012). Theory learning as sto c hastic search in the language of though t. Co gnitive Development , 27 (4), 455–480. v an den Hengel, A., Russell, C., Dick, A., Bastian, J., Pooley , D., Fleming, L., & Agapito, L. (2015). Part-based mo delling of comp ound scenes from images. In Computer Vision and Pattern R e c o gnition (CVPR) (pp. 878–886). v an Hasselt, H., Guez, A., & Silver, D. (2016). Deep Reinforcement Learning with Double Q- learning. In Thirtieth Confer enc e on Artificial Intel ligenc e (AAAI). Vin yals, O., Blundell, C., Lillicrap, T., Ka vuk cuoglu, K., & Wierstra, D. (2016). Matching Netw orks for One Shot Learning. arXiv pr eprint . Retrieved from Vin yals, O., T oshev, A., Bengio, S., & Erhan, D. (2014). Sho w and T ell: A Neural Image Caption Generator. In International Confer enc e on Machine L e arning (ICML). V ul, E., Go o dman, N., Griffiths, T. L., & T enenbaum, J. B. (2014). One and Done? Optimal 57 Decisions F rom V ery F ew Samples. Co gnitive Scienc e . W ang, Z., Schaul, T., Hessel, M., v an Hasselt, H., Lanctot, M., & de F reitas, N. (2016). Duel- ing netw ork architectures for deep reinforcement learning. arXiv pr eprint . Retrieved from W ard, T. B. (1994). Structured imagination: The role of category structure in exemplar generation. Co gnitive Psycholo gy , 27 , 1–40. W atkins, C. J., & Day an, P . (1992). Q-learning. Machine L e arning , 8 , 279–292. W ellman, H. M., & Gelman, S. A. (1992). Cognitive developmen t: F oundational theories of core domains. Annual R eview of Psycholo gy , 43 , 337–75. W ellman, H. M., & Gelman, S. A. (1998). Knowledge acquisition in foundational domains. In The handb o ok of child psycholo gy (pp. 523–573). Retrieved from http://doi.apa.org/psycinfo/ 2005-01927-010 W eng, C., Y u, D., W atanab e, S., & Juang, B.-H. F. (2014). Recurrent deep neural netw orks for robust speech recognition. ICASSP, IEEE International Confer enc e on A c oustics, Sp e e ch and Signal Pr o c essing - Pr o c e e dings (2), 5532–5536. W eston, J., Chopra, S., & Bordes, A. (2015). Memory Netw orks. In International Confer enc e on L e arning R epr esentations (ICLR). Williams, J. J., & Lombrozo, T. (2010). The role of explanation in disco very and generalization: Evidence from category learning. Co gnitive Scienc e , 34 (5), 776–806. Winograd, T. (1972). Understanding natural language. Co gnitive Psycholo gy , 3 , 1–191. Winston, P. H. (1975). Learning structural descriptions from examples. In P. H. Winston (Ed.), The psycholo gy of c omputer vision. New Y ork: McGraw-Hill. Xu, F., & T enen baum, J. B. (2007). W ord learning as Bay esian inference. Psycholo gic al R eview , 114 (2), 245–272. Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakh utdino v, R., . . . Bengio, Y. (2015). Show, A ttend and T ell: Neural Image Caption Generation with Visual Atten tion. In International Confer enc e on Machine L e arning (ICML). Retriev ed from .03044 Y amins, D. L. K., Hong, H., Cadieu, C. F., Solomon, E. a., Seib ert, D., & DiCarlo, J. J. (2014). P erformance-optimized hierarc hical models predict neural resp onses in higher visual cortex. Pr o c e e dings of the National A c ademy of Scienc es , 111 (23), 8619–24. Yildirim, I., Kulk arni, T. D., F reiwald, W. A., & T e. (2015). Efficien t analysis-by-syn thesis in vision: A computational framew ork, b eha vioral tests, and comparison with neural representations. In Pr o c e e dings of the 37th Annual Confer enc e of the Co gnitive Scienc e So ciety. Y osinski, J., Clune, J., Bengio, Y., & Lipson, H. (2014). Ho w transferable are features in deep neural net works? In A dvanc es in Neur al Information Pr o c essing Systems (NIPS). Zeiler, M. D., & F ergus, R. (2014). Visualizing and Understanding Conv olutional Netw orks. In Eur op e an Confer enc e on Computer Vision (ECCV). 58

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment