MineRL: A Large-Scale Dataset of Minecraft Demonstrations

The sample inefficiency of standard deep reinforcement learning methods precludes their application to many real-world problems. Methods which leverage human demonstrations require fewer samples but have been researched less. As demonstrated in the c…

Authors: William H. Guss, Br, on Houghton

MineRL: A Large-Scale Dataset of Minecraft Demonstrations
MineRL: A Large-Scale Dataset of Minecraft Demonstrations William H. Guss ∗ † , Brandon Houghton ∗ , Nicholay T opin , Phillip W ang , Cayden Codel , Manuela V eloso and Ruslan Salakhutdinov Carnegie Mellon Uni versity , Pittsbur gh, P A 15289, USA { wguss, bhoughton, ntopin, pkwang, ccodel, mmv , rsalakhu } @cs.cmu.edu Abstract The sample inefficiency of standard deep reinforce- ment learning methods precludes their application to many real-world problems. Methods which lev erage human demonstrations require fewer sam- ples but ha ve been researched less. As demon- strated in the computer vision and natural language processing communities, large-scale datasets hav e the capacity to f acilitate research by serving as an experimental and benchmarking platform for ne w methods. Howe ver , existing datasets compatible with reinforcement learning simulators do not ha ve sufficient scale, structure, and quality to enable the further development and ev aluation of methods fo- cused on using human examples. Therefore, we introduce a comprehensive, large-scale, simulator- paired dataset of human demonstrations: MineRL. The dataset consists of ov er 60 million automati- cally annotated state-action pairs across a variety of related tasks in Minecraft, a dynamic, 3D, open- world environment. W e present a nov el data col- lection scheme which allows for the ongoing intro- duction of ne w tasks and the gathering of complete state information suitable for a v ariety of meth- ods. W e demonstrate the hierarchality , div ersity , and scale of the MineRL dataset. Further, we sho w the dif ficulty of the Minecraft domain along with the potential of MineRL in developing techniques to solve ke y research challenges within it. 1 Introduction As deep reinforcement learning (DRL) methods are ap- plied to increasingly difficult problems, the number of sam- ples used for training increases. For example, Atari 2600 games [ Bellemare et al. , 2013 ] hav e been used to ev aluate DQN [ Mnih et al. , 2015 ] , A3C [ Mnih et al. , 2016 ] , and Rainbow DQN, which each require from 44 to ov er 200 mil- lion frames (200 to over 900 hours) to achieve human-lev el performance [ Hessel et al. , 2018 ] . On more complex do- mains: OpenAI Fiv e utilizes 11,000+ years of Dota 2 game- ∗ Equal contribution. † Contact Author . Figure 1: A diagram of the MineRL data collection platform. Our system renders demonstrations from packet-le vel data, so the game state and rendering parameters can be changed. play [ OpenAI, 2018 ] , AlphaGoZero uses 4.9 million g ames of self-play in Go [ Silver et al. , 2017 ] , and AlphaStar uses 200 years of Starcraft II gameplay [ DeepMind, 2018 ] . This inherent sample inef ficiency precludes the applica- tion of standard DRL methods to real-world problems with- out lev eraging data augmentation techniques [ T obin et al. , 2017 ] , [ Andrychowicz et al. , 2018 ] , domain alignment meth- ods [ W ang et al. , 2018 ] , or carefully designing real-world en- vironments to allo w for the required number of trials [ Levine et al. , 2018 ] . Recently , techniques lev eraging trajectory ex- amples, such as imitation learning and Bayesian reinforce- ment learning methods, hav e been successfully applied to older benchmarks and real-world problems where samples from the en vironment are costly . Howe ver , these techniques are still not suf ficiently sample efficient for a lar ge class of complex real-world domains. As noted by [ Kurin et al. , 2017 ] , sev eral subfields of machine learning hav e been catalyzed by the introduc- tion of datasets and efficient large-scale data collection schemes, such as Switchboard [ Godfrey et al. , 1992 ] and Im- ageNet [ Deng et al. , 2009 ] . Though the reinforcement learn- ing community has created an extensi ve range of benchmark simulators, there is currently a lack of large-scale labeled datasets of human demonstrations for domains with a broad Figure 2: A subset of the Minecraft item hierarchy (totaling 371 unique items). Each node is a unique Minecraft item, block, or non- player character , and a directed edge between two nodes denotes that one is a prerequisite for another . Each item presents is o wn unique set of challenges, so coverage of the full hierarchy by one player takes se veral hundred hours. range of structural constraints and tasks. Ther efor e, we intr oduce MineRL, a lar ge-scale, dataset of over 60 million state-action pairs of human demonstrations acr oss a range of related tasks in Minecraft . T o capture the div ersity of gameplay and player interactions in Minecraft, MineRL includes six tasks with a variety of research chal- lenges including open-world multi-agent interactions, long- term planning, vision, control, and navigation, as well as ex- plicit and implicit subtask hierarchies. W e provide imple- mentations of these tasks as sequential decision-making en- vironments in an existing Minecraft simulator . Additionally , we intr oduce a novel platform and methodology for the con- tinued collection of human demonstrations in Minecraft. As users play on our publicly available game server , we record packet-le vel information, which allows perfect reconstruction of each player’ s view and actions. This platform enables the addition of ne w tasks to the MineRL dataset and automatic annotation to complement current and future methods applied to Minecraft. Demo videos and more details about the dataset can be found at http://minerl.io. 2 En vironment: Minecraft 2.1 Description Minecraft is a compelling domain for the dev elopment of re- inforcement and imitation learning based methods because of the unique challenges it presents: Minecraft is a 3D, first- person, open-world game centered around the gathering of re- sources and creation of structures and items. It can be played in a single-player mode or a multi-player mode, where all players exist in and affect the same world. Games are played across many sessions, for tens of hours total, per player . No- tably , the procedurally generated world is composed of dis- crete blocks which allow modification; over the course of gameplay , players change their surroundings by gathering re- sources (such as wood from trees) and constructing structures (such as shelter and storage). As an open-world game, Minecraft has no single definable objectiv e. Instead, players develop their o wn subgoals which form a multitude of natural hierarchies. Though these hierar- chies can be e xploited, their size and comple xity contribute to Minecraft’ s inherent difficulty . One such hierarchy is that of item collection : for a lar ge number of objecti ves in Minecraft, players must create specific tools, materials, and items which require the collection of a strict set of requisite items. The aggregate of these dependencies forms a large-scale task hi- erarchy (see Figure 2). In addition to obtaining items, implicit hierarchies emerge through other aspects of gameplay . For example, players (1) construct structures to provide safety for themselves and their stored resources from naturally occurring enemies and (2) ex- plore the world in search of natural resources, often engaging in combat with non-player characters. Both of these game- play elements ha ve long time horizons and exhibit flexible hierarchies due to situation dependent requirements (such as farming a certain resource necessary to survive, enabling ex- ploration to then gather another resource, and so on). 2.2 Existing Interest W ith the development of Malmo [ Johnson et al. , 2016 ] , a sim- ulator for Minecraft, the en vironment has garnered great re- search interest: [ Shu et al. , 2017 ] , [ T essler et al. , 2017 ] , and [ Oh et al. , 2016 ] hav e le veraged Minecraft’ s massi ve hier- archality and expressi ve power as a simulator to make great strides in language-grounded, interpretable multi-task option- extraction, hierarchical lifelong learning, and activ e percep- tion. Howe ver , much of the existing research utilizes toy tasks in Minecraft, often restricted to 2D mo vement, discrete po- sitions, or artificially confined maps unrepresentative of the intrinsic comple xity that human players typically face. These restrictions reflect the difficulty of the domain as well as the inability of current approaches to cope with fully embodied human state- and action-spaces and the complexity exhibited in optimal human policies. This inability is further evidenced by the large-body of work developed on Minecraft-like do- mains which specifically captures restricted subsets of the features of Minecraft [ Salge et al. , 2014 ] , [ Andreas et al. , 2017 ] , [ Liu et al. , 2017 ] . Bridging the gap between these restricted Minecraft en vi- ronments and the full domain encountered by humans is a driving force behind the dev elopment of MineRL. T o do this, MineRL-v0 captures core aspects of Minecraft that ha ve mo- tiv ated its use as a research domain, including its hierarchal- ity and its large family of intrinsic subtasks. At the same time, MineRL-v0 provides the human priors and rich, au- tomatically generated meta-data necessary to enable current and future research to tackle the full Minecraft domain. 3 Methods: MineRL Data Collection Plaf orm Classification and natural language datasets hav e benefited greatly from the existence of data collection platforms like Mechanical T urk, but, in contrast, the collection of g ameplay data usually requires the implementation of a new platform and user acquisition scheme for each game. T o that end, we introduce the first end-to-end platform for the collection of player trajectories in Minecraft, enabling the construction of the MineRL-v0 dataset. As shown in Figure 1, our platform consists of (1) a public game server and website , where we obtain permission to record trajectories of Minecraft players in natural gameplay; (2) a custom Minecraft client plugin , which records all packet level communication between the client and the server , so we can re-simulate and re-render human demonstrations with modifications to the game state and graphics; and (3) a data pr ocessing pipeline , which en- ables us to produce automatically annotated datasets of task demonstrations. Data Acquisition. Minecraft players find the Min- eRL serv er on standard Minecraft server lists. Players first use our webpage to provide IRB consent for having their gameplay anonymously recorded. They then do wnload a plugin for their Minecraft client which records and streams users’ client-server game packets to the MineRL data reposi- tory . When playing on our server , users select a stand-alone task to complete and receiv e in-game currenc y proportional to the amount of rew ard obtained. F or the Survival game mode (where there is no known rew ard function), players receiv e rew ards only for duration of gameplay so as not to impose an artificial reward function. W e implement each of these stand-alone tasks in Malmo. Data Pipeline. Our data pipeline enables the continued ex- pansion of the structured information accompan ying MineRL dataset releases; it allows us to resimulate, modify , and aug- ment recorded trajectories into se veral algorithmically con- sumable formats. The pipeline serves as an extension to the core Minecraft game code and synchronously resends each recorded packet from the MineRL data repository to a Minecraft client using our custom API for automatic annota- tion and game-state modification. This API allows us to add annotations based on any aspect of the game state accessible from existing Minecraft simulators. Extensibility . Our aim is to use our platform to provide an exhaustiv e and broad set of multi-task datasets (beyond MineRL-v0) paired with RL en vironments, spanning natu- ral language, embodied reasoning, hierarchical planning, and multi-agent cooperation. The modular design of the server allows us to obtain data for a growing number of stand-alone tasks. Furthermore, the in-game economy and server com- munity create consistent engagement from the user -base al- lowing us to collect data at a growing rate without incurring additional costs. The modularity , simulator compatibility , and configurability of the data pipeline also allows new datasets to be created to compliment ne w techniques leveraging hu- man demonstrations. For example, it is possible to conduct large-scale generalization studies by repeatedly re-rendering the data with different constraints: altered lighting, camera positions (embodied and non-embodied), and other video ren- dering conditions; the injection of artificial noise in observa- tions, re wards, and actions; and game hierarchy rearrangment (swapping the function and semantics of game items). 4 Results: MineRL-v0 In this section, we introduce and analyze the MineRL-v0 dataset. W e first gi ve details about the dataset including its size, form, and packaging. Then we indicate the wide ap- plicability of this initial release by gi ving a detailed account Figure 3: Images of v arious stages of the six stand-alone tasks ( Survial gameplay not shown). of the included tasks families, followed by an analysis of the data quality , coverage, and hierarchicality . T o frame the use- fulness of the MineRL-v0 dataset, in Section 5, we demon- strate the dif ficulty of our tasks with respect to out-of-the-box methods and the performance increase achiev ed through ba- sic imitation learning techniques using MineRL-v0. 4.1 Dataset Details Size. The MineRL-v0 dataset consists of 500+ hours of recorded human demonstrations over six dif ferent tasks from the data collection platform. The released data is comprised of four different versions of the dataset rendered with varied resolutions ( 64 × 64 and 192 × 256 ) and textures (default Minecraft and simplified). Each version indi vidually totals to ov er 60 million state-action pairs with a size of 130 GB and 734 GB for the low and medium resolution datasets respec- tiv ely . Form. Each trajectory is a contiguous set of state-action pairs sampled every Minecraft game tick (at 20 game ticks per second). Each state is comprised of an RGB video frame of the player’ s point-of-view and a comprehensi ve set of fea- tures from the game-state at that tick: player in ventory , item collection e vents, distances to objectiv es, player attributes (health, level, achiev ements), and details about the current GUI the player has open. The action recorded at each tick consists of: all of the keyboard presses on the client, the change in view pitch and yaw (caused by mouse movement), all player GUI click and interaction ev ents, chat messages sent, and agglomorativ e actions such as item crafting. Additional Annotations. Human trajectories are accompa- nied by a large set of automatically generated annotations. For all the stand-alone tasks, we record metrics which indi- cate the quality of the demonstration, such as timestamped re- Figure 4: Normalized histograms of the lengths of human demon- stration on v arious MineRL tasks. The red E denotes the upper threshold for expert play on each task. wards, number of no-ops, number of deaths, and total score. Additionally , the trajectory meta-data includes timestamped markers for hierarchical labelings; e.g. when a house-like structure is built or certain objecti ves such as chopping do wn a tree are met. Packaging. Each version of the dataset is packaged as a Zip archiv e with one folder per task family and one sub-folder per demonstration. In each trajectory folder , the states and actions are stored as an H.264 compressed MP4 video of the player’ s PO V with a max bit rate of 18Mb/s and a JSON file containing all of the non-visual features of game-state as well as the player’ s actions corresponding to e very frame of the video. Additionally , for specific task configurations (simpli- fications of action and state space) we provide Numpy .npz files composed of state-action-reward tuples in vector form, promoting the accessibility of the dataset. The packaged data and accompanying documentation can be do wnloaded from http://minerl.io. 4.2 T asks The initial MineRL-v0 dataset consists of six stand-alone tasks chosen to represent difficult aspects of Minecraft that re- flect challenges widely researched in the domain: hierarchal- ity , long-term planning, and complex orienteering. Through- out all tasks, the agent has access to the same set of actions and observations as a human player, as outlined in Section 4.1 All tasks hav e a time limit, which is part of the observ ation. Details for each task follow belo w . Navigation. In the Navigate task, the agent must move to a random goal location over procedurally generated, non- con vex terrain with variable material type and geometry . This is a subtask for many tasks throughout Minecraft. In addition to standard observations, the agent has access to a “compass” observation, which points to a set location, 64 blocks (meters) from the start location. The goal has a small random horizon- tal offset from this location and may be slightly belo w surface lev el, so the agent must find the final goal by searching based on visual features. There are two variants of the provided re- ward function: sparse (+1 upon reaching the goal, at which point the episode terminates), and dense (re ward proportional to distance mov ed towards the goal). T ree Chopping. The Treechop task replicates obtaining wood for producing further items. W ood is a key resource in Minecraft since it is a prerequisite for all tools (as seen by Figure 5: Plots of the XY positions of players in Treechop , Navigate , ObtainIronPickaxe , and ObtainDiamond ov erlaid so each player’ s indi vidual, random initial location is (0 , 0) . the placement of sticks in Figure 2 and Figure 6). The agent begins in a forest biome (near many trees) with an iron axe for cutting the trees. The agent is given +1 re ward for obtaining each unit of wood, and the episode terminates once the agent obtains 64 units. Obtain Item. W e include four related tasks which re- quire the agent to obtain an item further in the item hier - archy: ObtainIronPickaxe , ObtainDiamond , Ob- tainCookedMeat , and ObtainBed . The agent always begins in a random location without any items; this matches the starting conditions for human players in Minecraft. Dif- ferent task variants correspond to a dif ferent, frequently used item: iron pickaxe, diamond, cooked meat (four variants, one per animal source), and bed (three variants, one per dye color needed). Iron pickax es are tools required for obtaining key materials. Diamonds are central to high-le vel Minecraft play , and lar ge portion of gameplay centers around their discov ery . Cooked meat is used to replenish stamina, and a bed is re- quired for sleeping. T ogether , these items represent what a player would need to obtain to survive and access further ar- eas of the game. The agent is given +1 reward for obtaining the required item, at which point the episode terminates. Surviv al. In addition to data on specific, designed tasks, we provide data in Survival , the standard open-ended game mode used by most players. Starting from a random loca- tion without any items, players formulate their own high-level goals and obtain items to complete these goals. Data from this task can be used for learning the intricate re ward functions followed by humans in open play and the corresponding poli- cies. This data can also be used to train agents attempting to complete the other , structured tasks, or further for extracting policy sketches as in [ Andreas et al. , 2017 ] . 4.3 Analysis Human Perf ormance A majority of the human demonstrations in the dataset fall squarely within expert le vel play . Figure 4 shows the distri- bution over players of time required to complete each stand- alone task. The red region in each histogram denotes the range of times which correspond to play at an e xpert lev el, computed as the a verage time required for task completion by players with at least five years of Minecraft experience. The large number of expert samples and rich labelings of demon- stration performance enable application of man y standard im- itation learning techniques which assume optimality of the base policy . In addition, the beginner and intermediate lev el trajectories allow for the further de velopment of techniques which lev erage imperfect demonstrations. Coverage MineRL-v0 has near complete coverage of Minecraft. W ithin the Survival game mode, a large majority of the 371 sub- tasks for obtaining different items ha ve been demonstrated by players hundreds to tens of thousands of times. Further , some of these subtasks require hours to complete, requiring a long sequence of mining, building, exploring, and combat- ting enemies. As a result of the large number of task-le vel annotations, the dataset can be used for large-scale option ex- traction and skill acquisition, enabling the e xtension of the work of [ Shu et al. , 2017 ] and [ Andreas et al. , 2017 ] . More- ov er, the rich label hierarchy of the Obtain tasks can be utilized in constructing metrics for the interpretability and quality of extracted options. In addition to item coverage, the MineRL data collection platform is structured to promote a broad representation of game conditions. The current dataset consists of a div erse set of demonstrations extracted from 1,002 unique player ses- sions. In the Survival game mode, the recorded trajec- tories collectively cover 24 , 393 , 057 square meters of game content, where a square meter corresponds to one Minecraft block. For all other tasks, each demonstration occurs in a ran- domly initialized game world, so we collect a large number of unique, disparate trajectories for each task: In Figure 5, we show the top-down position of players over the course of completing each task where the starting state is at (0 , 0) . Not only does each player act in a different game world, b ut each player also explores a lar ge region during each task. Hierarchality As exemplified by the item graph shown in Figure 2, Minecraft is deeply hierarchical, and the MineRL data col- lection platform is designed to capture these hierarchies both explicitly and implicitly . As a primary example, the Ob- tain stand-alone tasks isolate dif ficult yet overlap- ping core paths in the item hierarchy . Due to the subtask la- belings provided in MineRL-v0, we can inspect and quantify the extent to which these tasks o verlap. A direct measure of hierachality emerges through item pr ecedence fr equency graphs , graphs where nodes corre- spond to items obtained in a task and directed edges corre- spond to the number of times players obtained the source- node item immediately before the target-node item. These graphs pro vide a statistical vie w of the meta-policies of humans and the extent to which their subpolicies transfer between tasks. Figure 6 shows precedence frequency graphs constructed from MineRL trajectories on the ObtainDia- mond , ObtainCookedMeat , and ObtainIronPick- axe tasks. Inspection rev eals that policies for obtaining a diamond consist of subpolicies which obtain wood, torches, and iron ore. All of these are also required for the Ob- tainIronPickaxe task, but only some of them are used within the ObtainCookedMeat task. The effects of these Figure 6: Item precedence frequency graphs for Obtain- Diamond (left), ObtainCookedMeat (middle), and ObtainIronPickaxe (right). The thickness of each line indicates the number of times a player collected item A then subsequently item B . ov erlapping subpolicies can be seen in Figure 5: players mov e similarly in tasks with o verlapping hierarchies (such as ObtainIronPickaxe and ObtainDiamond ) and move differently in tasks with less o verlap. Moreover , these graphs paint a distrib utional picture of human meta-policies within a task: despite there being necessary graph tra versal modes (e.g. wood → stone-pickaxe), depending on the situation, players adapt their strategies by acquiring items typically found later in the item precedence graph through longer paths when earlier items are unav ailable. This, in turn, enables the use of MineRL-v0 in dev eloping distributional hierarchical reinforcement learning methods. 5 Experiments 5.1 Experiment Configuration T o sho wcase the difficulty of Minecraft, we e valuate the performance of three reinforcement learning methods and one behavioral cloning method on the easiest of our tasks ( Treechop and Navigate (Sparse)), as well as a sim- plified task with additional, shaped re wards, Navigate (Dense). Specifically , we e valuate (1) Dueling Double Deep Q-networks (DQN) [ Mnih et al. , 2015 ] , an off-polic y , Q- learning based method; (2) Pretrain DQN (PreDQN), DQN with additional pretraining steps and the replay buf fer initial- ized with expert demonstrations from MineRL-v0; (3) Ad- vantage Actor Critic (A2C) [ Mnih et al. , 2016 ] , an on-polic y , policy gradient method; and (4) Behavioral Cloning (BC), a method using standard classification techniques to learn a policy from demonstrations. T o ensure reproducibility and an accurate e valuation of these methods, we build atop the OpenAI baseline implementations [ Dhariwal et al. , 2017 ] . Observations are con verted to grey scale and resized to 64x64. Due to the thousands of action combinations in Minecraft and the limitations of the baseline algorithms, we simplify the action space to be 10 discrete actions. Ho wev er, behavioral cloning does not hav e such limitations, and per- forms similarly without the action space simplifications. T o Treechop Navigate (S) Navigate (D) DQN 3.73 ± 0.61 0.00 ± 0.00 55.59 ± 11.38 A2C 2.61 ± 0.50 0.00 ± 0.00 -0.97 ± 3.23 BC 43.9 ± 31.46 4.23 ± 4.15 5.57 ± 6.00 PreDQN 4.16 ± 0.82 6.00 ± 4.65 94.96 ± 13.42 Human 64.00 ± 0.00 100.00 ± 0.00 164.00 ± 0.00 Random 3.81 ± 0.57 1.00 ± 1.95 -4.37 ± 5.10 T able 1: Results in Treechop , Navigate (S)parse, and Navigate (D)ense, over the best 100 contiguous episodes. ± de- notes standard deviation. Note: humans achie ve the maximum score for all tasks shown. use human demonstrations with Pretrained DQN and Behav- ioral Cloning, we approximate each action with one of our 10 action primiti ves. W e train each reinforcement learning method for 1500 episode (approximately 12 million frames). T o train Behavioral Cloning, we use expert trajectories from each respectiv e task family and train until polic y performance reaches its maximum. 5.2 Evaluation and Discussion W e compare algorithms by the highest average rew ard ob- tained ov er a 100-episode window during training. W e also report the performance of random policies and 50th percentile human performance. The results are summarized in T able 1. In all tasks, the learned agents perform significantly worse than human performance. Treechop exhibits the lar gest difference: humans achie ve a score of 64, b ut reinforce- ment agents achiev e scores of less than 4. This suggests that our tasks are quite dif ficult, especially giv en that the Ob- tain tasks build upon the Treechop task by re- quiring the completion of several additional subgoals ( ≥ 3 ). W e hypothesize that a large source of difficulty comes from the environment’ s inherent long horizon credit assignment problems. For example, it is hard for agents to learn to nav- igate through water because it takes many transitions before the agent dies by drowning. In light of these difficulties, our data is useful in improv- ing performance and sample ef ficiency: in all tasks, meth- ods that lev erage human data perform better . As seen in Fig- ure 7, the expert demonstrations were able to achiev e higher rew ard per episode and attain high performance using fewer samples. Expert demonstrations are particularly helpful in en vironments where random exploration is unlikely to yield any re ward, like Navigate (Sparse). 6 Related W ork A number of domains ha ve been pre viously solved through imitation learning and a dataset of human demonstrations. These include the Atari domain using the Atari Grand Chal- lenge dataset [ Kurin et al. , 2017 ] and the Super T ux Kart do- main using an on-demand dataset [ Ross et al. , 2011 ] . Un- like Minecraft, these are simple domains: they hav e shal- low dependency hierarchies and are not open-world. Due to the small action- and state-spaces, these domains hav e been solved using imitation learning using relativ ely few samples (9.7 million frames across fiv e games in [ Kurin et al. , 2017 ] and 20 thousand frames in [ Ross et al. , 2011 ] ). In contrast, we Figure 7: Performance graphs over time with DQN and pretrained DQN on Navigate (Dense). present 60 million automatically annotated state-action pairs and do not achiev e human performance. Existing datasets for challenging, unsolv ed domains are primarily for real-world tasks where a lack of simulators lim- its the pace of de velopment. The KITTI dataset [ Geiger et al. , 2013 ] , for example, is a dataset of 3 hours of 3D in- formation on real-world traffic. Similarly , Dex-Net [ Mahler et al. , 2019 ] is a dataset of five million grasps with corre- sponding 3D pointclouds for robotic manipulation. Unlike these datasets, MineRL is directly compatible with a simula- tor , Malmo, thereby allowing training in the same domain as the data was gathered and comparison to methods not based on imitation learning. Additionally , the scale of MineRL is larger relati ve to the domain dif ficulty than the KITTI and Dex-Net datasets. The only complex, unsolved domain with an existing sim- ulator and large-scale dataset is StarCraft II. Howe ver , Star- Craft II is not open-world so cannot be used to ev aluate meth- ods designed for embodied tasks in 3D en vironments. The largest dataset is currently StarData [ Lin et al. , 2017 ] . Un- like MineRL, it consists of unlabeled, extracted trajectories of standard gameplay . In contrast, MineRL includes a growing number of related tasks which represent different components of the overall Minecraft task hierarchy . In addition, MineRL consists of rich automatically generated annotations includ- ing subtask completion, player skill-lev el, and an API to ex- tend these labels. T ogether , these properties allo w the use and ev aluation of methods which exploit hierarchical structures. 7 Conclusion and Future W ork MineRL-v0 currently features 60 million state-action pairs of procedurally annotated human demonstrations in an open- world, simulator-paired en vironment. It currently contains data for six tasks, none of which can be fully solv ed with standard deep reinforcement learning methods. Our platform allows for the ongoing collection of demonstrations for both existing and new tasks. Thus, we host MineRL-v0 at a com- munity accessible website, http://minerl.io, and will gather feedback on adding ne w annotations and tasks to MineRL. As we expand MineRL, we expect it to be increasingly use- ful for a range of methods including in verse reinforcement learning, hierarchical learning, and life-long learning. W e hope MineRL will become a central resource for sequential decision-making research, bolstering many branches of AI tow ard the common goal of developing methods capable of solving a wider range of real-world en vironments. Acknowledgements W e would like to thank Greg Y ang, Devendra Chaplot, Lucy Cheung, Stephanie Milani, Miranda Chen, Y iwen Y uan, Cheri Guss, Ste ve Shalongo, Jim Guss, Sauce, and Bridget Hickey for their insightful con versations and support. References [ Andreas et al. , 2017 ] Jacob Andreas, Dan Klein, and Sergey Levine. Modular multitask reinforcement learning with policy sketches. In Pr oceedings of the 34th ICML-V olume 70 , pages 166–175. JMLR. org, 2017. [ Andrychowicz et al. , 2018 ] Marcin Andrycho wicz, Bowen Baker , Maciek Chociej, Rafal Jozefowicz, Bob McGrew , Jakub Pa- chocki, Arthur Petron, Matthias Plappert, Glenn Powell, Ale x Ray , et al. Learning dexterous in-hand manipulation. arXiv pr eprint arXiv:1808.00177 , 2018. [ Bellemare et al. , 2013 ] Marc G Bellemare, Y av ar Naddaf, Joel V e- ness, and Michael Bowling. The arcade learning en vironment: An ev aluation platform for general agents. J AIR , 47:253–279, 2013. [ DeepMind, 2018 ] DeepMind. Alphastar: Mastering the real-time strategy game starcraft ii, 2018. [ Deng et al. , 2009 ] Jia Deng, W ei Dong, Richard Socher , Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical im- age database. 2009. [ Dhariwal et al. , 2017 ] Prafulla Dhariwal, Christopher Hesse, Oleg Klimov , Alex Nichol, Matthias Plappert, Alec Radford, John Schulman, Szymon Sidor, Y uhuai W u, and Peter Zhokhov . Ope- nai baselines, 2017. [ Geiger et al. , 2013 ] Andreas Geiger, Philip Lenz, Christoph Stiller , and Raquel Urtasun. V ision meets robotics: The kitti dataset. IJRR , 32(11):1231–1237, 2013. [ Godfrey et al. , 1992 ] John J Godfrey , Edward C Holliman, and Jane McDaniel. Switchboard: T elephone speech corpus for re- search and dev elopment. In Acoustics, Speech, and Signal Pr o- cessing, 1992. ICASSP-92., 1992 IEEE International Conference on , volume 1, pages 517–520. IEEE, 1992. [ Hessel et al. , 2018 ] Matteo Hessel, Joseph Modayil, Hado V an Hasselt, T om Schaul, Georg Ostrovski, Will Dabney , Dan Horgan, Bilal Piot, Mohammad Azar , and David Silver . Rain- bow: Combining improv ements in deep reinforcement learning. In Thirty-Second AAAI Conference on Artificial Intelligence , 2018. [ Johnson et al. , 2016 ] Matthew Johnson, Katja Hofmann, T im Hut- ton, and David Bignell. The malmo platform for artificial intelli- gence experimentation. In IJCAI , pages 4246–4247, 2016. [ Kurin et al. , 2017 ] V italy Kurin, Sebastian No wozin, Katja Hof- mann, Lucas Beyer , and Bastian Leibe. The atari grand challenge dataset. arXiv pr eprint arXiv:1705.10998 , 2017. [ Levine et al. , 2018 ] Serge y Levine, Peter Pastor , Alex Krizhe vsky , Julian Ibarz, and Deirdre Quillen. Learning hand-eye coordina- tion for robotic grasping with deep learning and large-scale data collection. IJRR , 37(4-5):421–436, 2018. [ Lin et al. , 2017 ] Zeming Lin, Jonas Gehring, V asil Khalidov , and Gabriel Synnaev e. Stardata: A starcraft ai research dataset. In Thirteenth AIDE Confer ence , 2017. [ Liu et al. , 2017 ] Jerry Liu, Fisher Y u, and Thomas Funkhouser . In- teractiv e 3d modeling with a generativ e adversarial network. In 2017 IC3D V , pages 126–134. IEEE, 2017. [ Mahler et al. , 2019 ] Jeffre y Mahler , Matthew Matl, V ishal Satish, Michael Danielczuk, Bill DeRose, Stephen McKinley , and Ken Goldberg. Learning ambidextrous robot grasping policies. Sci- ence Robotics , 4(26):eaau4984, 2019. [ Mnih et al. , 2015 ] V olodymyr Mnih, K oray Ka vukcuoglu, David Silver , Andrei A Rusu, Joel V eness, Marc G Bellemare, Alex Grav es, Martin Riedmiller , Andreas K Fidjeland, Georg Ostro- vski, et al. Human-level control through deep reinforcement learning. Natur e , 518(7540):529, 2015. [ Mnih et al. , 2016 ] V olodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, T imothy Lillicrap, T im Harley , Da vid Silver , and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In ICML , pages 1928–1937, 2016. [ Oh et al. , 2016 ] Junhyuk Oh, V alliappa Chockalingam, Satinder Singh, and Honglak Lee. Control of memory , acti ve perception, and action in minecraft. arXiv pr eprint arXiv:1605.09128 , 2016. [ OpenAI, 2018 ] OpenAI. Openai fiv e, Sep 2018. [ Ross et al. , 2011 ] St ´ ephane Ross, Geof frey Gordon, and Drew Bagnell. A reduction of imitation learning and structured pre- diction to no-regret online learning. In Pr oceedings of the 14th ICIAS , pages 627–635, 2011. [ Salge et al. , 2014 ] Christoph Salge, Cornelius Glackin, and Daniel Polani. Changing the environment based on empowerment as intrinsic motiv ation. Entr opy , 16(5):2789–2819, 2014. [ Shu et al. , 2017 ] T ianmin Shu, Caiming Xiong, and Richard Socher . Hierarchical and interpretable skill acquisition in multi- task reinforcement learning. arXiv pr eprint arXiv:1712.07294 , 2017. [ Silver et al. , 2017 ] David Silver , Julian Schrittwieser , Karen Si- monyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker , Matthew Lai, Adrian Bolton, et al. Mas- tering the game of go without human knowledge. Natur e , 550(7676):354, 2017. [ T essler et al. , 2017 ] Chen T essler , Shahar Gi vony , T om Zahavy , Daniel J Manko witz, and Shie Mannor . A deep hierarchical ap- proach to lifelong learning in minecraft. In Thirty-F irst AAAI , 2017. [ T obin et al. , 2017 ] Josh T obin, Rachel Fong, Alex Ray , Jonas Schneider , W ojciech Zaremba, and Pieter Abbeel. Domain ran- domization for transferring deep neural networks from simu- lation to the real world. In Intelligent Robots and Systems (IR OS), 2017 IEEE/RSJ International Confer ence on , pages 23– 30. IEEE, 2017. [ W ang et al. , 2018 ] T ing-Chun W ang, Ming-Y u Liu, Jun-Y an Zhu, Andrew T ao, Jan Kautz, and Bryan Catanzaro. High-resolution image synthesis and semantic manipulation with conditional gans. In Pr oceedings of the IEEE Conference on Computer V i- sion and P attern Recognition , pages 8798–8807, 2018.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment