Multimodal Hierarchical Dirichlet Process-based Active Perception

Journal of Machine Learning Research 1 (****) **-** Submitted 4/00; Published 10/00 Multimodal Hierar chical Dirichlet Pr ocess-based Activ e P er ception T adahiro T aniguchi TAN I G U C H I @ C I . R I T S U M E I . A C . J P T oshiaki T akano TAK A N O @ E M . C I . R I T S U M E I . A C . J P Department of Human and Computer Intelligence, Ritsumeikan Univer sity , Nojihigashi 1-1-1, K usatsu, Shiga 525-8577 Japan. Ry o Y oshino Y O S H I N O @ E M . C I . R I T S U M E I . AC . J P Graduate Sc hool of Information Science and Engineering, Ritsumeikan University , Nojihigashi 1-1-1, K usatsu, Shiga 525-8577 Japan. Editor: **** Abstract In this paper , we propose an acti ve perception method for recognizing object categories based on the multimodal hierarchical Dirichlet process (MHDP). The MHDP enables a robot to form object categories using multimodal information, e.g., visual, auditory , and haptic information, which can be observed by performing actions on an object. Howe ver , performing many actions on a target object requires a long time. In a real-time scenario, i.e., when the time is limited, the robot has to determine the set of actions that is most effecti ve for recognizing a target object. W e propose an MHDP-based activ e perception method that uses the information gain (IG) maximization criterion and lazy greedy algorithm. W e sho w that the IG maximization criterion is optimal in the sense that the criterion is equi valent to a minimization of the expected K ullback–Leibler div ergence between a ﬁnal recognition state and the recognition state after the ne xt set of actions. Howe ver , a straight- forward calculation of IG is practically impossible. Therefore, we deriv e an ef ﬁcient Monte Carlo approximation method for IG by making use of a property of the MHDP . W e also show that the IG has submodular and non-decreasing properties as a set function because of the structure of the graphical model of the MHDP . Therefore, the IG maximization problem is reduced to a submodular maximization problem. This means that greedy and lazy greedy algorithms are effecti ve and hav e a theoretical justiﬁcation for their performance. W e conducted an experiment using an upper-torso humanoid robot and a second one using synthetic data. The experimental results show that the method enables the robot to select a set of actions that allow it to recognize target objects quickly and accurately . The results support our theoretical outcomes. Keyw ords: Activ e Perception, Cogniti ve Robotics, T opic model, Multimodal Machine Learning, Submodular Maximization 1. Introduction Acti ve perception is a fundamental component of our cognitiv e skills. Human infants autonomously and spontaneously perform actions on an object to determine its nature. The sensory information that we can obtain usually depends on the actions performed on the target object. For example, when a person ﬁnds a box placed in front of him/her, he/she cannot perceiv e its weight without holding the box, and he/she cannot determine its sound without hitting or shaking it. In other words, we can obtain sensory information about an object by selecting and e xecuting actions to manipulate it. Adequate action selection is important for recognizing objects quickly and accurately . This example about a human also holds for a robot. An autonomous robot that moves and helps people c  **** T adahiro T aniguchi, T oshiaki T akano and Ryo Y oshino. T A N I G U C H I , T A K A N O , , & Y O S H I N O ? ? ? Vision sensor Auditory sensor Action selection Tactile sensor Target object Grasp Hit Drop Look at the bottom 1 2 3 Figure 1: Overvie w of acti ve perception for multimodal object cate gory recognition. in a li ving en vironment should also select adequate actions to recognize tar get objects. For example, when a person asks an autonomous robot to bring an empty plastic bottle, the robot has to examine many objects by applying sev eral actions (Fig. 1). The importance of this type of acti ve perception is because our object categories are formed on the basis of multimodal information, i.e., not only visual information, but also auditory , haptic, and other information. Therefore, a computational model of the acti ve perception should be consistently based on a computational model for multimodal object categorization and recognition. This paper considers the activ e perception problem for multimodal object recognition. Speciﬁ- cally , we adopt the multimodal hierarchical Dirichlet process (MHDP) proposed by Nakamura et al. (2011b) as a representati ve computational model for multimodal object cate gorization. W e dev elop an acti ve perception method based on the MHDP . The MHDP is a sophisticated, fully Bayesian probabilistic model for multimodal object categorization. It is a multimodal extension of hierarchi- cal Dirichlet process (HDP) T eh et al. (2006), which is a nonparametric Bayesian e xtension of latent Dirichlet allocation (LD A) Blei et al. (2003), which in turn was originally proposed for document- word clustering. Nakamura et al. (2011b) showed that the MHDP enables a robot to form object categories using multimodal information, i.e., visual, auditory , and haptic information, in an unsu- pervised manner . Because of the nature of Bayesian nonparametrics, the MHDP can estimate the number of object categories as well. In spite of the wide range of studies about acti ve perception and multimodal categorization for robots, activ e perception methods, i.e., action selection methods for perception for multimodal categorization hav e not been suf ﬁciently explored from a theoretical vie wpoint (see Section 2). This paper describes a ne w MHDP-based acti ve perception method for multimodal object recognition based on object categories formed by a robot itself. W e found that an acti ve perception method that has a good theoretical nature can be deriv ed by taking the MHDP as a robot’ s multimodal categorization method. In this study , we deﬁne the activ e perception problem in the context of unsupervised multimodal object categorization as follo ws. • Which set of actions should a robot tak e to recognize a tar get object as accurately as possible under the constraint that the number of actions is restricted? 2 M U LT I M O D A L H I E R A R C H I C A L D I R I C H L E T P RO C E S S - B A S E D A C T I V E P E R C E P T I O N Our MHDP-based acti ve perception method uses an information gain (IG) maximization criterion, Monte Carlo approximation, and the lazy greedy algorithm. In this paper , we show that the MHDP provides the follo wing three advantages for deri ving an efﬁcient acti ve perception method. 1. The IG maximization criterion is optimal in the sense that a selected set of actions minimizes the expected Kullback–Leibler (KL) div ergence between the ﬁnal posterior distrib ution es- timated using the information re garding all modalities and the posterior distribution of the category estimated using the selected set of actions. 2. An efﬁcient Monte Carlo appr oximation method for IG can be deriv ed. 3. The IG has a submodular and non-decreasing property as a set function. Therefore, for per- formance, the greedy and lazy greedy algorithms are guaranteed to be near-optimal strategies. Although the abov e desirable properties are due to the theoretical characteristics of the MHDP , this has ne ver been pointed out in pre vious studies. The main contrib utions of this paper are that we present the above three properties of the MHDP clearly , develop an MHDP-based activ e perception method, and sho w its ef fecti veness through ex- periments using a upper-torso humanoid robot and synthetic data. The proposed activ e perception method can be used for general purposes, i.e., not only for robots but also for other target domains to which the MHDP can be applied. In addition, The proposed method can be easily extended for multimodal latent Dirichlet allocation (MLDA), which is a multimodal extension of latent Dirichlet allocation (LD A) Nakamura et al. (2009); Blei et al. (2003), and other multimodal categorization methods with similar graphical models. Howe ver , in this paper , we focus on the MHDP and the robot acti ve perception scenario, and explain our method on the basis of this task. The remainder of this paper is organized as follows. Section 2 describes the background and work related to our study . Section 3 brieﬂy introduces the MHDP , proposed by Nakamura et al. (2011b), which enables a robot to obtain an object category by fusing multimodal sensor information in an unsupervised manner . Section 4 describes our proposed action selection method. Section 5 discusses the effecti veness of the action selection method through experiments using an upper -torso humanoid robot. Section 6 describes a supplemental e xperiment using synthetic data. Section 7 concludes this paper . 2. Background and Related W ork In this section, we describe background and related work of this paper . 2.1 Multimodal Categorization The human capability for object categori zation is a fundamental topic in cogniti ve science Barsalou (1999). In the ﬁeld of robotics, adaptiv e formation of object categories that considers a robot’ s embodiment, i.e., its sensory-motor system, is gathering attention as a way to solv e the symbol grounding problem Harnad (1990); T aniguchi et al. (2015). Recently , various computational models and machine learning methods for multimodal object categorization ha ve been proposed in artiﬁcial intelligence, cognitive robotics, and related research ﬁelds Celikkanat et al. (2014); Sinapov and Stoytche v (2011); Natale et al. (2004); Araki et al. 3 T A N I G U C H I , T A K A N O , , & Y O S H I N O ∞ ∞ λ (a) (b) γ γ θ k α 0 β β π j k jt t jn x jn N j J λ θ k α 0 π j t jn x jn h h h h θ k α 0 t jn x jn a a a a θ k α 0 t jn x jn v v v v k jt N j h j a j v J ∞ ∞ N N Figure 2: Graphical representation of HDP Sudderth et al. (2005) and MHDP Nakamura et al. (2011a). (2012); Ando et al. (2013); Nakamura et al. (2007, 2009, 2011b,a, 2014); Grifﬁth et al. (2012); Iwahashi et al. (2010); Roy and Pentland (2002); Sinapov et al. (2014). For example, Sinapov & Stoytche v (2011) proposed a graph-based multimodal categorization method that allo ws a robot to recognize a ne w object by its similarity to a set of familiar objects. . They also b uilt a robotic system that categorizes 100 objects from multimodal information in a supervised manner Sinapov et al. (2014). Celikkanat et al. (2014) modeled the context in terms of a set of concepts that allow many-to-man y relationships between objects and contexts using latent Dirichlet allocation. . Of these, a series of statistical multimodal categorization methods for autonomous robots have been proposed by extending LD A, i.e., a topic model Araki et al. (2012); Ando et al. (2013); Naka- mura et al. (2007, 2009, 2011b,a, 2014). All these methods are Bayesian generative models, and the MHDP is a representati ve method of this series Nakamura et al. (2011a). The MHDP is an extension of the HDP , which was proposed by T eh et al. (2006), and the HDP is a nonparametric Bayesian extension of LD A Blei et al. (2003). A graphical model of the HDP is sho wn in Fig. 2(a). Concretely , the graphical model of the MHDP has multiple types of emissions that correspond to v arious sensor data obtained through v arious modality inputs, as sho wn in Fig. 2(b). In the HDP , observ ation data are usually represented as a bag-of-words (BoW). In contrast, the observ ation data in the MHDP use bag-of-features (BoF) representations for multimodal information. Latent vari- ables t j n are regarded as indicators of topics in the HDP , which correspond to object cate gories in the MHDP . Nakamura et al. (2011b) showed that the MHDP enables a robot to categorize a large number of objects in a home en vironment into categories that are similar to human categorization results. T o obtain multimodal information, a robot has to perform actions and interact with a target object in v arious ways, e.g., grasping, shaking, or rotating the object. If the number of actions and types of sensor information increase, multimodal categorization and recognition can require a 4 M U LT I M O D A L H I E R A R C H I C A L D I R I C H L E T P RO C E S S - B A S E D A C T I V E P E R C E P T I O N longer time. In most practical cases, the execution of an action by a robot takes longer than it does for a human for mechanical and security reasons. In many cases, one action can take longer than 30 seconds, although that depends on each particular robotic system. When the recognition time is a constraint and/or if quick recognition is required, it becomes important for a robot to select a small number of actions that are ef fective for accurate recognition. Action selection for recognition is often called acti ve perception. Ho wever , an activ e perception method for the MHDP has not been proposed. This paper aims to provide an acti ve perception method for the MHDP . 2.2 Active Perception Generally , acti ve perception is one of the most important cognitive capabilities of humans. From an engineering vie wpoint, activ e perception has many speciﬁc tasks, e.g., localization, mapping, navig ation, object recognition, object segmentation, and self–other dif ferentiation. Historically , active vision, i.e., activ e visual perception, has been studied as an important engi- neering problem in computer vision. Roy et al. (2004) presented a comprehensiv e survey of activ e three-dimensional object recognition. For e xample, Borotshnig et al. (2000) proposed an active vision method in a parametric eigenspace to improve the visual classiﬁcation results. Denzler et al. (2002) proposed an information theoretic action selection method to gather information that con- ve ys the true state of a system through an acti ve camera. They used the mutual information (MI) as a criterion for action selection. Krainin et al. (2011) dev eloped an active perception method in which a mobile robot manipulates an object to build a three-dimensional surface model of it. Their method uses the IG criterion to determine when and ho w the robot should grasp the object. Modeling and/or recognizing a single object as well as modeling a scene and/or segmenting objects are also important tasks in the context of robotics. Eidenberger et al. (2010) proposed an activ e perception planning method for scene modeling in a realistic en vironment. Hoof et al. (2012) proposed an acti ve scene exploration method that enables an autonomous robot to ef ﬁciently segment a scene into its constituent objects by interacting with the objects in an unstructured envi- ronment. They used IG as a criterion for action selection. InfoMax control for acoustic exploration was proposed by Rebguns et al. (2011). Localization, mapping, and navigation are also targets of active perception. V elez et al. (2012) presented an online planning algorithm that enables a mobile robot to generate plans that maximize the expected performance of object detection. Burgard et al. (1997) proposed an acti ve perception method for localization. Action selection is performed by maximizing the weighted sum of the expected entrop y and expected costs. T o reduce the computational cost, they only consider a subset of the next locations. Roy et al. (1999) proposed a coastal na vigation method for a robot to generate trajectories for its goal by minimizing the positional uncertainty at the goal. Stachniss et al. (2005) proposed an information-gain-based exploration method for mapping and localization.. Correa et al. proposed an acti ve perception method for a mobile robot with a visual sensor mounted on a pan- tilt mechanism to reduce localization uncertainty . They used the IG criterion, which was estimated using a particle ﬁlter . In addition, v arious studies on acti ve perception by a robot have been conducted Gouko et al. (2013); Sae gusa et al. (2011); Ji and Carin (2006); T uci et al. (2010); Natale et al. (2004); Schneider et al. (2009); Sushko v and Sammut (2012); Hogman et al. (2013); Iv aldi et al. (2014); Fishel and Loeb (2012); P ape et al. (2012). In spite of a lar ge number of contrib utions about acti ve perception, fe w theories of acti ve perception for multimodal object category recognition ha ve been proposed. 5 T A N I G U C H I , T A K A N O , , & Y O S H I N O In particular , an MHDP-based activ e perception method has not yet been proposed, although the MHDP-based categorization method and its series hav e obtained many successful results and ex- tensions. In machine learning, active learning is a well-deﬁned terminology . Activ e learning algorithms select an unobserv ed input datum and ask a user (labeler) to pro vide a training signal (label) in order to reduce uncertainty as quickly as possible Cohn et al. (1996); Settles (2012); Muslea et al. (2006). These algorithms usually assume a supervised learning problem. This problem is related to the problem in this paper , but is fundamentally dif ferent. 2.3 Active perception for multimodal categorization Sinapov et al. (2014) in vestigated multimodal cate gorization and acti ve perception by making a robot perform 10 different behaviors; obtain visual, auditory , and haptic information; explore 100 dif ferent objects, and classify them into 20 object categories. In addition, they proposed an activ e behavior selection method based on confusion matrices. They reported that the method was able to reduce the exploration time by half by dynamically selecting the next exploratory behavior . How- e ver , their multimodal categorization is performed in a supervised manner, and the theory of active perception is still heuristic. The method does not have theoretical guarantees of performance. IG-based acti ve perception is popular , as sho wn abov e, but the theoretical justiﬁcation for using IG in each task is often missing in many robotics papers. Moreover , in many cases, IG cannot be e v aluated directly , reliably , or accurately . When one takes an IG criterion-based approach, how to estimate the IG is an important problem. In this study , we focus on MHDP-based activ e perception and de velop an ef ﬁcient near-optimal method based on ﬁrm theoretical justiﬁcation. 3. Multimodal Hierarchical Dirichlet Pr ocess f or Statistical Multimodal Categorization W e assume that a robot forms object categories using the MHDP from multimodal sensory data. In this section, we brieﬂy introduce the MHDP on which our proposed active perception method is based Nakamura et al. (2011a). The MHDP assumes that an observ ation node in its graphical model corresponds to an action and its corresponding modality . Nakamura et al. (2011b) employed three observ ation nodes in their graphical model, i.e., haptic, visual, and auditory information nodes. Three actions, i.e., grasping, looking around, and shaking, correspond to these modalities, respec- ti vely . Howe ver , the MHDP can be easily extended to a model with additional types of sensory inputs. It is without doubt that autonomous robots will also gain more types of action for percep- tion. For modeling more general cases, an MHDP with M actions is described in this paper . A more general graphical model of the MHDP than in Fig. 2 is illustrated in Fig. 3. The index m ∈ M (#( M ) = M ) in Fig. 3 represents the type of information that corresponds to an action-modality perception pair , e.g., hitting an object to obtain its sound, grasping an object to test its shape and hardness, or looking at all of an object by rotating it. The observation x m j n ∈ X m is the m -th modality’ s n -th feature for the j -th target object. The observ ation x m j n is assumed to be drawn from a cate gorical distribution whose parameter is θ m k , where k is an index of a latent topic. Parameter θ m k is assumed to be drawn from the Dirichlet prior distribution whose parameter is α m 0 . The MHDP assumes that a robot obtains each modality’ s sensory information as a BoF representation. 6 M U LT I M O D A L H I E R A R C H I C A L D I R I C H L E T P RO C E S S - B A S E D A C T I V E P E R C E P T I O N λ α m θ k m x jn m π j γ β t jn m k jt M J ∞ N j m ∞ 0 Figure 3: Graphical representation of an MHDP with M modalities corresponding to actions for perception. Similarly to the generative process of the original HDP T eh et al. (2006), the generative process of the MHDP can be described as a Chinese restaurant franchise (CRF). The learning and recogni- tion algorithms are both deri ved using Gibbs sampling. In its learning process, the MHDP estimates a latent v ariable t m j n for each feature of the j -th object and a topic index k j t for each latent v ariable t . The combination of latent v ariable and topic index corresponds to a topic in LD A Blei et al. (2003). Using the estimated latent variables, the cate gorical distrib ution parameter θ m k and topic proportion of the j -th object π j are drawn from the posterior distrib ution. The selection procedure for latent variable t m j n is as follows. The prior probability that x m j n selects t is P ( t m j n = t | λ ) =    P m w m N m j t λ + P m w m N m j − 1 , ( t = 1 , · · · , T j ) , λ λ + P m w m N m j − 1 , ( t = T j + 1) , where w m is a weight for the m -th modality , N m j t is the number of m -th modality observ ations that are allocated to t in the j -th object, and λ is a hyperparameter . In the Chinese restaurant process, if the number of observed features N j t = P m w m N m j t that are allocated to t increases, the probability at which a new observation is allocated to the latent variable t increases. Using the prior distribution, the posterior probability that observ ation x m j n is allocated to the latent v ariable t becomes P ( t m j n = t | X m , λ ) ∝ P ( x m j n | X m k = k j t ) P ( t m j n = t | λ ) =    P ( x m j n | X m k = k j t ) P m w m N m j t λ + P m w m N m j − 1 , ( t = 1 , · · · , T j ) , P ( x m j n | X m k = k j t ) λ λ + P m w m N m j − 1 , ( t = T j + 1) , where N m j is the number of the m -th modality’ s observations about the j -th object. The observations that correspond to the m -th modality and hav e the k -th topic in an y object are represented by X m k . In the Gibbs sampling procedure, a latent variable for each observation is drawn from the pos- terior probability distrib ution. If t = T j + 1 , a ne w observation is allocated to a ne w latent v ariable. The dish selection procedure is as follows. The prior probability that the k -th topic is allocated on 7 T A N I G U C H I , T A K A N O , , & Y O S H I N O the t -th latent v ariable becomes P ( k j t = k | γ ) = ( M k γ + M − 1 , ( k = 1 , · · · , K ) , γ γ + M − 1 , ( k = K + 1) , where K is the number of topic types, and M k is the number of latent v ariables on which the k -th topic is placed. Therefore, the posterior probability that the k -th topic is allocated on the t -th latent v ariable becomes P ( k j t = k | X , γ ) = P ( X j t | X k ) P ( k j t = k | γ ) = ( P ( X j t | X k ) M k γ + M − 1 , ( k = 1 , · · · , K ) , P ( X j t | X k ) γ γ + M − 1 , ( k = K + 1) . A topic index for the latent variable t for the j -th object is drawn using the posterior probability , where γ is a hyperparameter . If k = K + 1 , a new topic is placed on the latent v ariable. By sampling t m j n and k j t , the Gibbs sampler performs probabilistic object clustering: t m j n ∼ P ( t m j n | X − mj n , λ ) , (1) k j t ∼ P ( k j t | X − j t , γ ) , (2) where X − mj n = X m j n \ { x m j n } , and X − j t = X t \ X j t . By sampling t m j n for each observation in e very object using (1) and sampling k j t for each latent v ariable t in e very object using (2), all of the latent v ariables in the MHDP can be inferred. If t m j n and k j t are gi ven, the probability that the j -th object is included in the k -th category becomes P ( k | X j ) = Σ T j t =1 δ k ( k j t ) P m w m N m j t P m w m N m j , (3) where X j = ∪ m X m j , w m is the weight for the m -th modality and δ a ( x ) is a delta function. When a robot attempts to recognize a ne w object after the learning phase, the probability that feature x m j n is generated from the k -th topic becomes P ( x m j n | X m k ) = w m N m kx m j n + α m 0 w m N m k + d m α m 0 , where d m denotes the dimension of the m -th modality input. T opic k t allocated to t for a new object is sampled from k t ∼ P ( k j t = k | X , γ ) ∝ P ( X j t | X k ) γ γ + M − 1 . These sampling procedures play an important role in the Monte Carlo approximation of our pro- posed method (see Section 4.2.) For a more detailed explanation of the MHDP , please refer to Nakamura et al. (2011b). Basi- cally , a robot can autonomously learn object categories and recognize new objects using the multi- modal categorization procedure described abo ve. The performance and ef fecti veness of the method was e valuated in the paper . 8 M U LT I M O D A L H I E R A R C H I C A L D I R I C H L E T P RO C E S S - B A S E D A C T I V E P E R C E P T I O N 4. Active P er ception Method In this section, we describe acti ve perception method based on the MHDP . 4.1 Basic F ormulation A robot should hav e already conducted several actions and obtained information from several modalities when it attempts to select next action set for recognizing a target object. For exam- ple, visual information can usually be obtained by looking at the front face of the j -th object from a distance before interacting with the object physically . W e assume that a robot has already obtained information corresponding to a subset of modalities m o j ⊂ M . When a robot faces a new object and has not obtained any information, m o j = ∅ . The purpose of object recognition in multimodal categorization is different from con ventional supervised learning-based pattern recognition problems. In supervised learning, the recognition re- sult is ev aluated by checking whether the output is same as the truth label. Howe ver , in unsupervised learning, there are basically no truth labels. Therefore, the performance of acti ve perception should be measured in a dif ferent manner . The action set the robot selects is described as A = { a 1 , a 2 , . . . , a #( A ) } ∈ 2 M \ m o j , where 2 M \ m o j is a family of subsets of M \ m o j , i.e., A ⊂ M \ m o j and a i ∈ M \ m o j . W e consider an ef fecti ve action set for activ e perception to be one that largely reduces the distance between the ﬁnal recognition state after the information from all modalities M is obtained and the recognition state after the robot executes the selected action set A . The recognition state is represented by the posterior distribution P ( z j | X m o j ∪ A j ) . Here, z j = {{ k j t } 1 ≤ t ≤ T j , { t m j n } m ∈ M , 1 ≤ n ≤ N m j } is a latent v ariable representing the j -th object’ s topic information, where X A j = ∪ m ∈ A X m j , X m j = { x m j 1 , . . . , x m j n , . . . , x m j N m j } . Probability P ( z j | X m o j ∪ A j ) represents the posterior distrib ution related to the object category after taking actions m o j and A . The ﬁnal recognition state, i.e., posterior distribution ov er latent variables after obtaining the information from all modalities M , becomes P ( z j | X M j ) . The purpose of acti ve perception is to select a set of actions that can estimate the posterior distribution most accurately . When L actions can be executed, if we employ KL di vergence as the metric of the difference between the two probability distributions, minimize A ∈ F m o j L KL  P ( z j | X M j ) , P ( z j | X m o j ∪ A j )  (4) is a reasonable ev aluation criterion for realizing effecti ve active perception, where F m o j L = { A | A ⊂ M \ m o j , #( A ) ≤ L } is a feasible set of actions. Ho we ver , neither the true X M j nor X m o j ∪ A j can be observed before taking A on the j -th tar get object, and hence cannot be used at the moment of action selection. Therefore, a rational alternativ e for the ev aluation criterion is the expected value of the KL di ver gence at the moment of action selection: minimize A ∈ F m o j L E X M \ m o j j | X m o j j [KL  P ( z j | X M j ) , P ( z j | X m o j ∪ A j )  ] . (5) 9 T A N I G U C H I , T A K A N O , , & Y O S H I N O Here, we propose to use the IG maximization criterion to select the next action set for activ e perception: A ∗ j = argmax A ∈ F m o j L IG( z j ; X A j | X m o j j ) (6) = argmax A ∈ F m o j L E X A j | X m o j j [KL  P ( z j | X m o j ∪ A j ) , P ( z j | X m o j j )  ] , (7) where IG( X ; Y | Z ) is the IG of Y for X , which is calculated on the basis of the probability distri- bution commonly conditioned by Z as follows: IG( X ; Y | Z ) = KL  P ( X , Y | Z ) , P ( X | Z ) P ( Y | Z )  . By deﬁnition, the expected KL di ver gence is the same as IG( X ; Y ) . The deﬁnition of IG and its relation to KL di ver gence are as follows. IG( X ; Y ) = H ( X ) − H ( X | Y ) = KL  P ( X , Y ) , P ( X ) P ( Y )  = E Y [KL  P ( X | Y ) , P ( X )  ] . The optimality of the proposed criterion (6) is supported by Theorem 1. Theorem 1 The set of next actions A ∈ F m o j L that maximizes the IG( z j ; X A j | X m o j j ) minimizes the expected KL diverg ence between the posterior distribution over z j after all modality information has been observed and after A has been executed. argmin A ∈ F m o j L E X M \ m o j j | X m o j j [KL  P ( z j | X M j ) , P ( z j | X m o j ∪ A j )  ] = argmax A ∈ F m o j L IG( z j ; X A j | X m o j j ) Proof See Appendix A. This theorem is essentially the result of well-known characteristics of IG (see Russo and Roy (2015); MacKay (2003) for example). This means that maximizing IG is the optimal policy for activ e perception in an MHDP-based multimodal object category recognition task. As a special case, when only a single action is permitted, the follo wing corollary is satisﬁed. Corollary 2 The next action m ∈ M \ m o j that maximizes IG( z j ; X m j | X m o j j ) minimizes the ex- pected KL diver gence between the posterior distribution over z j after all modality information has been observed and after the action has been executed. argmin m ∈ M \ m o j E X M \ m o j j | X m o j j [KL  P ( z j | X M j ) , P ( z j | X { m }∪ m o j j )  ] = argmax m ∈ M \ m o j IG( z j ; X m j | X m o j j ) . (8) Proof By substituting { m } into A in Theorem 1, we can obtain the corollary . Using IG , the acti ve perception strate gy for the next single action is simply described as follows: m ∗ j = argmax m ∈ M \ m o j IG( z j ; X m j | X m o j j ) . (9) 10 M U LT I M O D A L H I E R A R C H I C A L D I R I C H L E T P RO C E S S - B A S E D A C T I V E P E R C E P T I O N This means that the robot should select the action m ∗ j that can obtain the X m ∗ j j that maximizes the IG for the recognition result z j under the condition that the robot has already observed X m o j j . Ho we ver , we still ha ve tw o problems, as follo ws. 1. The calculation of IG( z j ; X A j | X m o j j ) cannot be performed in a straightforward manner . 2. The argmax operation in (6) is a combinatorial optimization problem and incurs hea vy com- putational cost when #( M \ m o j ) and L become lar ge. Based on some properties of the MHDP , we can obtain reasonable solutions for these two problems. 4.2 Monte Carlo Appr oximation of IG Equations (6) and (9) provide a robot with an appropriate criterion for selecting an action to efﬁ- ciently recognize a target object. Howe ver , at ﬁrst glance, it looks dif ﬁcult to calculate the IG. First, the calculation of the expectation procedure E X A j | X m o j j [ · ] requires a sum operation over all possible X A j . The number of possible X A j exponentially increases when the number of elements in the BoF increases. Second, the calculation of P ( z j | X A ∪ m o j j ) for each possible observation X A j requires the same computational cost as recognition in the multimodal categorization itself. Therefore, the straightforward calculation for solving (9) is computationally impossible in a practical sense. Ho we ver , by e xploiting a characteristic property of the MHDP , an ef ﬁcient Monte Carlo approx- imation can be deri ved. First, we describe IG as the e xpectation of a logarithm term. IG( z j ; X m j | X m o j j ) = X z j ,X m j P ( z j , X m j | X m o j j ) log P ( z j , X m j | X m o j j ) P ( z j | X m o j j ) P ( X m j | X m o j j ) = E z j ,X m j | X m o j j  log P ( z j , X m j | X m o j j ) P ( z j | X m o j j ) P ( X m j | X m o j j )  . (10) An analytic e valuation of (10) is also practically impossible. Therefore, we adopt a Monte Carlo method. Equation (10) suggests that an efﬁcient Monte Carlo approximation can be performed as sho wn belo w if we can sample ( z [ k ] j , X m [ k ] j ) ∼ P ( z j , X m j | X m o j j ) , ( k ∈ { 1 , . . . , K } ) . Fortunately , the MHDP provides a sampling procedure for z [ k ] j ∼ P ( z j | X m o j j ) and X m [ k ] j ∼ P ( X m j | z [ k ] j ) in its original paper Nakamura et al. (2011a). In the context of multimodal cate goriza- tion by a robot, X m [ k ] j ∼ P ( X m j | z [ k ] j ) is a prediction of an unobserved modality’ s sensation using observed modalities’ sensations, i.e., cross-modal inference. The sampling process of ( z [ k ] j , X m [ k ] j ) can be re garded as a mental simulation by a robot that predicts the unobserved modality’ s sensation 11 T A N I G U C H I , T A K A N O , , & Y O S H I N O leading to a categorization result based on the predicted sensation and observ ed information. (10) ≈ 1 K X k log P ( z [ k ] j , X m [ k ] j | X m o j j ) P ( z [ k ] j | X m o j j ) P ( X m [ k ] j | X m o j j ) = 1 K X k log P ( X m [ k ] j | z [ k ] j , X m o j j ) P ( X m [ k ] j | X m o j j ) | {z } ∗ . (11) In (11), P ( X m [ k ] j | z [ k ] j , X m o j j ) in the numerator can be easily calculated because all the parent nodes of X m [ k ] j are giv en in the graphical model shown in Fig. 3. Howe ver , P ( X m [ k ] j | X m o j j ) in the denominator cannot be e valuated in a straightforward way . Again, a Monte Carlo method can be adopted, as follo ws: P ( X m [ k ] j | X m o j j ) = X z j P ( X m [ k ] j | z j , X m o j j ) P ( z j | X m o j j ) = E z j | X m o j j [ P ( X m [ k ] j | z j , X m o j j )] ≈ 1 K 0 X k 0 P ( X m [ k ] j | z [ k 0 ] j , X m o j j ) (12) where K 0 is the number of samples for the second Monte Carlo approximation. Fortunately , in this Monte Carlo approximation (12), we can reuse the samples dra wn in the pre vious Monte Carlo approximation efﬁciently . By substituting (12) for (11), we ﬁnally obtain the approximate IG for the criterion of acti ve perception, i.e., our proposed method, as follo ws: IG( z j ; X m j | X m o j j ) ≈ 1 K X k log P ( X m [ k ] j | z [ k ] j , X m o j j ) 1 K P k 0 P ( X m [ k ] j | z [ k 0 ] j , X m o j j ) . Note that the computational cost for ev aluating IG becomes O ( K 2 ) . In summary , a robot can ap- proximately estimate the IG for unobserved modality information by generating virtual observ ations based on observed data and e valuating their likelihood. 4.3 Sequential Decision Making as a Submodular Maximization If a robot wants to select L actions A j = { a 1 , a 2 , . . . , a L } ( a i ∈ M \ m o j ) , it has to solv e (6), i.e., a combinatorial optimization problem. The number of combinations of L actions is #( M \ m o j ) C L , which increases dramatically when the number of possible actions #( M \ m o j ) and L increase. For example, Sinapov et al. (2014) ga ve a robot 10 different behaviors in their experiment on robotic multimodal cate gorization. Future autonomous robots will ha ve more a vailable actions for interact- ing with a tar get object and be able to obtain additional types of modality information through these interactions. Hence, it is important to dev elop an ef ﬁcient solution for the combinatorial optimiza- tion problem. Here again, the MHDP has adv antages for solving this problem. 12 M U LT I M O D A L H I E R A R C H I C A L D I R I C H L E T P RO C E S S - B A S E D A C T I V E P E R C E P T I O N Theorem 3 The evaluation criterion for multimodal active per ception IG( z j ; X A j | X m o j j ) is a sub- modular and non-decr easing function with re gar d to A . Proof As shown in the graphical model of the MHDP in Fig. 3, the observations for each modality X m j are conditionally independent under the condition that a set of latent variables z j = {{ k j t } 1 ≤ t ≤ T j , { t m j n } m ∈ M , 1 ≤ n ≤ N m j } is giv en. This satisﬁes the conditions of the theorem by Krause et al. (2005). Therefore, IG( z j ; X m j | X m o j j ) is a submodular and non-decreasing function with regard to X m j . Submodularity is a property similar to the con ve xity of a real-v alued function in a vector space. If a set function F : V → R satisﬁes F ( A ∪ x ) − F ( A ) ≥ F ( A 0 ∪ x ) − F ( A 0 ) , where V is a ﬁnite set ∀ A ⊂ A 0 ⊆ V and x / ∈ A , the set function F has submodularity and is called a submodular function. Function IG is not al ways a submodular function. Howe ver , Krause et al. proved that IG( U ; A ) is submodular and non-decreasing with re gard to A ⊆ S if all of the elements of S are conditionally independent under the condition that U is gi ven. W ith this theorem, Krause et al. (2005) solved the sensor allocation problem efﬁciently. Theorem 3 means that the problem (6) is reduced to a submodular maximization pr oblem . It is known that the greedy algorithm is an efﬁcient strategy for the submodular maximization problem. Nemhauser et al. (1978) prov ed that the greedy algorithm can select a subset that is at most a constant factor (1 − 1 / e) worse than the optimal set, if the e v aluation function F ( A ) is submodular , non-decreasing, and F ( ∅ ) = 0 , where F ( · ) is a set function, and A is a set. If the ev aluation function is a submodular set function, a greedy algorithm is practically sufﬁcient for selecting subsets in man y cases. In sum, a greedy algorithm gi ves a near -optimal solution. Howe ver , the greedy algorithm is still inefﬁcient because it requires an ev aluation of all choices at each step of a sequential decision making process. Minoux (1978) proposed a lazy greedy algorithm to makes the greedy algorithm more efﬁcient for the submodular ev aluation function. The lazy greedy algorithm can reduce the number of ev al- uations by using the characteristics of a submodular function. In this paper , we propose the use of the lazy greedy algorithm for selecting L actions to rec- ognize a target object on the basis of the submodular property of IG . The ﬁnal greedy and lazy greedy algorithms for MHDP-based activ e perception, i.e., our proposed methods, are shown in Algorithms 1 and 2, respecti vely . The main contribution of the lazy greedy algorithm is to reduce the computational cost of acti ve perception. The majority of the computational cost originates from the number of times a robot e v aluates IG m for determining action sequences. When a robot has to choose L actions, the brute- force algorithm that directly e v aluates all alternativ es A ∈ F m o j L using (6) requires #( M \ m o j ) C L e v aluations of IG( z j ; X A j | X m o j j ) . In contrast, the greedy algorithm requires { #( M \ m o j ) + (#( M \ m o j ) − 1) + . . . + (#( M \ m o j ) − L + 1) } ev aluations of IG( z j ; X m j | X m o j j ) , i.e., O ( M L ) . The lazy greedy algorithm incurs the same computational cost as the greedy algorithm only in the worst case. Ho we ver , practically , the number of re-ev aluations in the lazy greedy algorithm is quite small. Therefore, the computational cost of the lazy greedy algorithm increases almost in proportion to L , i.e., almost linearly . The memory requirement of the proposed method is also quite small. 13 T A N I G U C H I , T A K A N O , , & Y O S H I N O Algorithm 1 Greedy algorithm. Require: MHDP is trained using a training data set. The j -th object is found. m o j is initialized, and X m o j j is observed. f or l = 1 to L do f or all m ∈ M \ m o j do f or k = 1 to K do Draw ( z [ k ] j , X m [ k ] j ) ∼ P ( z j , X m j | X m o j j ) end for IG m ← 1 K X k log P ( X m [ k ] j | z [ k ] j , X m o j j ) 1 K P k 0 P ( X m [ k ] j | z [ k 0 ] j , X m o j j ) end for m ∗ ← argmax m IG m Execute the m ∗ -th action to the j -th tar get object and obtain X m ∗ j . m o j ← m o j ∪ { m ∗ } end for Both the greedy and lazy greedy algorithms only require memory for IG m for each modality and K samples for the Monte Carlo approximation. These requirements are negligibly small compared with the MHDP itself. 5. Experiment 1: Humanoid Robot An experiment using an upper-torso humanoid robot was conducted to verify the proposed activ e perception method in the real-world en vironment. 5.1 Conditions In this experiment, RIC-T orso, dev eloped by the R T Corporation, was used (see Fig. 4). RIC- T orso is an upper-torso humanoid robot that has two robot hands. W e prepared an experimental en vironment that is similar to the one in the original MHDP paper Nakamura et al. (2011a). 5 . 1 . 1 V I S UA L I N F O R M A T I O N ( m v ) V isual information was obtained from the Xtion PRO LIVE set on the head of the robot. The camera was regarded as the eyes of the robot. The robot captured 74 images of a target object while it rotated on a turntable (see Fig. 4). The size of each image was re-sized to 320 × 240 . Scale-inv ariant feature transform (SIFT) feature vectors were extracted from each captured image Lowe (2004). A certain number of 128 -dimensional feature vectors were obtained from each image. Note that the SIFT feature did not consider hue information. All of the obtained feature vectors were transformed into 14 M U LT I M O D A L H I E R A R C H I C A L D I R I C H L E T P RO C E S S - B A S E D A C T I V E P E R C E P T I O N Algorithm 2 Lazy greedy algorithm. Require: The MHDP is trained using a training data set. The j -th object is found. m o j is initialized, and X m o j j is observed. f or all m ∈ M \ m o j do f or k = 1 to K do Draw ( z [ k ] j , X m [ k ] j ) ∼ P ( z j , X m j | X m o j j ) end for IG m ← 1 K X k log P ( X m [ k ] j | z [ k ] j , X m o j j ) 1 K P k 0 P ( X m [ k ] j | z [ k 0 ] j , X m o j j ) end for m ∗ ← argmax m IG m Execute the m ∗ -th action to the j -th tar get object and obtain X m ∗ j . m o j ← m o j ∪ { m ∗ } Prepare a stack S for the modality indices and initialize it. f or all m ∈ M \ m o j do push ( S, ( m, IG m )) end for f or l = 1 to L − 1 do repeat S ← descending sor t ( S ) // w .r .t. IG m ( m 1 , IG m 1 ) ← pop ( S ) , ( m 2 , IG m 2 ) ← pop ( S ) // Re-e v aluate IG m 1 as follo ws. f or k = 1 to K do Draw ( z [ k ] j , X m 1 [ k ] j ) ∼ P ( z j , X m 1 j | X m o j j ) end for IG m 1 ← 1 K X k log P ( X m 1 [ k ] j | z [ k ] j , X m o j j ) 1 K P k 0 P ( X m 1 [ k ] j | z [ k 0 ] j , X m o j j ) push ( S, ( m 2 , IG m 2 )) , push ( S, ( m 1 , IG m 1 )) until IG m 1 ≥ IG m 2 m ∗ ← m 1 pop ( S ) Execute the m ∗ -th action to the j -th tar get object and obtain X m ∗ j . m o j ← m o j ∪ { m ∗ } end for 15 T A N I G U C H I , T A K A N O , , & Y O S H I N O Visual image (looking around) Haptic information (grasping) Auditory input (hitting) Auditory input (shaking) Figure 4: Robot used in the experiment. BoF representations using k-means clustering. BoF representations were used as observation data for the visual modality of the MHDP . The index for this modality was deﬁned as m v . 5 . 1 . 2 A U D I T O RY I N F O R M A T I O N ( m as A N D m ah ) Auditory information was obtained from a multipo wered shotgun microphone NTG-2 by RODE Microphone. The microphone was regarded as the ear of the robot. In this experiment, two types of auditory information were acquired. One was generated by hitting the object, and the other was generated by shaking it. The two sounds were reg arded as dif ferent auditory information and hence dif ferent modality observations in the MHDP model. The two actions, i.e., hitting and shaking, were manually programmed for the robot. When the robot be gan to ex ecute an action, it also started recording the objects’ s sound (see Fig. 4). The sound was recorded until two seconds after the robot ﬁnished the action. The recorded auditory data were temporally di vided into frames, and each frame was transformed into 13 -dimensional Mel-frequency cepstral coefﬁcients (MFCCs). The MFCC feature vectors were transformed into BoF representations using k-means clustering in the same way as the visual information. The indices of these modalities were deﬁned as m as and m ah , respecti vely , for “shake” and “hit. ” 5 . 1 . 3 H A P T I C I N F O R M A T I O N ( m h ) Haptic information was obtained by grasping a target object using the robot’ s hand. When the robot attempted to obtain haptic information from an object placed in front of it, it mo ved its hand to the object and gradually closed its hand until a certain amount of counterforce was detected (see Fig. 4). The joint angle of the hand was measured when the hand touched the tar get object and when the hand stopped. The two variables and dif ference between the two angles were used as a three-dimensional feature vector . When obtaining haptic information, the robot grasped the target object 10 times and obtained 10 feature vectors. The feature v ectors were transformed into BoF representations using k-means clustering in the same way as for the other information types. The index of the haptic modality w as deﬁned as m h . 5 . 1 . 4 M U LT I M O D A L I N F O R M A T I O N A S B O F R E P R E S E N TA T I O N S In summary , a robot could obtain multimodal information from four modalities for perception. The set of modalities was M = { m v , m as , m ah , m h } . The dimensions of the BoFs were set to 25 , 25 , 25 , and 5 for m v , m as , m ah , and m h , respecti vely . The dimension of each BoF corresponds to the 16 M U LT I M O D A L H I E R A R C H I C A L D I R I C H L E T P RO C E S S - B A S E D A C T I V E P E R C E P T I O N number of clusters for k-means clustering. The numbers of clusters, i.e., the sizes of the dictionaries, were empirically determined on the basis of a preliminary experiment on multimodal categorization. All of the training datasets were used to train the dictionaries. The histograms of the feature vectors, i.e., the BoFs, were resampled to make their counts N m v j = 100 , N m as j = 80 , N m ah j = 130 , and N m h j = 30 . The weight of each modality w m was set to 1 . The formation of multimodal object categories itself is out of the scope of this paper . Therefore, the constants were empirically determined so that the robot could form object cate gories that are similar to human participants. The number of samples K in the Monte Carlo approximation for estimating IG was set to K = 5000 . 5 . 1 . 5 T A R G E T O B J E C T S For the target objects, 17 types of commodities were prepared for the experiment shown in Fig. 5. Each index on the right-hand side of the ﬁgure indicates the inde x of each object. The hardness of the balls, the striking sounds of the cups, and the sounds made while shaking the bottles were dif ferent depending on the object categories. Therefore, ground-truth categorization could not be achie ved using visual information alone. 5.2 Procedure The experimental procedure was as follows. First, the robot formed object categories through mul- timodal categorization in an unsupervised manner . An experimenter placed each object in front of the robot one by one. The robot looked at the object to obtain visual features, grasped it to obtain haptic features, shook it to obtain auditory shaking features, and hit it to obtain the auditory striking features. After obtaining the multimodal information of the objects as a training data set, the MHDP was trained using a Gibbs sampler . The results of multimodal categorization are sho wn in Fig. 5. The category that has the highest posterior probability for each object is shown in white. These results sho w that the robot can form multimodal object categories using MHDP , as described in Nakamura et al. (2011a). After the robot had formed object categories, we ﬁx ed the latent v ariables for the training data set. Second, an experimental procedure for activ e perception was conducted. An experimenter placed an object in front of the robot. The robot observed the object using its camera, obtained visual information, and set m o j = { m v } . The robot then determined its next set of actions for recognizing the target object using its acti ve perception strategy . 5.3 Results 5 . 3 . 1 S E L E C T I N G T H E N E X T AC T I O N First, we describe results for the ﬁrst single action selection after obtaining visual information. In this experiment, the robot had three choices for its ne xt action, i.e., m as , m ah , and m h . T o e valuate the results of active perception, we used KL  P ( k | X M j ) , P ( k | X A ∪ m o j j )  , i.e., the distance between the posterior distribution ov er the object categories k in the ﬁnal recognition state and that in the next recognition state as an ev aluation criterion on behalf of KL  P ( z j | X M j ) , P ( z j | X A ∪ m o j j )  . This is the original ev aluation criterion in (4) because the computational cost for ev aluating KL  P ( z j | X M j ) , P ( z j | X A ∪ m o j j )  is too high to calculate. 17 T A N I G U C H I , T A K A N O , , & Y O S H I N O Category 7 Hard ball (Polystyrene) Category 1 Soft ball (Vinyl) Category 5 Cup (Plastic) Category 4 Can (Steel) Category 6 Cup (Metal) Category 3 Plastic bottle (Containing bells) Category 2 Plastic bottle (Empty) 1 2, 3, 4 5, 6, 7 8 9, 10, 11 12, 13, 14 15, 16, 17 Category (est imated ) Visual image Object ID Figure 5: (Left) target objects used in the experiment and (right) categorization results obtained in the experiment. 18 M U LT I M O D A L H I E R A R C H I C A L D I R I C H L E T P RO C E S S - B A S E D A C T I V E P E R C E P T I O N 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 0.0 0.5 1.0 1.5 2.0 2.5 Object ID KL div ergence v only 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 0 5 10 15 20 25 Object ID Information Gain as ah h 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 0.0 0.5 1.0 1.5 2.0 2.5 Object ID KL div ergence v+as v+ah v+h Figure 6: (T op) KL div ergence between the ﬁnal recognition state and the posterior probability estimated after obtaining only visual information, (middle) estimated IG m for each object based on visual information, and (bottom) KL div ergence between the ﬁnal recognition state and the posterior probability estimated after obtaining only visual information and each selected action. Our theory of multimodal acti ve perception suggests that the action with the highest information gain (shown in the middle) tends to lead its initial recognition state (whose KL div er gence from the ﬁnal recognition state is sho wn at the top) to a recognition state whose KL diver gence from the ﬁnal recognition state (shown at the bottom) is the smallest. These ﬁgures suggest the probabilistic relationships were satisﬁed as a whole. 19 T A N I G U C H I , T A K A N O , , & Y O S H I N O T able 1: Number of Successfully Recognized Objects v only v+IG.min v+IG.mid v+IG.max Full information 8 / 17 11 / 17 15 / 17 1 6 / 17 17 / 17 Fig. 6 (top) shows the KL diver gence between the posterior probabilities of the category after obtaining the information from all modalities and after obtaining only visual information. W ith regard to some objects, e.g., objects 6 and 7, the ﬁgure shows that visual information is suf ﬁcient for the robot to recognize the objects. Howe ver , with regard to many objects, visual information alone could not lead the recognition state to the ﬁnal state. Howe ver , it could be reached using the information of all modalities. Fig. 6 (middle) shows IG m calculated using the visual information for each action. Fig. 6 (bottom) sho ws the KL di ver gence between the ﬁnal recognition state and the posterior probability estimated after obtaining visual information and the information of each selected action. W e observe that an action with a higher value of IG m tended to further reduce the KL di vergence, as Theorem 1 suggests. Fig. 7 sho ws the a verage KL div ergence for the ﬁnal recog- nition state after executing an action selected by the IG m criterion. Actions IG . min, IG . mid, and IG . max denote actions that ha ve the minimum, middle, and maximum values of IG m , respecti vely . These results sho w that IG . max clearly reduced the uncertainty of the target objects. The precision of category recognition after an action execution is summarized in T able 1. Ba- sically , a category recognition result is obtained as the posterior distribution (3) in the MHDP . The category with the highest posterior probability is considered to be the recognition result for illus- trati ve purposes in T able 1. Obtaining information by executing IG . max almost always increased recognition performance. Examples of changes in the posterior distribution are shown in Figs. 8 and 9 for objects 8 (“metal cup”) and 12 (“plastic bottle containing bells”), respectiv ely . The robot could not clearly recognize the category of object 8 after obtaining visual information. Action IG m in Fig. 6 shows that m ah was IG . max for the 8th object. Fig. 8 sho ws that m ah reduced the uncertainty and allo wed the robot to correctly recognize the object, as evidenced by category 6, a metal cup. This means that the robot noticed that the target object was a metal cup by hitting it and listening to its metallic sound. The metal cup did not make a sound when the robot shook it. Therefore, the IG for m as was small. As Fig. 9 sho ws, the robot ﬁrst recognized the 12th object as a plastic bottle containing bells with high probability and as an empty plastic bottle with a lo w probability . Fig. 6 shows that the IG m criterion suggested m ah as the ﬁrst alternativ e and m as as the second alternativ e. Fig. 9 shows that m as and m ah could determine that the target object w as an empty plastic bottle, b ut m h could not. As humans, we would e xpect to dif ferentiate an empty bottle from a bottle containing bells by shaking or hitting the bottle, and differentiate a metal cup from a plastic cup by hitting it. The proposed activ e perception method constructiv ely reproduced this behavior in a robotic system using an unsupervised multimodal machine learning approach. 5 . 3 . 2 S E L E C T I N G T H E N E X T S E T O F M U L T I P L E A C T I O N S W e ev aluated the greedy and lazy greedy algorithms for activ e perception sequential decision mak- ing. The KL di vergence from the ﬁnal state for all tar get objects is a veraged at each step and sho wn in Fig. 10. For each condition, the KL div ergence gradually decreased and reached almost zero. Ho we ver , the rate of decrease notably dif fered. As the theory of submodular optimization suggests, 20 M U LT I M O D A L H I E R A R C H I C A L D I R I C H L E T P RO C E S S - B A S E D A C T I V E P E R C E P T I O N 0.0 0.5 1.0 1.5 2.0 2.5 v only v+IG.min v+IG.mid v+IG.max Criterion of action selection KL div ergence v only v+IG.min v+IG.mid v+IG.max Figure 7: Reduction in the KL div er gence by executing an action selected on the basis of the IG m maximization criterion. The KL div ergences between the recognition state after ex ecuting the second action and the ﬁnal recognition state are calculated for all objects and sho wn with box plot. This shows that an action with more information brings the recognition of its state closer to the ﬁnal recognition state. v only v+as v+ah v+h 0.00 0.25 0.50 0.75 1.00 Categor y ID Probability C1 C2 C3 C4 C5 C6 C7 C8 Figure 8: Posterior probability of the category for object 8 after executing each ac- tion. These results sho w that the ac- tion with the highest information gain, i.e., ah , allowed the robot to efﬁ- ciently estimate that the true object category w as “metal cup. ” v only v+as v+ah v+h 0.00 0.25 0.50 0.75 1.00 Categor y ID Probability C1 C2 C3 C4 C5 C6 C7 C8 Figure 9: Posterior probability of the category for object 12 after executing each ac- tion. These results show that the actions with the highest and second highest information gain, i.e., ah and as , allowed the robot to efﬁciently es- timate that the true object category was “plastic bottle containing bells. ” 21 T A N I G U C H I , T A K A N O , , & Y O S H I N O 0.00 0.25 0.50 0.75 Step KL div ergence W orst case A verage Lazy greedy Greedy Best case Figure 10: KL div ergence from the ﬁnal state at each step for each sequential action selection pro- cedure. Note that the line of the lazy greedy algorithm is overlapped by that of the greedy algorithm. the greedy algorithm was shown to be a better solution on average and slightly worse than the best case Nemhauser et al. (1978). The best and worst cases were selected after all types of sequential actions had been performed. The “average” is the a verage of the KL diver gence obtained by all possible types of sequential actions. The results for the lazy greedy algorithm were almost the same as those of the greedy algorithm, as Minoux et al. (1978) suggested. The sequential behaviors of IG m were observed to determine if their behaviors were consistent with our theories. For example, the changes in IG m at each step as the robot sequentially selected its action to perform on object 10 using the greedy algorithm is sho wn in Fig. 11. Theorem 3 sho ws that the IG is a submodular function. This predicts that IG m decreases monotonically when a new action is executed in activ e perception. When the robot obtained only visual information (v only in Fig. 11), all values of IG m were still large. After m ah was executed on the basis of the greedy algorithm, IG m ah became zero. At the same time, IG m as and IG m h decreased. In the same way , all v alues of IG m gradually decreased monotonically . Fig. 12 sho ws the time series of the posterior probability of the category for object 10 during sequential acti ve perception. Using only visual information, the robot misclassiﬁed the target object as a plastic bottle containing bells (category 3). The action sequence in rev erse order did not allow the robot to recognize the object as a steel can at its ﬁrst step and change its recognition state to an empty plastic bottle (category 4). After the second action, i.e., grasping ( m h ), the robot recognized the object as a steel can. In contrast, the greedy algorithm could determine that the target object was in category 4, i.e., steel can, with its ﬁrst action. The effect of the number of samples K for the Monte Carlo approximation was observed. Fig. 13 sho ws the relation between K and the standard deviation of the estimated IG m for the 15th object for each action after obtaining a visual image. This ﬁgure shows that estimation error gradually decreases when K increases. Roughly speaking, K ≥ 1000 seems to be required for an appropriate estimate of IG m in our experimental setting. Evaluation of IG m required less than 1 22 M U LT I M O D A L H I E R A R C H I C A L D I R I C H L E T P RO C E S S - B A S E D A C T I V E P E R C E P T I O N v only First Second Final 0 5 10 15 Action Inf ormation Gain as ah h Figure 11: IG m at each step for object 10 when the greedy algorithm is used. 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 Greedy Rev ersed Step P oster ior Probability C1 C2 C3 C4 C5 C6 C7 C8 Figure 12: Time series of the posterior probabil- ity of the category for object 10 dur- ing sequential action selection based on (top) the greedy algorithm, i.e., m ah → m h → m as , and (bottom) its re verse order , i.e., m as → m h → m ah . 0 2 4 6 10 100 1000 10000 K Standard deviation of estimated information ga in as ah h Figure 13: Standard deviation of the estimated information gain IG m for the 15th object. For each K , 100 values of the estimated information g ain IG m were obtained, and their standard de viation is sho wn. second, which is far shorter than the time required for action execution by a robot. This means that our method can be used in a real-time manner . These empirical results show that the proposed method for activ e perception allowed a robot to select appropriate actions sequentially to recognize an object in the real-world en vironment and in a real-time manner . It was shown that the theoretical results were supported, ev en in the real-world en vironment. 6. Experiment 2: Synthetic Data In experiment 1, the numbers of classes, actions, and modalities as well as the size of dataset were limited. In addition, it was dif ﬁcult to control the experimental settings so as to check some in- teresting theoretical properties of our proposed method. Therefore, we performed a supplemental 23 T A N I G U C H I , T A K A N O , , & Y O S H I N O b b                                          2EMHFW,' &DWHJRU\ Figure 14: Categorization results for the posterior probability distributions for each object. experiment, Experiment 2, using synthetic data comprising 21 object types, 63 objects, and 20 actions, i.e., modalities. First, we checked the v alidity of our acti ve perception method when the number of types of actions increases. Second, we checked how the method worked when two classes were assigned to the same object. Although the MHDP can categorize an object into two or more categories in a probabilistic manner , each object was classiﬁed into a single category in the pre vious experiment. 6.1 Conditions A synthetic dataset was generated using the generati ve model that the MHDP assumes (see Fig. 3). W e prepared 21 virtual object classes, and three objects were generated from each object class, i.e., we obtained 63 objects in total. Among the object classes, 14 object classes are “pure, ” and se ven object classes are “mixed. ” For each pure object class, a multinomial distribution was drawn from the Dirichlet distribution corresponding to each modality . W e set the number of modalities M = 20 . The hyperparameters of the Dirichlet distributions of the modalities were set to α m 0 = 0 . 4( m − 1) for m > 1 . For m = 1 , we set α 1 0 = 10 . For each mixed object class, a multinomial distribution for each modality was prepared by mixing the distributions of the two pure object classes. Speciﬁcally , the multinomial distribution for the i -th mix ed object was obtained by av eraging those of the (2 i − 1) -th and the 2 i -th object classes. The observations for each modality of each object were drawn from the multinomial distrib utions corresponding to the object’ s class. The count of the BoFs for each modality was set to 20. Finally , 42 pure virtual objects and 21 mix ed virtual objects were generated. The experiment was performed almost in the same way as experiment 1. First, multimodal cat- egorization was performed for the 63 virtual objects, and 14 categories were successfully formed in an unsupervised manner . The posterior distributions ov er the object cate gories are shown in Fig. 14. Generally speaking, mixed objects were cate gorized into two or more classes. After categorization, a virtual robot was asked to recognize all of the target objects using the proposed activ e perception method. 24 M U LT I M O D A L H I E R A R C H I C A L D I R I C H L E T P RO C E S S - B A S E D A C T I V E P E R C E P T I O N 0.00 0.25 0.50 0.75 1.00 5 1 01 52 0 Step KL div ergence Method Greedy Lazy greedy Random Figure 15: KL div ergence from the ﬁnal state at each step for each sequential action selection pro- cedure. 6.2 Results W e compared the greedy , lazy greedy , and random algorithms for the activ e perception sequential decision making process. The random algorithm is a baseline method that determines the next action randomly from the remaining actions that have not been taken. In other words, the random algorithm is the case in which a robot does not employ an y acti ve perception algorithms. The KL div ergence from the ﬁnal state for all target objects is av eraged at each step and shown in Fig. 15. For each condition, the KL div ergence gradually decreased and reached almost zero. Ho we ver , the rate of decrease was different. The greedy and lazy greedy algorithms were clearly sho wn to be better solutions on av erage than the random algorithm. In contrast with e xperiment 1, the best and worst cases could not practically be calculated because of the prohibitiv e computa- tional cost. Interestingly , the lazy greedy algorithm has almost the same performance as the greedy algorithm, as the theory suggests, although the laziness reduced the computational cost in reality . The number of times the robot e valuated IG m to determine the action sequences for all ex e- cutable counts of actions L = 1 , 2 , . . . , M is summarized for each method. The number of times the lazy greedy algorithm was required for each target object was 71 . 7 ( S D = 5 . 2) on average, and that of the greedy algorithm was 190 . Theoretically , the greedy and lazy greedy algorithms require O ( M 2 ) ev aluations. Practically , the number of re-ev aluations needed by the lazy greedy algorithm is quite small. In contrast, the brute-force algorithm requires O (2 M ) ev aluations, i.e., far more e v aluations of IG are required. Next, a case in which two classes were assigned to the same object was in vestig ated. The target dataset contained “mixed” objects. The results also imply that our method works well ev en when two classes are assigned to the same object. This is because our theory is completely deri ved on the basis of the probabilistic generati ve model, i.e., the MHDP . W e show a typical result. Fig. 16 sho ws the time series of the posterior probability of the category for object 51, i.e., one of the mixed objects, during sequential activ e perception. This shows that the greedy and lazy greedy algorithms quickly categorized the target object into tw o categories “correctly . ” Our formulation 25 T A N I G U C H I , T A K A N O , , & Y O S H I N O 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 Greedy Lazy greedy Random Step P osterior Probability C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 Figure 16: Time series of the posterior probability of the category for object 51 during sequential action selection based on (top) the greedy algorithm, (middle) the lazy greedy algorithm, and (bottom) the random selection procedure. assumes the categorization result to be a posterior distribution. Therefore, this type of probabilistic case can be treated naturally . 7. Conclusion In this paper , we described an MHDP-based acti ve perception method for robotic multimodal ob- ject category recognition. W e formulated a new acti ve perception method on the basis of the MHDP Nakamura et al. (2011a) . First, we proposed an action selection method based on the IG criterion and prov ed that IG is an optimal criterion for activ e perception from the viewpoint of reducing the expected KL div ergence between the ﬁnal and current recognition states. Second, we deriv ed a Monte Carlo approximation method for ev aluating IG efﬁciently and made the action selection method executable. Third, we prov ed that the IG has a submodular property and reduced the sequential acti ve perception problem to a submodular maximization problem. Given the theoretical results, we proposed to use the lazy greedy algorithm for selecting a set of actions for acti ve perception. It is important to note that all of the three theoretical contributions mentioned above were naturally derived from the characteristics of the MHDP . These contributions are clearly a result of the theoretical soundness of the MHDP . In this sense, our theorems re veal a new advantage of the MHDP that other several heuristic multimodal object categorization methods do not ha ve. 26 M U LT I M O D A L H I E R A R C H I C A L D I R I C H L E T P RO C E S S - B A S E D A C T I V E P E R C E P T I O N T o e v aluate the proposed methods empirically , we conducted experiments using an upper-torso humanoid robot and a synthetic dataset. Our results showed that the method enables the robot to acti vely select actions and recognize tar get objects quickly and accurately . One of the most interesting points of this paper is that not only object categories b ut also an action selection policy for object recognition can be formed in an unsupervised manner . From the vie wpoint of cognitive dev elopmental robotics, providing an unsupervised learning model for bridg- ing the de velopment between perceptual and action systems is meaningful for shedding a ne w light on the computational understanding of cognitiv e development Cangelosi and Schlesinger (2015); Asada et al. (2009). It is believ ed that the coupling of action and perception is important for an embodied cogniti ve system Pfeifer and Scheier (2001). The adv antage of this paper compared with the related works is that our action selection method for multimodal category recognition has a clear theoretical basis and is tightly connected to the computational model for multimodal object categorization, i.e., MHDP . This fact gi ves our acti ve perception method a theoretical guarantee of its the performance. Our directions for future research are as follo ws. In addition to activ e perception, active “learn- ing” for multimodal categorization is also an important research topic. It takes a longer time for a robot to gather multimodal information to form multimodal object categories from a massiv e num- ber of daily objects than it does to recognize a new object. If a robot can notice that “the object is obviously a sample of learned category , ” the robot need not obtain knowledge about object cat- egories from such an object. In contrast, if a target object appears to be completely new to the robot, the robot should carefully interact with the object to obtain multimodal information from the object. Such a scenario will be achie ved by de veloping an acti ve “learning” method for multimodal categorization. It is likely that such a method will be able to be obtained by extending our proposed acti ve perception method. In addition, the MHDP model treated in this paper assumed that an action for perception is related to only one modality , e.g., grasping only corresponds to m h . Howe ver , in reality , when we interact with an object with a speciﬁc action, e.g., grasping, shaking, or hitting, we obtain rich information related to various modalities. For example, when we shake a box to obtain auditory information, we also unwittingly obtain haptic information and information about its weight. The tight linkage between the modality information and an action is a type of approximation taken in this research. An extension of our model and the MHDP to a model that can treat actions that are related to v arious modalities is also a task for our future work. A ppendix A. Proof of the Optimality of the Pr oposed Activ e P erception Strategy In this appendix, we show that the proposed acti ve perception strategy , which maximizes the e x- pected KL di ver gence between the current state and the posterior distribution of z j after a selected set of actions, minimizes the expected KL di vergence between the ne xt and ﬁnal states. argmin A ∈ F m o j L E X M \ m o j j | X m o j j  KL  P ( z j | X M j ) , P ( z j | X A ∪ m o j j )  = argmin A ∈ F m o j L X X M \ m o j j X z j  P ( X M \ m o j j | X m o j j ) P ( z j | X M j ) log P ( z j | X M j ) P ( z j | X m o j j , X A j )  (13) 27 T A N I G U C H I , T A K A N O , , & Y O S H I N O The numerator inside of the log function does not depend on A . Therefore, the term related to the numerator can be deleted. In addition, by negating the remaining term, we obtain (13) = argmax A ∈ F m o j L X X M \ m o j j X z j [ P ( X M \ m o j j | X m o j j ) P ( z j | X M j ) log P ( z j | X m o j j , X A j )] = argmax A ∈ F m o j L X X M \ m o j j X z j [ P ( z j , X M \ m o j j | X m o j j ) log P ( z j | X m o j j , X A j )] . (14) By marginalizing X M \ ( m o j ∪ A ) j from (14), we obtain (14) = argmax A ∈ F m o j L X X A j X z j P ( z j , X A j | X m o j j ) log P ( z j | X m o j j , X A j ) = argmax A ∈ F m o j L [  X X A j X z j P ( z j , X A j | X m o j j ) log P ( z j | X m o j , X A j )  ×  − X z j P ( z j | X m o j j ) log P ( z j | X m o j j ) | {z } constant w .r.t. A  ] = argmax A ∈ F m o j L  X X A j X z j [ P ( X A j | X m o j j ) P ( z j | X m o j j , X A j ) log P ( z j | X m o j j , X A j )] − X X A j X z j [ P ( X A j | X m o j j ) P ( z j | X m o j j , X A j ) log P ( z j | X m o j j )]  = argmax A ∈ F m o j L X X A j X z j [ P ( X A j | X m o j j ) KL  P ( z j | X m o j j , X A j ) , P ( z j | X m o j j )  ] = argmax A ∈ F m o j L E X A j | X m o j j [KL  P ( z j | X A ∪ m o j j ) , P ( z j | X m o j j )  ] . Acknowledgment The authors would like to thank undergraduate student T akuya T akeshita and graduate student Ha- jime Fukuda of Ritsumeikan Univ ersity , who helped us dev elop the experimental instruments for obtaining our preliminary results. This research was partially supported by a Grant-in-Aid for Y oung Scientists (B) 2012-2014 (24700233) funded by the Ministry of Education, Culture, Sports, Science, and T echnology , Japan, T ateishi Science and T echnology Foundation, and JST , CREST . References Y oshiki Ando, T omoaki Nakamura, T akaya Araki, and T akayuki Nagai. Formation of hierarchical object concept using hierarchical latent dirichlet allocation. In IEEE/RSJ International Confer- ence on Intelligent Robots and Systems , pages 2272–2279, 2013. T akaya Araki, T omoaki Nakamura, T akayuki Nagai, Shogo Nagasaka, T adahiro T aniguchi, and Naoto Iwahashi. Online learning of concepts and words using multimodal LD A and hierarchical 28 M U LT I M O D A L H I E R A R C H I C A L D I R I C H L E T P RO C E S S - B A S E D A C T I V E P E R C E P T I O N Pitman-Y or Language Model. In IEEE/RSJ International Conference on Intelligent Robots and Systems , pages 1623–1630, 2012. Minoru Asada, K oh Hosoda, Y asuo Kuniyoshi, Hiroshi Ishiguro, T oshio Inui, Y uichiro Y oshikawa, Masaki Ogino, and Chisato Y oshida. Cognitiv e Developmental Robotics: A Surv ey . IEEE T rans- actions on Autonomous Mental De velopment , 1(1):12–34, 2009. Lawrence W . Barsalou. Perceptual symbol systems. Behavioral and Brain Sciences , 22(04):1–16, 1999. ISSN 0140-525X. doi: 10.1017/S0140525X99002149. David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation. the J ournal of machine Learning r esear ch , 3:993–1022, 2003. H Borotschnig, L Paletta, M Prantl, and A Pinz. Appearance-based acti ve object recognition. Image and V ision Computing , 18:715–727, 2000. W olfram Burgard, Dieter F ox, and Sebastian Thrun. Active Mobile Robot Localization. In Pr oceed- ings of the F ourteenth International J oint Conference on Artiﬁcial Intelligence (IJCAI) , pages 1346–1352, 1997. Angelo Cangelosi and Matthe w Schlesinger . De velopmental Robotics . The MIT press, 2015. Hande Celikkanat, Guner Orhan, Nicolas Pugeault, Frank Guerin, Sahin Erol, and Sinan Kalkan. Learning and Using Context on a Humanoid Robot Using Latent Dirichlet Allocation. In Joint IEEE International Confer ences on Development and Learning and Epigenetic Robotics (ICDL- Epir ob) , pages 201–207, 2014. David a. Cohn, Zoubin Ghahramani, and Michael I. Jordan. Active learning with statistical models. J ournal of Artiﬁcial Intelligence Resear ch , 4:129–145, 1996. Joachim Denzler and Christopher M Brown. Information Theoretic Sensor Data Selection for Activ e Object Recognition and State Estimation. IEEE T ransactions on pattern analysis and machine intelligence , 24(2):1–13, 2002. Sumantra Dutta Ro y, Santanu Chaudhury , and Subhashis Banerjee. Active recognition through next vie w planning: a survey. P attern Reco gnition , 37(3):429–446, 2004. R. Eidenberger and J. Scharinger . Activ e perception and scene modeling by planning with proba- bilistic 6D object poses. In IEEE/RSJ International Confer ence on Intelligent Robots and Sys- tems , pages 1036–1043, 2010. Jeremy A. Fishel and Gerald E. Loeb . Bayesian e xploration for intelligent identiﬁcation of textures. F r ontiers in Neur or obotics , 6:1–20, 2012. ISSN 16625218. Manabu Gouko, Y uichi K obayashi, and Chyon Hae Kim. Online Exploratory Behavior Acquisi- tion of Mobile Robot Based on Reinforcement Learning. In Recent T rends in Applied Artiﬁcial Intelligence , pages 272–281. 2013. Shane Grifﬁth, Ji vko Sinapov , Vladimir Sukhoy , and Alexander Stoytchev . A behavior-grounded approach to forming object categories: Separating containers from noncontainers. IEEE T rans- actions on Autonomous Mental De velopment , 4(1):54–69, 2012. 29 T A N I G U C H I , T A K A N O , , & Y O S H I N O Ste v an Harnad. The symbol grounding problem. Physica D: Nonlinear Phenomena , 42(1):335–346, 1990. V ir gile Hogman, Mats Bjorkman, and Danica Kragic. Interactiv e object classiﬁcation using senso- rimotor contingencies. In IEEE/RSJ International Confer ence on Intelligent Robots and Systems (IR OS) , pages 2799–2805, 2013. Serena Ivaldi, Sao Mai Nguyen, Natalia L yubo v a, Alain Droniou, V incent Padois, David Filliat, Pierre-Yves Oudeyer , and Olivier Sigaud. Object Learning Through Activ e Exploration. IEEE T ransactions on A utonomous Mental Development , 6(1):56–72, 2014. Naoto Iwahashi, K omei Sugiura, Ryo T aguchi, T akayuki Nagai, and T adahiro T aniguchi. Robots That Learn to Communicate: A Dev elopmental Approach to Personally and Physically Situated Human-Robot Con versations. In Dialog with Robots P apers fr om the AAAI F all Symposium , pages 38–43, 2010. Shihao Ji and Lawrence Carin. Cost-Sensitiv e Feature Acquisition and Classiﬁcation. P attern Recognition , 40(5):1474–1485, 2006. Michael Krainin, Brian Curless, and Dieter Fox. Autonomous generation of complete 3D ob- ject models using next best vie w manipulation planning. In IEEE International Confer ence on Robotics and Automation , pages 5031–5037, 2011. Andreas Krause and Carlos E. Guestrin. Near-optimal Nonmyopic V alue of Information in Graph- ical Models. In Pr oceedings of the T wenty-F irst Confer ence on Uncertainty in Artiﬁcial Intelli- gence , 2005. David G Lo we. Distinctive image features from scale-in v ariant keypoints. International journal of computer vision , 60(2):91–110, 2004. David J. C. MacKay . Information Theory , Infer ence and Learning Algorithms . Cambridge Univ er - sity Press, 2003. Michel Minoux. Accelerated greedy algorithms for maximizing submodular set functions. In Opti- mization T echniques , pages 234–243. Springer , 1978. Ion Muslea, Stev en Minton, and Craig a. Knoblock. Activ e learning with multiple views. Journal of Artiﬁcial Intelligence Resear ch , 27(1):203–233, 2006. T omoaki Nakamura, T akayuki Nagai, and Naoto Iwahashi. Multimodal object categorization by a robot. In IEEE/RSJ International Confer ence on Intelligent Robots and Systems , pages 2415– 2420, 2007. T omoaki Nakamura, T akayuki Nagai, and Naoto Iwahashi. Grounding of word meanings in mul- timodal concepts using LD A. In IEEE/RSJ International Confer ence on Intelligent Robots and Systems , pages 3943–3948, 2009. T omoaki Nakamura, T akayuki Nagai, and Naoto Iw ahashi. Multimodal categorization by hierarchi- cal dirichlet process. In IEEE/RSJ International Confer ence on Intelligent Robots and Systems , pages 1520–1525, 2011a. 30 M U LT I M O D A L H I E R A R C H I C A L D I R I C H L E T P RO C E S S - B A S E D A C T I V E P E R C E P T I O N T omoaki Nakamura, T akayuki Nagai, and Naoto Iwahashi. Bag of multimodal LD A models for concept formation. IEEE International Confer ence on Robotics and Automation , pages 6233– 6238, 2011b. T omoaki Nakamura, T akayuki Nagai, K otaro Funakoshi, Shogo Nagasaka, T adahiro T aniguchi, and Naoto Iwahashi. Mutual Learning of an Object Concept and Language Model Based on MLD A and NPYLM. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IR OS’14) , pages 600 – 607, 2014. Lorenzo Natale, Giorgio Metta, and Giulio Sandini. Learning haptic representation of objects. In International Confer ence of Intelligent Manipulation and Grasping, , 2004. George L Nemhauser , Laurence A W olse y , and Marshall L Fisher . An analysis of approximations for maximizing submodular set functions-I. Mathematical Pr ogr amming , 14(1):265–294, 1978. Leo Pape, Calogero M. Oddo, Marco Controzzi, Christian Cipriani, Alexander F ¨ orster , Maria C. Carrozza, and J ¨ urgen Schmidhuber . Learning tactile skills through curious exploration. F r ontiers in Neur or obotics , 6:1–16, 2012. Rolf Pfeifer and Christian Scheier . Understanding Intelligence . A Bradford Book, 2001. ISBN 9780262661256. Antons Rebguns, Daniel Ford, and Ian F asel. InfoMax Control for Acoustic Exploration of Objects by a Mobile Robot. In AAAI11 W orkshop on Lifelong Learning , pages 22–28, 2011. ISBN 9781577355311. Deb K. Ro y and Alex P . Pentland. Learning w ords from sights and sounds: a computational model. Cognitive Science , 26(1):113–146, 2002. Nicholas Roy and Sebastian Thrun. Coastal Na vigation with Mobile Robots. In Advances in Neur al Pr ocessing Systems 12 , 1999. Daniel Russo and Benjamin V an Roy . An information-theoretic analysis of thompson sampling, 2015. arXi v:1403.5341v2. Ryo Saegusa, Lorenzo Natale, Giorgio Metta, and Giulio Sandini. Cognitiv e Robotics - Acti ve Perception of the Self and Others -. In the 4th International Confer ence on Human System Inter - actions (HSI) , pages 419–426, 2011. Alexander Schneider , Jur gen Sturm, Cyrill Stachniss, Marco Reisert, Hans Burkhardt, and W olfram Burg ard. Object identiﬁcation with tactile sensors using bag-of-features. In IEEE/RSJ Interna- tional Confer ence on Intelligent Robots and Systems , pages 243–248, 2009. Burr Settles. Activ e learning. Synthesis Lectures on Artiﬁcial Intelligence and Machine Learning , 6(1):1–114, 2012. Ji vko Sinapov and Alexander Stoytche v . Object Cate gory Recognition by a Humanoid Robot Using Behavior -Grounded Relational Learning. In IEEE International Confer ence on Robotics and Automation (ICRA) , pages 184 – 190, 2011. 31 T A N I G U C H I , T A K A N O , , & Y O S H I N O Ji vko Sinapov , Connor Schenck, Kerrick Staley , Vladimir Sukhoy , and Alexander Stoytche v . Grounding semantic categories in behavioral interactions: Experiments with 100 objects. Robotics and Autonomous Systems , 62(5):632–645, 2014. C. Stachniss, G. Grisetti, and W . Burgard. Information Gain-based Exploration Using Rao- Blackwellized Particle Filters. In Robotics Science and Systems (RSS) , 2005. E. B. Sudderth, A. T orralba, W . Freeman, and A. S. Willsk y . Describing V isual Scenes using T rans- formed Dirichlet Processes. In Advances in Neural Information Processing Systems , volume 18, pages 1297–1304, 2005. Oleg O Sushk ov and Claude Sammut. Acti ve robot learning of object properties. In Intellig ent Robots and Systems (IROS), 2012 IEEE/RSJ International Confer ence on , pages 2621–2628. IEEE, 2012. T adahiro T aniguchi, T akayuki Nagai, T omoaki Nakamura, Naoto Iwahashi, T etsuya Ogata, and Hideki Asoh. Symbol emer gence in robotics: A survey , 2015. arXi v:1509.08973. Y .W . T eh, M.I. Jordan, M.J. Beal, and D.M. Blei. Hierarchical Dirichlet processes. Journal of the American Statistical Association , 101(476):1566–1581, 2006. Elio T uci, Gianluca Massera, and Stefano Nolﬁ. Activ e Categorical Perception of Object Shapes in a Simulated Anthropomorphic Robotic Arm. IEEE T ransactions on Evolutionary Computation , 14(6):885–899, 2010. Herke v an Hoof, Oliv er Kroemer , Heni Ben Amor, and Jan Peters. Maximally informati ve interac- tion learning for scene exploration. In IEEE/RSJ International Confer ence on Intelligent Robots and Systems , pages 5152–5158, 2012. Javier V elez, Garrett Hemann, Albert S. Huang, Ingmar Posner , and Nicholas Roy . Modelling observ ation correlations for active exploration and robust object detection. Journal of Artiﬁcial Intelligence Resear ch , 44:423–453, 2012. 32

Multimodal Hierarchical Dirichlet Process-based Active Perception

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment