Recent studies in the field of human vision science suggest that the human responses to the stimuli on a visual display are non-deterministic. People may attend to different locations on the same visual input at the same time. Based on this knowledge, we propose a new stochastic model of visual attention by introducing a dynamic Bayesian network to predict the likelihood of where humans typically focus on a video scene. The proposed model is composed of a dynamic Bayesian network with 4 layers. Our model provides a framework that simulates and combines the visual saliency response and the cognitive state of a person to estimate the most probable attended regions. Sample-based inference with Markov chain Monte-Carlo based particle filter and stream processing with multi-core processors enable us to estimate human visual attention in near real time. Experimental results have demonstrated that our model performs significantly better in predicting human visual attention compared to the previous deterministic models.
Deep Dive into A stochastic model of human visual attention with a dynamic Bayesian network.
Recent studies in the field of human vision science suggest that the human responses to the stimuli on a visual display are non-deterministic. People may attend to different locations on the same visual input at the same time. Based on this knowledge, we propose a new stochastic model of visual attention by introducing a dynamic Bayesian network to predict the likelihood of where humans typically focus on a video scene. The proposed model is composed of a dynamic Bayesian network with 4 layers. Our model provides a framework that simulates and combines the visual saliency response and the cognitive state of a person to estimate the most probable attended regions. Sample-based inference with Markov chain Monte-Carlo based particle filter and stream processing with multi-core processors enable us to estimate human visual attention in near real time. Experimental results have demonstrated that our model performs significantly better in predicting human visual attention compared to the pre
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XXX, NO. XXX, XXXXX 2010
1
A stochastic model of human visual
attention with a dynamic Bayesian network
Akisato Kimura, Senior Member, IEEE, Derek Pang, Student Member, IEEE, Tatsuto
Takeuchi, Kouji Miyazato, Kunio Kashino, Senior Member, IEEE, and Junji Yamato, Senior
Member, IEEE.
!
Abstract
Recent studies in the field of human vision science suggest that the human responses to the stimuli on a visual display
are non-deterministic. People may attend to different locations on the same visual input at the same time. Based on this
knowledge, we propose a new stochastic model of visual attention by introducing a dynamic Bayesian network to predict
the likelihood of where humans typically focus on a video scene. The proposed model is composed of a dynamic Bayesian
network with 4 layers. Our model provides a framework that simulates and combines the visual saliency response and
the cognitive state of a person to estimate the most probable attended regions. Sample-based inference with Markov
chain Monte-Carlo based particle filter and stream processing with multi-core processors enable us to estimate human
visual attention in near real time. Experimental results have demonstrated that our model performs significantly better in
predicting human visual attention compared to the previous deterministic models.
Index Terms
Human visual attention, saliency, dynamic Bayesian network, state space model, hidden Markov model, Markov chain
Monte-Carlo, particle filter, stream processing.
• The authors are with NTT Communication Science Laboratories, NTT Corporation, 3-1 Morinosato Wakamiya, Atsugi, Kanagawa, 243-0198
Japan. E-mail: akisato@ieee.org
• D. Pang is with Department of Electrical Engineering, Stanford University, Packard 240, 350 Serra Mall, Stanford, CA 94305, USA. He
contributed to this work during his internship at NTT Communication Science Laboratories.
• K. Miyazato was with Department of Information and Communication Systems Engineering, Okinawa National College of Technology, 905
Henoko, Nago, Okinawa, 905-2192 Japan. He contributed to this work during his internship at NTT Communication Science Laboratories.
• Parts of the material in this paper has been presented at IEEE International Conference on Multimedia and Expo (ICME2008), Hannover,
Germany, June 2008, and IEEE International Conference on Multimedia and Expo (ICME2009), Cancun, Mexico, June-July 2009.
• Manuscript receive March 31 2010.
October 22, 2018
DRAFT
arXiv:1004.0085v1 [cs.CV] 1 Apr 2010
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XXX, NO. XXX, XXXXX 2010
2
Fig. 1. An example of a saliency map using Koch-Ullman model
1
INTRODUCTION
Developing a sophisticated object detection and recognition algorithms has been a long distance challenge
in computer and robot vision researches. Such algorithms are required in most applications of compu-
tational vision, including robotics [1], medical imaging [2], intelligent cars [3], surveillance [4], image
segmentation [5], [6] and content-based image retrieval [7]. One of the major challenges in designing
generic object detection and recognition systems is to construct methods that are fast and capable of
operating on standard computer platforms without any prior knowledge. To that end, pre-selection
mechanism would be essential to enable subsequent processing to focus only on relevant data. One
promising approach to achieve this mechanism is visual attention: it selects regions in a visual scene that
are most likely to contain objects of interest. The field of visual attention is currently the focus of much
research for both biological and artificial systems.
Attention is generally controlled by one or a combination of the two mechanisms: 1) a top-down
control that voluntarily chooses the focus of attention in a cognitive and task-dependent manner, and 2)
a bottom-up control that reflexively directs the visual focus based on the observed saliency attributes.
The first biologically-plausible model for explaining the human attention system was proposed by Koch
and Ullman [8], which follows the latter approach. The basic concept underlying this model is the feature
integration theory developed by Treisman and Gelade [9] which has been one of the most influential
theories of human visual attention. According to the feature integration theory, in a first step to visual
processing, several primary visual features are processed and represented with separate feature maps
that are later integrated in a saliency map that can be accessed in order to direct attention to the most
conspicuous areas. In an example shown in Fig. 1, a red car placed on the right in the frame should
be attentive, and therefore people directs one’s attention to this area. The Koch-Ullman model has been
attracting attention of many researchers, especially after the development of an implementation model
October 22, 2018
DRAFT
IEEE TRANSACTIONS ON PATTERN
…(Full text truncated)…
This content is AI-processed based on ArXiv data.