Attention Guidance through Video Script: A Case Study of Object Focusing on 360° VR Video Tours

Aention Guidance through Video Script: A Case Study of Obje ct Focusing on 360 º VR Video T ours Paulo Vitor Santana da Silva AKCIT Federal University of Goiás Goiânia, Goiás, Brazil paulosantana@discente.ufg.br Arthur Ricardo Sousa Vitória AKCIT Federal University of Goiás Goiânia, Goiás, Brazil arthurvitoria@discente.ufg.br Diogo Fernandes Costa Silva AKCIT Federal University of Goiás Goiânia, Goiás, Brazil diogo_fernandes@egresso.ufg.br Arlindo Rodrigues Galvão Filho AKCIT Federal University of Goiás Goiânia, Goiás, Brazil arlindogalvao@ufg.br Figure 1: Guiding users’ attention on 360 º VR vide o tours through textual scripts using deep learning and computational vision Abstract Within the expansive domain of virtual reality (VR), 360 º VR videos immerse viewers in a spherical environment, allo wing them to ex- plore and interact with the virtual world from all angles. While this video representation oers unparalleled levels of immersion, it often lacks eective metho ds to guide viewers’ attention toward specic elements within the virtual environment. This pap er com- bines the models Grounding Dino and Segment Anything (SAM) to guide attention by object focusing based on video scripts. As a case study , this work conducts the experiments on a 360 º video tour on Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distribute d for prot or commercial advantage and that copies bear this notice and the full citation on the rst page. Copyrights for components of this work owned by others than the author(s) must be honor e d. Abstracting with credit is permitted. T o copy otherwise, or republish, to post on servers or to redistribute to lists, r equires prior specic permission and /or a fee. Request permissions from permissions@acm.org. SVR 2024, September 30-October 3, 2024, Manaus, Brazil © 2024 Copyright held by the owner/author( s). Publication rights licensed to ACM. ACM ISBN 979-8-4007-0979-1/24/09 https://doi.org/10.1145/3691573.3691618 the University of Reading. The experiment results show that video scripts can improve the user experience in 360 º VR Videos T our by helping in the task of directing the user’s attention. Ke y words 360 º Videos, Attention Guidance, Deep Learning A CM Reference Format: Paulo Vitor Santana da Silva, Arthur Ricardo Sousa Vitória, Diogo Fernandes Costa Silva, and Arlindo Rodrigues Galvão Filho. 2024. Attention Guidance through Video Script: A Case Study of Object Focusing on 360 º VR Video T ours. In Symposium on Virtual and Augmented Reality (SVR 2024), September 30-October 3, 2024, Manaus, Brazil . ACM, New Y ork, NY, USA, 5 pages. https://doi.org/10.1145/3691573.3691618 1 Introduction Attention guidance involves the interpretation of pertinent informa- tion and directing focus towar ds specic aspects of the visual scene. Due to its immersiv e viewing experience and omnidirectional vie w , 360 º videos have gained popularity in various areas [ 13 , 14 ] such SVR 2024, September 30-October 3, 2024, Manaus, Brazil Paulo Vitor S. Silva, Arthur Ricardo S. Vitória, Diogo F. Costa Silva, and Arlindo R. Galvão Filho Figure 2: Dierent moments of the scene showing the museum on the video tour . It is dened in the video script as “Look at the sculpture of a person on the right side” at moment (1) and “Look at the sculpture of a centaur on the left side” at moment (2). (a) The original frame of the video. (b) The object describ ed in the script detected and segmented. (c) The target object with the vignette eect applied. as entertainment, education, and more [ 3 , 4 , 17 , 22 ]. Users who are experiencing virtual environments have the freedom to explore the full scene around them, which can lead to a dierent experi- ence than expected by the narrative [ 1 ]. Howev er , such narratives are still undergoing experiments to establish an eective narrative language [15]. T o control the narrative, several techniques can be utilized to captivate the audience’s focus on particular elements within a frame [ 15 ]. Depth of eld control is a widely used technique in which the focal length of the camera is adjusted to highlight important elements while blurring the r est of the scene [ 2 ]. A similar approach can be applied to virtual reality environments [ 5 ], in which visual blur eects improve the experience of participants. Several works addr ess how to eectively improve ho w to guide users’ attention through visual features [ 19 , 21 ]. In the study pro- posed by W allgrün et al. [ 23 ], they assess the ecacy of three distinct visual guiding me chanisms (arrow , buttery , and radar) in educational VR tour applications of real-world sites through a user study . All mechanisms showed an improv ement over the no-guidance condition, with the arro w having greater prefer ence between 33 participants. The work of Danieau et al. [ 2 ] proposes an overview of visual resources to guide users thr ough narratives 360 º videos, comparing a fade to black, desaturation, blur , and deformation. The intensity of each eect is gradually increased to dir e ct the user’s attention toward the area of interest. Moreover , the eectiveness of fade to black and desaturation eects were assessed through a user study , showing promising results y et highlighting the diculty to make a user unconsciously move his head. In the work of W oodworth et al. [ 24 ] 9 distinct visual features are explored in order to guide or restor e users’ attention. Their work encompasses a guidance task wherein subjects gaze at dierent regions of interest in a ran- domized or der , and a restoration task in which gaze sequences are interrupted by distraction events, r equiring a return of focus. Showing an extensive direct comparison of attention guidance and restoration cues. Hillaire et al. [ 6 ] proposed two techniques to enhance user experi- ence during rst-person navigation in a virtual environment. These techniques include simulating realistic camera motion similar to human eye mov ement during walking, and implementing a depth of eld blur eect to mimic human perception, where sharp objects are perceived only within a certain range of distances around the focal point. The results showed that participants globally preferred the use of these eects when they are dynamically adapted to the focus point in a virtual environment. With the rapid democratization of immersive technologies, nu- merous design challenges and considerations must be taken into account to impro ve how the users’ experience virtual reality . In this context, this work proposes an approach that, through a video script and deep learning, automatically recognizes important elements in a scene and then guides the user’s attention to them through a vignette eect, using VR tours as a study case. This work is organized as follows, in Section 2 the material and methods are describe d, describing all techniques utilized to generate the regions of interests and segmentation, following by how we apply the visual feature of vignette to guide the user’s attention. In Section 3 the experimental results are shown using a 360 º video tour on the University of Reading. Finally , in Section 4 the results and implications of the results are discussed on how these mechanisms can enhance attention guidance in virtual reality environments. 2 Material and Methods The 360 º videos used consists in a video tour through the campus of University of Reading, featuring both indoor and outdoor envi- ronments as shown in Figur e 3. During the video, the user is mov ed to dierent locations, allowing him to freely look around to several university environments. Moreover , at some frames, ther e might be more than one region of inter est that the user’s attention should be directed. Aention Guidance through Video Script: A Case Study of Object Fo cusing on 360 º VR Video T ours SVR 2024, September 30-October 3, 2024, Manaus, Brazil Figure 3: Dierent frames from the 360 º video showcasing distinct environments: (a) depicts an external area, ( b) show- cases a biology Lab oratory , c) shows an external building, and (d) the gym. Given a video script, containing the target object description in relation to time intervals of the video, on every n frames, the object description and frame are sent as input to Grounding Dino, which in turn, detect the object on the scene and return its b ounding- box coordinates as shown in Figure 4-b). These coordinates are sent as input jointly with the input frame to the SAM, so that the segmentation mask of the target object can be computed, as shown in Figure 4-c). With the bounding-boxes coordinates and the mask of the region of interest, a vignette ee ct is applied to the scene, indicating where the user must pay attention at this moment as shown in Figure 4-d). Figure 4: General workow for a single frame in a 360 º VR T our . a) Through a given input vide o description along with a selected 360 º input frame 𝑡 as input to Grounding Dino. b) Grounding Dino selects the area ( bounding-box) with higher condence. c) The output b ounding-b ox and image are then used as input to SAM for Object Segmentation, which outputs a segmentation mask. d) Uses the segmentation masks and bounding-box to create a vignette eect that indicates where the user must pay attention. 2.1 Object Detection Grounding Dino [ 12 ] is an obje ct dete ction mo del base d on the Transformer-based detector Dino. Its architecture is composed ba- sically by thr e e components, a feature enhancer , a language-guided query sele ction and a cross-modality decoder as shown in Figure 6. For e ver y pair of image and text, the features ar e extracted using an image backb one and a text backbone, respectively . These two sets of features ar e then inputted into a featur e enhancer module to merge them across modalities. The feature enhancer comprises several layers designed for enhancing features. Deformable self-attention is employed to enrich image features, while standard self-attention is used for text feature enhancement. Once cr oss-modality text and image features are obtained, a language-guided quer y selection module is employed to pick cross-modality queries from the image features. These chosen cross-modality queries ar e then input into a cross-modality decoder . Each cross-modality quer y is fed into a self-attention layer , an image cross-attention layer to combine image features, a text cross-attention layer to combine text features, and a fully-connected layer in each cr oss-mo dality decoder layer . The resulting queries from the nal deco der layer are utilized to forecast object boxes and derive corresponding phrases. The model pre-training strategy involv es three key data types, detection data, grounding data and caption data. Inspired by GLIP [ 10 ], the object detection task is reframed into a phrase ground- ing task by integrating category names into text prompts sourced from datasets like COCO [ 11 ], O365 [ 20 ], and OpenImage (OI) [ 20 ], with category names dynamically sampled during training for text input variation. For grounding data, GoldG and RefC datasets pre- processed by MDETR [ 7 ] are utilized, encompassing images from Flickr30k entities [ 16 ], Visual Genome [ 9 ], RefCOCO, RefCOCO+, and RefCOCOg, enabling direct training of Grounding DINO . T o enhance model performance, semantically enriched caption data is incorporated using pseudo-labeling, where a prociently trained model generates these lab els, aiding in the understanding of novel categories. 2.2 Object Segmentation The Segment Anything Model (SAM) [ 8 ] is a versatile segmentation model designed for open-world applications, capable of isolating any object within an image using appropriate pr ompts such as points, b ounding-boxes, or text. Trained on an extensive dataset comprising over 11 million images and 1.1 billion masks, SAM e x- hibits robust p erformance even in zero-shot scenarios. However , while its capabilities are signicant, the model relies on point or box prompts for accurate object identication, as arbitrar y text in- puts may not be enough for eective segmentation. Its ar chite cture is composed by three components, an image encoder , a prompt encoder and a mask decoder as shown in Figure 7. The cor e of the image encoder is a masked auto-encoder , which leverages a vision transformer for scalability . The authors utilized a ViT -H/16, a large-scale vision transformer model designed to handle a 16 × 16 patch size, featuring a 14 × 14 window ed attention and four global attention blocks equally spaced apart. The resulting output from this encoder is a feature embedding, downscaled to a 16x smaller version of the original image. This downsizing step is important for ecient processing while preserving essential image charac- teristics. The model operates on input images with a resolution of 1024 × 1024 × 3, and transforms them into a dense embedding sized 64 × 64 × 256. The pr ompt encoder incorporates tw o types of prompts: sparse (points, boxes, and text) and dense (masks). Sparse prompts SVR 2024, September 30-October 3, 2024, Manaus, Brazil Paulo Vitor S. Silva, Arthur Ricardo S. Vitória, Diogo F. Costa Silva, and Arlindo R. Galvão Filho Figure 5: Dierent moments of the scene showing cafe-lounge on the video tour . It is dened on vide o script to “Look at the cafe lounge” on moment (1) and “Look at the cars between the trees” on moment (2). (a) The original frame of the vide o. (b) The object describe d on the script detected and segmented. (c) The target object with the vignette eect applied. Figure 6: Grounding Dino Architecture [12] guide the model through repr esentations of points and boxes via learned embeddings, while text prompts utilize CLIP (Contrastive Language-Image Pre- Training) [ 18 ] without modications. Dense prompts, repr esented by masks, maintain spatial correspondence with images. These masks undergo downsizing by a factor of 4 before input, followed by additional downsizing within the model. Gelu activation and layer normalization enhance each layer , and the resulting mask emb edding is added to the image embedding. When no mask prompt is provided, a learned embedding for ’no mask’ is applied to each image embedding location. The mask de- coder is inuence d by transformer segmentation models and under- goes modications to align with transformer decoder framework. The adaptation involves incorporating a learned output token em- bedding into the prompt embe dding prior to decoder processing. This token embedding contains useful information for eective im- age segmentation, similar to the function of class tokens in vision transformers for image classication. Within each deco der layer , four primary operations o ccur: self-attention on the tokens, cross- attention from tokens to the image embe dding, updates of each token via a point-wise multi-layer perceptron, and cross-attention from the image emb edding to tokens. This latter step facilitates the integration of prompt information into the image embe dding. During cross-attention, the image emb edding is treated as a set of 256-dimensional vectors, enhancing the model’s segmentation capabilities. Figure 7: Segment Anything Architecture [8] 3 Results Object detection using Grounding Dino was made using zero-shot with a box-threshold equals 0.3 and te xt-threshold equals 0.25, as recommended by the authors. Although the model may r eturn dierent bounding boxes, only the one with the highest condence is considered. It was used the checkpoint ViT -H of the SAM model for object segmentation and the zero shot approach was followed. Aention Guidance through Video Script: A Case Study of Object Fo cusing on 360 º VR Video T ours SVR 2024, September 30-October 3, 2024, Manaus, Brazil On Figures 2 and 5 ar e sho wn tw o moments (1 and 2) fr om scenes at Reading University , presenting the museum and the cafe-lounge, respectively . According to the video script, at moment (1) in Figure 5, attention should be on “The cafe-lounge, ” and at moment (2), on “The cars between the trees. ” Similarly , for Figure 2, attention should be on “The sculpture of a person on the right side ” at moment (1), and on “The sculpture of a centaur on the left side ” at moment (2). Object detection and segmentation were successfully p erformed for both scenes, as shown in Figures 5 (1-b), 5 (2-b), and 2 (1-b). The vignette eect eectively dir ected attention to the respective target elements, as seen in Figures 5 (1-c), 5 (2-c), and 2 (1-c). However , in Figure 2 (2-b), SAM failed to correctly segment the sculpture of the centaur , resulting in the vignette eect partially obscuring the target element. The observed error occurred due to the utilization of a zero-shot approach during the inference on SAM. Fine-tuning the model on a dataset containing images related to the elements presented in this case study would result in greater eectiveness during the segmentation phase. 4 Conclusion T o direct users’ attention to specic regions of interest outlined by an input script during a 360 º vide o tour , this study proposes a method that relies on combining the mo dels Grounding Dino for object detection and SAM for object segmentation in scenes. Additionally , the application of a vignette feature is employ e d to direct users to where their focus is required according to the video script. The experimental results demonstrate that the integration of deep learning methodologies for comprehending video scripts, coupled with computational vision techniques, yields signicant enhancements in user experience within 360 º VR video tours. In this context, future w orks will be focused on the improvement of the object segmenting technique by ne-tuning SAM in a dataset containing images related to elements presented in this case study . Besides, a study will be conducte d through interviews about the eectiveness of the proposed method in directing users’ attention. Acknowledgments This work has been fully/partially funded by the project Research and Development of Algorithms for Construction of Digital Hu- man T echnological Components supp orted by Advanced Knowl- edge Center in Immersive T echnologies (AK CI T), with nancial resources from the PPI IoT/Manufatura 4.0 / PPI HardwareBR of the MCTI grant number 057/2023, signed with EMBRAPII References [1] Haram Choi and Sanghun Nam. 2022. A Study on Attention Attracting Elements of 360-Degree Videos Based on VR Eye- Tracking System. Multimodal T echnologies and Interaction 6, 7 (2022). https://doi.org/10.3390/mti6070054 [2] Fabien Danieau, Antoine Guillo, and Renaud Doré. 2017. Attention guidance for immersive video content in head-mounted displays. In 2017 IEEE Virtual Reality (VR) . IEEE, 205–206. [3] Esther Guervós, Jaime Jesús Ruiz Alonso, Pablo Pérez García, Juan Alberto Muñoz, César Díaz Martín, and Narciso García Santos. 2019. Using 360 VR video to improve the learning experience in v eterinary medicine university degree. (2019). [4] Romain Christian Herault, Alisa Lincke, Mar celo Milrad, Elin-Soe Forsgärde, and Carina Elmqvist. 2018. Using 360-degrees interactive videos in patient trauma treatment e ducation: design, dev elopment and evaluation aspects. Smart Learning Environments 5, 1 (2018), 26. [5] Sébastien Hillaire, Anatole Lécuyer , Rémi Cozot, and Géry Casiez. 2007. Depth- of-eld blur eects for rst-person navigation in virtual environments. In Pr o- ceedings of the 2007 ACM symposium on Virtual reality software and technology . 203–206. [6] Sebastien Hillaire, Anatole Lecuyer , Remi Cozot, and Gery Casiez. 2008. Using an Eye- Tracking System to Improve Camera Motions and Depth-of-Field Blur Eects in Virtual Environments. In 2008 IEEE Virtual Reality Conference . 47–50. https://doi.org/10.1109/VR.2008.4480749 [7] Aishwarya Kamath, Mannat Singh, Y ann LeCun, Gabriel Synnaeve, Ishan Misra, and Nicolas Carion. 2021. Mdetr-modulated dete ction for end-to-end multi- modal understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision . 1780–1790. [8] Alexander Kirillov , Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, T ete Xiao, Spencer Whitehead, Alexander C. Berg, Wan- Y en Lo, Piotr Dollar , and Ross Girshick. 2023. Segment Anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) . 4015–4026. [9] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Y annis Kalantidis, Li-Jia Li, David A Shamma, et al . 2017. Visual genome: Connecting language and vision using crow dsourced dense image annotations. International journal of computer vision 123 (2017), 32–73. [10] Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Y ang, Chunyuan Li, Yiwu Zhong, Lijuan W ang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, Kai- W ei Chang, and Jianfeng Gao. 2022. Grounded Language-Image Pre-training. arXiv:2112.03857 [cs.CV] [11] T sung-Yi Lin, Michael Mair e, Serge Belongie, Lubomir Bourde v , Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár . 2015. Microsoft COCO: Common Objects in Context. arXiv:1405.0312 [cs.CV] [12] Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Y ang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al . 2023. Grounding dino: Marr ying dino with grounded pre-training for op en-set object detection. arXiv preprint arXiv:2303.05499 (2023). [13] Xing Liu, Qingyang Xiao, Vijay Gopalakrishnan, Bo Han, Feng Qian, and Matteo V arvello. 2017. 360 innovations for panoramic video streaming. In Procee dings of the 16th ACM W orkshop on Hot T opics in Networks . 50–56. [14] Andrew MacQuarrie and Anthony Steed. 2017. Cinematic virtual reality: Evalu- ating the eect of display type on the viewing experience for panoramic video. In 2017 IEEE Virtual Reality (VR) . IEEE, 45–54. [15] Carlos Marañes, Diego Gutierr ez, and Ana Serrano. 2020. Exploring the impact of 360 movie cuts in users’ attention. In 2020 IEEE Conference on Virtual Reality and 3D User Interfaces (VR) . IEEE, 73–82. [16] Bryan A Plummer, Liwei W ang, Chris M Cervantes, Juan C Caicedo, Julia Hock- enmaier , and Svetlana Lazebnik. 2015. F lickr30k entities: Collecting region-to- phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision . 2641–2649. [17] Y eshwanth Pulijala, Minhua Ma, and Ashraf A youb. 2017. VR surgery: Interactive virtual reality application for training oral and maxillofacial surgeons using oculus rift and leap motion. Serious Games and Edutainment A pplications: V olume II (2017), 187–202. [18] Alec Radford, Jong W ook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry , Amanda Askell, Pamela Mishkin, Jack Clark, et al . 2021. Learning transferable visual models from natural language super vision. In International conference on machine learning . PMLR, 8748–8763. [19] Anastasia Schmitz, Andrew MacQuarrie, Simon Julier , Nicola Binetti, and An- thony Steed. 2020. Directing versus attracting attention: Exploring the eective- ness of central and peripheral cues in panoramic videos. In 2020 IEEE Conference on Virtual Reality and 3D User Interfaces (VR) . IEEE, 63–72. [20] Shuai Shao, Zeming Li, Tianyuan Zhang, Chao Peng, Gang Y u, Xiangyu Zhang, Jing Li, and Jian Sun. 2019. Objects365: A Large-Scale, High-Quality Dataset for Object Detection. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV) . 8429–8438. https://doi.org/10.1109/ICCV .2019.00852 [21] Marco Speicher , Christoph Rosenb erg, Donald Degraen, Florian Daib er , and Anto- nio Krúger . 2019. Exploring visual guidance in 360-degree videos. In Proce edings of the 2019 ACM International Conference on Interactive Exp eriences for T V and Online Video . 1–12. [22] Arthur van Ho. 2017. Virtual reality and the future of immersiv e entertainment. In Proceedings of the 2017 ACM International Conference on Interactive Exp eriences for T V and Online Vide o . 129–129. [23] Jan Oliver W allgrün, Mahda M Bagher , Pejman Sajjadi, and Alexander Klippel. 2020. A comparison of visual attention guiding approaches for 360 image-base d vr tours. In 2020 IEEE Conference on Virtual Reality and 3D User Interfaces (VR) . IEEE, 83–91. [24] Jason W W oodworth, Andrew Y oshimura, Nicholas G Lipari, and Christoph W Borst. 2023. Design and Evaluation of Visual Cues for Restoring and Guiding Visual Attention in Eye- Tracked VR. In 2023 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops (VRW) . IEEE, 442–450.

Attention Guidance through Video Script: A Case Study of Object Focusing on 360° VR Video Tours

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment