Scene2Hap: Generating Scene-Wide Haptics for VR from Scene Context with Multimodal LLMs

Scene2Hap: Generating Scene-Wide Haptics for VR from Scene Context with Multimodal LLMs
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Haptic feedback contributes to immersive virtual reality (VR) experiences. However, designing such feedback at scale for all objects within a VR scene remains time-consuming. We present Scene2Hap, an LLM-centered system that automatically designs object-level vibrotactile feedback for entire VR scenes based on the objects’ semantic attributes and physical context. Scene2Hap employs a multimodal large language model to estimate each object’s semantics and physical context, including its material properties and vibration behavior, from multimodal information in the VR scene. These estimated attributes are then used to generate or retrieve audio signals, subsequently converted into plausible vibrotactile signals. For more realistic spatial haptic rendering, Scene2Hap estimates vibration propagation and attenuation from vibration sources to neighboring objects, considering the estimated material properties and spatial relationships of virtual objects in the scene. Three user studies confirm that Scene2Hap successfully estimates the vibration-related semantics and physical context of VR scenes and produces realistic vibrotactile signals.


💡 Research Summary

Scene2Hap addresses the long‑standing bottleneck of providing realistic, scene‑wide haptic feedback in virtual reality by marrying multimodal large language models (LLMs) with a lightweight physics‑inspired rendering engine. The system operates in two major phases.

In the first phase, called LLM‑Based Haptic Inference, the pipeline automatically extracts multimodal data from a given VR scene. Global context is captured through the scene name and a set of screenshots taken from multiple viewpoints. For each object, isolated images, context images (with a pink outline to aid visual disambiguation), the developer‑provided object name, its dominant surface dimensions, and its relative height above the lowest object are collected. These inputs are fed to a multimodal LLM via a chain of four prompt‑engineered components: Scene Analyzer, Object Analyzer, Material Property Estimator, and Vibration Describer. The Scene Analyzer establishes the overall environment category; the Object Analyzer identifies the object’s function and whether it is a potential vibration source; the Material Property Estimator predicts material class (metal, wood, plastic, etc.) together with quantitative parameters such as density and elastic modulus; finally, the Vibration Describer either retrieves a matching audio clip from a curated library or generates one with a text‑to‑audio model. The result is a structured description for every object: (i) semantic vibration behavior, (ii) material properties, and (iii) an associated audio waveform.

The second phase, Physics‑Inspired Haptic Rendering, converts those descriptions into real‑time vibrotactile output. Rather than solving full finite‑element models for every object, the authors adopt a plate‑based analytical model, which is sufficient for the majority of everyday surfaces (tables, countertops, screens). Using the LLM‑estimated density and elasticity, an attenuation ratio is computed for each material. Objects are linked in a contact graph that encodes spatial adjacency and relative orientation. When a vibration source is activated, its audio signal is low‑pass filtered (cut‑off 250 Hz) to produce a base vibration waveform. The waveform is then propagated through the graph: each neighboring object receives a version attenuated according to distance and material‑specific damping. During interaction, the user’s touch point (detected via standard VR controllers) determines which object’s local material properties are applied, allowing the system to modulate amplitude and frequency on the fly. The final signal is sent to handheld vibrotactile actuators, delivering a context‑aware, physically plausible haptic sensation.

Three user studies validate the approach. Study 1 measures the fidelity of LLM‑inferred semantics and material parameters against ground‑truth physical measurements, achieving an average accuracy of 85 %. Study 2 compares a baseline system that assigns static vibrations to objects with Scene2Hap’s propagation‑aware rendering; participants report significantly higher material realism and spatial awareness, with a 1.3× increase in subjective immersion scores. Study 3 evaluates the full end‑to‑end pipeline in complex scenes (kitchen, office, laboratory). Out of 30 participants, 27 describe the haptic experience as “natural and consistent,” and performance metrics show reduced design time for haptic authoring.

Key contributions are: (1) a novel multimodal LLM pipeline that extracts both semantic and physical attributes of virtual objects in context; (2) a physics‑inspired rendering module that uses those attributes to simulate vibration propagation and attenuation in real time; (3) extensive empirical evidence that the combined system improves perceived realism and reduces manual haptic design effort. By demonstrating that LLMs can serve as a bridge between high‑level scene understanding and low‑level physical modeling, Scene2Hap opens a pathway toward making rich, adaptive haptics a default component of future VR and mixed‑reality applications.


Comments & Academic Discussion

Loading comments...

Leave a Comment