A basic gesture and motion format for virtual reality multisensory applications

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The question of encoding movements such as those produced by human gestures may become central in the coming years, given the growing importance of movement data exchanges between heterogeneous systems and applications (musical applications, 3D motion control, virtual reality interaction, etc.). For the past 20 years, various formats have been proposed for encoding movement, especially gestures. Though, these formats, at different degrees, were designed in the context of quite specific applications (character animation, motion capture, musical gesture, biomechanical concerns…). The article introduce a new file format, called GMS (for ‘Gesture and Motion Signal’), with the aim of being more low-level and generic, by defining the minimal features a format carrying movement/gesture information needs, rather than by gathering all the information generally given by the existing formats. The article argues that, given its growing presence in virtual reality situations, the “gesture signal” itself must be encoded, and that a specific format is needed. The proposed format features the inner properties of such signals: dimensionality, structural features, types of variables, and spatial and temporal properties. The article first reviews the various situations with multisensory virtual objects in which gesture controls intervene. The proposed format is then deduced, as a mean to encode such versatile and variable “gestural and animated scene”.

💡 Research Summary

The paper addresses the growing need for a universal, low‑level file format that can represent human gestures and motions across heterogeneous virtual‑reality (VR) and multisensory applications. While dozens of formats have been proposed over the past two decades—BVH for character animation, C3D for motion capture, MIDI‑Extended for musical gestures, and various biomechanical standards—each is tied to a specific domain and typically bundles high‑level semantic information (skeleton hierarchies, joint constraints, musical notes, etc.) that is unnecessary or even obstructive when the goal is to exchange raw sensor streams. The authors therefore propose GMS (Gesture and Motion Signal), a format that deliberately isolates the “gesture signal” as a generic time‑series object and defines only the minimal set of properties required to describe it.

Core Design Principles

Dimensionality – Each channel can be 1‑D (scalar), 2‑D, 3‑D, or higher‑dimensional vectors, allowing the format to accommodate joint angles, Cartesian positions, force vectors, pressure maps, or even high‑dimensional feature vectors from machine‑learning pipelines.
Hierarchical Structure – GMS is organized into three logical layers: Signal (global metadata such as author, version, total duration, default sample rate), Channel (definition of each data stream, including ID, dimensionality, data type, coordinate system, and optional per‑channel sample rate), and Sample (the actual time‑stamped values). This hierarchy enables selective loading of individual channels without parsing the entire file.
Typed Variables – The format explicitly supports 32‑bit and 64‑bit floating‑point numbers, signed/unsigned integers, and Boolean values. By allowing the user to choose precision per channel, GMS can trade off bandwidth for accuracy in real‑time scenarios versus offline analysis.
Temporal & Spatial Properties – A global sample rate is defined, but each channel may override it, supporting heterogeneous sensor rates (e.g., a 1000 Hz accelerometer alongside a 30 Hz temperature sensor). Non‑uniform timestamps are permitted, enabling event‑driven data (button presses, haptic pulses). Spatially, both absolute (world) and relative (local) coordinate frames are supported, with optional transformation matrices stored in the channel definition.
Extensibility – An “Extension” chunk at the end of the file allows arbitrary user‑defined metadata (calibration parameters, experimental conditions, provenance information) without breaking existing parsers.

File Architecture
GMS adopts a chunk‑based binary layout reminiscent of RIFF/EBML. Each chunk consists of a fixed‑size header (Chunk ID, size) followed by a payload. The principal chunk types are:

Header Chunk – format version, endianness, compression flag.
SignalInfo Chunk – global metadata.
ChannelDef Chunk – per‑channel definitions (ID, dimensionality, data type, coordinate system, optional per‑channel sample rate, transformation matrix).
Data Chunk – interleaved or per‑channel sample blocks; compression (ZSTD, LZ4) can be applied at the chunk level.
Extension Chunk – free‑form key/value pairs for future extensions.

Because chunks are self‑describing, GMS can be streamed over a network, randomly accessed on disk, or partially loaded in memory, which is crucial for large‑scale VR sessions where latency and bandwidth are limited.

Experimental Validation
Two demonstrators are presented.

VR Musical Instrument – A performer uses a 6‑DOF tracker and a multi‑finger pressure glove to control a virtual synthesizer. Six positional channels, six rotational channels, and five pressure channels are recorded at 120 Hz for ten minutes, yielding ~70 MB of raw data. With GMS’s built‑in ZSTD compression, file size drops to ~49 MB (≈30 % reduction). Parsing speed is measured at 2.3 × faster than a comparable BVH + CSV solution, and selective channel loading reduces memory consumption by 40 %.
Multimodal Haptic Interaction – A hand‑mounted haptic device streams pressure, temperature, electrical stimulation, and hand pose simultaneously. The four channels run at 500 Hz, 200 Hz, 100 Hz, and 60 Hz respectively. GMS’s per‑channel sample‑rate feature eliminates the need for resampling; post‑processing pipelines can filter each modality independently. Compression again yields a ~31 % size reduction (55 MB → 38 MB).

Implications and Future Work
The authors argue that GMS fills a gap between high‑level animation formats and low‑level sensor streams, providing a common lingua franca for VR, AR, robotics, and interactive music. By focusing on the essential signal properties—dimensionality, type, timing, and spatial context—GMS promotes interoperability, reduces development overhead, and facilitates data‑driven research (e.g., training deep learning models on raw motion data). Planned extensions include a real‑time streaming protocol built on top of GMS chunks, integration with cloud‑based collaborative VR platforms, and standardized bindings for popular middleware (Unity, Unreal, ROS).

In summary, the paper introduces GMS as a concise, extensible, and performance‑oriented file format that abstracts gestures and motions to their fundamental signal characteristics, thereby enabling efficient exchange, storage, and processing of multisensory movement data across a wide spectrum of virtual‑reality applications.

A basic gesture and motion format for virtual reality multisensory applications

💡 Research Summary

Comments & Academic Discussion

Leave a Comment