Vibro-Sense: Robust Vibration-based Impulse Response Localization and Trajectory Tracking for Robotic Hands

Vibro-Sense: Robust Vibration-based Impulse Response Localization and Trajectory Tracking for Robotic Hands
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Rich contact perception is crucial for robotic manipulation, yet traditional tactile skins remain expensive and complex to integrate. This paper presents a scalable alternative: high-accuracy whole-body touch localization via vibro-acoustic sensing. By equipping a robotic hand with seven low-cost piezoelectric microphones and leveraging an Audio Spectrogram Transformer, we decode the vibrational signatures generated during physical interaction. Extensive evaluation across stationary and dynamic tasks reveals a localization error of under 5 mm in static conditions. Furthermore, our analysis highlights the distinct influence of material properties: stiff materials (e.g., metal) excel in impulse response localization due to sharp, high-bandwidth responses, whereas textured materials (e.g., wood) provide superior friction-based features for trajectory tracking. The system demonstrates robustness to the robot’s own motion, maintaining effective tracking even during active operation. Our primary contribution is demonstrating that complex physical contact dynamics can be effectively decoded from simple vibrational signals, offering a viable pathway to widespread, affordable contact perception in robotics. To accelerate research, we provide our full datasets, models, and experimental setups as open-source resources.


💡 Research Summary

Vibro‑Sense tackles the long‑standing challenge of providing rich tactile perception for robotic manipulators without resorting to expensive, fragile tactile skins. The authors equip a 19‑DOF Seed Robotics RH8D hand with only seven low‑cost Harley‑Bentton CM‑1000 piezoelectric contact microphones, strategically placed around the hand’s structure. Raw voltage signals are sampled at 50 kHz, down‑sampled to 20 kHz, and a 200 ms window centered on the contact event is extracted. A Short‑Time Fourier Transform (STFT) with a window size of 128 converts each channel into a time‑frequency spectrogram; the first 100 ms of each recording serve to estimate and subtract steady‑state background noise.

The processed spectrograms are fed into an Audio Spectrogram Transformer (AST), a vision‑transformer‑style architecture that tokenizes spectrogram patches and learns global temporal‑frequency relationships through multi‑head self‑attention. Two distinct tasks are investigated:

  1. Impulse‑Response Localization – A UR5e arm drives a solenoid actuator equipped with interchangeable cylindrical indenters (soft plastic, hard plastic, wood, metal) to deliver controlled pokes on the hand’s front, back, left, and right sides. Over 65 000 labeled samples are collected under both idle and powered‑on conditions (the latter includes fan noise). The AST regresses the 3‑D contact coordinates. Results show that stiff materials such as metal, which generate broadband high‑frequency vibrations, achieve sub‑3 mm mean localization error, while softer or textured materials yield slightly larger errors but remain under 5 mm. This demonstrates that material stiffness directly influences the richness of the vibrational spectrum and thus the localization precision.

  2. Trajectory Tracking – The UR5e draws a variety of patterns (selected from the open‑source Quick Draw dataset) on the hand’s forearm using the same four indenters. Two datasets are built: (a) a “static hand” set with 160 k strokes (hand fixed, fan noise present) and (b) a “dynamic hand” set with 80 k strokes where the hand moves to random poses while drawing. Continuous recordings are segmented into 200 ms chunks and processed identically to the impulse task. The AST now predicts the instantaneous contact point for each chunk, effectively reconstructing the drawing trajectory. Even when the hand’s own actuators generate significant vibration, the model maintains an average positional error of 6 mm (static) and 9 mm (dynamic), confirming robustness to self‑induced noise and motion.

Key contributions include: (i) a cost‑effective, scalable vibro‑acoustic sensing pipeline that leverages a sparse microphone array and a powerful transformer model; (ii) a systematic analysis of how material properties (stiffness vs. texture) affect impulse versus friction‑based signal components; (iii) validation of the approach in dynamic scenarios where the robot is simultaneously moving and sensing; and (iv) the release of all code, hardware CAD files, and the two large‑scale datasets (≈ 65 k impulse samples, ≈ 240 k trajectory samples) to the community.

Limitations are acknowledged: the seven‑microphone configuration may not capture fine‑grained 3‑D deformation fields on highly complex geometries; high‑frequency environmental noise (e.g., from fast‑spinning motors) can degrade performance without additional filtering; and the study focuses only on impact and sliding contacts, leaving compression, shear, and multi‑point contacts for future work. Prospective research directions involve optimizing microphone placement, fusing vibro‑acoustic data with vision and force‑torque sensing, implementing online adaptation to cope with wear‑induced sensor drift, and deploying the system in real‑world manipulation pipelines such as assembly or packaging. Overall, Vibro‑Sense demonstrates that simple vibrational cues, when processed with modern deep learning, can deliver whole‑hand tactile awareness at a fraction of the cost of conventional tactile skins.


Comments & Academic Discussion

Loading comments...

Leave a Comment