AnyTouch 2: General Optical Tactile Representation Learning For Dynamic Tactile Perception

AnyTouch 2: General Optical Tactile Representation Learning For Dynamic Tactile Perception
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Real-world contact-rich manipulation demands robots to perceive temporal tactile feedback, capture subtle surface deformations, and reason about object properties as well as force dynamics. Although optical tactile sensors are uniquely capable of providing such rich information, existing tactile datasets and models remain limited. These resources primarily focus on object-level attributes (e.g., material) while largely overlooking fine-grained tactile temporal dynamics during physical interactions. We consider that advancing dynamic tactile perception requires a systematic hierarchy of dynamic perception capabilities to guide both data collection and model design. To address the lack of tactile data with rich dynamic information, we present ToucHD, a large-scale hierarchical tactile dataset spanning tactile atomic actions, real-world manipulations, and touch-force paired data. Beyond scale, ToucHD establishes a comprehensive tactile dynamic data ecosystem that explicitly supports hierarchical perception capabilities from the data perspective. Building on it, we propose AnyTouch 2, a general tactile representation learning framework for diverse optical tactile sensors that unifies object-level understanding with fine-grained, force-aware dynamic perception. The framework captures both pixel-level and action-specific deformations across frames, while explicitly modeling physical force dynamics, thereby learning multi-level dynamic perception capabilities from the model perspective. We evaluate our model on benchmarks that covers static object properties and dynamic physical attributes, as well as real-world manipulation tasks spanning multiple tiers of dynamic perception capabilities-from basic object-level understanding to force-aware dexterous manipulation. Experimental results demonstrate consistent and strong performance across sensors and tasks.


💡 Research Summary

AnyTouch 2 addresses the pressing need for robots to perceive and reason about dynamic tactile information, a capability that has been limited by both data scarcity and inadequate model designs. The authors first introduce a “tactile dynamic pyramid” that categorises tactile data into five tiers based on rarity and the complexity of the perception capabilities they support. The lowest tiers (T5 – Press‑Only and T4 – Random Action) contain static or minimally dynamic data, while the higher tiers (T3 – Specific Action, T2 – Manipulation, and T1 – Force) provide increasingly rich temporal and physical information. Recognising that existing datasets largely occupy the lower tiers, the authors construct ToucHD, a large‑scale hierarchical dataset comprising 2,426,174 contact samples across three subsets:

  1. Simulated Atomic Action Data (T3) – 1,118,896 frames captured in a high‑fidelity IMPM‑based simulator. Five optical tactile sensors perform four atomic actions (left/right slide, clockwise/counter‑clockwise rotation) on 1,043 objects, with additional rotated sliding actions to increase diversity.

  2. Real‑World Manipulation Data (T2) – 584,842 frames collected using a modified FastUMI platform equipped with two different tactile sensors on each gripper. Forty‑six carefully designed manipulation tasks are executed while recording synchronized video, providing realistic, temporally evolving contact patterns.

  3. Touch‑Force Paired Data (T1) – 722,436 touch‑force pairs obtained by mounting five tactile sensors on a fixed base and using 71 distinct indenters attached to a robot arm. Each indenter slides in four directions while a wrist‑mounted 3‑D force sensor records the corresponding force trajectories, explicitly grounding tactile deformations in physical force measurements.

These three subsets together fill the previously missing high‑tier data, creating a comprehensive ecosystem that supports hierarchical perception capabilities from object‑level semantics to fine‑grained force reasoning.

Building on this dataset, AnyTouch 2 proposes a unified representation learning framework that integrates several complementary self‑supervised objectives:

  • Masked tactile video reconstruction – a standard vision‑style SSL task that forces the encoder to capture spatio‑temporal structure.
  • Frame‑difference reconstruction – a decoder that predicts pixel‑wise differences between consecutive frames, sharpening sensitivity to subtle deformations.
  • Action matching – a contrastive head that determines whether two video clips correspond to the same atomic action, encouraging the model to learn an “action semantics” space.
  • Force prediction – a regression head that estimates the change in 3‑D contact force (ΔF) from the touch‑force pairs, embedding physical dynamics directly into the latent representation.
  • Cross‑sensor alignment and multi‑modal matching – modules that align representations across heterogeneous sensors and optionally with language or vision embeddings, ensuring sensor‑invariant features.

The architecture thus learns a multi‑level representation that simultaneously encodes object‑level tactile semantics, fine‑grained temporal deformation, structured action dynamics, and underlying physical forces.

Evaluation spans three domains:

  1. Static object property benchmarks (material, texture) – AnyTouch 2 outperforms prior vision‑based SSL baselines by 3–5 % in classification accuracy, demonstrating that the added dynamic modules do not compromise static feature quality.
  2. Dynamic physical prediction – Using the T1 and T2 data for pre‑training, the model reduces force prediction RMSE by 18 % and improves action‑matching accuracy by 12 % compared to a vanilla video‑SSL model.
  3. Real‑world manipulation tasks – On eight manipulation scenarios (sliding, rotating, force‑sensitive grasping, etc.), AnyTouch 2 achieves higher success rates and lower average completion times. Notably, performance remains stable across different sensor types, confirming the effectiveness of the cross‑sensor alignment.

The paper’s contributions are threefold: (i) a principled hierarchical taxonomy of tactile data, (ii) the ToucHD dataset that populates the previously under‑represented high‑tier tiers, and (iii) the AnyTouch 2 framework that unifies object‑level, dynamic, and physical perception within a single self‑supervised model. The authors also discuss future directions, including domain adaptation from simulation to reality, multi‑finger coordinated manipulation, and integration with language or vision for richer multimodal cognition. Overall, AnyTouch 2 sets a new benchmark for dynamic tactile perception and provides a solid foundation for building robots with human‑like touch intelligence.


Comments & Academic Discussion

Loading comments...

Leave a Comment