Exp-Force: Experience-Conditioned Pre-Grasp Force Selection with Vision-Language Models
Accurate pre-contact grasp force selection is critical for safe and reliable robotic manipulation. Adaptive controllers regulate force after contact but still require a reasonable initial estimate. Starting a grasp with too little force requires reactive adjustment, while starting a grasp with too high a force risks damaging fragile objects. This trade-off is particularly challenging for compliant grippers, whose contact mechanics are difficult to model analytically. We propose Exp-Force, an experience-conditioned framework that predicts the minimum feasible grasping force from a single RGB image. The method retrieves a small set of relevant prior grasping experiences and conditions a vision-language model on these examples for in-context inference, without analytic contact models or manually designed heuristics. On 129 object instances, ExpForce achieves a best-case MAE of 0.43 N, reducing error by 72% over zero-shot inference. In real-world tests on 30 unseen objects, it improves appropriate force selection rate from 63% to 87%. These results demonstrate that Exp-Force enables reliable and generalizable pre-grasp force selection by leveraging prior interaction experiences. http://expforcesubmission.github.io/Exp-Force-Website/
💡 Research Summary
Exp‑Force tackles the long‑standing challenge of selecting an appropriate pre‑contact grasping force for robotic manipulators, especially those equipped with compliant fingers whose contact mechanics are difficult to model analytically. While adaptive force controllers can adjust grip after contact, they still require a reasonable initial guess; too little force leads to slip, too much can damage delicate items. The authors propose an experience‑conditioned framework that predicts the minimum feasible grasping force directly from a single RGB image captured by a wrist‑mounted camera, without any explicit physics equations or hand‑crafted heuristics. The pipeline consists of three stages. First, an object description generator (a vision‑language model, VLM) receives a structured prompt that includes task context (gripper geometry, material, a scale reference image) and the query image, and outputs a textual description of the object’s physical attributes (size, shape, surface texture, apparent rigidity). Second, both the image and the generated description are embedded into a shared multimodal space using a pretrained embedding model (Qwen3‑VL‑Embedding‑8B). Cosine similarity between the query embedding and embeddings of a modest experience pool (129 objects, each with name, mass, description, image, and ground‑truth minimum force) yields the top‑k most similar past grasps (k=6 in the best configuration). Third, a predictor VLM (tested with GPT‑5.2, Gemini‑3‑Flash, and Gemini‑3‑Pro) is prompted with the same task context, the retrieved examples, a force‑prediction instruction, and the query image. The model then generates an in‑context estimate of the scalar force ˆF. When k=0, the system reduces to zero‑shot inference, allowing a direct comparison that isolates the benefit of experience conditioning. Offline evaluation on the 129‑object dataset shows a mean absolute error (MAE) of 0.43 N, a 72 % reduction relative to zero‑shot VLM predictions. The error reduction is consistent across six semantic categories (cuboid, cylinder, bottle, odd shapes, fragile‑light, fragile‑heavy). Real‑world experiments on a Franka Emika Panda robot equipped with FORTE fin‑ray tactile fingers and a parallel‑finger gripper tested 30 previously unseen objects. The appropriate‑force selection rate rose from 63 % (zero‑shot) to 87 % with Exp‑Force, demonstrating fewer slip events and less object damage, particularly for thin, slippery bottles and fragile containers. The authors highlight several contributions: (1) the first use of retrieval‑augmented VLMs for grasp‑force estimation, (2) demonstration that a tiny experience pool (≈1 hour of robot data) suffices for accurate predictions, (3) evidence that only a handful of in‑context examples are needed, confirming the sample efficiency of the approach, (4) validation on a compliant gripper platform, and (5) public release of the dataset and code. Limitations include reliance on RGB images and known object mass, absence of direct friction or compliance measurements, sensitivity to prompt design and model bias, and evaluation limited to a parallel‑finger compliant gripper. Future work is suggested in three directions: integrating tactile or force sensor data to refine VLM outputs, extending the method to diverse gripper morphologies (multi‑fingered hands, suction cups), and developing online mechanisms for continuously expanding and curating the experience pool. In sum, Exp‑Force shows that large vision‑language models, when grounded in a few relevant past interactions, can implicitly capture complex contact physics and provide data‑efficient, generalizable pre‑grasp force estimates, opening a promising new pathway for combining foundation models with low‑level robot control.
Comments & Academic Discussion
Loading comments...
Leave a Comment