GraspLook: a VR-based Telemanipulation System with R-CNN-driven Augmentation of Virtual Environment

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The teleoperation of robotic systems in medical applications requires stable and convenient visual feedback for the operator. The most accessible approach to delivering visual information from the remote area is using cameras to transmit a video stream from the environment. However, such systems are sensitive to the camera resolution, limited viewpoints, and cluttered environment bringing additional mental demands to the human operator. The paper proposes a novel system of teleoperation based on an augmented virtual environment (VE). The region-based convolutional neural network (R-CNN) is applied to detect the laboratory instrument and estimate its position in the remote environment to display further its digital twin in the VE, which is necessary for dexterous telemanip-ulation. The experimental results revealed that the developed system allows users to operate the robot smoother, which leads to a decrease in task execution time when manipulating test tubes. In addition, the participants evaluated the developed system as less mentally demanding (by 11 %) and requiring less effort (by 16%) to accomplish the task than the camera-based teleoperation approach and highly assessed their performance in the augmented VE. The proposed technology can be potentially applied for conducting laboratory tests in remote areas when operating with infectious and poisonous reagents.

💡 Research Summary

The paper introduces GraspLook, a novel tele‑manipulation platform that combines a collaborative robot (UR3) with a virtual‑reality (VR) interface and a deep‑learning based perception pipeline to overcome the limitations of conventional camera‑only teleoperation in medical and laboratory settings. The hardware stack consists of a 6‑DOF UR3 arm equipped with a 2‑finger Robotiq gripper, an 8‑megapixel on‑gripper RGB camera, and an Intel RealSense D435 RGB‑D sensor mounted on the end‑effector. Operators control the robot’s translational motion using an Omega.7 desktop haptic device; rotational degrees of freedom are fixed to simplify control, and a workspace scaling factor (1×–5×) can be selected to match the haptic device’s smaller work envelope to the robot’s larger reach. Optional axis‑locking further enhances precision for fine adjustments.

Visual feedback is delivered through two complementary channels. The first is the raw video stream from the on‑gripper camera, providing a direct view of the remote scene. The second channel is a digitally‑augmented VR environment built in Unity, where a digital twin of the robot is continuously updated at 50 Hz and where laboratory instruments are rendered as 3‑D CAD models positioned according to real‑time pose estimates. Pose estimation is performed by a Mask R‑CNN (ResNet‑101‑FPN backbone) trained on a synthetic dataset of 8 000 images containing eight common lab tools (scraper, micro test tube, needle holder, Pasteur pipette, pipettor, centrifuge test tube, vacuum test tube, swab). The dataset was generated by extracting object silhouettes from internet images, overlaying them on ten realistic laboratory backgrounds, and applying random rotations, scalings, and reflections. Training for 10 k iterations on Google Colab achieved average precision (AP) above 88 % across classes, with the centrifuge test tube reaching AP₅₀ = 97.5 %, which was later used as the benchmark object in user experiments.

During operation, the RGB frame from the RealSense sensor is fed to the Mask R‑CNN to obtain bounding boxes and segmentation masks. The associated depth map supplies distance information; two strategies are evaluated—averaging depth values inside the bounding box versus inside the segmentation mask. The 3‑D coordinates of the object’s centroid are computed by projecting the masked points into the camera coordinate system, then transformed into the Unity world frame. To mitigate jitter caused by sensor noise, an exponential moving‑average (alpha filter) smooths the position updates before the virtual model is placed in the scene. This pipeline enables the operator to see a 360° view of the workspace through an HTC Vive Pro HMD, with head orientation controlling the camera perspective, thereby eliminating occlusions and limited viewpoints inherent to fixed‑camera setups.

A within‑subjects user study with eight participants (average age 24.2 years) compared the traditional camera‑based teleoperation (two static cameras: an isometric view and the on‑gripper view) against the GraspLook VR‑augmented mode. Participants performed a pick‑and‑place task involving a test tube marked with a target grasp point, repeating the task three times per condition. Objective metrics included task completion time, robot end‑effector trajectory length, and grasping error rate. Subjective workload was assessed using a NASA‑TLX‑derived Likert questionnaire covering mental demand, physical demand, temporal demand, performance, effort, and frustration, plus an additional question on perceived involvement.

Results showed that the VR‑augmented condition reduced average task time by approximately 18 %, shortened trajectory length by 12 %, and lowered grasping errors by 9 % relative to the camera‑only condition. Subjectively, participants reported a 11 % reduction in mental demand and a 16 % reduction in overall effort, indicating a lower cognitive load and higher perceived ease of use. The authors attribute these gains to the accurate, real‑time placement of digital twins, which provides consistent spatial cues regardless of camera placement, and to the intuitive haptic‑VR control loop that aligns operator motion with robot motion through scalable mapping.

The paper’s contributions are fourfold: (1) integration of real‑time Mask R‑CNN object detection with a VR digital twin to create an augmented teleoperation visual channel; (2) a scalable haptic‑to‑robot control scheme with optional axis locking for precision tasks; (3) a synthetic data generation pipeline that enables rapid training of high‑performance instance segmentation models for laboratory objects; and (4) empirical evidence that the augmented VR interface reduces operator workload and improves task efficiency. Limitations include the assumption that objects are vertically oriented, reliance on a single‑object detection scenario, and the need for more robust handling of occlusions and dynamic scene changes. Future work may explore multi‑object tracking, online model adaptation, and integration of force feedback to further enhance tele‑manipulation in hazardous or remote environments. Overall, GraspLook demonstrates a promising direction for deploying VR‑enhanced, AI‑driven teleoperation in settings where visual clarity, precision, and operator comfort are critical.

GraspLook: a VR-based Telemanipulation System with R-CNN-driven Augmentation of Virtual Environment

💡 Research Summary

Comments & Academic Discussion

Leave a Comment