A High-Fidelity Robotic Manipulator Teleoperation Framework for Human-Centered Augmented Reality Evaluation
Validating Augmented Reality (AR) tracking and interaction models requires precise, repeatable ground-truth motion. However, human users cannot reliably perform consistent motion due to biomechanical variability. Robotic manipulators are promising to act as human motion proxies if they can mimic human movements. In this work, we design and implement ARBot, a real-time teleoperation platform that can effectively capture natural human motion and accurately replay the movements via robotic manipulators. ARBot includes two capture models: stable wrist motion capture via a custom CV and IMU pipeline, and natural 6-DOF control via a mobile application. We design a proactively-safe QP controller to ensure smooth, jitter-free execution of the robotic manipulator, enabling it to function as a high-fidelity record and replay physical proxy. We open-source ARBot and release a benchmark dataset of 132 human and synthetic trajectories captured using ARBot to support controllable and scalable AR evaluation.
💡 Research Summary
The paper introduces ARBot, a real‑time teleoperation framework that captures natural human hand motions and faithfully reproduces them on a robotic manipulator, thereby providing a precise, repeatable physical ground‑truth for evaluating augmented reality (AR) tracking and interaction pipelines. The authors identify a critical gap in current AR evaluation methodologies: human testers introduce biomechanical variability, fatigue, and tremor, making it difficult to isolate algorithmic performance from user‑induced noise. Existing software‑only testbeds such as ILLIXR, ExpAR, and ARCADE lack a deterministic physical input, while prior robotic telepresence systems focus on embodiment rather than measurement fidelity.
ARBot addresses this by integrating two complementary motion‑capture interfaces. The first, ARPose, is an Android application that leverages ARCore’s visual‑inertial odometry (VIO) to turn a smartphone into a 6‑DOF pose sensor. Users hold the phone and move it naturally; the device streams timestamped position and orientation data to the robot controller. The second, CV + IMU, fuses a depth camera (Intel RealSense) with a wearable inertial measurement unit running at 200 Hz. The depth stream supplies hand position, while the high‑rate IMU preserves rotational fidelity during rapid motions, preventing the drop‑outs typical of vision‑only systems. This interface also supports an autopilot mode that drives the robot through predefined geometric trajectories (circles, squares) to generate synthetic benchmark data.
To translate noisy human intent into safe, smooth robot motion, the authors design a three‑stage control pipeline. First, an inverse‑kinematics (IK) solver based on Newton‑Raphson with Damped Least‑Squares (DLS) computes a target joint configuration (q_{target}) for the desired end‑effector pose. DLS regularizes the Jacobian pseudo‑inverse, mitigating singularities and limiting excessive joint velocities. Second, a proactive safety filter formulates a quadratic‑programming (QP) problem (solved with OSQP) that minimizes the deviation between the desired joint velocity (\dot q_{need}) and the actual command (\dot q) while enforcing hard bounds on joint velocity and acceleration. This guarantees that even abrupt user movements cannot violate hardware limits. Third, the safe velocity output is integrated over the control timestep to produce a position command (q_{cmd}=q_{current}+ \dot q \Delta t). The authors deliberately choose a position‑centric architecture because it naturally handles packet loss: if a command packet is dropped, the robot simply stops at the last known safe pose, avoiding the “zero‑order hold” drift seen in velocity‑controlled teleoperation.
Implementation details include a ROS 2 Humble distributed stack, a coordinate‑frame homogenization layer that converts ARCore’s Y‑up right‑handed system to the robot’s Z‑up convention via a fixed quaternion, and a low‑latency network stack. High‑frequency IMU data are transmitted using a compact 22‑byte binary protocol over serial, while ARPose data use a persistent WebSocket connection with a “drop‑oldest” buffer to always process the freshest packet. The overall loop runs at 200 Hz, achieving an end‑to‑end latency of 19.5 ms for the ARPose interface and 38.5 ms for CV + IMU, with median tracking error of 5 mm for both modalities (95th‑percentile error 26.8 mm for ARPose, 39.7 mm for CV + IMU).
The system was validated through an IRB‑approved user study involving 11 participants. Results show that ARBot can mimic human motions with a median absolute trajectory error of 5 mm and a latency of 19.5 ms, while inter‑trial variability among humans is roughly ten times larger than the robot’s repeatability. Preference data indicate that 7 participants favored the handheld ARPose method, whereas 4 preferred the hands‑free CV + IMU approach, highlighting the complementary nature of the two capture modes.
Beyond the experimental validation, the authors release a benchmark dataset comprising 132 recorded 6‑DOF trajectories (including circles, squares, spirals, and other shape‑tracing motions) captured with both interfaces, as well as the full ARBot software stack under an open‑source license. This contribution enables the AR research community to conduct reproducible, human‑centered evaluations, generate synthetic ground‑truth motion for algorithmic benchmarking, and employ the robot as a deterministic physical oracle for visual‑analytics tools such as ARGUS. By treating the manipulator as a scientific instrument rather than a telepresence device, ARBot establishes a new standard for rigorous, repeatable AR system assessment.
Comments & Academic Discussion
Loading comments...
Leave a Comment