📝 Original Info
- Title: AgentBay: A Hybrid Interaction Sandbox for Seamless Human-AI Intervention in Agentic Systems
- ArXiv ID: 2512.04367
- Date: 2025-12-04
- Authors: Researchers from original ArXiv paper
📝 Abstract
The rapid advancement of Large Language Models (LLMs) is catalyzing a shift towards autonomous AI Agents capable of executing complex, multi-step tasks. However, these agents remain brittle when faced with real-world exceptions, making Human-in-the-Loop (HITL) supervision essential for mission-critical applications. In this paper, we present AgentBay, a novel sandbox service designed from the ground up for hybrid interaction. AgentBay provides secure, isolated execution environments spanning Windows, Linux, Android, Web Browsers, and Code interpreters. Its core contribution is a unified session accessible via a hybrid control interface: An AI agent can interact programmatically via mainstream interfaces (MCP, Open Source SDK), while a human operator can, at any moment, seamlessly take over full manual control. This seamless intervention is enabled by Adaptive Streaming Protocol (ASP). Unlike traditional VNC/RDP, ASP is specifically engineered for this hybrid use case, delivering an ultra-low-latency, smoother user experience that remains resilient even in weak network environments. It achieves this by dynamically blending command-based and video-based streaming, adapting its encoding strategy based on network conditions and the current controller (AI or human). Our evaluation demonstrates strong results in security, performance, and task completion rates. In a benchmark of complex tasks, the AgentBay (Agent + Human) model achieved more than 48% success rate improvement. Furthermore, our ASP protocol reduces bandwidth consumption by up to 50% compared to standard RDP, and in end-to-end latency with around 5% reduction, especially under poor network conditions. We posit that AgentBay provides a foundational primitive for building the next generation of reliable, human-supervised autonomous systems.
💡 Deep Analysis
Deep Dive into AgentBay: A Hybrid Interaction Sandbox for Seamless Human-AI Intervention in Agentic Systems.
The rapid advancement of Large Language Models (LLMs) is catalyzing a shift towards autonomous AI Agents capable of executing complex, multi-step tasks. However, these agents remain brittle when faced with real-world exceptions, making Human-in-the-Loop (HITL) supervision essential for mission-critical applications. In this paper, we present AgentBay, a novel sandbox service designed from the ground up for hybrid interaction. AgentBay provides secure, isolated execution environments spanning Windows, Linux, Android, Web Browsers, and Code interpreters. Its core contribution is a unified session accessible via a hybrid control interface: An AI agent can interact programmatically via mainstream interfaces (MCP, Open Source SDK), while a human operator can, at any moment, seamlessly take over full manual control. This seamless intervention is enabled by Adaptive Streaming Protocol (ASP). Unlike traditional VNC/RDP, ASP is specifically engineered for this hybrid use case, delivering an ult
📄 Full Content
AgentBay: A Hybrid Interaction Sandbox for Seamless
Human-AI Intervention in Agentic Systems
Yun Piao, Hongbo Min, Hang Su, Leilei Zhang, Lei Wang, Yue Yin, Xiao Wu, Zhejing Xu,
Liwei Qu, Hang Li, Xinxin Zeng, Wei Tian, Fei Yu, Xiaowei Li, Jiayi Jiang, Tongxu Liu,
Hao Tian, Yufei Que, Xiaobing Tu, Bing Suo, Yuebing Li, Xiangting Chen, Zeen Zhao, Jiaming Tang,
Wei Huang, Xuguang Li, Jing Zhao, Jin Li, Jie Shen, Jinkui Ren, Xiantao Zhang
Alibaba Cloud Computing
ABSTRACT
The rapid advancement of Large Language Models (LLMs) is catalyzing a shift towards autonomous AI
Agents capable of executing complex, multi-step tasks. However, these agents remain brittle when faced
with real-world exceptions, making Human-in-the-Loop (HITL) supervision essential for mission-critical
applications. In this paper, we present AgentBay, a novel sandbox service designed from the ground up for
hybrid interaction. AgentBay provides secure, isolated execution environments spanning Windows, Linux,
Android, Web Browsers, and Code interpreters. Its core contribution is a unified session accessible via a
hybrid control interface: An AI agent can interact programmatically via mainstream interfaces (MCP, Open
Source SDK), while a human operator can, at any moment, seamlessly take over full manual control.
This seamless intervention is enabled by Adaptive Streaming Protocol (ASP). Unlike traditional VNC/RDP,
ASP is specifically engineered for this hybrid use case, delivering an ultra-low-latency, smoother user
experience that remains resilient even in weak network environments. It achieves this by dynamically
blending command-based and video-based streaming, adapting its encoding strategy based on network
conditions and the current controller (AI or human).
Our evaluation demonstrates strong results in security, performance, and task completion rates. In a benchmark
of complex tasks, the AgentBay (Agent + Human) model achieved more than 48% success rate improvement.
Furthermore, our ASP protocol reduces bandwidth consumption by up to 50% compared to standard RDP,
and in end-to-end latency with around 5% reduction, especially under poor network conditions. We posit that
AgentBay provides a foundational primitive for building the next generation of reliable, human-supervised
autonomous systems.
1
Introduction
The proliferation of Large Language Models (LLMs) has given rise to a new paradigm of autonomous agents [1].
Systems like Auto-GPT [2] and frameworks like LangChain [3] empower agents to reason, plan, and execute tasks
across digital environments. However, their deployment is hindered by two key challenges: brittleness and the need for
secure human-in-the-loop (HITL) collaboration. Brittleness occurs when agents fail at unforeseen exceptions—modal
pop-ups or CAPTCHAs [21]. The collaboration need arises when agents must securely handle private data. For example,
an agent might autonomously navigate a website, but upon reaching the login page, it must pause and request human
intervention to securely enter credentials (username and password), as the agent itself is not provisioned with this
sensitive information. In both scenarios—unexpected failure or planned intervention—a seamless, low-friction handoff
to a human operator is critical.
This necessity for supervision highlights the critical importance of sandboxed environments that also support fluid human
intervention. Leading agent sandboxes (e.g., E2B [7], Daytona [14]) recognize this need, and they typically address
it by integrating general-purpose remote interaction protocols like VNC [8] or RDP [9] as their human intervention
mechanism. While this approach provides a vital function, these protocols were not specifically designed for the rapid,
arXiv:2512.04367v1 [cs.AI] 4 Dec 2025
low-friction handoff required in hybrid AI systems. They can introduce noticeable latency and may be less resilient
under variable network conditions, which can diminish the fluidity of the human-agent collaboration.
To address this specific challenge, we present a hybrid interaction sandbox infrastructure designed from the ground
up to optimize this interaction (based on the production service AgentBay). AgentBay provides a single, isolated
execution sandbox that can be controlled simultaneously via a programmatic API and open source SDK for AI Agent,
or a high-performance graphical streaming interface for humans.
The core of our system is the Adaptive Streaming Protocol (ASP). When a human takes over, it instantly switches to an
ultra-low-latency, smooth, and resilient graphical stream explicitly optimized for interaction fluency, even on variable
networks.
We make the following contributions:
• Hybrid Interaction Architecture: We present the design of a hybrid interaction sandbox system supporting diverse
OS, mobile, browser, and code, that are secure, isolated execution environments.
• Adaptive Streaming Protocol (ASP): We detail our novel streaming protocol that enables human control, specifically
…(Full text truncated)…
📸 Image Gallery
Reference
This content is AI-processed based on ArXiv data.