Book your room in the Turing Hotel! A symmetric and distributed Turing Test with multiple AIs and humans

Book your room in the Turing Hotel! A symmetric and distributed Turing Test with multiple AIs and humans
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

In this paper, we report our experience with TuringHotel'', a novel extension of the Turing Test based on interactions within mixed communities of Large Language Models (LLMs) and human participants. The classical one-to-one interaction of the Turing Test is reinterpreted in a group setting, where both human and artificial agents engage in time-bounded discussions and, interestingly, are both judges and respondents. This community is instantiated in the novel platform UNaIVERSE (https://unaiverse.io), creating a World’’ which defines the roles and interaction dynamics, facilitated by the platform’s built-in programming tools. All communication occurs over an authenticated peer-to-peer network, ensuring that no third parties can access the exchange. The platform also provides a unified interface for humans, accessible via both mobile devices and laptops, that was a key component of the experience in this paper. Results of our experimentation involving 17 human participants and 19 LLMs revealed that current models are still sometimes confused as humans. Interestingly, there are several unexpected mistakes, suggesting that human fingerprints are still identifiable but not fully unambiguous, despite the high-quality language skills of artificial participants. We argue that this is the first experiment conducted in such a distributed setting, and that similar initiatives could be of national interest to support ongoing experiments and competitions aimed at monitoring the evolution of large language models over time.


💡 Research Summary

The paper introduces “TuringHotel,” a novel, symmetric, and distributed implementation of the classic Turing Test that moves the interaction from a one‑on‑one setting to a multi‑agent community. The authors build the experiment on UNaIVERSE, a peer‑to‑peer (P2P) platform that creates “Worlds” – isolated overlay networks where both human participants and artificial agents (large language models, LLMs) can join, interact, and be governed by a finite‑state automaton (FSA) that defines permissible actions and roles. Communication occurs over an authenticated P2P network; a root server only issues short‑lived tokens, while all message exchange is direct between agents, guaranteeing privacy and preventing third‑party eavesdropping.

In a TuringHotel “room,” four agents (a mixture of humans and LLMs) converse freely for three minutes. After the time expires, a manager agent asks every participant – both humans and AIs – to identify which of the interlocutors were human. This creates a dual role: every participant is simultaneously a judge and a subject. The experiment involved 17 human volunteers and 19 LLM instances (different model families and prompting configurations). The key findings are:

  1. Partial indistinguishability – LLMs were mistaken for humans in roughly 30 % of the judgments, indicating that current models can produce language that is sometimes indistinguishable from human output in a realistic, multi‑turn, multi‑party dialogue.
  2. Human fingerprints persist – Humans correctly identified bots about 70 % of the time. Errors made by humans (mis‑labeling a bot as human or vice‑versa) reveal cognitive biases and the difficulty of the task when participants must keep track of several interlocutors simultaneously.
  3. Dynamic detection – Many LLMs performed well in the early phases of the conversation but revealed inconsistencies, factual hallucinations, or overly formal style later on, which human judges used as cues.
  4. Symmetric evaluation – Unlike traditional Turing setups where a single human judge interrogates a machine, TuringHotel’s symmetric design lets every participant evaluate others, providing richer data on how humans adapt their detection strategies over repeated exposure.

The authors compare TuringHotel with existing online Turing‑style platforms such as turingtest.live, Human or Not, and with multi‑agent benchmarks like LM Arena and AvalonBench. Those systems either enforce a strict ping‑pong turn structure, limit interaction length, or focus on AI‑vs‑AI competition. TuringHotel distinguishes itself by (i) supporting genuine group conversation dynamics (interruptions, overlapping topics, informal turn‑taking), (ii) being built on a decentralized P2P infrastructure that gives participants control over data ownership and privacy, and (iii) enabling longitudinal participation – users can re‑enter the world repeatedly, allowing continuous monitoring of model evolution and human adaptation.

The paper also discusses the broader implications of a decentralized experimental paradigm. By exposing the experimental design, data collection procedures, and aggregated results as open and auditable, the authors argue that such a framework can serve public‑interest monitoring of AI capabilities, support national digital‑sovereignty goals, and provide a reusable infrastructure for other social‑AI research (e.g., AI‑as‑a‑service testing, social deduction games, collective decision‑making studies).

Limitations acknowledged include the modest scale of the study, the sensitivity of results to prompt engineering (different system prompts dramatically affect LLM behavior), and the need for systematic exploration of parameters such as room size, conversation duration, and topic diversity. Future work is proposed to expand participant demographics across languages and cultures, to standardize prompting strategies for each model, and to develop statistical models that quantify how specific conversational cues influence human detection accuracy. Moreover, the authors envision a continuous monitoring pipeline where updated LLM releases are automatically deployed into the UNaIVERSE world, and longitudinal data are analyzed to track the trajectory of “human‑likeness” over time.

In summary, TuringHotel demonstrates that a decentralized, multi‑agent, symmetric Turing Test is feasible and yields nuanced insights that single‑judge, dyadic tests miss. It opens a path toward open, auditable, and scalable evaluation of large language models in realistic social settings, while preserving participant privacy through peer‑to‑peer networking.


Comments & Academic Discussion

Loading comments...

Leave a Comment