Assessing Task-based Chatbots: Snapshot and Curated Datasets for Dialogflow
In recent years, chatbots have gained widespread adoption thanks to their ability to assist users at any time and across diverse domains. However, the lack of large-scale curated datasets limits research on their quality and reliability. This paper presents TOFU-D, a snapshot of 1,788 Dialogflow chatbots from GitHub, and COD, a curated subset of TOFU-D including 185 validated chatbots. The two datasets capture a wide range of domains, languages, and implementation patterns, offering a sound basis for empirical studies on chatbot quality and security. A preliminary assessment using the Botium testing framework and the Bandit static analyzer revealed gaps in test coverage and frequent security vulnerabilities in several chatbots, highlighting the need for systematic, multi-Platform research on chatbot quality and security.
💡 Research Summary
The paper addresses a critical gap in chatbot research: the scarcity of large, curated datasets for task‑oriented conversational agents, especially on commercial platforms. The authors introduce two datasets focused on Google Dialogflow, a widely used commercial framework. The first, TOFU‑D (Dialogflow Task‑Based Chatbots From GitHub), is a snapshot of 1,788 unique Dialogflow chatbots collected from GitHub as of September 16 2025. The second, COD (Collection of Dialogflow chatbots), is a curated subset of 185 chatbots selected from TOFU‑D based on criteria that ensure functional relevance and research utility: each must handle at least one intent and one entity, implement a webhook service, support English, run on Dialogflow v2, and originate from a starred repository.
To build TOFU‑D, the authors queried the GitHub API for repositories containing the keyword “Dialogflow” together with “chatbot” or “agent”. This yielded 12,883 repositories; after discarding empty ones and those lacking the mandatory Dialogflow agent configuration file, 2 650 agent files remained across 1 710 repositories. Further filtering removed CX‑only bots, malformed agents, and duplicates. Duplicates were identified by comparing intent, entity, webhook intent lists, supported languages, and backend code similarity (using difflib with a 95 % threshold). The most recent, most popular version of each duplicate group was retained, resulting in the final 1 788 unique bots.
COD was derived by applying three relevance filters: (1) minimal dialog complexity (≥1 intent and ≥1 entity), (2) functional complexity (presence of a webhook), and (3) utility (English language, Dialogflow v2, and starred repository). This reduced the pool to 191 candidates; six were eliminated after deployment validation via the Dialogflow REST API, leaving 185 validated bots. For topic labeling, the authors prompted GPT‑4o with each bot’s repository title, description, intents, entities, and README, asking it to assign a Google Play category; manual spot‑checks confirmed labeling accuracy.
Statistical analysis shows that TOFU‑D exhibits a wide spread in dialog complexity (1–316 intents, 0–140 entities) and includes about 15 % trivial bots that merely return static answers. COD, by design, excludes such low‑complexity bots and displays higher functional richness: 62 % of COD bots use webhook intents for more than half of their intents, and integration with Google Assistant rises from 32 % in TOFU‑D to 51 % in COD, while cloud‑function usage climbs from 16 % to 29 %. Programming language distribution highlights JavaScript as the dominant backend language (native support in Dialogflow), followed by Python, Java, and TypeScript. Language support is heavily skewed toward English (87 % of TOFU‑D bots), with 159 bots supporting multiple languages (up to 12).
The authors conduct an early validation experiment on COD. Ten bots are randomly selected and test cases are generated with Botium, a multi‑platform chatbot testing framework. Findings reveal systematic gaps: 100 % lack fallback‑behavior tests, 30 % miss greeting tests, another 30 % omit pre‑condition checks, 30 % have no entity tests, and 70 % only partially cover entities. Additionally, 10 % of generated tests are broken (missing responses). These results echo similar challenges observed in prior Rasa‑based studies, underscoring the cross‑platform difficulty of automatic test generation.
For security assessment, the authors run Bandit, a static analysis tool for Python, on the 69 COD bots that include Python backends, and compare results with 193 BRASA‑TO (Rasa) bots. Six vulnerability categories dominate both datasets: missing timeouts on external API calls, weak pseudo‑random number generators, improper exception handling (e.g., empty catch blocks), potential SQL injection due to unsanitized inputs, exposure of sensitive information, and excessive permissions. Dialogflow bots exhibit additional platform‑specific issues such as misconfigured webhook endpoints. The higher incidence of vulnerabilities in COD is attributed to its greater functional complexity.
Overall, the paper makes three major contributions: (1) the release of TOFU‑D, a comprehensive snapshot of real‑world Dialogflow bots, (2) the curated COD dataset that balances diversity, complexity, and usability for empirical research, and (3) a demonstration of how these resources can surface concrete quality and security challenges using existing tools. The datasets are publicly available and the extraction pipeline is fully automated, enabling replication and extension to other task‑oriented platforms. The authors argue that the availability of such large, heterogeneous corpora will foster more robust, scalable, and generalizable research on chatbot testing, robustness, and security, moving the field beyond the current reliance on a handful of small, often outdated examples.
Comments & Academic Discussion
Loading comments...
Leave a Comment