A Multimodal Conversational Agent for Tabular Data Analysis

Reading time: 5 minute
...

📝 Original Info

  • Title: A Multimodal Conversational Agent for Tabular Data Analysis
  • ArXiv ID: 2511.18405
  • Date: 2025-11-23
  • Authors: Mohammad Nour Al Awad, Sergey Ivanov, Olga Tikhonova, Ivan Khodnenko

📝 Abstract

Large language models (LLMs) can reshape information processing by handling data analysis, visualization, and interpretation in an interactive, context-aware dialogue with users, including voice interaction, while maintaining high performance. In this article, we present Talk2Data, a multimodal LLM-driven conversational agent for intuitive data exploration. The system lets users query datasets with voice or text instructions and receive answers as plots, tables, statistics, or spoken explanations. Built on LLMs, the suggested design combines OpenAI Whisper automatic speech recognition (ASR) system, Qwen-coder code generation LLM/model, custom sandboxed execution tools, and Coqui library for text-to-speech (TTS) within an agentic orchestration loop. Unlike text-only analysis tools, it adapts responses across modalities and supports multi-turn dialogues grounded in dataset context. In an evaluation of 48 tasks on three datasets, our prototype achieved 95.8% accuracy with model-only generation time under 1.7 seconds (excluding ASR and execution time). A comparison across five LLM sizes (1.5B-32B) revealed accuracy-latency-cost trade-offs, with a 7B model providing the best balance for interactive use. By routing between conversation with user and code execution, constrained to a transparent sandbox, with simultaneously grounding prompts in schema-level context, the Talk2Data agent reliably retrieves actionable insights from tables while making computations verifiable. In the article, except for the Talk2Data agent itself, we discuss implications for human-data interaction, trust in LLM-driven analytics, and future extensions toward large-scale multimodal assistants.

💡 Deep Analysis

Figure 1

📄 Full Content

A Multimodal Conversational Agent for Tabular Data Analysis Mohammad Nour Al Awad ITMO University Saint Petersburg, Russia MohammadNourAlAwad@itmo.ru Sergey Ivanov ITMO University Saint Petersburg, Russia svivanov@itmo.ru Olga Tikhonova ITMO University Saint Petersburg, Russia tikhonova ob@itmo.ru Ivan Khodnenko ITMO University Saint Petersburg, Russia ivan.khodnenko@itmo.ru Abstract—Large language models (LLMs) can reshape in- formation processing by handling data analysis, visualization, and interpretation in an interactive, context-aware dialogue with users, including voice interaction, while maintaining high perfor- mance. In this article, we present Talk2Data, a multimodal LLM- driven conversational agent for intuitive data exploration. The system lets users query datasets with voice or text instructions and receive answers as plots, tables, statistics, or spoken expla- nations. Built on LLMs, the suggested design combines OpenAI Whisper automatic speech recognition (ASR) system, Qwen-coder code generation LLM/model, custom sandboxed execution tools, and Coqui library for text-to-speech (TTS) within an agentic orchestration loop. Unlike text-only analysis tools, it adapts responses across modalities and supports multi-turn dialogues grounded in dataset context. In an evaluation of 48 tasks on three datasets, our prototype achieved 95.8% accuracy with model-only generation time under 1.7 seconds (excluding ASR and execution time). A comparison across five LLM sizes (1.5B–32B) revealed accuracy–latency–cost trade-offs, with a 7B model providing the best balance for interactive use. By routing between conversation with user and code execution, constrained to a transparent sandbox, with simultaneously grounding prompts in schema-level context, the Talk2Data agent reliably retrieves actionable insights from tables while making computations verifiable. In the article, except for the Talk2Data agent itself, we discuss implications for human–data interaction, trust in LLM-driven analytics, and future extensions toward large-scale multimodal assistants. Index Terms—Conversational AI, Multimodal Interaction, Data Analysis, Tabular Data, Human-Data Interaction, Data Visualization, AI Agent I. INTRODUCTION Interacting with data often requires programming skills or statistical expertise, creating barriers for managers, analysts, and other non-technical users [1], [2]. Natural language inter- faces (NLIs) aim to improve this information seeking process by allowing users to query data conversationally [3], [4]. At the same time, voice interfaces are becoming increasingly common in daily life, yet existing voice assistants remain limited: they can answer factual questions or control devices, but they lack the analytical capabilities needed for meaningful data exploration. LLMs now provide a powerful foundation for code genera- tion and complex reasoning [5]–[7]. Systems such as OpenAI’s © 2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses. Code Interpreter [8] demonstrate this potential, but they typi- cally support only text-based input/output. Another problem is that features like multimodal responses and transparent execution of generated code (capabilities that are increasingly important for reliable human–data interaction in the informa- tion seeking process) are less present in current voice assistant designs. We present Talk2Data, a multimodal conversational agent that enables users to effectively seek, retrieve, and analyze data stored in tabular datasets through either voice or text instructions, and to receive answers and insights as plots, tables, or spoken explanations. For example, users can ask “What’s the average delay for United flights?” or “Plot a histogram of age,” and the agent dynamically determines whether to generate Python code or provide a direct natural language response. Code is executed in a secure sandbox, results are narrated through text-to-speech (TTS), and multi- turn dialogue is supported via conversational memory. This design allows to adapt its responses to user intent, offering both analytical depth and accessibility. At the core of this agent is a orchestration module that reasons about each query’s intent and selects the appropriate response path. Dataset metadata and conversation history are injected into structured prompts, enabling grounded, context- aware behavior. By integrating these components, this design demonstrates how LLMs can support natural, multimodal workflows for data analysis. Our contributions are as follows: • Multimodal information-seeking agent. An end-to-end system that unifies voice/text input with visual, tabular, and spoken outputs for exploratory analysis, supporting multi-turn, back-and-forth dialogue for information seek- ing over tabular corpora. • Orchestration with transparent code execution. A router that adaptively selects narration vs. code generation within one dialog using grounded p

📸 Image Gallery

1.png 2.png 3.png use.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut