Sales Research Agent and Sales Research Bench

Sales Research Agent and Sales Research Bench
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Enterprises increasingly need AI systems that can answer sales-leader questions over live, customized CRM data, but most available models do not expose transparent, repeatable evidence of quality. This paper describes the Sales Research Agent in Microsoft Dynamics 365 Sales, an AI-first application that connects to live CRM and related data, reasons over complex schemas, and produces decision-ready insights through text and chart outputs. To make quality observable, we introduce the Sales Research Bench, a purpose-built benchmark that scores systems on eight customer-weighted dimensions, including text and chart groundedness, relevance, explainability, schema accuracy, and chart quality. In a 200-question run on a customized enterprise schema on October 19, 2025, the Sales Research Agent outperformed Claude Sonnet 4.5 by 13 points and ChatGPT-5 by 24.1 points on the 100-point composite score, giving customers a repeatable way to compare AI solutions.


💡 Research Summary

The paper introduces the Sales Research Agent, an AI‑first application embedded in Microsoft Dynamics 365 Sales, and the Sales Research Bench, a purpose‑built benchmark designed to evaluate the quality of AI‑driven sales research on live, customized CRM data. The authors argue that while enterprises increasingly demand AI systems capable of answering complex, business‑language questions over real‑time sales data, existing models lack transparent, repeatable evidence of quality. To address this gap, Microsoft developed a multi‑agent, multi‑model architecture that can dynamically select the most appropriate model for each sub‑task, parse business‑language prompts into sub‑questions, generate SQL or Python code, and iteratively correct and validate that code using a lightweight model for simple errors and a more powerful reasoning model for deeper issues. This “self‑correction and validation” loop, together with built‑in schema intelligence, enables the agent to handle highly customized enterprise schemas that may contain thousands of tables and hundreds of columns, automatically identifying relevant tables, columns, and joins.

The Sales Research Bench evaluates AI agents on 200 realistic sales‑leader questions using a customized enterprise dataset that mirrors the complexity of real‑world deployments (including custom tables, renamed fields, and nuanced business logic such as pipeline coverage calculations). The benchmark defines eight quality dimensions weighted according to direct customer input: Text Groundedness (25 %), Chart Groundedness (25 %), Text Relevance (13 %), Explainability (12 %), Schema Accuracy (10 %), Chart Relevance (5 %), Chart Fit (5 %), and Chart Clarity (5 %). Scores for each dimension are produced by LLM judges: Azure Foundry’s out‑of‑the‑box evaluators handle Text Groundedness and Text Relevance, while the remaining six dimensions are judged by OpenAI’s GPT‑4.1 using detailed scoring rubrics (ranging from 20 to 100). A weighted average yields a composite score on a 0‑100 scale.

For the comparative study, the Sales Research Agent, OpenAI’s ChatGPT‑5 (run in Auto mode with a Pro license), and Anthropic’s Claude Sonnet 4.5 (Max license) were each given identical access to the same dataset. The agent accessed the native Dataverse store directly, whereas ChatGPT and Claude accessed a mirrored Azure SQL copy via the MCP SQL connector, preserving primary/foreign keys and relationships. All three systems were instructed to produce both textual narratives and visualizations, and their outputs were evaluated using the same rubrics.

Results show the Sales Research Agent achieving a composite score of 78.2, outperforming Claude Sonnet 4.5 (65.2) by 13 points and ChatGPT‑5 (54.1) by 24.1 points. The agent led in every dimension, with the largest margins in chart‑related metrics (Chart Groundedness, Chart Fit, Chart Clarity, Chart Relevance), reflecting its dedicated orchestration layer for visualization generation and data mapping. Text Groundedness and Schema Accuracy showed smaller but still positive differentials, indicating room for further improvement in handling highly customized schemas. Claude Sonnet 4.5 beat ChatGPT‑5 across all dimensions, confirming the value of larger model capacity and specialized prompting.

The authors emphasize that the Sales Research Bench is intended as an ongoing evaluation framework. Microsoft plans to release the full evaluation package, allowing customers to independently verify results or benchmark alternative agents. Future work includes expanding the benchmark to cover additional business functions such as customer service, finance, and marketing, and continuously updating the dataset and question set to reflect evolving enterprise needs. By establishing a rigorous, transparent, and customer‑aligned evaluation methodology, the paper argues that enterprises can gain trustworthy evidence of AI quality, fostering greater adoption of AI‑driven decision support in sales and beyond.


Comments & Academic Discussion

Loading comments...

Leave a Comment