ElectriQ: A Benchmark for Assessing the Response Capability of Large Language Models in Power Marketing

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

As power systems decarbonise and digitalise, high penetrations of distributed energy resources and flexible tariffs make electric power marketing (EPM) a key interface between regulation, system operation and sustainable-energy deployment. Many utilities still rely on human agents and rule- or intent-based chatbots with fragmented knowledge bases that struggle with long, cross-scenario dialogues and fall short of requirements for compliant, verifiable and DR-ready interactions. Meanwhile, frontier large language models (LLMs) show strong conversational ability but are evaluated on generic benchmarks that underweight sector-specific terminology, regulatory reasoning and multi-turn process stability. To address this gap, we present ElectriQ, a large-scale benchmark and evaluation framework for LLMs in EPM. ElectriQ contains over 550k dialogues across six service domains and 24 sub-scenarios and defines a unified protocol that combines human ratings, automatic metrics and two compliance stress tests-Statutory Citation Correctness and Long-Dialogue Consistency. Building on ElectriQ, we propose SEEK-RAG, a retrieval-augmented method that injects policy and domain knowledge during finetuning and inference. Experiments on 13 LLMs show that domain-aligned 7B models with SEEK-RAG match or surpass much larger models while reducing computational cost, providing an auditable, regulation-aware basis for deploying LLM-based EPM assistants that support demand-side management, renewable integration and resilient grid operation.

💡 Research Summary

The paper introduces ElectriQ, a comprehensive benchmark designed to evaluate large language models (LLMs) for electric power marketing (EPM), a critical interface linking regulation, system operation, and sustainable‑energy deployment. As power systems become more decentralized with high penetrations of distributed photovoltaics, storage, and electric‑vehicle charging, utilities increasingly rely on customer‑facing services such as tariff consultation, demand‑response enrollment, DER interconnection assistance, and outage communication. Existing human agents and rule‑based chatbots suffer from fragmented knowledge bases and poor performance in long, cross‑scenario dialogues, while frontier LLMs (e.g., GPT‑4, Claude‑3) are typically assessed on generic benchmarks that ignore sector‑specific terminology, regulatory reasoning, and multi‑turn stability.

ElectriQ fills this gap by assembling over 550 000 multi‑turn dialogues drawn from six core service domains and 24 sub‑scenarios. Data sources include real‑world 95598 hotline call logs, work‑order tickets, internal knowledge bases, and official regulatory and tariff documents. Each dialogue is annotated with user intent, regional context, and version information, enabling fine‑grained analysis of policy‑aligned responses.

The evaluation framework combines four subjective dimensions—Professionalism (regulatory and procedural correctness), Clarity & Organization (plain‑language explanations and logical structure), Actionability & Completeness (step‑by‑step guidance, required forms, deadlines, costs), and Empathy & Helpfulness (tone, escalation handling)—scored on a 1‑5 scale with detailed rubrics. Automatic metrics (BLEURT, CoSMSiC, BLEU, ROUGE) are reported alongside human scores to capture semantic consistency, coverage, and lexical overlap.

Crucially, the authors introduce two compliance‑focused stress tests: Statutory Citation Correctness (SCC) and Long‑Dialogue Consistency (LDC). SCC verifies that model outputs contain verifiable legal citations (document title, version, clause) and that numerical values match the cited clauses. LDC evaluates the model’s ability to maintain coherent state over 8‑12 turns, updating references and commitments appropriately when policy versions or regional settings change. Both tests are binary‑scored, macro‑averaged across scenarios, and double‑annotated for reliability.

To improve LLM performance on this domain, the paper proposes SEEK‑RAG, a retrieval‑augmented generation approach that injects policy and regulatory knowledge during fine‑tuning and inference. By indexing the full corpus of tariff rules, interconnection standards, and DR program guidelines, SEEK‑RAG retrieves the most relevant clauses for each user query and concatenates them to the model’s prompt, ensuring that generated answers are grounded in up‑to‑date regulations.

Experiments cover 13 mainstream LLMs, ranging from open‑source 7 B models (LLaMA‑3, Mixtral) to proprietary giants (GPT‑4, Claude‑3, Gemini 1.5). Results show that a domain‑aligned 7 B model equipped with SEEK‑RAG matches or exceeds the performance of much larger models on both human‑rated dimensions and the SCC/LDC stress tests. Moreover, the smaller models achieve 30‑50 % lower computational cost and energy consumption, highlighting a path toward cost‑effective, energy‑efficient deployment in utility environments.

The contributions are threefold: (1) the release of ElectriQ, a large‑scale, publicly available benchmark with detailed protocols and evaluation scripts; (2) the definition of multidimensional, regulation‑aware evaluation criteria, including novel compliance stress tests; and (3) the demonstration that retrieval‑augmented fine‑tuning can endow compact LLMs with professional, auditable performance suitable for real‑world EPM tasks. The authors argue that ElectriQ and SEEK‑RAG together provide utilities and regulators with a reproducible, audit‑ready framework to assess and safely integrate LLM‑based assistants, thereby enhancing demand‑side management, renewable integration, and grid resilience while maintaining regulatory compliance.

ElectriQ: A Benchmark for Assessing the Response Capability of Large Language Models in Power Marketing

💡 Research Summary

Comments & Academic Discussion

Leave a Comment