AnalyticsGPT: An LLM Workflow for Scientometric Question Answering
This paper introduces AnalyticsGPT, an intuitive and efficient large language model (LLM)-powered workflow for scientometric question answering. This underrepresented downstream task addresses the subcategory of meta-scientific questions concerning the “science of science.” When compared to traditional scientific question answering based on papers, the task poses unique challenges in the planning phase. Namely, the need for named-entity recognition of academic entities within questions and multi-faceted data retrieval involving scientometric indices, e.g. impact factors. Beyond their exceptional capacity for treating traditional natural language processing tasks, LLMs have shown great potential in more complex applications, such as task decomposition and planning and reasoning. In this paper, we explore the application of LLMs to scientometric question answering, and describe an end-to-end system implementing a sequential workflow with retrieval-augmented generation and agentic concepts. We also address the secondary task of effectively synthesizing the data into presentable and well-structured high-level analyses. As a database for retrieval-augmented generation, we leverage a proprietary research performance assessment platform. For evaluation, we consult experienced subject matter experts and leverage LLMs-as-judges. In doing so, we provide valuable insights on the efficacy of LLMs towards a niche downstream task. Our (skeleton) code and prompts are available at: https://github.com/lyvykhang/llm-agents-scientometric-qa/tree/acl.
💡 Research Summary
AnalyticsGPT presents a novel, end‑to‑end workflow that leverages large language models (LLMs) to answer scientometric questions—queries that combine academic entities (authors, institutions, journals, topics) with quantitative metrics such as citation counts, impact factors, or field‑weighted citation impact. Unlike conventional scientific QA that retrieves information from papers, scientometric QA (SQA) requires precise named‑entity recognition (NER) of scholarly entities, multi‑faceted filtering, aggregation, and often comparative or superlative reasoning. The authors argue that a pure LLM generation approach would suffer from outdated knowledge, hallucinations, and an inability to construct complex database queries, especially when the underlying data resides in a proprietary research performance analytics platform.
The system is organized into five sequential modules, each with a clearly defined role:
-
High‑Level Planning Module (HLPM) – The LLM first performs NER on the user’s natural‑language query, tagging supported entity types, and then produces a high‑level, step‑by‑step outline of the solution. The identified entity names are resolved to unique identifiers (IDs) via a GraphQL‑based service that aggregates vector‑search endpoints.
-
Detailed Planning Module (DPM) – Building on the HLPM outline, a second LLM call generates a low‑level execution plan. For each step the plan specifies (i) which tool to call, (ii) the sub‑task, (iii) dependencies on earlier steps, and (iv) concrete parameter values. The authors expose only two high‑level tools to the planner: (a) entity‑name‑to‑ID resolution, (b) article search, and (c) faceted aggregation of article‑derived entities. Prompt engineering includes method contracts and few‑shot examples to reduce syntax errors.
-
Action Module (AM) – The plan is executed. Independent steps are run concurrently to exploit parallelism (especially for “union” queries that involve multiple entities). Dependent steps first generate an intermediate LLM tool call to infer missing parameters from previous results. All database queries are assembled by a rule‑based engine that translates the high‑level parameters into syntactically correct API calls, thereby preventing crashes caused by malformed queries.
-
Writing Module (WM) – After data retrieval, a single LLM call composes the final answer. The prompt explicitly forbids the model from filling gaps with internal knowledge; instead it must cite retrieved data inline, use markdown tables, headings, and bullet lists, and provide a concise high‑level analysis. This design mitigates hallucination and ensures traceability.
-
Visualization Module (VM) – Post‑processing decides whether the textual answer would benefit from visual aids. If so, the LLM generates Python plotting code, executes it in a sandbox, and iteratively fixes any errors until a plot image is produced. The visual is then embedded alongside the textual response.
The retrieval backend is a proprietary research analytics platform accessed through a set of API endpoints; the system is built on LangChain, but the authors stress that the workflow is implementation‑agnostic and could be instantiated with other data sources.
For evaluation, the authors compare AnalyticsGPT against a naïve RAG baseline that performs a single tool‑selection step followed by answer generation, lacking the HLPM/DPM planning stages and the rule‑based query builder. They employ two expert SMEs who grade responses using a five‑point rubric covering robustness, content coverage, and claim validity, and they also implement an “LLM jury” consisting of multiple LLM judges that provide confidence‑weighted scores. Results show that AnalyticsGPT consistently outperforms the baseline across all metrics, particularly on complex, multi‑intent or comparative queries where the baseline frequently omits required filters or generates syntactically invalid queries. The LLM‑as‑judge scores correlate strongly with human judgments, suggesting that automated evaluation can be reliable for future large‑scale testing.
Key contributions include: (1) defining the under‑explored SQA task and aligning it with real‑world product integration needs (bridging two internal products to avoid user context switching); (2) introducing a modular planning architecture that enables LLMs to generate executable, parameter‑rich tool calls reliably; (3) demonstrating that rule‑based query assembly combined with parallel execution yields both robustness and speed; and (4) providing a systematic human‑plus‑LLM evaluation framework.
In conclusion, AnalyticsGPT illustrates how LLM‑driven agents, when coupled with careful planning, tool abstraction, and verification steps, can effectively tackle niche, data‑intensive question answering tasks. Future work may extend entity coverage, incorporate multimodal sources (e.g., full‑text PDFs), and explore continual learning from user feedback to refine planning heuristics.
Comments & Academic Discussion
Loading comments...
Leave a Comment