Understanding Usage and Engagement in AI-Powered Scientific Research Tools: The Asta Interaction Dataset

Preprint. Under review . Understanding Usage and Engagement in AI-Powered Scien- tiﬁc Research T ools: The Asta Interaction Dataset Dany Haddad *† , Dan Bareket *† , Joseph Chee Chang *† , Jay DeY oung *† , Jena D. Hwang *† , Uri Katz *† , Mark Polak *† , Sangho Suh *† , Harshit Surana *† , Aryeh T iktinsky *† , Shriya Atmakuri * , Jonathan Bragg * , Mike D’Arcy * , Sergey Feldman * , Amal Hassan-Ali * , Rubén Lozano * , Bodhisattwa Prasad Majumder * , Charles McGrady * , Amanpreet Singh * , Brooke Vlahos * , Y oav Goldberg *† , Doug Downey *† * Allen Institute for AI † Core contributors Abstract AI-powered scientiﬁc resear ch tools are rapidly being integrated into re- search workﬂows, yet the ﬁeld lacks a clear lens into how resear chers use these systems in real-world settings. W e pr esent and analyze the Asta Inter- action Dataset , a large-scale r esource comprising over 200,000 user queries and interaction logs from two deployed tools (a literatur e discovery inter- face and a scientiﬁc question-answering interface) within an LLM-powered retrieval-augmented generation platform. Using this dataset, we charac- terize query patterns, engagement behaviors, and how usage evolves with experience. W e ﬁnd that users submit longer and more complex queries than in traditional sear ch, and treat the system as a collaborative r esearch partner , delegating tasks such as drafting content and identifying r esearch gaps. Users treat generated responses as persistent artifacts, r evisiting and navigating among outputs and cited evidence in non-linear ways. W ith experience, users issue mor e targeted queries and engage mor e deeply with supporting citations, although keyword-style queries persist even among experienced users. W e release the anonymized dataset and analysis with a new query intent taxonomy to inform future designs of real-world AI resear ch assistants and to support realistic evaluation. 1 Introduction AI-powered assistants for scientiﬁc question answering and literature discovery are in- creasingly deployed in both academic and commer cial settings ( Xu & Peng , 2025 ), rapidly becoming part of day-to-day scientiﬁc workﬂows for tasks ranging from paper discovery to literatur e reviews and experimental planning ( Liao et al. , 2024 ). Commercial systems typically combine r etrieval over large scholarly corpora with LLM-based synthesis, and include general-purpose AI sear ch engines (e.g, Perplexity AI , 2024 ; Y ou.com , 2024 ); deep resear ch agents (e.g, Google , 2024 ; OpenAI , 2025 ; Anthropic , 2025 ); and science-focused platforms (e.g, Elicit , 2024 ; Skarlinski et al. , 2024 ). Despite rapid adoption, we still lack a clear picture of what researchers actually do with these systems: Do they use them like search engines? As writing assistants? As collaborators? Or as something else entirely? Existing studies typically r eport aggregate statistics derived from pr oprietary logs ( Chatterji et al. , 2025 ; Phang et al. , 2025 ; Handa et al. , 2025 ). T o our knowledge, no publicly available lar ge-scale dataset of real-world user interactions with deployed AI-powered scientiﬁc r esearch tools exists. W e address this gap by releasing and analyzing the Asta Interaction Dataset , a large-scale anonymized user behavior log from Asta ( Singh et al. , 2025 ; Allen Institute for Artiﬁcial Intelligence , 2025 ). Asta exempliﬁes the emerging class of AI resear ch assistants, and integrates with the academic sear ch engine Semantic Scholar to pr ovide two AI-powered 1 Preprint. Under review . interfaces: PF (PaperFinder, a paper search interface) and SQA (ScholarQA, a scientiﬁc question-answering interface). Our dataset includes over 200,000 queries and associated clickstream logs, forming the ﬁrst public dataset of real-world user interactions with a deployed AI-powered scientiﬁc r esearch tool. W e address two interrelated research questions : RQ1: How do researchers formulate information needs when interacting with LLM-based retrieval and synthesis systems, how does it differ fr om traditional search, and how do these behaviors evolve with experience? RQ2: How do users consume and navigate AI-generated r esearch r eports , and what does this r eveal about how design choices shape engagement? Across both interfaces, we observe a shift from sear ch-like behavior towar d collaborative use: users issue longer , task-oriented queries and delegate higher-level r esearch activities such as drafting and gap identiﬁcation. W e structure these analyses within a new taxonomy of query intent, phrasing, and criteria. Contributions: (1) W e publicly r elease a large-scale dataset of over 200,000 real-world user queries and interaction logs from deployed AI-powered r esearch tools. (2) W e provide an analysis of query patterns and user engagement and how usage behaviors evolve over time. (3) W e introduce a multidimensional query taxonomy (intent, phrasing, criteria) tailored to AI resear ch assistants. 2 System description Interfaces. Asta exposes two tools: PF (a paper-ﬁnding interface that r eturns a ranked list of papers with lightweight synthesis) and SQA (a scientiﬁc question-answering interface that produces a structured r eport with citations). As a comparative baseline, we also analyze queries performed on Semantic Scholar 1 (S2), a traditional academic search website. Retrieval and synthesis. Given a natural-language query , the system r etrieves candidate papers from a scholarly corpus, re-ranks them, and generates outputs that ground claims in retrieved papers via inline citations. PF (main interface panel shown in Figure 1a ) presents a chat-like interface returning a ranked list of papers linked to their Semantic Scholar page, each with a brief generated summary of its relevance to the query . Users can click to see evidence from the paper supporting its r elevance. SQA (interface shown in Figure 1b ) is a single-turn literatur e summary tool pr oducing a multi-section r eport: each section has a title, a one-sentence TL;DR (visible when collapsed), an expandable body with inline citations, and feedback contr ols. Citations open evidence cards with a link to the paper page and excerpts that support the surrounding claim. W e refer to the r esponse generated by each query as a report : for SQA, this is the multi-section literature summary; for PF, this is the ranked list of papers with generated summaries. Data collected and released. W e release Asta Interaction Dataset (AID), a dataset of anonymized, opt-in user interactions from SQA and PF containing 258,935 queries and 432,059 clickstream interactions (February–August 2025). T o limit personally identiﬁable information (PII) risks, we release only hashed report identiﬁers and dr op queries with LLM-detected PII (less than 1%). Our analysis uses internal pseudonymous user identiﬁers to compute cohorts and retention; these identiﬁers ar e not included in the released dataset to reduce r e-identiﬁcation risk. The full dataset schema is in Appendix F . 3 Analysis design and methodology Analysis pipeline. W e analyze user behavior thr ough a pipeline combining prepr ocess- ing, LLM-based query labeling, and statistical modeling. Our preprocessing ﬁlters bots, identiﬁes sessions, and removes PII. W e label 30,000 single-turn queries acr oss multiple aspects (intent, phrasing, criteria, ﬁeld of study) using GPT -4.1 with structured decod- ing. All statistical tests conducted are two-sided t-tests ( α = 0.05). W e also ﬁt binomial logistic regr ession models pr edicting click-through, controlling for false discovery with the Benjamini-Hochberg pr ocedure over all estimated p -values. Full details are in Appendix A . 1 https://semanticscholar.org 2 Preprint. Under review . (a) PF interface shows a ranked list of papers with per-item actions and ﬁlters. (b) SQA interface. A r eport with collapsible sec- tions and inline citations. Figure 1: Screenshots of the two Asta interfaces used in our study ( Singh et al. , 2025 ; Allen Institute for Artiﬁcial Intelligence , 2025 ). Query taxonomy . T raditional IR taxonomies are less appr opriate for AI r esearch assistants like Asta; as users issue complex natural language queries r equiring multi-step r easoning and constraint satisfaction rather than keyword queries. Therefore, we intr oduce a new taxonomy with non-mutually exclusive labels for query intent, phrasing style, and criteria. W e build our taxonomy thr ough an iterative human-and-LLM pr ocess: starting fr om manual inspection, we have an LLM (Gemini-2.5-pro) propose additional labels, then manually consolidate until convergence. This yields three query aspects (intent, phrasing style, and criteria) with non-mutually exclusive labels. See Appendix G for full deﬁnitions. W e identify 16 query intents (e.g., Broad Topic Exploration , Causal and Relational Inquiry ; T able 1 ), 7 phrasing styles (e.g., Keyword-style Query , Natural Language Question ; T able 2 ), 6 criteria types (e.g., Temporal Constraints , Methodology-Specific Criteria ; T able 3 ), and 28 ﬁelds of study (e.g., Biology , Electrical Engineering , Law and Legal Studies ; see T able 31 for the full list). User actions and success metrics. W e study four primary actions in our clickstream dataset: S2 link clicks (navigating from the r eport to a paper page on semanticscholar .org), section expansions (revealing the section contents in SQA), evidence clicks (viewing inline citation support), and feedback (thumbs up/down). Fr om these we derive: click-through rate (CTR) , the fraction of reports with at least one link click; churn rate , the fraction of users with no subsequent query; and return rate , the fraction of users who return after their initial visit. Following prior work establishing implicit behaviors as r eliable satisfaction indicators ( Joachims et al. , 2005 ; Fox et al. , 2005 ; Kim et al. , 2014 ), we use click-through rate (CTR) as our primary success surrogate, as S2 link clicks str ongly predict user returns (Appendix D ). Explicit thumbs feedback is too sparse (fewer than 2% of r eports) and less predictive of r eturn than link clicks. User experience stages. T o study how behavior evolves with experience, we deﬁne three progr ession stages based on each user ’s cumulative query count at the time of each query: the single-query stage (a user ’s ﬁrst query), the inexperienced stage (queries 2 through 10), and the experienced stage (queries beyond the 10th). These are not separate groups; we track the same users over time as they pr ogress through these stages, enabling us to observe within-user behavioral changes as familiarity with Asta grows. About 40% of users initiate at least 2 queries (the second queries usually coming within a few hours of the ﬁrst) while less than 10% of users initiate 10 or more queries. 4 Results Our analysis reveals that users of both PF and SQA initiate queries differ ently than tradi- tional search engines and engage with r esults as persistent artifacts. 3 Preprint. Under review . T able 1: Representative examples of query intents illustrating of user information needs. Intent Example Query Broad T opic Exploration GLP-1 and diabetes Speciﬁc Factual Retrieval What are the four cor e concepts of Rotter ’ s theory? Concept Def. & Exploration Summarize the concept of “T echnoimagination” [...] by V ilém Flusser Comparative Analysis Describe the trade-offs between HBr and Cl2 plasma gases for reactive ion etching of polysilicon Causal & Relational Inquiry Relation between nighttime digital device use and sleep quality [...] Method. & Pr oced. Guidelines How often should I collect mosquitoes for dengue surveillance? T ool & Resource Discussion Are ther e any tools to count the quality or semantic content of citations? Research Gap Analysis Survey on the limitations of classical NLP Evaluation Metrics Citation & Evidence Finding Can you assist me get the source: Malnutrition refers to deﬁciencies... (WHO, 2023) Speciﬁc Paper Retrieval Anderson and Moore’ s paper on the stability of the Kalman ﬁlter Ideation Give me a cost-efﬁcient way to build Rapid Antigen T ests using low-cost expression systems Application Inquiry ET A prediction with GPS data fr om cargo Data Interpretation Support Why do T arM knockout strains show higher IL-1b responses compar ed to JE2 WT? Discuss my results Content Gen. & Experiment Impr ove this Materials and Methods section for a journal paper [...] Academic Document Drafting Write a full Materials and Methods section suitable for submission to a peer-reviewed journal in plant science or envir onmental science. [...] T able 2: Representative examples of query phrasing styles observed acr oss PF and SQA. Phrasing Style Example Query Keyword-style piracetam efﬁcacy Natural Language Q. How are emer ging digital technologies reshaping sustainable development outcomes? Explicit Instruction Write an essay as a doctor of literary discourse analysis[...] Complex Context Narrative When training large language models, it’ s essential to develop benchmarks that assess true pr oblem-solving ability , not just factual r ecall. [detailed discussion about knowing Berlin’ s train stations vs. planning a novel journey] [C]onduct a literature review on efforts to curate evaluations that explicitly test synthesis rather than memorization. Multi-part Query 9. Identify and summarise key international instruments (e.g., UN Charter , Budapest Convention, UNSC Resolutions). Discuss their relevance and application [...] Boolean/Logical Ops (“Family Centered” OR “Family Centr ed”) AND (challeng* OR implantat*) Citation/Format Speciﬁcation Use Harvard-style citation to apply in-text citations cite with author ’ s name and publication date, page number to Write “ 1,000-word written r eﬂection [...] Query intents. T able 1 illustrates the diversity of query intents with r epresentative examples. User queries span a spectrum from traditional retrieval tasks (ﬁnding speciﬁc papers, locating citations, or exploring broad topics) to tasks that go well beyond search. Users ask for methodological guidance, request help interpr eting their own experimental results, seek to identify r esearch gaps, and even delegate content generation such as drafting full manuscript sections. This range suggests users approach Asta not mer ely as a search tool but as a resear ch assistant capable of supporting the full resear ch workﬂow . Figure 3b presents quantitative distributions of these intents. Phrasing styles. Users employ a wide range of phrasing styles when querying Asta (T able 2 ). While Keyword-style queries remain the most common phrasing on both tools (Figur e 3a ), a substantial share of queries adopt styles that would be ineffective on traditional sear ch engines. Natural Language Questions account for a large portion of queries, indicating that users expect the system to parse full sentences and act on dir ectives. More notably , Complex Contextual Narratives , where users paste entire draft paragraphs as context before posing a question, and Multi-part Queries that specify structur ed sub-tasks reﬂect phrasing behaviors shaped by general-purpose LLMs. 4 Preprint. Under review . T able 3: Representative examples of search constraints (criteria) speciﬁed in user queries across PF and SQA. Search Constraint Example Query Methodology- Speciﬁc federated learning wher e the authors had coined the new terms for deﬁning client types like homogenous clients, selﬁsh clients... Publication T ype/Quality Please ﬁnd scientiﬁc r eferences only fr om highly reputable and scopus-indexed interna- tional journals... T emporal Constraints Pharmacological Activities of Ganoderma applanatum. Use journals from 2020-2025 Metadata-Based Papers from SIG CHI that have use cases with an “adaptive task challenge” Cit./Impact-Based What is the most cited paper of Orit Hazzan ? Data/Resource A vailability search for documents that provide data of the amount of comic books are purchased among different classes in the United States. Search constraints. Beyond phrasing, many queries include explicit search constraints that go beyond what traditional search interfaces support (T able 3 shows users applying facets via text descriptions of their sear ch criteria that traditional search engines cannot handle). As shown in T able 7 , Methodology-Specific criteria are by far the most common constraint on both Asta and S2 (42% and 29% of queries, respectively), reﬂecting users’ desire to ﬁlter results by experimental design or analytical approach. Publication Quality ﬁlters (11% on Asta vs. 3% on S2) and Temporal Constraints (5% vs. 2%) are also mor e prevalent on Asta. Rarer but notable are Citation/Impact-Based criteria ( < 1% on Asta) such as requests for highly cited papers or journals above a given impact factor , and Data/Resource Availability constraints where users seek studies with publicly available datasets or code. 4.1 RQ1: How do users query LLM-based information retrieval and synthesis systems, and how does it differ from traditional search? W e analyze session duration, query complexity , and query types across tools and over time. Users initiate much longer and more complex queries than on the S2 baseline (T able 4 ). Query complexity on S2 itself has also increased over time, possibly due to broader AI adoption raising user expectations. Session duration and usage patterns. Users typically initiate 1-2 queries per session with a median session duration of 4 minutes for PF and 8 minutes for SQA. 2 The median number of sessions per user and the median number of queries per user is 2 for both PF and SQA. Like many IR tools, ther e is a wide spr ead of usage level, with both session duration and number of sessions per user having a long tail. The shorter duration of PF sessions is to be expected since one of the main actions performed on PF (clicking on a S2 link) navigates the user away from PF (although in a new tab) wher eas most of the content presented by SQA is consumed in-situ. Across both tools, usage is dominated by r epeat users: twice as many of our users have initiated multiple queries as have initiated only a single query , and more than half of those repeat users have initiated mor e than 15 queries. T emporal changes in query behavior . As previously noted, we saw S2 queries’ complexity also increase over the years. Comparing S2 queries between 2022 and 2025, we ﬁnd that the fraction of queries with at least 1 constraint rose fr om 7 % to 10 %, queries with at least 1 relation grew fr om 65 % to 78 %, and average query length increased from 4.8 to over 6 words. This suggests users incr easingly expect search systems to handle more complex queries, likely shaped by exposure to AI-power ed tools. Query intent and phrasing distribution. While query length and structural complexity capture surface-level properties, they do not reveal what users are trying to accomplish. W e therefore examine the distribution of query intents, phrasing styles, ﬁelds of study , and criteria speciﬁed in the query (with labels identiﬁed using an LLM). Figure 3 shows query distribution across these various aspects and r eveals that relatively simple queries ar e the most common: Broad Topic Exploration and Keyword-style queries 2 Note that the median response time for PF and SQA are 34 seconds and 129 seconds, respectively . 5 Preprint. Under review . T able 4: Query complexity and length com- parison between PF, SQA, and S2 (mean val- ues with 95 % CI). Since S2 is a traditional search engine we expect that its queries ar e simpler than those performed on Asta (PF and SQA). Query length is shown separately below the bar as it is not a notion of com- plexity . Metric PF SQA S2 Constraints 0.60 ± 0.05 0.82 ± 0.08 0.15 ± 0.02 Entities 4.00 ± 0.2 5.14 ± 0.42 2.25 ± 0.05 Relations 2.17 ± 0.08 2.68 ± 0.18 1.20 ± 0.04 Length 17.04 ± 2.51 36.96 ± 9.02 5.35 ± 0.18 Figure 2: Query length distribution for PF and SQA showing heavy-tailed behavior . (a) Phrasing Style (b) Intent Figure 3: Query phrasing styles and intents (% of queries). Keyword queries dominate the distribution. Asta users initiate more natural language queries than S2 users. Most queries are broad exploration or concept explanation; S2 is skewed towards broad and speciﬁc paper retrieval queries. See T able 6 for complete intent distribution, and T able 7 for criteria distribution. dominate the query distribution. Fewer than half of all queries specify some explicit search criteria constraint (see T able 7 for the full list of rates), the most common being criteria regar ding paper methodology (see T able 3 ). In contrast, queries to S2 are skewed more heavily toward simpler queries: nearly all (98%) of queries to S2 are keyword-style queries. The query intents that are relatively more common in S2 are also simpler and reﬂect the informational and navigational queries of Broder ( 2002 ) from traditional IR systems: broad queries and speciﬁc paper retrieval queries. W e also observe variation in query intents across scientiﬁc ﬁelds. For example, computer science queries are the most likely to be Ideation queries while history queries are the least likely (see Figure 8 in the appendix for the full breakdown). Moreover , the prevalence of Ideation queries on Asta compared to S2 (Figure 3b ) suggests that users are asking AI-powered tools to take over tasks they would pr eviously have done themselves, not just retrieving r elevant papers, but directly generating ideas and solutions. User expectations. The distribution of query intents shows that users expect Asta to perform diverse tasks beyond search, including experimental design and identifying unexplored resear ch directions. Y et despite these expectations, keyword-style queries and broad topic exploration dominate, a pattern that persists even among experienced users (Section 4.1 ), suggesting functional ﬁxedness. Multi-part queries or Complex Contextual Narrative queries often include long passages copy-pasted from drafts followed by a question, taking advantage of LLMs’ capability of handling long texts. Beyond typical queries, users employ diverse strategies (template-ﬁlling, explicit pr ompting, collaborative writing workﬂows, and reﬁnding queries) that r eveal expectations shaped by general-purpose LLMs (see T able 10 in the appendix for examples). These patterns suggest users expect Asta to function as a 6 Preprint. Under review . T able 5: Query label by experience stage, tracking the same user population over time (percentages with 95 % conﬁdence intervals). Query Label single-query inexperienced experienced Broad T opic Exploration 61.23 ± 1.38 55.63 ± 0.98 53.48 ± 0.99 Causal and Relational Inquiry 15.78 ± 1.03 15.97 ± 0.72 17.06 ± 0.75 Citation & Evidence Finding 6.25 ± 0.69 8.65 ± 0.55 9.65 ± 0.59 Methodology-Speciﬁc Criteria 41.84 ± 1.4 45.50 ± 0.98 47.17 ± 0.99 collaborative resear ch partner with capabilities similar to general-purpose chatbots. Users also incr easingly use abstract concepts rather than speciﬁc jargon (e.g., “why some language models behave unpredictably when trained further ” vs. “BER T ﬁne-tuning instability”); 66% of PF queries include abstract concepts vs. 38% in S2 (Appendix B ). Learning effects. T racking the same users across the three experience stages deﬁned in Section 3 , we ﬁnd that users learn to issue more targeted queries over time. As shown in T able 5 , Broad Topic Exploration drops from 61.2% in the single-querystage to 53.5% in the experiencedstage, while other query intents become more common. This suggests that as users gain more experience with the system they learn to initiate more complex and challenging queries beyond broad topic exploration queries. Near-duplicate queries and report revisitation. Report revisitation is common: 50.5% of SQA users and 42.1% of PF users revisit previous reports, substantially more than near- duplicate query submission (18.8% and 14.8%, respectively). This suggests users treat Asta results as persistent artifacts, refer ence material that they can consult multiple times, rather than ephemeral search r esults. Near-duplicate queries occur on shorter timescales (median < 16 minutes) than revisits (median 4–6 hours), with most showing slight reﬁnements such as format instructions or language pr eferences (see T ables 11 and 12 in the appendix). 4.2 RQ2: How do users engage with the presented content? Our analysis of engagement patterns reveals that users learn to extract additional value from these systems as they gain experience: SQA users discover that they can verify claims by examining the cited works (the rate of clicking on inline evidence increases by 27% between a user ’s ﬁrst query and their 4th), while experienced PF users increasingly consume information dir ectly from the result list without clicking thr ough to papers (link clicks drop by 24% over the same period), as the interface pr ovides sufﬁcient context. SQA-speciﬁc actions. SQA presents r eports as expandable sections with TL;DR summaries. Analysis reveals diverse, non-linear reading behaviors: users skip the introduction 43% of the time, and over half of reports (52.4%) involve non-consecutive section expansions. While sequential reading dominates, users frequently skip sections, navigate backwar ds, and return to the intr oduction from later sections (see Appendix C for detailed transition analysis and Figure 7 for the Sankey diagram). These patterns suggest that the section- oriented display helps users efﬁciently identify and access just the portions of information most relevant to their needs. Churn due to latency and errors. Since AI-power ed tools have higher latency and error rates than traditional search, we examined how response time and err ors affect user retention. PF typically returns r esults within 30 seconds while SQA takes around 2 minutes to generate full reports. Users tolerate this differ ence: SQA churn r emains stable (close to 11 %) for response times under 5 minutes, wher eas PF churn incr eases by 10 % relative if r esponses exceed 1 minute, suggesting that users expect PF to behave like traditional sear ch but accept longer waits for SQA’s synthesized reports. In contrast, catastrophic errors sever ely impact retention: ﬁrst-time users who encounter an error have only a 10 % chance of returning, compared to 53 % for users with an initial successful experience. 7 Preprint. Under review . 5 Related work Understanding user behavior has long been central to information r etrieval (IR) research. Classic web search work proposes query intent taxonomies including Broder ’s inﬂuential three-way split into informational, navigational, and transactional intents ( Broder , 2002 ) and subsequent reﬁnements ( Rose & Levinson , 2004 ; Jansen et al. , 2008 ; Jansen & Booth , 2010 ; Cambazoglu et al. , 2021 ). These taxonomies have primarily been developed for keyword-based web sear ch over general-purpose corpora. W e similarly develop and apply a taxonomy of user intents, but focus on scientiﬁc research tasks and LLM-based assistants. Recent work has begun to characterize how people query and adapt to LLM-based infor - mation access systems. Liu et al. ( 2025 ) examine “functional ﬁxedness” in LLM-enabled chat search, showing how users’ prior experience with search engines, virtual assistants, and LLMs constrains their initial prompt styles and adaptation strategies, and proposing a typology of user intents in chat search. Shah et al. ( 2025 ) introduce an LLM-plus-human- in-the-loop pipeline to generate and apply user intent taxonomies from large-scale Bing search and chat logs, uncovering distinct intent distributions between traditional search and AI-driven chat and demonstrating how such taxonomies support log analysis at scale. Complementing these log-based perspectives, W ang et al. ( 2024 ) develop a taxonomy of seven high-level user intents for general LLM interactions and, through a survey of 411 users, reveal heter ogeneous usage patterns, satisfaction levels, and concerns across intents. Kim et al. ( 2024 ) focus speciﬁcally on conversational search, deriving a taxonomy of 18 follow-up query patterns and using an LLM-powered classiﬁer on real-world logs to r elate differ ent follow-up behaviors to user satisfaction signals. T ogether , these studies show that LLM-based systems elicit rich behaviors that ar e strongly mediated by users’ mental models and prior tool use, but they primarily investigate general-purpose search or everyday LLM services rather than domain-speciﬁc scientiﬁc resear ch tools. Another line of work examines how LLM-powered sear ch affects engagement with informa- tion and the diversity of content that users ultimately see. Spatharioti et al. ( 2025 ) compare LLM-powered sear ch with more traditional search conditions in decision-making tasks, measuring differ ences in speed, accuracy , and overreliance. Kaiser et al. ( 2025 ) conduct a large-scale study of user behavior and prefer ences during practical search tasks, contrasting generative-AI search with traditional search engines and documenting differ ences in how people explore and pr efer results across interfaces. While some work exists characterizing usage patterns with LLM-based systems, these stud- ies release only ﬁnal analyses rather than underlying interaction data. Anthropic ( T amkin et al. , 2024 ) and OpenAI ( Chatterji et al. , 2025 ) released limited descriptions of how people use their chat products, including LLM-derived taxonomies of user intents. Y ang et al. ( 2025 ) present a study of AI agent adoption analyzing usage from millions of Perplexity users. OpenRouter ’s State of AI report analyzes task distributions, model prefer ences, and retention patterns ( Aubakirova et al. , 2026 ). Other white papers addr ess the economic impacts of AI tools ( Handa et al. , 2025 ; Appel et al. , 2025 ). None of these publicly r elease interaction data. The LMSYS ( Zhao et al. , 2024b ), W ildChat ( Zhao et al. , 2024a ), and Open Assistant ( Köpf et al. , 2023 ) datasets do re lease user conversation text, but include only basic metadata and are br oad-domain rather than speciﬁc to systems targeting resear chers. 6 Discussion Our analysis reveals that users pose longer , more complex queries to AI-powered tools than to traditional search-power ed tools. Users treat results as persistent artifacts, and experienced users engage more deeply with content than other users. However , these patterns must be interpreted in light of differential query success across query types, which we discuss under “limitations and generalizability” below . Implications for tool and interface design. The behavioral patterns we observe suggest several directions for the design of futur e AI-powered scientiﬁc resear ch tools. 8 Preprint. Under review . Query formulation support. The observation that users often discover unmet requirements only after seeing initial results (evidenced by near-duplicate queries that add format in- structions or language prefer ences) suggests value in clarifying user intent before executing long-running queries. Users’ feedback often resembles follow-up instr uctions (e.g., “add a historical component”), indicating expectations for iterative r eﬁnement through conversa- tion. The high rate of report revisitation (42–50 % of users) and users’ treatment of results as persistent artifacts suggest that generated content may need mechanisms for staying curr ent as new literature appears. Content navigation and consumption. The non-linear reading patterns we observe in SQA, including frequent skipping of introductions (43 % of reports) and revisiting of speciﬁc sections, motivate designs that for eground section-level navigation, TL;DR-style summaries, and user control over the ordering and granularity of generated content, rather than assum- ing strictly sequential consumption. Reliability and latency tolerance. Users tolerate higher latency for report-generation tools (SQA) than for search-oriented tools (PF), but ar e highly sensitive to catastrophic errors during their initial query . This error -sensitivity suggests that graceful degradation and clear error r ecovery paths may be especially important for ﬁrst impressions. Addressing underserved query types. Users expect Asta to function as a general-purpose agentic system. Content generation queries, temporal constraints, data resource requests, and citation format speciﬁcations all show lower satisfaction (see Section 6 for the underlying odds ratios), and the behavioral patterns underlying these failures are striking. Users submit template-ﬁlling queries, explicit pr ompting instructions, and collaborative writing workﬂows (T able 10 ), all of which presuppose a general-purpose conversational agent rather than a task-speciﬁc retrieval tool. At the same time, Complex Contextual Narrative queries, where users provide extensive context by pasting fr om drafts or describing their resear ch situation in detail, are among the most successful query types on SQA (OR = 1.47), suggesting that when the interface allows users to supply rich context, the system can effectively leverage it. Limitations and generalizability . Our behavioral ﬁndings study one r epresentative system, Asta, and the ﬁndings may be skewed toward query types that this system happens to address well. T o help quantify this potential bias, we estimate which queries Asta handles more effectively today by ﬁtting logistic regression models predicting click-through rate (CTR) from query attributes for each tool with user features as controls (see Appendix D for a justiﬁcation of using CTR as a success surr ogate). For PF, Citation/Evidence Finding (OR = 1.17) and Broad T opic Exploration (OR = 1.12) queries have higher click odds, while Content Generation and Expansion (OR = 0.47), Data Resource A vailability (OR = 0.61), and T emporal Constraint (OR = 0.82) queries have substantially lower click-through odds. For SQA, Concept Deﬁnition and Explanation (OR = 1.29) and Complex Contextual Narrative (OR = 1.47) queries have higher click odds, while Citation Format Speciﬁcation (OR = 0.62) queries have lower click odds. See Section E.1 for the full analysis, odds ratio tables, and coefﬁcient visualizations. 7 Conclusions and Future W ork W e have presented the Asta Interaction Dataset, interaction logs fr om PF and SQA, two LLM- based resear ch assistants deployed within Asta. While both tools follow standard design patterns for this class of system, our ﬁndings may not generalize to tools with substantially differ ent retrieval scope, interaction modality , or optimization objectives. W e view our released dataset, taxonomy , and behavioral analysis as a starting point for cross-system comparisons and more tar geted experiments. Future work will examine follow-up queries and user journeys over time, tracking how users reﬁne their queries within and acr oss sessions and how their mental models of system capabilities evolve with experience. W e also plan to investigate cr oss-tool usage patterns, characterizing how users move between PF and SQA within research workﬂows and what triggers transitions between search-oriented and r eport-oriented tools. 9 Preprint. Under review . References Allen Institute for Artiﬁcial Intelligence. Introducing ai2 paper ﬁnder . https://allenai. org/blog/paper- finder , 2025. Anthropic. Claude takes resear ch to new places. https://www.anthropic.com/news/ research , April 2025. Ruth Appel, Peter McCrory , Alex T amkin, Miles McCain, T yler Neylon, and Michael Stern. Anthropic economic index report: Uneven geographic and enterprise ai adoption, 2025. URL . Malika Aubakirova, Alex Atallah, Chris Clark, Justin Summerville, and Anjney Midha. State of ai: An empirical 100 trillion token study with openrouter , 2026. URL https: //arxiv.org/abs/2601.10088 . Andrei Br oder . A taxonomy of web search. SIGIR Forum , 36(2):3–10, September 2002. ISSN 0163-5840. doi: 10.1145/792550.792552. URL https://doi.org/10.1145/792550.792552 . B. Barla Cambazoglu, Leila T avakoli, Falk Scholer , Mark Sanderson, and Bruce Croft. An intent taxonomy for questions asked in web sear ch. In Proceedings of the 2021 Confer ence on Human Information Interaction and Retrieval , CHIIR ’21, pp. 85–94, New Y ork, NY , USA, 2021. Association for Computing Machinery . ISBN 9781450380553. doi: 10.1145/3406522. 3446027. URL https://doi.org/10.1145/3406522.3446027 . Aaron Chatterji, Thomas Cunningham, David J Deming, Zoe Hitzig, Christopher Ong, Carl Y an Shan, and Kevin W adman. How people use chatgpt. W orking Paper 34255, National Bureau of Economic Research, September 2025. URL http://www.nber.org/ papers/w34255 . Elicit. Elicit: The ai research assistant. https://elicit.com , 2024. Steve Fox, Kuldeep Karnawat, Mark Mydland, Susan Dumais, and Thomas White. Evaluat- ing implicit measures to impr ove web search. ACM T rans. Inf. Syst. , 23(2):147–168, April 2005. ISSN 1046-8188. doi: 10.1145/1059981.1059982. URL https://doi.org/10.1145/ 1059981.1059982 . Google. T ry deep resear ch and our new experimental model in gemini, your ai assis- tant. https://blog.google/products/gemini/google- gemini- deep- research/ , Decem- ber 2024. Kunal Handa, Alex T amkin, Miles McCain, Saffron Huang, Esin Durmus, Sarah Heck, Jared Mueller , Jerry Hong, Stuart Ritchie, T im Belonax, Kevin K. T roy , Dario Amodei, Jared Kaplan, Jack Clark, and Deep Ganguli. Which economic tasks are performed with ai? evidence fr om millions of claude conversations, 2025. URL 04761 . Bernard J. Jansen and Danielle Booth. Classifying web queries by topic and user intent. In CHI ’10 Extended Abstracts on Human Factors in Computing Systems , CHI EA ’10, pp. 4285–4290, New Y ork, NY , USA, 2010. Association for Computing Machinery . ISBN 9781605589305. doi: 10.1145/1753846.1754140. URL https://doi.org/10.1145/1753846. 1754140 . Bernard J. Jansen, Danielle L. Booth, and Amanda Spink. Determining the informational, navigational, and transactional intent of web queries. Information Pr ocessing & Management , 44(3):1251–1266, 2008. ISSN 0306-4573. doi: https://doi.org/10.1016/j.ipm.2007.07.015. URL https://www.sciencedirect.com/science/article/pii/S030645730700163X . Thorsten Joachims, Laura Granka, Bing Pan, Helene Hembrooke, and Geri Gay . Ac- curately interpreting clickthrough data as implicit feedback. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in In- formation Retrieval , SIGIR ’05, pp. 154–161, New Y ork, NY , USA, 2005. Association for Computing Machinery . ISBN 1595930345. doi: 10.1145/1076034.1076063. URL https://doi.org/10.1145/1076034.1076063 . 10 Preprint. Under review . Carolin Kaiser , Jakob Kaiser , Rene Schallner , and Sabrina Schneider . A new era of online search? a large-scale study of user behavior and personal preferences during practical search tasks with generative ai versus traditional search engines. In Proceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems , CHI EA ’25, New Y ork, NY , USA, 2025. Association for Computing Machinery . ISBN 9798400713958. doi: 10.1145/3706599.3720123. URL https://doi.org/10.1145/3706599.3720123 . Hyunwoo Kim, Y oonseo Choi, T aehyun Y ang, Honggu Lee, Chaneon Park, Y ongju Lee, Jin Y oung Kim, and Juho Kim. Using llms to investigate correlations of conversational follow-up queries with user satisfaction, 2024. URL . Y oungho Kim, Ahmed Hassan, Ryen W . White, and Imed Zitouni. Modeling dwell time to predict click-level satisfaction. In Pr oceedings of the 7th ACM International Conference on W eb Search and Data Mining , WSDM ’14, pp. 193–202, New Y ork, NY , USA, 2014. Association for Computing Machinery . ISBN 9781450323512. doi: 10.1145/2556195.2556220. URL https://doi.org/10.1145/2556195.2556220 . Andreas Köpf, Y annic Kilcher , Dimitri von Rütte, Sotiris Anagnostidis, Zhi-Rui T am, Keith Stevens, Abdullah Barhoum, Nguyen Minh Duc, Oliver Stanley , Richár d Nagyﬁ, Shahul ES, Sameer Suri, David Glushkov , Arnav Dantuluri, Andrew Maguire, Christoph Schuh- mann, Huu Nguyen, and Alexander Mattick. Openassistant conversations – democratiz- ing large language model alignment, 2023. URL . Zhehui Liao, Maria Antoniak, Inyoung Cheong, Evie Y u-Y en Cheng, Ai-Heng Lee, Kyle Lo, Joseph Chee Chang, and Amy X. Zhang. Llms as resear ch tools: A large scale survey of resear chers’ usage and perceptions, 2024. URL . Jiqun Liu, Jamshed Karimnazarov , and R yen W . White. T rapped by expectations: Functional ﬁxedness in llm-enabled chat search, 2025. URL . OpenAI. Introducing deep research. https://openai.com/index/ introducing- deep- research/ , February 2025. Perplexity AI. Perplexity: Ask anything. https://www.perplexity.ai , 2024. Jason Phang, Michael Lampe, Lama Ahmad, Sandhini Agarwal, Cathy Mengying Fang, Auren R. Liu, V aldemar Danry , Eunhae Lee, Samantha W . T . Chan, Pat Pataranutaporn, and Pattie Maes. Investigating affective use and emotional well-being on chatgpt, 2025. URL . Daniel E. Rose and Danny Levinson. Understanding user goals in web sear ch. In Pr oceedings of the 13th International Conference on World Wide Web , WWW ’04, pp. 13–19, New Y ork, NY , USA, 2004. Association for Computing Machinery . ISBN 158113844X. doi: 10.1145/ 988672.988675. URL https://doi.org/10.1145/988672.988675 . Chirag Shah, R yen White, Reid Andersen, Georg Buscher , Scott Counts, Sarkar Das, Ali Montazer , Sathish Manivannan, Jennifer Neville, Nagu Rangan, T ara Safavi, Siddharth Suri, Mengting W an, Leijie W ang, and Longqi Y ang. Using large language models to generate, validate, and apply user intent taxonomies. ACM T rans. W eb , 19(3), August 2025. ISSN 1559-1131. doi: 10.1145/3732294. URL https://doi.org/10.1145/3732294 . Amanpreet Singh, Joseph Chee Chang, Chloe Anastasiades, Dany Haddad, Aakanksha Naik, Amber T anaka, Angele Zamarron, Cecile Nguyen, Jena D. Hwang, Jason Dunkleber ger , Matt Latzke, Smita Rao, Jaron Lochner , Rob Evans, Rodney Kinney , Daniel S. W eld, Doug Downey , and Sergey Feldman. Ai2 scholar qa: Organized literatur e synthesis with attribution, 2025. URL . Michael D. Skarlinski, Sam Cox, Jon M. Laurent, James D. Braza, Michaela Hinks, Michael J. Hammerling, Manvitha Ponnapati, Samuel G. Rodriques, and Andrew D. White. Lan- guage agents achieve super human synthesis of scientiﬁc knowledge, 2024. URL https: //arxiv.org/abs/2409.13740 . 11 Preprint. Under review . Soﬁa Eleni Spatharioti, David Rothschild, Daniel G Goldstein, and Jake M Hofman. Effects of llm-based search on decision making: Speed, accuracy , and overreliance. In Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems , CHI ’25, New Y ork, NY , USA, 2025. Association for Computing Machinery . ISBN 9798400713941. doi: 10.1145/ 3706598.3714082. URL https://doi.org/10.1145/3706598.3714082 . Alex T amkin, Miles McCain, Kunal Handa, Esin Durmus, Liane Lovitt, Ankur Rathi, Saf fron Huang, Alfred Mountﬁeld, Jerry Hong, Stuart Ritchie, Michael Stern, Brian Clarke, Landon Goldberg, Theodor e R. Sumers, Jared Mueller , W illiam McEachen, W es Mitchell, Shan Carter , Jack Clark, Jared Kaplan, and Deep Ganguli. Clio: Privacy-preserving insights into real-world ai use, 2024. URL . Jiayin W ang, W eizhi Ma, Peijie Sun, Min Zhang, and Jian-Y un Nie. Understanding user experience in large language model interactions, 2024. URL 2401.08329 . Renjun Xu and Jingwen Peng. A comprehensive survey of deep research: Systems, method- ologies, and applications, 2025. URL . Jeremy Y ang, Noah Y onack, Kate Zyskowski, Denis Y arats, Johnny Ho, and Jerry Ma. The adoption and usage of ai agents: Early evidence from perplexity , 2025. URL https: //arxiv.org/abs/2512.07828 . Y ou.com. Y ou.com: The ai search engine you contr ol. https://you.com , 2024. W enting Zhao, Xiang Ren, Jack Hessel, Claire Cardie, Y ejin Choi, and Y untian Deng. W ild- chat: 1m chatgpt interaction logs in the wild, 2024a. URL 01470 . W enting Zhao, Xiang Ren, Jack Hessel, Claire Cardie, Y ejin Choi, and Y untian Deng. W ild- chat: 1m chatgpt interaction logs in the wild, 2024b. URL 01470 . Appendix A Methods Details A.1 Query Analysis Pipeline Our prepr ocessing pipeline includes bot and canned query ﬁltering, session identiﬁcation, action debouncing, and PII r emoval. W e limit analysis to a stable deployment period and focus on single-query behavior . Session boundaries are set based on a 45-minute UI action timeout period. Action debouncing is performed for page revisits: we only consider a page to be revisited if it occurred after 5 minutes fr om the initial visit. T o preserve privacy , we remove any queries ﬂagged by an LLM as possibly containing PII. Since we have left the study of followup query behavior for future study , we focus primarily on user behavior with respect to a single query at a time. Speciﬁcally , we restrict PF analysis to the ﬁrst query in each PF conversation as neither SQA nor S2 supported chat-style followup queries during the data collection period. W e have removed any notion of user ID fr om our data release. W e use LLMs throughout our analysis pipeline: for query labeling (intent, phrasing, criteria, ﬁeld of study), complexity assessment, duplicate detection, feedback classiﬁcation, and response quality evaluation. All pr ompts ar e pr ovided in Appendix H , with label deﬁnitions and few-shot examples in Appendix G . Using the taxonomy we label a random subset of 30,000 queries. For each query aspect, we prompt GPT -4.1 with the query text, possible labels, descriptions, and in-context examples, using structured decoding to ensure valid outputs. Labels are non-mutually exclusive; a query might receive both Broad Topic Exploration and Methodological Guidance for intent. 12 Preprint. Under review . A.2 Comparison Between T ools Since PF is naturally a multiturn chat experience whereas S2 and SQA neither support multi-turn conversations or contextualized followup queries, we limit our analysis in this study to the ﬁrst query initiated in a PF conversation. This may end up skewing the query distribution. Followup queries are left to future work. A.3 Statistical Analysis Our study explores many aspects of user behavior; we use frequentist hypothesis testing to identify statistically signiﬁcant ﬁndings and focus on those with the large ef fect sizes. All reported dif ferences between groups (e.g., rates across tools, user segments, or query types) are statistically signiﬁcant at α = 0.05 using two-sided t-tests. W e report 95% conﬁdence intervals (W ilson CIs for rates, bootstrap for unbounded values), showing the larger bound as a single ± value. T o identify query characteristics associated with successful result engagement, we ﬁt binomial logistic regression models pr edicting click-through rate (CTR) for each tool. In this context, a click is a binary outcome indicating whether a user clicked on at least one Semantic Scholar paper result following their query . Query level covariates were derived fr om LLM-based multilabel classiﬁcations of each query , expanded into binary indicators (including those from the query taxonomy). T o control for heterogeneity in usage patterns with the tools between users, we included user history statistics as covariates: the number of previous queries issued by the user , cumulative prior clicks, and an empirical Bayes-smoothed estimate of the user ’s historical click rate. Separate models were ﬁt for SQA and PF (since each tool exposes a different interface) using maximum likelihood estimation with a logit link ( n = 30, 000 for each model). T o contr ol the false discovery rate when testing the signiﬁcance of individual coef ﬁcients, we applied the Benjamini-Hochberg pr ocedure across all p-values fr om both models. Appendix B Abstractiveness Analysis The transition to natural language queries, together with users’ expectations from modern AI systems, incr eased the use of abstract intents, meaning that users shifted fr om relying on jargon terms to expressing their information needs more abstractly . For example, a query like “BER T ﬁne-tuning instability” increasingly appears in forms such as “why some language models behave unpr edictably when trained further ”, where the speciﬁc technical term is replaced by an abstract description of the underlying intent. W e measured the abstractiveness of queries by classifying abstract concepts versus jar gon terms using an LLM. W e found that 38% of S2 queries include at least one abstract concept, compared with 66% in PF. The median number of abstract concepts per query is 0 in S2 and 1 in PF, reinfor cing the same trend. W e also observe a positive correlation between query length and the number of abstract concepts it contains (Pearson r = 0.519). Since PF queries tend to be longer , this suggests that users now expr ess their intents through more elaborate, abstract descriptions, while S2 queries were shorter and more densely packed with jargon. This substantial gap indicates that query complexity increased not only due to the higher rate of entities and relations, but also because modern queries deviate more from the scientiﬁc jargon anchor ed in the documents themselves. Appendix C SQA Section Navigation Analysis Figure 4 shows the distribution of section expansions. Users often start from the ﬁrst section (index 0), and the last section expanded in a r eport is typically between index 1 and 4. Given that the ﬁrst section is almost always an intr oduction to the topic, it is notable to see users frequently starting with the second section instead. W e do still see the expected position bias towards the ﬁrst sections of the report, with the total number of expansions on each section index over all reports decreasing with index number after index 1; the tendency to skip the introduction appears to be strong enough that index 1 is the most commonly expanded section. 13 Preprint. Under review . Figure 4: Section expansion distribution (on a log scale) showing which sections users expand ﬁrst in SQA responses. Section index 1 has the largest number of expansions. Users tend to start on section 0 or 1 and end on a section between 2–4. T o better understand user navigation behavior and the importance of sequential generation of the r eport content, we examined user navigation between sections. Figure 5 shows the section transition counts, speciﬁcally how many times a user expanded section j after having just expanded section i . Notably , the upper off-diagonal is bright, indicating sequential ex- pansions, but users also exhibit the behavior of closing and r eopening a section (presumably to read the TL;DR which is only visible when the section is collapsed). There is also notable section skipping behavior and a bright lower off-diagonal indicating sequential expansion backwards. Also notable is seeing users often return to section 0 (the intr oduction section) regar dless of the section they are currently on. Overall, users primarily move sequentially through the r eport with some skipping behavior and backwards navigation. Appendix D CTR V alidation as Success Metric While thumbs up/down feedback would directly indicate satisfaction, it is sparse (less than 2% of reports; see Figure 11 ) and available only for self-selecting users. W e therefore use link clicks as our success proxy: they occur on at least 15% of PF reports and correlate more strongly with r epeat usage (users who click links are more likely to return than those who provide thumbs up feedback, see Figur e 13 for full analysis). W e also consider ed using PF evidence clicks but chose link clicks to maintain parity between tools. LLM-assessed quality of the r eport generated by SQA has a substantial corr elation with success metrics. Reports assessed as high quality have an average CTR of 5.5 % compared to 3.8 % for reports assessed as low quality , a r elative increase of 44 %. Similarly , the return rate of reports assessed as high quality is 62.3 % versus 54.7 % for low-quality reports. The relative size of these dif ferences validates the importance of r esponse quality , though the absolute change suggests that quality , at least as estimated by an LLM judge, is only one factor affecting user behavior . 14 Preprint. Under review . Figure 5: Section transition heatmap showing in-order vs out-of-order reading patterns in SQA. Sequential expansion is the dominant behavior , but there is notable backward traversal behavior as well as r eturn to 0 (the introduction) behavior . Users also close and then reopen a section (presumably to read the TL;DR which is only visible when a section is collapsed). Appendix E Additional ﬁgures and tables E.1 Query characteristics and success metrics T o understand which types of queries current systems handle well and where they struggle, we ﬁt logistic regression models predicting click-through rate (CTR) from query intent, phrasing style, search criteria, and ﬁeld of study , with user features as controls (see Ap- pendix D for CTR validation). T ables 15 and 16 report the resulting odds ratios (OR > 1 indicates higher click odds). W e focus on query aspects that ar e independent of ﬁeld of study (intent, phrasing, and criteria) because these have direct implications for system design, whereas ﬁeld ef fects likely reﬂect properties of the user population or corpus coverage (see T able 17 for statistically signiﬁcant ﬁeld coefﬁcients). Since PF and SQA are designed for differ ent purposes, we analyze each tool separately (see T ables 13 and 14 for example queries). PF struggles with Content Generation and Expansion queries (which it is not designed for) and queries with Temporal Constraints or Data Resource requir ements such as publicly available datasets (which may be unsatisﬁable). On the ﬂip side, PF performs well on queries it was designed for: Citation/Evidence Finding and Broad Topic Exploration queries both have higher click odds. SQA shows an analogous pattern. Concept Definition and Explanation queries have higher click odds, as SQA was designed with them in mind. Complex Contextual Narrative 15 Preprint. Under review . Figure 6: Distribution of coarse-grained ﬁelds of study across PF, SQA, and S2. Figure 7: Reading order shown through section ﬂow Sankey diagram, illustrating user navigation patterns through SQA r esponse sections. queries also have high click odds, suggesting users achieve success with complex queries that would typically fail using traditional IR tools. Citation Format Specification queries have lower click odds because SQA uses a single ﬁxed citation format rather than adapting to user-speciﬁed styles. Appendix F Dataset Schema The released dataset comprises six parquet ﬁles. Each query generates a r eport, identiﬁed by its thread_id . All ﬁles can be joined on thread_id . F .1 optin_queries_anonymized.parquet User queries submitted to SQA and PF. 16 Preprint. Under review . Figure 8: The fraction of queries fr om each ﬁeld of study that have a given intent on Asta. The distribution reﬂects the r esearch tasks common in the given ﬁelds. Figure 9: Action engagement trends by query index showing how users perform dif ferent actions as they gain experience with the system. Shaded regions repr esent 95 % conﬁdence intervals. PF reports tend to receive less click engagement over time compared to SQA reports which gr ow in engagement, the reason likely being that PF results can be consumed passively without interacting with the web content at all whereas most of the content generated by SQA can only be accessed after clicking on the web page. Column T ype Description query string The user ’s query text thread_id string Hashed report identiﬁer query_ts timestamp Query submission time tool string T ool used: sqa or pf 17 Preprint. Under review . Figure 10: Distribution of text feedback categories reﬂecting user expectations and expe- riences. The comprehensiveness of the response is the most common complaint ( Lack of Depth and Citation & Referencing Issues ). Figure 11: Fraction of reports on which the given action has been performed across PF and SQA, showing how users engage with differ ent features. Section expansion actions are the most common, likely because users must click to expand the section text in the SQA response. Note that link clicks ar e much more common than thumbs up/down feedback; below we also show that link clicks are also str ongly associated with use satisfaction. F .2 section_expansions_anonymized.parquet Records of users expanding sections in SQA r eports (sections are collapsed by default). 18 Preprint. Under review . Figure 12: Survival analysis showing the Kaplan-Meier curve associated with various actions that users performed during their ﬁrst visit to Asta. Users who did not perform any action take longer to return to Asta than those who do. (a) Return rates of ﬁrst time users. Users who perform thumbs up are likely to return, users who don’t perform any action are the least likely to return. (b) Return rates for users after they have initi- ated at least one query . Link clicks are at least as good an indicator of return as thumbs up. Page select , an action which suggests frustration, is associated with non-return. Figure 13: Action return rates showing the probability of users returning after differ ent action types. Actions of ﬁrst time users are likely more af fected by novelty effect. 19 Preprint. Under review . (a) Coefﬁcients for PF model pr edicting S2 clicks (left) and user return (right). (b) Coefﬁcients for SQA model predicting S2 clicks (left) and user return (right). Figure 14: Estimated coefﬁcients for the linear models pr edicting clicks and r eturn on PF and SQA. Coefﬁcients ar e shown with 95% conﬁdence intervals. Only coefﬁcients signiﬁcant after Benjamini-Hochberg corr ection are displayed. Column T ype Description thread_id string Hashed report identiﬁer section_expand_ts timestamp Expansion time section_id int Index of expanded section (0-indexed) F .3 s2_link_clicks_anonymized.parquet Clicks on Semantic Scholar paper links within either tool. Column T ype Description thread_id string Hashed report identiﬁer s2_link_click_ts timestamp Click time corpus_id int Semantic Scholar corpus ID tool string T ool: sqa or pf F .4 report_section_titles_anonymized.parquet Section titles from SQA generated r eports. Column T ype Description thread_id string Hashed report identiﬁer section_idx int Section index (0-indexed) section_title string T itle of the section F .5 report_corpus_ids_anonymized.parquet Papers cited in SQA report sections. Column T ype Description thread_id string Hashed report identiﬁer corpus_id int Semantic Scholar corpus ID 20 Preprint. Under review . F .6 pf_shown_results_anonymized.parquet Papers shown in PF search r esults. Column T ype Description thread_id string Hashed report identiﬁer query_ts timestamp Query submission time result_position int Position in results list (0-indexed) corpus_id int Semantic Scholar corpus ID Appendix G Label deﬁnitions and few-shot examples This section provides comprehensive deﬁnitions for all query intents, sear ch criteria, and phrasing styles identiﬁed in our taxonomy for both ScholarQA (SQA) and Paper Finder (PF). Each category includes a description and r epresentative examples from actual user queries. Query intent taxonomies are presented in T ables 21 – 23 (SQA) and T ables 24 – 26 (PF). Search criteria taxonomies are presented in T able 27 (SQA) and T able 28 (PF). Query phrasing taxonomies are presented in T able 29 (SQA) and T able 30 (PF). The 28 ﬁelds of study used for classiﬁcation are listed in T able 31 . Appendix H LLM Prompts This section documents the prompts used for LLM-based classiﬁcation and analysis tasks in this paper . Each prompt is accompanied by the Pydantic schema used for structured output decoding. H.1 Query Complexity Analysis This prompt extracts structural components (clauses, constraints, entities, and r elations) from user queries to measur e query complexity . A n a l y z e t h e f o l l o w i n g q u e r y a n d d e c o m p o s e i t i n t o i t s s t r u c t u r a l c o m p o n e n t s : Q u e r y : " { q u e r y } " P l e a s e i d e n t i f y : 1 . * * C l a u s e s * * : I n d e p e n d e n t c l a u s e s o r m a i n s t a t e m e n t s i n t h e q u e r y ( e . g . , s e p a r a t e q u e s t i o n s , c o m m a n d s , o r s t a t e m e n t s ) 2 . * * C o n s t r a i n t s * * : S p e c i f i c c o n s t r a i n t s o r c r i t e r i a m e n t i o n e d ( e . g . , d a t e r a n g e s , p u b l i c a t i o n v e n u e s , a u t h o r n a m e s , m e t h o d o l o g i c a l r e q u i r e m e n t s ) 3 . * * E n t i t i e s * * : N a m e d e n t i t i e s o r c o n c e p t s r e f e r e n c e d ( e . g . , s p e c i f i c t h e o r i e s , m e t h o d s , d i s e a s e s , t e c h n o l o g i e s , p e o p l e , p l a c e s ) 4 . * * R e l a t i o n s * * : R e l a t i o n s h i p s b e t w e e n e n t i t i e s t h a t a r e m e n t i o n e d o r i m p l i e d ( e . g . , " X i n f l u e n c e s Y " , " A i s a t y p e o f B " , " C c a u s e s D " ) F o r e a c h c a t e g o r y , p r o v i d e a l i s t o f d i s t i n c t i t e m s . I f a c a t e g o r y d o e s n ' t a p p l y t o t h e q u e r y , r e t u r n a n e m p t y l i s t . Output Schema (Pydantic): c l a s s Q u e r y C o m p l e x i t y A n a l y s i s ( B a s e M o d e l ) : c l a u s e s : L i s t [ s t r ] = F i e l d ( d e s c r i p t i o n = " L i s t o f i n d e p e n d e n t c l a u s e s i n t h e q u e r y " ) c o n s t r a i n t s : L i s t [ s t r ] = F i e l d ( 21 Preprint. Under review . d e s c r i p t i o n = " L i s t o f c o n s t r a i n t s o r c r i t e r i a s p e c i f i e d " ) e n t i t i e s : L i s t [ s t r ] = F i e l d ( d e s c r i p t i o n = " L i s t o f n a m e d e n t i t i e s r e f e r e n c e d i n t h e q u e r y " ) r e l a t i o n s : L i s t [ s t r ] = F i e l d ( d e s c r i p t i o n = " L i s t o f r e l a t i o n s b e t w e e n e n t i t i e s " ) 22 Preprint. Under review . T able 6: Query intent distribution (% of queries) with random examples. Intent % Example SQA Broad T opic Expl. 51.6 What are the types of appr oaches to employer branding? Concept Def. & Expl. 28.2 Digital Banking BI Maturity Models Causal & Rel. Inquiry 19.1 why evaluation as a construct of metacognition is very important for writing competence to emerge? Speciﬁc Factual Retr . 12.6 what do say paper of sharma2019 and kamga 2016 about resistace in chemotherapy in AML IN NOTCH pathway Method. & Proc. Guid. 9.1 design non-stationnary wavelets trough [...] bezout’ s equation Comparative Analysis 7.3 conduct an expansive analysis on the methods for and efﬁciency of using acoustic analysis along with machine learning for forest ﬁre detection [...] Acad. Doc. Drafting 6.2 Write a [...] section for a master’ s thesis titled “The Role of Digital T win T echnology in Improving Construction Project Delivery” Cit./Evid. Finding 5.7 [Chinese: I want to prove that higher MFN brainwave amplitude may be due to higher cognitive load or cognitive conﬂict] Focused Acad. Synth. 5.6 SIR Model Study (papers from 2020 to 2024). Res. Gap & Limit. Anal. 5.2 are there studies that look into robustness to reward noise on domains other than math for RL VR? App. Inquiry 4.2 how can a new leader on a unit reduce staff burnout T ool & Res. Discovery 2.3 V ideo to urdf format Complex Cross-Paper Synth. 2.1 start with Gestational diabetes mellitus and reach to impaired insulin signalling in placenta and mediated role of inostols Prob. Solving & Ideation 1.7 What resear ch is done into generating scientiﬁc ideas, speciﬁcally predicting the next paper based on the previous citations using LLMs Data Interp. & Anal. 1.1 FTO was electr ochemically etched using HCl (2M) and zinc acetate (0.03M) at different etching time... the etching rate decrease expo- nentially with time... why? Content Gen. & Exp. 1.1 [Arabic: rewrite and restructur e methodology section for doctoral dissertation on IT and event management] Speciﬁc Paper Retr . 0.7 Jayasinghe, A 2014, “Broadlands power pr oject will kill Kitulgala’ s white water rafting” PF Broad T opic Expl. 65.0 Enzymatic electrolysis Concept Def. & Expl. 23.5 iot framework Causal & Rel. Inquiry 12.0 [lit review for underr epresented learners in rural Nigerian commu- nities] Method. & Proc. Guid. 8.9 [PhD proposal guidance for autonomous satellite control to avoid space debris using AI] Cit./Evid. Finding 7.7 Could you help me ﬁnd a few CNS-level articles that discuss the knockout of DRP1 in cells and then its overexpr ession? App. Inquiry 5.0 [Chinese: femtosecond laser + sapphire substrate, manipulating nanosheets] Focused Acad. Synth. 4.5 Inﬂuence of Contaminated Ammonium Nitrate on Detonation Be- haviour of Bulk Emulsion Explosives and Numerical Analysis of Detonation-Induced Damage Zone Speciﬁc Paper Retr . 4.1 https://onlinelibrary .wiley.com/doi/10.1002/pds.5880 Comparative Analysis 3.8 [German: digital service platforms + automated assessments, func- tional & non-functional requir ements] T ool & Res. Discovery 2.9 Custom query languages for graphs that [...] Speciﬁc Factual Retr . 2.8 [Chinese: metastatic TNBC ﬁrst-line chemotherapy median progr ession-free survival (mPFS) is only 5-6 months] Complex Cross-Paper Synth. 2.5 can you ﬁnd papers that criticize Kilian and V igfusson 2011, 2013 work and conﬁrm the presence of asymmetries or nonlinearities in the oil-macroeconomy r elationship Res. Gap & Limit. Anal. 2.3 [systematic review abstract on conversational agent interventions for physical and psychological symptom management] Prob. Solving & Ideation 1.1 [Portuguese: write a thesis proposal on spatial geometry with a didactic intervention in high school] Data Interp. & Anal. 0.6 [...] all models consistently struggled to differentiate the Moderate Risk group[...] individuals in moderate states exhibit overlapping characteristics with both low and high-risk groups Content Gen. & Exp. 0.6 [Portuguese: write and compile thesis on drumstick leaves bioavail- ability] 23 Preprint. Under review . T able 7: Query criteria distribution (% of queries). Methodology-related criteria ar e most common on Asta. Asta S2 Methodology 42 29 Pub. Quality 11 3 Metadata 9 12 T emporal 5 2 Data A vail. 1 0.4 Cit. Impact 0.8 < 0.1 Characteristic Description Keyword-style Query (phrasing style) Short, often fragmentary , queries resembling search engine key- words or subject headings. No verbs or complete sentences; typ- ically a list of nouns or concepts separated by spaces or simple punctuation. Methodology-Speciﬁc Criteria (query criteria) Queries where the user requires speciﬁc methods, approaches, analytical techniques, or experimental designs to be present in the search results. This can include demands for computational models, experimental paradigms, or meta-analyses. T able 8: Common query characteristics observed in Asta. T able 9: Example user feedback by category . Some users seem to expect the feedback submission to provide a multiturn chat experience. Category Example Feedback Response Quality develop more the intr oduction and add more citations (refer ences) Response Quality this is leaving out some key details from Magalhaes et al 2013 Response Quality Out of date Query Reﬁnement Its very good but I need to add a historical component Feature Requests would be a great place to a diagram Feature Requests amazing! Coudl you create a PDF Formatting Prefer ences too dense. I wanted bullet points 24 Preprint. Under review . T able 10: Examples of interesting usage patterns showing how users pr obe the capabilities of LLM-powered r esearch tools. Users go beyond basic IR tasks and expect Asta to work as a collaborative resear ch assistant. Pattern T ool Example Query T emplate Filling PF ﬁll this tabel with 10 jurnal bellow:... [table template and citations] T emplate Filling SQA for sacubitril ﬁnd all: "IUP AC Name: CAS Number: Molecular For- mula:..." [15+ ﬁelds] Prompting SQA Y ou are an expert r esearch assistant specializing in computational geo- sciences and machine learning. Prompting PF Find papers...The model **must** be capable of... Persona Adoption SQA Think of yourself as experienced professor ...Please write me a phd pro- posal...devour T urnitin detection bots Collaborative W riting SQA I’m working on my paper ... [LaT eX section] add papers from TSE, TOSEM, ICSE Collaborative W riting PF [Chinese paragraph] Help me ﬁnd references...tell me which sentences cite which Research Lineage PF What are latest advances in research ﬁelds of these three papers? [3 DOIs] Reﬁnding PF ...paper using BER T that says we cant just look at top-k...which paper says this Reﬁnding PF hey whats the name of the paper that did a study on how people use llms by allowing the public to use their tokens on paid llms... T able 11: Comparison of duplicate query submission versus report r evisiting behavior . Metric SQA PF User Prevalence Users with duplicate queries 18.8 % 14.8 % Users who revisited r eports 50.5 % 42.1 % T emporal Patterns Median time between duplicates 15.9 min 5.8 min Median time between revisits 3.8 hours 5.9 hours Short-term occurrence ( < 1 hour) Duplicates within 1 hour 66.7 % 72.9 % Revisits within 1 hour 30.5 % 23.5 % Medium-term occurrence ( < 1 day) Duplicates within 1 day 81.2 % 83.5 % Revisits within 1 day 72.2 % 71.2 % 25 Preprint. Under review . T able 12: Representative examples of duplicate queries showing exact duplicates and incremental r eﬁnements. T ime gaps indicate minutes between submissions. T ool Gap Query Pair SQA 4.0 min First: History of experiential learning of science Second: History of experiential learning of science SQA 1.4 min First: postbiotics food Second: postbiotics in food industry SQA 6.4 min First: [Long query requesting literature r eview] Second: [Same query + “write this in future tense”] PF 4.1 min First: Item-based collaborative ﬁltering recommendation algorithm Second: Item-based collaborative ﬁltering recommendation algorithm PF 0.9 min First: ﬁnd me papers that experiment with dropping entire LLM blocks Second: ﬁnd me papers that have to do with dr opping entire blocks from decoder transformer LLMs PF 3.4 min First: [Chinese query about gated fusion attention] Second: [Same + “respond in Chinese”] T able 13: Example PF queries by type. PF performs well on Citation & Evidence Finding and Broad T opic Exploration queries but struggles with queries having T emporal Constraints, Data/Resource r equirements, or Content Generation requests. Query T ype Example Query Citation & Evidence Finding Can you ﬁnd for me explicitly mentioned in a paper when an anti-reﬂection coating was used to improve the temporal contrast of a high power laser due to it having to be at s polarization. Broad T opic Exploration ﬁnd me articles on AI implementation in SMEs. T emporal Constraints What are the drawbacks of Intelligent tutoring systems from before they used LLMs Data/Resource A vailability dataset with image and sun azimuth and sun elevation Content Generation and Expansion Create an article with the title “Material Strength Design on Excavator Arm Model” with refer ences of 40 articles or journals T able 14: Example SQA queries by type. SQA excels at Concept Deﬁnition and Complex Contextual Narrative queries but struggles with Citation Format Speciﬁcation r equests. Query T ype Example Query Concept Deﬁnition and Explanation explain what multimodal and multisensor are and the differ ents betwen them Complex Contextual Narrative Recent evidence highlights the critical role of growth hormone secretagogue r eceptor signaling in hippocampal synaptic physi- ology , mediated through dopamine receptor activity ... [describes detailed molecular mechanisms] Citation Format Speciﬁcation Use AP A style 7 for in-text citations, citing authors’ names, pub- lication dates, and page numbers. Write a theor etical framework of 17,000 words about schooling of rural girls... 26 Preprint. Under review . T able 15: Odds ratios for S2 link click on PF (95% CI; user features included as con- trols but not shown). Content Gen. & Exp. queries have approximately half the odds of a click (OR = 0.47), while Cit./Evid. Finding queries have 17% higher odds (OR = 1.17). Field of study coefﬁcients are shown in T able 17 . Only statistically signiﬁ- cant effects ar e presented. Predictor Group OR Cit./Evid. Finding I 1.17 ± 0.15 Broad T opic Expl. I 1.12 ± 0.08 T emporal Const. C 0.82 ± 0.12 Data Res. A vail. C 0.61 ± 0.18 Content Gen. & Exp. I 0.47 ± 0.24 Group: I=intent, C=criteria. OR=Odds Ratio. T able 16: Odds ratios for S2 link click on SQA (95% CI; user features included as con- trols but not shown). Fewer predictors reach signiﬁcance compared to PF. Complex Ctx. Narr. queries have 47% higher odds of a click (OR = 1.47), while Cit. Format Spec. queries have 38% lower odds (OR = 0.62). Predictor Group OR Complex Ctx. Narr . P 1.47 ± 0.39 Concept Def. & Expl. I 1.29 ± 0.17 Cit. Format Spec. P 0.62 ± 0.25 Group: I=intent, P=phrasing. OR=Odds Ratio. T able 17: Field of study odds ratios for S2 link click on PF (95% CI). Field of Study OR Law & Legal Studies 1.25 ± 0.22 Sociology 1.24 ± 0.12 Agricultural Sciences 1.23 ± 0.16 Electrical Engineering 1.20 ± 0.13 Clinical Medicine 1.20 ± 0.12 Business & Management 1.20 ± 0.11 Chemistry 1.17 ± 0.13 T able 18: User return rate by LLM quality assessment of SQA r eports. Users who received reports assessed as high-quality by an LLM are signiﬁcantly mor e likely to r eturn ( p < 0.001, r = 0.052). LLM Assessment n Return Rate 95% CI Negative 1,275 54.7% [52.0%, 57.5%] Positive 8,725 62.3% [61.3%, 63.3%] T able 19: S2 link click rate by LLM quality assessment of SQA reports. Users who received reports assessed as high-quality by an LLM ar e more likely to click on citation links ( p = 0.01, r = 0.026). LLM Assessment n Click Rate 95% CI Negative 1,373 3.9% [3.0%, 5.0%] Positive 8,631 5.5% [5.1%, 6.0%] T able 20: Confusion matrix comparing user feedback (thumbs up/down) with LLM quality assessment. The LLM achieves 73.9% accuracy , with higher agreement on positive assess- ments (F1 = 0.83) than negative ones (F1 = 0.47), reﬂecting the challenge of identifying subtle quality issues. LLM Assessment User Feedback Positive Negative Thumbs Up 86 14 Thumbs Down 22 16 27 Preprint. Under review . Intent Description Examples Broad T opic Explo- ration The user wants a general overview , lit- erature r eview , or summary of a broad academic ﬁeld or topic without highly speciﬁc constraints. “Conduct a literature r eview on the use of deep learning in focussed ultrasound stimulation” “E-commerce” “Theories of internalization” Speciﬁc Factual Re- trieval The user is asking for a precise, factual piece of information, a speciﬁc data point, a statistic, or a historical detail. This intent is characterized by its speci- ﬁcity and aims for a direct, veriﬁable answer . “what is the best wavelength to check the hemolysis in plasma samples?” “What was the ﬁrst study examine sex- ual double standards? When was the term coined?” “ﬁnd user statistics for UniProt” Concept Deﬁnition and Explanation The user seeks to understand the mean- ing, core principles, and fundamental aspects of a speciﬁc academic theory , concept, model, or term. “What are r easoning LLMs?” “overview of the Scholarly Primitive Theory by John unsworth” “What is ‘Administration’ in Public Fi- nancial Administration” Comparative Analysis The user wants to understand the sim- ilarities, differ ences, advantages, and disadvantages between two or more concepts, theories, models, or meth- ods. “employer brand vs employer brand- ing” “What are the differences and similar- ity between information foraging the- ory and exploratory search?” “Compare the concept of world order be- tween the theory of realism, liberalism, constructivism...” Methodological and Procedural Guidance The user is looking for instructions, protocols, frameworks, or best prac- tices for carrying out a speciﬁc research task, experiment, analysis, or proce- dure. “what can I do to parse effectively PDFs for RAG?” “how to extract dna from blood for sepsis molecular diagnosis” “What are the best r esearch questions re- lated to the following research Gap?...” T able 21: Complete ScholarQA query intent taxonomy (part 1 of 3). 28 Preprint. Under review . Intent Description Examples T ool and Resour ce Dis- covery The user is searching for speciﬁc aca- demic resour ces such as datasets, soft- ware tools, questionnair es, evaluation benchmarks, or code repositories r ele- vant to their resear ch. “Question answering datasets” “ﬁnd the deepfake dataset which contains some real-world perturbations.” “What is the best questionnaire to mea- sure mathematics anxiety?” Research Gap and Limitation Analysis The user aims to identify the limi- tations, unsolved problems, or unex- plored areas (gaps) within a speciﬁc ﬁeld of resear ch, often to justify a new study or ﬁnd a novel research direc- tion. “Systematic reviews on A WE tools in EFL argumentative writing. What is the gap?” “Limitations: What limitations are present in the theor etical literature on spatial poverty?” “Review the SOT A and resear ch gaps of organic neur omorphic computing in AI...” Causal and Relational Inquiry The user wants to understand the re- lationship, impact, effect, or inﬂuence of one variable or concept on another . These queries often explore cause-and- effect connections or corr elations. “Does sleep consistency beneﬁt overall health” “how do CEO power inﬂuence sustain- ability performance and ﬁnancial per- formance of companies?” “drivers of public support for cost- intensive policies” Focused Academic Synthesis The user requests a structured and constrained synthesis of academic lit- erature, such as a systematic review , a meta-analysis, or a review limited to speciﬁc journals, timeframes, or methodologies. “Future of the Labor Market and Higher Education ‘Only use articles from Q1 and Q2 journals.”’ “Write a systematic review on soybean production challenges and opportuni- ties in Ethiopia” “Generate a comprehensive, citation-rich literature r eview (2014–2025)...” Academic Document Drafting The user explicitly asks for the gener- ation of a speciﬁc academic document or a section of one, such as a proposal, a chapter , an abstract, or a detailed re- view , often pr oviding a structured tem- plate or detailed instructions. “using AP A 7 style with in text cita- tion, write a literature r eview of about 2000 words...” “make me a study justiﬁcation for Im- pacted canines and its association with dental anomalies...” “write a comprehensive theoretical framework for attention and sustained attention” T able 22: Complete ScholarQA query intent taxonomy (part 2 of 3). 29 Preprint. Under review . Intent Description Examples Ideation The user presents a complex problem, a scenario, or a r esearch objective and asks for potential solutions, innovative ideas, hypotheses, or new resear ch di- rections. “I want to do ‘quantization in MoE model inference’, ﬁnd some related pa- pers and recommend some ideas on this” “how can the anthr opomorphization of artiﬁcial intelligence inﬂuence teaching practices...” “Has any research been done to more fully articulate the collaboration/coop- eration requir ed...” Application Inquiry User seeks r esearch focused on solving a speciﬁc, practical problem or demon- strating the application of a theory or technology in a real-world context. “The Effect of a Program Based on Us- ing Zoom Application on Developing Level T wo Students’ Business V ocabu- lary ...” “Golf shot/swing analysis and feedback for self training” “resear ch paper on beneﬁt of construc- tion machineries application” Data Interpretation and Analysis Support The user needs help with interpreting or analyzing speciﬁc data, results, or observations. This can include under- standing statistical outputs, analyzing images/spectra, or making sense of ex- perimental ﬁndings. “analyze UV -V isible peak of HDPE at 281 nm” “What is the clinical and biological im- plication and meaning of the blood and nail Selenium levels...” “analyzed uv visible spectrum of Fe3O4 nanopartticles at 262 nm” Content Generation and Expansion User provides a piece of existing text (such as an outline, abstract, or draft) and explicitly requests for it to be ex- panded, rewritten, or to have compo- nents like refer ences generated for it. “My resear ch topic is ‘The Effects of V irtual Reality Environments on Adult English Language Acquisition.’ Per- sonalize the following questions...” “melhore essa analise A Figura exibe dois picos expressivos...” “The method of chemical copr ecipitation allows to obtain barium hexaferrite... select literary sources...” Complex Cross-Paper Synthesis The user requires complex reasoning that involves synthesizing information across multiple r esearch papers to un- derstand relationships, trace evolution of concepts, identify emergent pat- terns, or connect insights fr om differ - ent disciplines. “How has the conditional Pesin en- tropy formula evolved in the literatur e on coupled dynamical systems...” “Has any research been done to more fully articulate the collaboration/coop- eration requir ed...” “topological structure of knowledge net- works or knowledge maps...” Citation & Evidence Finding User provides a speciﬁc claim, state- ment, or fact and requests academic sources that can be used as a citation to support it. “Please search for literature to prove that BAK is mainly distributed on the outer membrane of mitochondria” “it was shown that hydroxyl groups... can form intramolecular hydrogen bonds... ﬁnd me a citation for that” “Media video dan televisi adalah media pembelajaran... carikan sumbernya” Speciﬁc Paper Re- trieval User attempts to locate a single, known academic paper , often using a DOI, a partial citation, or speciﬁc author and title information. “10.1097/PRS.0000000000008882” “T o improve the adaptability of con- trol systems, resear chers have investi- gated... Amiri et al.... ﬁnd this paper ” “Smith J, et al. The regulation of sperm motility ... [citation]” T able 23: Complete ScholarQA query intent taxonomy (part 3 of 3). T axonomy was built through iterative human review and LLM-assisted analysis using Gemini-2.5-pro on a sample of 1000 queries. 30 Preprint. Under review . Intent Description Examples Broad T opic Explo- ration The user wants a general overview , lit- erature r eview , or summary of a broad academic ﬁeld or topic without highly speciﬁc constraints. “Do a literature r eview on artiﬁcial in- telligence in management accounting” “papers about the technological ad- vances, debates and challenges in hu- manitarian aid” “Show some exploratory qualitative studies on EL T” Speciﬁc Paper Re- trieval User attempts to locate a single, known academic paper , often using a DOI, a partial citation, or speciﬁc author and title information. “10.1097/PRS.0000000000008882” “T o improve the adaptability of contr ol systems... Amiri et al.... ﬁnd this pa- per ” “Smith J, et al. The regulation of sperm motility ... [citation]” Complex Cross-Paper Synthesis The user requires complex reasoning that involves synthesizing information across multiple r esearch papers to un- derstand relationships, trace evolution of concepts, identify emergent pat- terns, or connect insights fr om differ - ent disciplines. “How has the conditional Pesin en- tropy formula evolved in the literatur e on coupled dynamical systems...” “Has any research been done to more fully articulate the collaboration/coop- eration requir ed...” “topological structure of knowledge net- works...” Citation & Evidence Finding User provides a speciﬁc claim, state- ment, or fact and requests academic sources that can be used as a citation to support it. “Please search for literature to prove that BAK is mainly distributed on the outer membrane of mitochondria” “it was shown that hydroxyl groups... ﬁnd me a citation for that” “Media video dan televisi... carikan sum- bernya” Methodological and Procedural Guidance The user is looking for instructions, protocols, frameworks, or best prac- tices for carrying out a speciﬁc research task, experiment, analysis, or proce- dure. “How to screen the downstream path- ways of ERK” “How is alignment consistency ensured from part identiﬁcation to measure- ment feedback...” “what is the importance of cover- ing many different spatial conﬁgura- tions...” T able 24: Complete Paper Finder query intent taxonomy (part 1 of 3). 31 Preprint. Under review . Intent Description Examples Concept Deﬁnition and Explanation The user seeks to understand the mean- ing, core principles, and fundamental aspects of a speciﬁc academic theory , concept, model, or term. “ppf ” “Deﬁnition of climate” “What is artiﬁcial intelligence?” Comparative Analysis The user wants to understand the sim- ilarities, differ ences, advantages, and disadvantages between two or more concepts, theories, models, or meth- ods. “ﬁnd paper that comparing the Amer- ican native or indigenous or village chicken egg and commercial chicken egg” “Copare the papers of learning analytics through UT AUT and T AM” “Comparison of the efﬁciency of in-vitro digestibility to in-vivo digestibility” T ool and Resour ce Dis- covery The user is searching for speciﬁc aca- demic resour ces such as datasets, soft- ware tools, questionnair es, evaluation benchmarks, or code repositories r ele- vant to their resear ch. “Papers introducing a dataset of an un- scripted dialogue between 2 speakers...” “i need Resear ch papers and open datasets on sports injuries... sensor data such as ﬁtbit...” “Document Processing LLM Bench- marks” Causal and Relational Inquiry The user wants to understand the re- lationship, impact, effect, or inﬂuence of one variable or concept on another . These queries often explore cause-and- effect connections or corr elations. “How does fear of missing out inﬂuence the effect of mindfulness on academic procrastination...” “The Impact of Corporate Governance Practices on Financial Inclusion...” “what are sex differ ences in psychiatric comorbidities among individuals with migraine?” Application Inquiry User seeks r esearch focused on solving a speciﬁc, practical problem or demon- strating the application of a theory or technology in a real-world context. “The Effect of a Program Based on Us- ing Zoom Application on Developing Level T wo Students’ Business V ocabu- lary ...” “Golf shot/swing analysis and feedback for self training” “resear ch paper on beneﬁt of construc- tion machineries application” T able 25: Complete Paper Finder query intent taxonomy (part 2 of 3). 32 Preprint. Under review . Intent Description Examples Focused Academic Synthesis The user requests a structured and constrained synthesis of academic lit- erature, such as a systematic review , a meta-analysis, or a review limited to speciﬁc journals, timeframes, or methodologies. “emerging trends in communications sectors that Ofcom deals with” “recent advances in nuclear fusion power ” “Must read papers for LLM” Content Generation and Expansion User provides a piece of existing text (such as an outline, abstract, or draft) and explicitly requests for it to be ex- panded, rewritten, or to have compo- nents like refer ences generated for it. “My resear ch topic is ‘The Effects of V irtual Reality Environments on Adult English Language Acquisition.’ Per- sonalize the following questions...” “melhore essa analise...” “The method of chemical coprecipita- tion... select literary sources...” Data Interpretation and Analysis Support The user needs help with interpreting or analyzing speciﬁc data, results, or observations. This can include under- standing statistical outputs, analyzing images/spectra, or making sense of ex- perimental ﬁndings. “analyze UV -V isible peak of HDPE at 281 nm” “What is the clinical and biological im- plication and meaning of the blood and nail Selenium levels...” “analyzed uv visible spectrum of Fe3O4 nanopartticles at 262 nm” Ideation The user presents a complex problem, a scenario, or a r esearch objective and asks for potential solutions, innovative ideas, hypotheses, or new resear ch di- rections. “I want to do ‘quantization in MoE model inference’, ﬁnd some related pa- pers and recommend some ideas on this” “how can the anthr opomorphization of artiﬁcial intelligence inﬂuence teaching practices...” “Has any research been done to more fully articulate the collaboration/coop- eration requir ed...” Research Gap and Limitation Analysis The user aims to identify the limi- tations, unsolved problems, or unex- plored areas (gaps) within a speciﬁc ﬁeld of resear ch, often to justify a new study or ﬁnd a novel research direc- tion. “Systematic reviews on A WE tools in EFL argumentative writing. What is the gap?” “Limitations: What limitations are present in the theor etical literature on spatial poverty?” “Review the SOT A and resear ch gaps of organic neur omorphic computing in AI...” Speciﬁc Factual Re- trieval The user is asking for a precise, factual piece of information, a speciﬁc data point, a statistic, or a historical detail. This intent is characterized by its speci- ﬁcity and aims for a direct, veriﬁable answer . “what is the best wavelength to check the hemolysis in plasma samples?” “What was the ﬁrst study examine sex- ual double standards? When was the term coined?” “ﬁnd user statistics for UniProt” T able 26: Complete Paper Finder query intent taxonomy (part 3 of 3). T axonomy was built through iterative human review and LLM-assisted analysis using Gemini-2.5-pro on a sample of 1000 queries. 33 Preprint. Under review . Category Description Examples Methodology-Speciﬁc Criteria Queries where the user requir es spe- ciﬁc methods, approaches, analyti- cal techniques, or experimental de- signs to be present in the search re- sults. This can include demands for computational models, experimental paradigms, or meta-analyses. “Fe dopped ZnO and its composite with zeolite, explain crystallite size by Scher- rer and Williamson-Hall models and compare” “meta analysis on behavioral epigenetics dogs and cats...” “compare las membranas poliméricas de inclusión respecto a cada uno de los métodos convencionales...” Publication T ype- /Quality Filters Queries that require certain types of lit- erature based on format, peer review status, or journal ranking, such as pa- pers, theses, reviews, or guidelines, with or without requirements for high- impact or reputable sour ces. “Review papers on SOT A and research gaps in manifold learning in AI.” “provide several examples from original resear ch articles published post-2020, excluding review articles” “ﬁnd articles suggesting a resear ch gap on the inﬂuence of OFW parentage...” T emporal Constraints Queries that include requirements re- lated to the publication date or pe- riod of the information (e.g., recent lit- erature, post-2020, historical periods). This includes searching within speciﬁc timeframes or for recent advances. “Find policies and events in Eur opean countries that demonstrate ‘strategic autonomy’ between 2016 and 2025.” “provide several examples from origi- nal research articles published post- 2020...” “Latest r esearch on social loneliness and perfectionism” Metadata-Based Crite- ria Queries specifying attributes of the publications or their creators, such as language, authorship, institutional af- ﬁliation, location/country , or publica- tion venue (journal, conference, etc.). “in welchen Studien wurde das R paket treeclim verwendet und was wurde damit gemacht?” “Shahinda Rezk” “search for papers published in NeurIPS about reinfor cement learning” Citation and Impact- Based Criteria Queries that specify requir ements re- lated to citation counts, impact factors, highly-cited papers, or seminal works in a ﬁeld. This can include searching for inﬂuential or foundational studies. “ﬁnd the most highly cited papers on deep learning optimization” “seminal works on contract-net proto- cols” “SOT A metric on the canonical bench- mark(s) and the delta or change vs. prior best benchmark score” Data/Resource A vail- ability Criteria Queries that ask for studies or papers with available datasets, code, experi- mental materials, or other r esources for repr oducibility or re-use. This category includes explicit requests for supple- mentary information or links to data/- code. “the code or data availability weblink or vendor for the method” “I want TB expression system papers with downloadable protocols” “review papers listing open datasets for computational linguistics” T able 27: Complete ScholarQA search criteria taxonomy . T axonomy was built through iterative human review and LLM-assisted analysis using GPT -4.1 on a sample of 1000 queries. 34 Preprint. Under review . Category Description Examples Methodology-Speciﬁc Criteria Queries where the user requir es spe- ciﬁc methods, approaches, analyti- cal techniques, or experimental de- signs to be present in the search re- sults. This can include demands for computational models, experimental paradigms, or meta-analyses. “papers which use Nobel laur eates’ pub- lication data to study interdisciplinary resear ch” “Equilibrium analysis enables the assess- ment of whether observed outcomes re- ﬂect a stable strategic conﬁguration...” “Find systematic reviews and/or clinical practice guidelines published in the last 10 years...” T emporal Constraints Queries that include requirements re- lated to the publication date or pe- riod of the information (e.g., recent lit- erature, post-2020, historical periods). This includes searching within speciﬁc timeframes or for recent advances. “Find systematic reviews and/or clini- cal practice guidelines published in the last 10 years (2014–2025)...” “papers that have been published be- tween 2023 and 2025” “papers about recent advances in com- munications sectors that Ofcom deals with” Metadata-Based Crite- ria Queries specifying attributes of the publications or their creators, such as language, authorship, institutional af- ﬁliation, location/country , or publica- tion venue (journal, conference, etc.). “Language: English or Portuguese.” “jurnal tentang T eori stimulus respons Neal E.Miller & John Dollard” “In Sri Lanka, when a government em- ployee is promoted...” Publication T ype- /Quality Filters Queries that require certain types of lit- erature based on format, peer review status, or journal ranking, such as pa- pers, theses, reviews, or guidelines, with or without requirements for high- impact or reputable sour ces. “Find systematic reviews and/or clini- cal practice guidelines published in the last 10 years...” “any paper or thesis about nostalgia and self construal” “por favor me das una tesis de maestría de control constitucional” Citation and Impact- Based Criteria Queries that specify requir ements re- lated to citation counts, impact factors, highly-cited papers, or seminal works in a ﬁeld. This can include searching for inﬂuential or foundational studies. “papers which use Nobel laur eates’ pub- lication data to study interdisciplinary resear ch” “Find papers that refer to the relation- ships between degradation ability of PROT AC...” “Find papers that explore mechanisms that stimulate agent cooperation in MARL.” Data/Resource A vail- ability Criteria Queries that ask for studies or papers with available datasets, code, experi- mental materials, or other r esources for repr oducibility or re-use. This category includes explicit requests for supple- mentary information or links to data/- code. “the code or data availability weblink or vendor for the method” “I want TB expression system papers with downloadable protocols” “review papers listing open datasets for computational linguistics” T able 28: Complete Paper Finder sear ch criteria taxonomy . T axonomy was built through iterative human review and LLM-assisted analysis using GPT -4.1 on a sample of 1000 queries. 35 Preprint. Under review . Category Description Examples Keyword-style Query Short, often fragmentary , queries re- sembling search engine keywords or subject headings. No verbs or com- plete sentences; typically a list of nouns or concepts separated by spaces or sim- ple punctuation. “sludge production MBBR” “obsessive-compulsive personality” “hydrogen pr oduction methods” Natural Language Question Fully formed, grammatically complete questions using natural language. “What is the role of agile human re- source management in strategic en- trepr eneurship? The moderating role of corporate culture...” “How can more effective research on Neuromarketing be conducted?” “What are the most common ways (re- cently) to correct ocr err ors” Explicit Instruc- tion/Imperative Direct commands or r equests instruct- ing the system to perform an action (e.g., ‘ﬁnd...’, ‘review ...’, ‘compare...’). May be simple or include detailed steps, but typically begins with an im- perative verb. “Review papers on SOT A and research gaps in manifold learning in AI.” “Find policies and events in European countries that demonstrate ‘strategic autonomy’ between 2016 and 2025.” “Compare las membranas poliméricas de inclusión respecto a cada uno de los métodos convencionales...” Complex Contextual Narrative Lengthy , detailed queries that provide substantial background context, moti- vations, deﬁnitions, or examples before asking a main question or issuing a command. These queries r esemble a mini-narrative, sometimes including an abstract, data, citations, or technical context. “In this experiment, various culture systems for the GF677 rootstock were compared... Is there an explanation in the scientiﬁc literature for why leaf number and leaf area ar e greater ...” “I have four papers and I want to com- bine them all to be a doctoral disserta- tion...” “The performance of T -junction mi- cromixers, as analysed in CFD mod- els...” Multi-part/Multi- step Query Queries composed of multiple distinct sub-questions, tasks, or steps, often di- vided by letters, numbers, or separate sentences. The sub-queries are related but require separate or structured re- sponses. “Review papers on SOT A and research gaps in manifold learning in AI. For each of the top 9 methods extract: a) Key architectural or algorithmic ideas... b) Reported SOT A metric... c) the code or data availability ...” “Describe what this cognitive ability en- compasses... Please use academic pa- pers... for each part of your response.” “Find me the research papers about inter- ictal epileptiform discharges (IEDs)... Does STD in excitatory neurons...” Boolean or Logical Op- erators Queries using explicit logical or Boolean operators or patterns (AND, OR, NOT , +, -, parentheses, slashes, etc.) to combine multiple search terms, constrain results, or designate alterna- tives. Sometimes appears as ‘T opic A/- T opic B’, ‘x, y AND z’. “adulteration detection in milk AND spectroscopy” “machine learning OR deep learning ap- plications in medical imaging” “neural network NOT convolutional” Citation/Format Spec- iﬁcation Queries where the format or style of the answer is explicitly speciﬁed, such as requiring speciﬁc citation styles (AP A, etc.), in-text citations, references, or structured bibliographies. “use AP A style citation apply in-text citation and write what the meaning of ‘innovation”’ “Please use academic papers and books when formulating your response. En- sure to pr ovide references...” “Include appropriate r eferences.” T able 29: Complete ScholarQA query phrasing taxonomy . T axonomy was built thr ough iterative human review and LLM-assisted analysis using GPT -4.1 on a sample of 1000 queries. 36 Preprint. Under review . Category Description Examples Keyword-style Query Short, often fragmentary , queries re- sembling search engine keywords or subject headings. No verbs or com- plete sentences; typically a list of nouns or concepts separated by spaces or sim- ple punctuation. “algorithmic trading” “ppf ” “bayesian optimization applications” Natural Language Question Fully formed, grammatically complete questions using natural language. “How does fear of missing out inﬂuence the effect of mindfulness on academic procrastination among students?” “Do a literature review on artiﬁcial in- telligence in management accounting” “What cognitive strategies and tech- niques can improve focus and concen- tration for artists...” Explicit Instruc- tion/Imperative Direct commands or r equests instruct- ing the system to perform an action (e.g., ‘ﬁnd...’, ‘review ...’, ‘compare...’). May be simple or include detailed steps, but typically begins with an im- perative verb. “Find papers that r efer to the relation- ships between degradation ability of PROT AC and potency or afﬁnity ...” “Give papers on ﬁnetuning an llm for writing resear ch papers” “Find systematic reviews and/or clinical practice guidelines published in the last 10 years...” Complex Contextual Narrative Lengthy , detailed queries that provide substantial background context, moti- vations, deﬁnitions, or examples before asking a main question or issuing a command. These queries r esemble a mini-narrative, sometimes including an abstract, data, citations, or technical context. “As restored ecosystems evolve towards the composition and structure charac- teristic of native vegetation...” “Equilibrium analysis enables the assess- ment of whether observed outcomes re- ﬂect a stable strategic conﬁguration...” “Papers about innovation and hurdles within industries that prevent it...” Boolean or Logical Op- erators Queries using explicit logical or Boolean operators or patterns (AND, OR, NOT , +, -, parentheses, slashes, etc.) to combine multiple search terms, constrain results, or designate alterna- tives. Sometimes appears as ‘T opic A/- T opic B’, ‘x, y AND z’. “Find systematic reviews and/or clini- cal practice guidelines published in the last 10 years (2014–2025)...” “Papers introducing a dataset of an un- scripted dialogue between 2 speakers...” “Stretching exer cises are commonly in- tegrated into physical education pro- grams...” Multi-part/Multi- step Query Queries composed of multiple distinct sub-questions, tasks, or steps, often di- vided by letters, numbers, or separate sentences. The sub-queries are related but require separate or structured re- sponses. “Review papers on SOT A and research gaps in manifold learning in AI. For each of the top 9 methods extract: a) Key architectural or algorithmic ideas... b) Reported SOT A metric... c) the code or data availability ...” “Describe what this cognitive ability en- compasses... Please use academic pa- pers... for each part of your response.” “Find me the research papers about inter- ictal epileptiform discharges (IEDs)... Does STD in excitatory neurons...” Citation/Format Spec- iﬁcation Queries where the format or style of the answer is explicitly speciﬁed, such as requiring speciﬁc citation styles (AP A, etc.), in-text citations, references, or structured bibliographies. “use AP A style citation apply in-text citation and write what the meaning of ‘innovation”’ “Please use academic papers and books when formulating your response. En- sure to pr ovide references...” “Include appropriate r eferences.” T able 30: Complete Paper Finder query phrasing taxonomy . T axonomy was built through iterative human review and LLM-assisted analysis using GPT -4.1 on a sample of 1000 queries. 37 Preprint. Under review . Field of Study Field of Study Agricultural Sciences Law and Legal Studies Anthropology Linguistics Arts and Design Literature Biomedical Sciences Mathematics Biology Mechanical Engineering Business and Management Philosophy Chemistry Physics Civil Engineering Political Science Clinical Medicine Psychology Computer Science Public Health Earth Sciences Sociology Economics Statistics Education and Pedagogy Environmental Studies Electrical Engineering History T able 31: The 28 ﬁelds of study used for query classiﬁcation. 38 Preprint. Under review . H.2 User Feedback Classiﬁcation This prompt classiﬁes user feedback into pr edeﬁned categories to understand the types of issues and suggestions users report. Y o u a r e c l a s s i f y i n g u s e r f e e d b a c k f o r a r e s e a r c h a s s i s t a n t t o o l i n t o p r e d e f i n e d c a t e g o r i e s . A s i n g l e f e e d b a c k c a n b e l o n g t o M U L T I P L E c a t e g o r i e s i f a p p l i c a b l e . A v a i l a b l e C a t e g o r i e s : { c a t e g o r i e s _ d e s c r i p t i o n } F e e d b a c k t o c l a s s i f y : " { f e e d b a c k _ t e x t } " W h i c h c a t e g o r i e s a p p l y t o t h i s f e e d b a c k ? R e t u r n a l i s t o f c a t e g o r y n a m e s . I f n o c a t e g o r i e s a p p l y , r e t u r n a n e m p t y l i s t . Output Schema (Pydantic): c l a s s F e e d b a c k C l a s s i f i c a t i o n ( B a s e M o d e l ) : c a t e g o r y _ n a m e s : L i s t [ s t r ] = F i e l d ( d e s c r i p t i o n = " L i s t o f a p p l i c a b l e c a t e g o r y n a m e s f o r t h e f e e d b a c k ( m u l t i - l a b e l ) " ) H.3 Duplicate Query Detection This prompt i dentiﬁes duplicate queries fr om a user ’s query history to understand query reuse patterns. Y o u a r e a n a l y z i n g a u s e r ' s s e a r c h q u e r i e s t o i d e n t i f y d u p l i c a t e s . T w o q u e r i e s s h o u l d b e c o n s i d e r e d d u p l i c a t e s i f t h e y a r e e s s e n t i a l l y t h e s a m e d e s p i t e m i n o r d i f f e r e n c e s s u c h a s : - T y p o s o r s p e l l i n g v a r i a t i o n s - W o r d o r d e r c h a n g e s - M i n o r s p a c i n g o r p u n c t u a t i o n d i f f e r e n c e s - S l i g h t p h r a s i n g o r w o r d c h o i c e c h a n g e s t h a t d o n ' t a l t e r t h e c o r e q u e s t i o n H e r e a r e t h e u s e r ' s q u e r i e s ( n u m b e r e d ) : { n u m b e r e d _ q u e r i e s } Y o u r t a s k : 1 . G r o u p t o g e t h e r q u e r i e s t h a t a r e d u p l i c a t e s ( o n e s a s k i n g e s s e n t i a l l y a s k i n g t h e s a m e t h i n g ) 2 . R e t u r n t h e E X A C T q u e r y t e x t ( n o t t h e n u m b e r s ) i n e a c h g r o u p 3 . Q u e r i e s t h a t a r e u n i q u e ( n o d u p l i c a t e s ) s h o u l d b e i n g r o u p s b y t h e m s e l v e s 4 . E v e r y q u e r y m u s t a p p e a r e x a c t l y o n c e i n y o u r o u t p u t R e t u r n y o u r a n s w e r a s J S O N w i t h a s i n g l e k e y ` g r o u p s ` w h i c h i s a l i s t o f l i s t s o f d u p l i c a t e q u e r i e s . E x a m p l e : I f q u e r i e s w e r e : 1 . W h a t i s m a c h i n e l e a r n i n g ? 2 . w h a t i s m a c h i n e l e a r n i n g 3 . D e f i n e n e u r a l n e t w o r k s 4 . W h a t a r e n e u r a l n e t w o r k s ? Y o u w o u l d r e t u r n g r o u p s l i k e : 39 Preprint. Under review . { " g r o u p s " : [ [ " W h a t i s m a c h i n e l e a r n i n g ? " , " w h a t i s m a c h i n e l e a r n i n g " ] , [ " D e f i n e n e u r a l n e t w o r k s " , " W h a t a r e n e u r a l n e t w o r k s ? " ] ] } Output Schema (Pydantic): c l a s s D u p l i c a t e Q u e r y G r o u p s ( B a s e M o d e l ) : g r o u p s : L i s t [ L i s t [ s t r ] ] = F i e l d ( d e s c r i p t i o n = " L i s t o f d u p l i c a t e g r o u p s . E a c h g r o u p c o n t a i n s t h e e x a c t q u e r y t e x t s t h a t a r e d u p l i c a t e s o f e a c h o t h e r . S i n g l e q u e r i e s ( n o d u p l i c a t e s ) s h o u l d b e i n g r o u p s o f s i z e 1 . A L L i n p u t q u e r i e s m u s t a p p e a r e x a c t l y o n c e . " ) 40 Preprint. Under review . H.4 Response Quality Assessment This prompt assesses the quality of AI-generated responses using an LLM-as-judge ap- proach. Y o u a r e e v a l u a t i n g t h e q u a l i t y o f a n A I - g e n e r a t e d r e s p o n s e t o a u s e r ' s q u e r y . U s e r Q u e r y : { q u e r y } A I R e s p o n s e : { f o r m a t t e d _ r e p o r t } B a s e d o n t h e r e l e v a n c e , a c c u r a c y , c o m p l e t e n e s s , a n d u s e f u l n e s s o f t h i s r e s p o n s e , w o u l d y o u g i v e i t a t h u m b s u p ( g o o d q u a l i t y ) o r t h u m b s d o w n ( p o o r q u a l i t y ) ? C o n s i d e r : - D o e s t h e r e s p o n s e a d e q u a t e l y a d d r e s s t h e u s e r ' s q u e r y ? - I s t h e i n f o r m a t i o n r e l e v a n t a n d w e l l - o r g a n i z e d ? - W o u l d a t y p i c a l u s e r f i n d t h i s r e s p o n s e h e l p f u l ? Output Schema (Pydantic): c l a s s F e e d b a c k A s s e s s m e n t ( B a s e M o d e l ) : a s s e s s m e n t : s t r = F i e l d ( d e s c r i p t i o n = " A s s e s s m e n t o f t h e r e s p o n s e q u a l i t y : ' t h u m b s _ u p ' o r ' t h u m b s _ d o w n ' " ) r e a s o n i n g : s t r = F i e l d ( d e s c r i p t i o n = " B r i e f r e a s o n i n g f o r t h e a s s e s s m e n t ( 1 - 2 s e n t e n c e s ) " ) 41 Preprint. Under review . H.5 Query Intent Classiﬁcation This prompt classiﬁes user queries into intent categories to understand the underlying purpose of each query . C l a s s i f y t h e f o l l o w i n g a c a d e m i c q u e r y i n t o o n e o r m o r e i n t e n t c a t e g o r i e s f r o m t h e p r o v i d e d l i s t . A q u e r y c a n b e l o n g t o m u l t i p l e c a t e g o r i e s i f i t h a s m u l t i p l e p u r p o s e s . Q u e r y : " { q u e r y } " A v a i l a b l e i n t e n t c a t e g o r i e s : { i n t e n t _ l i s t } P l e a s e r e s p o n d w i t h a J S O N o b j e c t c o n t a i n i n g a l i s t o f a p p l i c a b l e i n t e n t n a m e s . O n l y i n c l u d e i n t e n t n a m e s t h a t c l e a r l y a p p l y t o t h i s q u e r y . Output Schema (Pydantic): c l a s s M u l t i l a b e l Q u e r y I n t e n t R e s u l t ( B a s e M o d e l ) : i n t e n t _ n a m e s : L i s t [ s t r ] = F i e l d ( d e s c r i p t i o n = " L i s t o f a p p l i c a b l e i n t e n t n a m e s f o r t h e q u e r y " ) H.6 Query Phrasing Classiﬁcation This pr ompt classiﬁes queries based on their phrasing patterns and style (e.g., keywor d-style, natural language question, imperative command). C l a s s i f y t h e f o l l o w i n g a c a d e m i c q u e r y b a s e d o n i t s p h r a s i n g p a t t e r n s a n d s t y l e . A q u e r y c a n e x h i b i t m u l t i p l e p h r a s i n g s t y l e s - s e l e c t A L L c a t e g o r i e s t h a t a p p l y . Q u e r y : " { q u e r y } " A v a i l a b l e p h r a s i n g c a t e g o r i e s : { c a t e g o r y _ l i s t } P l e a s e r e s p o n d w i t h a J S O N o b j e c t c o n t a i n i n g a l i s t o f A L L a p p l i c a b l e p h r a s i n g c a t e g o r y n a m e s . O n l y i n c l u d e c a t e g o r i e s t h a t c l e a r l y m a t c h t h e p h r a s i n g s t y l e o f t h i s q u e r y . Output Schema (Pydantic): c l a s s Q u e r y P h r a s i n g R e s u l t ( B a s e M o d e l ) : p h r a s i n g _ t y p e s : L i s t [ s t r ] = F i e l d ( d e s c r i p t i o n = " L i s t o f a p p l i c a b l e p h r a s i n g c a t e g o r y n a m e s f o r t h e q u e r y " ) H.7 Field of Study Classiﬁcation This prompt classiﬁes queries into academic ﬁelds of study . C l a s s i f y t h e f o l l o w i n g a c a d e m i c q u e r y i n t o o n e o r m o r e f i e l d s o f s t u d y f r o m t h e p r o v i d e d l i s t . A q u e r y c a n b e l o n g t o m u l t i p l e f i e l d s i f i t s p a n s m u l t i p l e d i s c i p l i n e s . Q u e r y : " { q u e r y } " A v a i l a b l e f i e l d s o f s t u d y : 42 Preprint. Under review . { f i e l d s _ l i s t } P l e a s e r e s p o n d w i t h a J S O N o b j e c t c o n t a i n i n g a l i s t o f a p p l i c a b l e f i e l d n a m e s . O n l y i n c l u d e f i e l d n a m e s t h a t c l e a r l y a p p l y t o t h i s q u e r y . F o c u s o n t h e p r i m a r y d i s c i p l i n e i n v o l v e d a n d o n l y i n c l u d e m o r e t h a n o n e f i e l d i f t h e q u e r y i s t r u l y i n t e r d i s c i p l i n a r y a n d n o t c o v e r e d b y a s i n g l e f i e l d . U s e t h e e x a c t f i e l d n a m e s f r o m t h e p r o v i d e d l i s t . Output Schema (Pydantic): c l a s s M u l t i l a b e l F i e l d C l a s s i f i c a t i o n R e s u l t ( B a s e M o d e l ) : f i e l d _ n a m e s : L i s t [ s t r ] = F i e l d ( d e s c r i p t i o n = " L i s t o f a p p l i c a b l e f i e l d o f s t u d y n a m e s f o r t h e q u e r y " ) H.8 Search Criteria Classiﬁcation This prompt classiﬁes queries based on the types of search criteria speciﬁed (e.g., temporal constraints, methodology requir ements, publication type ﬁlters). C l a s s i f y t h e f o l l o w i n g a c a d e m i c q u e r y b a s e d o n t h e t y p e s o f s e a r c h c r i t e r i a b e i n g u s e d . A q u e r y c a n c o m b i n e m u l t i p l e s e a r c h c r i t e r i a t y p e s - s e l e c t A L L c a t e g o r i e s t h a t a p p l y . Q u e r y : " { q u e r y } " A v a i l a b l e s e a r c h c r i t e r i a c a t e g o r i e s : { c a t e g o r y _ l i s t } P l e a s e r e s p o n d w i t h a J S O N o b j e c t c o n t a i n i n g a l i s t o f A L L a p p l i c a b l e c r i t e r i a c a t e g o r y n a m e s . O n l y i n c l u d e c a t e g o r i e s t h a t c l e a r l y m a t c h t h e s e a r c h c r i t e r i a u s e d i n t h i s q u e r y . Output Schema (Pydantic): c l a s s S e a r c h C r i t e r i a R e s u l t ( B a s e M o d e l ) : c r i t e r i a _ t y p e s : L i s t [ s t r ] = F i e l d ( d e s c r i p t i o n = " L i s t o f a p p l i c a b l e c r i t e r i a c a t e g o r y n a m e s f o r t h e q u e r y " ) 43

Understanding Usage and Engagement in AI-Powered Scientific Research Tools: The Asta Interaction Dataset

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment