AIANO: Enhancing Information Retrieval with AI-Augmented Annotation
The rise of Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) has rapidly increased the need for high-quality, curated information retrieval datasets. These datasets, however, are currently created with off-the-shelf annotation tools that make the annotation process complex and inefficient. To streamline this process, we developed a specialized annotation tool - AIANO. By adopting an AI-augmented annotation workflow that tightly integrates human expertise with LLM assistance, AIANO enables annotators to leverage AI suggestions while retaining full control over annotation decisions. In a within-subject user study ($n = 15$), participants created question-answering datasets using both a baseline tool and AIANO. AIANO nearly doubled annotation speed compared to the baseline while being easier to use and improving retrieval accuracy. These results demonstrate that AIANO’s AI-augmented approach accelerates and enhances dataset creation for information retrieval tasks, advancing annotation capabilities in retrieval-intensive domains.
💡 Research Summary
The paper introduces AIANO (Artificial Intelligence Augmented Annotation), a purpose‑built annotation platform designed to streamline the creation of information‑retrieval (IR) datasets such as question‑answer pairs for Retrieval‑Augmented Generation (RAG) systems. The authors argue that existing off‑the‑shelf tools (e.g., Label Studio) are ill‑suited for IR tasks because they lack integrated search, multi‑document handling, and AI assistance, making the annotation process cumbersome, time‑consuming, and error‑prone.
AIANO’s architecture revolves around configurable “blocks” that encapsulate input and output schemas (defined in JSON) and operate in one of three collaboration modes:
- Plain Mode – pure human authoring, no AI involvement.
- AI Solo Mode – the LLM receives a fixed system prompt and generates content automatically; the annotator can accept or edit the output.
- Human‑AI Collaborative Mode – the LLM synthesizes multiple sources (e.g., highlighted passages, metadata, other block outputs) to propose a candidate answer; the annotator retains full control to accept, modify, or reject.
These blocks can be freely combined, allowing users to construct complex pipelines (e.g., a Question block in Plain Mode feeding an Answer block in Collaborative Mode). The platform supports any LLM that follows the OpenAI API specification, including commercial services (OpenAI, Anthropic) and locally deployed vLLM models, enabling cost‑effective high‑throughput inference.
The workflow is split into three phases:
- Project Creation – define project metadata, input/output schemas, annotation levels (e.g., importance tags), and assemble blocks.
- Project Configuration – connect blocks to an LLM provider, upload documents (JSON with at least document‑ID and subject‑ID), and set up any custom prompts.
- Annotation – users search the corpus, highlight relevant spans, trigger AI‑generated answers, review/edit them, and finally export the dataset with full provenance.
The UI is divided into three panels: a document browser with full‑text search (left), a central highlighting/annotation canvas, and a right‑hand block panel. All actions are automatically versioned and stored in a PostgreSQL backend. Export formats include plain JSON (question‑answer‑passage triplets) and a proprietary “.aiano” bundle that captures the entire project configuration for reproducibility.
To evaluate AIANO, the authors conducted a within‑subject user study with 15 participants from diverse backgrounds (graduate students, researchers, software developers, medical doctors, regulatory specialists). Each participant completed four IR tasks (two single‑document, two multi‑document) using both AIANO and Label Studio, with the order counterbalanced and a short break between tools. The tasks involved locating relevant documents, highlighting evidence passages, and writing answers to pre‑defined questions. AIANO was configured with a Question block (Plain Mode) and an Answer block (Human‑AI Collaborative Mode) powered by Meta’s Llama 70B model; Label Studio lacked any AI assistance or integrated search.
Metrics collected:
- Subjective – NASA‑TLX workload dimensions (mental, physical, temporal demand, effort, frustration, performance) and an 8‑item Likert‑scale usability questionnaire (intuitiveness, annotation process, likelihood of reuse, ease of use, navigation, speed, recommendation, overall satisfaction).
- Objective – task completion time, and IR performance measured as precision, recall, and F1 by comparing highlighted documents against a gold‑standard relevance set.
Statistical analysis used paired t‑tests for normally distributed variables and Wilcoxon signed‑rank tests otherwise, with significance set at p < 0.05.
Key Findings
- Speed – Median task time dropped from 10 minutes with Label Studio to 6 minutes with AIANO (≈40 % faster).
- Workload – Overall NASA‑TLX score was significantly lower for AIANO (22.5 vs. 34.17). Notable reductions were observed in mental demand (15 → 35, p = 0.005), physical demand (5 → 20, p = 0.008), effort (15 → 35, p = 0.003), and frustration (10 → 50, p = 0.001). Temporal demand did not differ significantly.
- Usability – AIANO received uniformly high ratings (4–5 on a 5‑point scale) across all eight dimensions, whereas Label Studio scores were lower and more variable. The composite usability score was 4.25 for AIANO versus 2.375 for Label Studio (p < 0.001). Participants also rated AI‑assisted features (search, answer generation) as “extremely useful” (5/5).
- Retrieval Performance – AIANO‑generated datasets achieved higher precision (0.889 vs. 0.867), recall (0.883 vs. 0.783), and F1 (0.860 vs. 0.787), representing an average 8.2 % improvement, driven primarily by a 12.8 % boost in recall.
- Qualitative Feedback – Users praised AIANO’s integrated full‑text search and AI‑generated answer suggestions for reducing cognitive load and speeding up the workflow. In contrast, Label Studio users reported sluggish document switching, difficulty copying text, and navigation challenges.
Discussion
The authors interpret these results as evidence that tightly coupling LLM assistance with a task‑specific annotation pipeline can simultaneously accelerate dataset creation and improve data quality. The reduction in frustration is highlighted as especially important because high frustration correlates with annotator burnout and lower annotation fidelity. The study also demonstrates that AI assistance does not merely “speed up” work at the expense of accuracy; rather, it helps annotators locate more relevant documents (higher recall) while maintaining or slightly improving precision.
Limitations include the narrow domain (German general‑knowledge short texts) and reliance on a single LLM (Meta Llama 70B). Generalization to other languages, domains (e.g., biomedical, legal), or smaller LLMs remains to be validated. Moreover, AIANO’s AI suggestions are still “assistive” rather than fully automated, so performance may degrade if the underlying LLM produces low‑quality outputs.
Future work is proposed in three directions: (1) extending the platform to multilingual, multi‑domain settings; (2) integrating automatic quality‑control metrics (e.g., uncertainty estimation) to better triage when AI assistance should be trusted; and (3) exploring lightweight, open‑source LLM back‑ends to reduce inference cost while preserving the collaborative workflow.
Conclusion
AIANO represents a significant step forward for IR dataset annotation by providing a modular, AI‑augmented environment that reduces annotation time by nearly half, lowers cognitive workload, improves user satisfaction, and yields higher‑quality retrieval datasets. The study validates the hypothesis that strategic human‑AI collaboration can overcome the bottlenecks of traditional annotation pipelines, offering a scalable solution for the growing demand for high‑quality IR benchmarks in the era of RAG and large language models.
Comments & Academic Discussion
Loading comments...
Leave a Comment