Efficient Human-in-the-Loop Active Learning: A Novel Framework for Data Labeling in AI Systems
Modern AI algorithms require labeled data. In real world, majority of data are unlabeled. Labeling the data are costly. this is particularly true for some areas requiring special skills, such as reading radiology images by physicians. To most efficiently use expert’s time for the data labeling, one promising approach is human-in-the-loop active learning algorithm. In this work, we propose a novel active learning framework with significant potential for application in modern AI systems. Unlike the traditional active learning methods, which only focus on determining which data point should be labeled, our framework also introduces an innovative perspective on incorporating different query scheme. We propose a model to integrate the information from different types of queries. Based on this model, our active learning frame can automatically determine how the next question is queried. We further developed a data driven exploration and exploitation framework into our active learning method. This method can be embedded in numerous active learning algorithms. Through simulations on five real-world datasets, including a highly complex real image task, our proposed active learning framework exhibits higher accuracy and lower loss compared to other methods.
💡 Research Summary
The paper addresses the practical bottleneck that modern AI systems require large amounts of labeled data while most real‑world data remain unlabeled, and labeling can be expensive, especially in domains that need expert knowledge such as medical imaging. Traditional human‑in‑the‑loop active learning (AL) methods focus solely on selecting which instances to query, assuming a single “class” question (“What is the label of this point?”). The authors propose a novel, model‑agnostic AL framework that simultaneously decides what to ask and which data points to ask about, thereby extending the query space beyond the conventional single‑label query.
Three query types are defined:
- Class – the standard full‑label query that returns the exact class of a single instance.
- All – a binary query of the form “Are all of the m instances in set S from class c?”; a “yes” answer provides full information for all m instances, while a “no” answer indicates that at least one instance is not of class c.
- Any – a binary query of the form “Is any of the m instances in set S from class c?”; a “yes” answer confirms the presence of at least one instance of class c, and a “no” answer confirms that none belong to class c.
These queries can dramatically reduce annotation cost because a single answer can convey information about multiple samples, and the binary nature often makes the task easier for human experts.
The core of the framework is a probabilistic classifier (p(x;\theta)) that outputs class‑probability vectors. For any query (q) and answer (a), a cross‑entropy loss (\ell_k(q,a;\theta) = -\log \Pr(a|q;\theta)) is defined, which unifies full and partial information under a single objective. The loss for an “All” or “Any” query decomposes into the sum of losses of the individual instances involved, preserving consistency with standard supervised learning.
To decide which query to issue, the authors introduce an information‑gain function (G(P|R)) that measures the distance between a pre‑query probability matrix (P) and a post‑query matrix (R). By choosing (G) as KL‑divergence, total variation, or any suitable divergence, the framework recovers classic entropy‑ or variance‑based AL criteria as special cases for the Class query. For “All” and “Any”, the set of admissible post‑query probability matrices is defined by the logical constraints of the query (e.g., all rows equal to a one‑hot vector for a “yes” answer to All). The expected gain for a candidate query is then \
Comments & Academic Discussion
Loading comments...
Leave a Comment