Multi-Agent Combinatorial-Multi-Armed-Bandit framework for the Submodular Welfare Problem under Bandit Feedback

Reading time: 5 minute
...

📝 Original Info

  • Title: Multi-Agent Combinatorial-Multi-Armed-Bandit framework for the Submodular Welfare Problem under Bandit Feedback
  • ArXiv ID: 2602.16183
  • Date: 2026-02-18
  • Authors: ** - Subham Pokhriyal (Indian Institute of Technology Ropar, Rupnagar, India) – subham.22csz0002@iitrpr.ac.in - Shweta Jain (Indian Institute of Technology Ropar, Rupnagar, India) – shwetajain@iitrpr.ac.in - Vaneet Aggarwal (Purdue University, West Lafayette, USA) – vaneet@purdue.edu **

📝 Abstract

We study the \emph{Submodular Welfare Problem} (SWP), where items are partitioned among agents with monotone submodular utilities to maximize the total welfare under \emph{bandit feedback}. Classical SWP assumes full value-oracle access, achieving $(1-1/e)$ approximations via continuous-greedy algorithms. We extend this to a \emph{multi-agent combinatorial bandit} framework (\textsc{MA-CMAB}), where actions are partitions under full-bandit feedback with non-communicating agents. Unlike prior single-agent or separable multi-agent CMAB models, our setting couples agents through shared allocation constraints. We propose an explore-then-commit strategy with randomized assignments, achieving $\tilde{\mathcal{O}}(T^{2/3})$ regret against a $(1-1/e)$ benchmark, the first such guarantee for partition-based submodular welfare problem under bandit feedback.

💡 Deep Analysis

📄 Full Content

In combinatorial multi-armed bandit (CMAB) problems, a learner selects a subset of base arms at each round and receives stochastic feedback. The challenge stems from the structured yet exponentially large action space, which makes exploration and optimization nontrivial. In CMAB, two types of feedback are typically studied. With semibandit feedback, the learner observes the individual reward contribution of each chosen arm in the selected subset. This allows for efficient estimation and often leads to faster learning. However, in the full-bandit feedback setting, only the total reward from the chosen set is observed, without any information about individual arm contributions. This makes 1 Indian Institute of Technology Ropar, Rupnagar, India 2 Purdue University, West Lafayette, USA. Correspondence to: Subham Pokhriyal , Shweta Jain , Vaneet Aggarwal .

A preliminary version of this work is accepted for publication as an extended abstract in the Proceedings of the 25th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026). learning significantly harder, as marginal values must be inferred indirectly, often requiring deliberate and suboptimal exploration.

Despite its challenges, full-bandit feedback more accurately reflects many real-world scenarios where per-arm signals are inaccessible due to privacy concerns, measurement limits, or system constraints. Applications such as recommender systems, data summarization, fair allocation, revenue and influence maximization frequently operate in such regimes, where only aggregate outcomes are observable (Fourati et al., 2023;Nie et al., 2023;Fourati et al., 2024a;Pokhriyal et al., 2025). This motivates the need for learning algorithms that can operate effectively in the full-bandit setting.

Our work addresses a multi-agent variant of this problem, grounded in the well-studied Submodular Welfare Problem (SWP). In an offline SWP, the objective is to divide a set of items among multiple agents (partitions) so as to maximize the total (utilitarian) utility, where each agent’s utility is defined by a monotone submodular function. This problem arises in domains such as economics, fair division, and resource allocation. In the value-oracle model, the problem admits a 1/2-approximation via a greedy algorithm (Fisher et al., 1978;Lehmann et al., 2001), while more advanced techniques (such as the continuous-greedy algorithm with pipage or randomized rounding) achieve the optimal (1 -1/e) approximation (Vondrák, 2008). The SWP reduces to submodular maximization under a partition matroid.

Online submodular optimization under bandit feedback has gained considerable attention in recent literature, particularly in single-agent CMAB settings. For non-monotone objectives, Fourati et al. (2023) adapts the RANDOMIZED-SUM algorithm from Buchbinder et al. (2015), establishing a sublinear 1/2-regret. In the monotone case under cardinality constraints, Nie et al. (2023) builds on the classical GREEDY algorithm of Nemhauser et al. (1978), while Fourati et al. (2024a) extends the STOCHASTIC-GREEDY approach of Mirzasoleiman et al. (2015), both of which yield a sublinear (1 -1/e)-regret.

However, all these works are limited to a single agent setting. Extensions to multi-agent scenarios have been proposed in federated and decentralized optimization contexts (Konečnỳ et al., 2016;McMahan et al., 2017;Li et al., 2020;Hos-seinalipour et al., 2021;Elgabli et al., 2022;Fourati et al., 2024b), but these typically assume continuous updates and communication between agents. In contrast, we focus on discrete, non-communicating, multi-agent settings where global utility depends on structured item allocations, and feedback is limited to total reward. Practical applications of this framework include partitioned recommendation, distributed sensing, influence maximization across user groups, and equitable allocation of indivisible goods.

In this work we formalize a connection between the offline SWP (Vondrák, 2008) and the online multi-agent CMAB framework, with a particular focus on settings involving full-bandit feedback. In the online variant, the learner repeatedly selects a feasible partition of items among agents, receives aggregate reward feedback, and aims to minimize regret with respect to a benchmark offline solution. Our formulation departs from prior work in four key aspects: (i) all agents are present and active in every round (no partial participation), (ii) agents are non-communicating and for each agent i we have distinct submodular utility w i (•), (iii) the learner selects a partition of the item set, assigning disjoint subsets to agents, rather than a single best subset, and (iv) the optimization objective is the total welfare i w i (•), coupling feasibility across agents.

To the best of our knowledge, this is the first work to cast discrete partitioning for submodular welfare as a

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut