๐ Original Info
- Title:
- ArXiv ID: 2512.21794
- Date:
- Authors: Unknown
๐ Abstract
We study the sequential mechanism design problem in which a principal seeks to elicit truthful reports from multiple rational agents while starting with no prior knowledge of agents' beliefs. We introduce Distributionally Robust Adaptive Mechanism (DRAM), a general framework combining insights from both mechanism design and online learning to jointly address truthfulness and cost-optimality. Throughout the sequential game, the mechanism would estimate agents' beliefs, then iteratively updates a distributionally robust linear program with shrinking ambiguity sets to reduce payments while preserving truthfulness. Our mechanism guarantees truthful reporting with high probability while achieving ร (๐ โ ๐ ) cumulative regret, and we establish a matching lower bound showing that no truthful adaptive mechanism can asymptotically do better. The framework generalizes to plug-in estimators (DRAM+), supporting structured priors and delayed feedback. To our knowledge, this is the first adaptive mechanism under the general settings that maintains truthfulness and achieves optimal regret when incentive constraints are unknown and must be learned.๐ Full Content
In parallel, the theory of online learning studies algorithms that learn and make decisions in unfamiliar environments, aiming to approach the performance of oracles that have full knowledge from the start. The principal typically begins with no knowledge of the environment, and information is acquired through repeated data collection and carefully designed statistical methods. A common assumption is that the environment is unknown but stationary. For example, in the classical multi-armed bandit model [Lattimore and Szepesvรกri, 2020], each arm’s reward is a stochastic distribution, and the best arm can be discovered via repeated sampling. An alternative is to assume the worst-case scenario from the environment, i.e., fully adversarial feedback. These algorithms have wide applications in recommendation, pricing, scheduling, and more [Lattimore * Author ordering alphabetical. All authors made valuable contributions to the writing, editing, and overall management of the project. Renfei Tan led the project, and is the first to propose the idea of achieving cost-efficient adaptive mechanisms via sequentially accurate distributionally robust mechanisms. He is the main developer of the modeling, algorithm, and corresponding theorems, as well as the main writer of the paper. Zishuo Zhao contributed on discussions, proofreading, and comprehensive review. He proposed the main idea of distributionally robust mechanisms with insights on the synergy between it and online learning. He also helped formulating the examples and wrote the literature review part of peer prediction. Qiushi Han proposed the initial idea of the two-phased (warm-start and adaptive) approach to relax the common knowledge assumptions and contributed to the development of the main algorithm. He led the design and conduct of the numerical experiments in this work. David Simchi-Levi supervised the research and assisted in writing the paper. โ Corresponding author. Nature samples an unlabeled image with an unknown ground truth, which is then independently observed by multiple agents. Each agent’s observation (type) is private to herself. The agents then report to the principal and receive rewards in the end. Lying or lazy behavior is possible, since the principal does not know the ground truth or the agents’ observations. One objective is to incentivize truthful behavior via reward mechanisms based on only agents’ reports. and Szepesvรกri, 2020]. In application, however, they often interact with humans, who are neither stationary nor fully adversarial. In fact, a foundational assumption in economics is that humans are rational [Von Neumann and Morgenstern, 2007].
Therefore, the strength and weaknesses from both fields seems to complement to each one. Mechanism design incentivizes nice behavior from rational agents for proper learning guarantees, and online learning can provide the necessary knowledge for efficient mechanisms. For this reason, the combination of mechanism design and online learning has received increasing attention, most notably in settings such as online contract design [Ho et al., 2014, Zhu et al., 2022] and online auctions [Blum et al., 2004, Cesa-Bianchi et al., 2014]. However, the design of general multi-agent adaptive mechanisms remains an under-explored problem.
In this work, we study the sequential mechanism design problem in which a principal in each round designs reward mechanisms for multiple rational agents, while starting with no prior knowledge of agents’ beliefs. The principal’s objective is three-fold: data quality, truthfulness, and cost-optimality. The principal wants to design a reward mechanism that can obtain the highestquality data from the task, while incentivizing truthful report from agents, and do so in a costminimal way. As a motivating example, consider the image labeling task where the principal assigns raw images to multiple agents for labeling (Figure 1). In each round, each agent makes a private observation of the image as her type, then reports her type to the principal, and finally receives a payment.
The multi-agent mechanism design problem faces several major challenges. First, each agent’s observation is private and unknown to the principal. Agents are rational and pursue utility, so they may lie or become lazy. However, ground truth may be unavailable or expensive to obtain, making it hard to directly control report quality or infer agent skills. Second, classical mechanism designs that rely on common knowledge are inapplicable: even mechanisms that only focus on maintaining truthfulness often assume accurate knowledge of posteriors or correlation structure [Miller et al., 2005]. Finally, we cannot exploit the structures in specific mechanism design problems. For example, in auction design, a second-price auction incentivizes truthfulness without any common knowledge. However, for a general mechanism design problem, maintaining truthfulness itself under limited knowledge is already a substantial difficulty.
Our work draws insights from both the mechanism design and online learning literature. From a mechanism design perspective, we relaxed the common knowledge assumption, and generalizes the optimal mechanism design problem. From an online learning perspective, we relaxed the setting from always-honest agents to the more realistic rational agents, and generalizes the prediction with expert advice problem.
The necessity of truthfulness. We show that truthfulness is in a sense “necessary” for sequential decision making. Any decision making process based on agents’ rational reports can achieve highest performance if and only if agents are truthful (up to permutations). This is a result based on Blackwell’s informativeness theorem [Blackwell, 1953], and stronger than the revelation principle [Myerson, 1979]. Since mechanism design is also a decision making task itself, it implies truthfulness is necessary for learning optimal mechanisms.
Distributionally robust mechanisms. We introduce and study a family of distributionally robust mechanisms, which preserves truthfulness and incurs low cost even when the principal’s knowledge is ambiguous. We study the relations between design parameters and achievable robustness, establish methods to tractably acquire these mechanisms, and finally characterize the cost of robustness.
Optimal adaptive mechanism design. We design a general framework named Distributionally Robust Adaptive Mechanism (DRAM), which estimate agents’ beliefs and iteratively updates a distributionally robust mechanism based on estimation accuracy. Our algorithm achieves an ร (๐ โ ๐ ) regret guarantee (up to logarithmic factors) while preserving truthfulness with high probability. We complement this result with a matching lower bound showing that no truthful adaptive mechanism can do better in the worst case. The theoretical results are validated by numerical simulations. This framework extends to any plug-in estimators (e.g., structured or regularized estimators for discrete distributions) and is compatible with delayed or batched feedback. To our knowledge, this is the first general adaptive mechanism that maintains truthfulness and achieves optimal regret when incentive constraints depend on unknown and learned information.
Our work draws insights from both the mechanism design and online learning literature.
1.2.1 Online and adaptive Mechanism Design. The combination of mechanism design and online learning is a fast-growing direction in algorithmic game theory [Roughgarden, 2010]. Prior work spans a variety of participation structures and problem settings. Some works study synchronous settings, where the same agents have multiple encounters with each other, as in repeated games [Hart and Mas-Colell, 2000, Papadimitriou et al., 2022, Satchidanandan and Dahleh, 2023]. Others consider asynchronous settings, where new agents may arrive and depart over time, a structure particularly relevant in online auctions and advertising platforms [Cesa-Bianchi et al., 2007, Choi et al., 2020, Hajiaghayi et al., 2004, Milgrom, 2019, Wang et al., 2017]. Under both settings, two domains have received the most attention: online contract design [Ho et al., 2014, Zhu et al., 2022] and online auctions [Blum et al., 2004, Cesa-Bianchi et al., 2014]. These works typically focus on learning an optimal mechanism, such as an optimal contract or an optimal reserve price, using tools from bandit learning.
A key observation is that preserving truthfulness is not a substantive difficulty in these existing models. In contract design, agents’ choices naturally reflect their private information, and no truthful reporting constraint is involved. In online auctions, structural properties ensure incentive compatibility: for instance, in second-price auctions, truthful bidding remains a dominant strategy even when the reserve price is inaccurate. Even without any knowledge, the second-price auction guarantee’s agents’ truthfulness. As a result, works such as [Cesa-Bianchi et al., 2014] can safely explore suboptimal mechanisms during learning without risking incentive distortion or data contamination. The focus is solely on learning an optimal mechanism, which makes these problems are more or less reducible to a bandit problem [Cesa-Bianchi et al., 2014, Zhu et al., 2022].
In contrast, in a general mechanism design problem, preserving truthfulness becomes a significant difficulty. When the principal begins with ambiguous knowledge, an improperly constructed mechanism can immediately encourage agents to lie or exert low effort, thereby corrupting the collected data, undermining subsequent learning processes. Thus, unlike prior literature, maintaining truthfulness throughout the entire learning trajectory is not merely desirable but essential, and this requirement is one of the central challenge we must address.
1.2.2 Prediction with Expert Advice. We note that our setup is a stochastic, label-efficient variant of the classical prediction with expert advice problem. The prediction with expert advice problem is fundamental in online learning [Cesa-Bianchi and Lugosi, 2006]. In the standard framework, the agents (often called “experts”) can sequentially provides arbitrary and even adversarial signals, and the principal’s objective is to implement an aggregation algorithm that achieves sublinear regret. A simplification is to assume agents behave stochastically (report signals according to a probability law), under which the aggregation regret can be significantly improved [Cesa-Bianchi et al., 2004]. The stochastic variant also has connections with other online learning problems such as online optimization [Agarwal et al., 2017, Cesa-Bianchi et al., 2007, Gaillard et al., 2014], with extensions in contextual or non-stationary settings [Besbes et al., 2016].
In the practical setting, acquiring a true label might be expensive. It may be only feasible to query the true label for a small portion of rounds. This is the label-efficient setting of prediction with expert advice [Cesa-Bianchi et al., 2005, Helmbold andPanizza, 1997]. Roughly speaking, the aggregation regret decreases as the inverse square root of the number of queries. There also exist adaptive algorithms that achieve the same regret with far fewer queries in benign cases [Castro et al., 2023, Mitra andGopalan, 2020].
Compared to the standard framework, our setup deals with rational experts who need proper incentives for nice behaviors, which (on the difficulty of response aggregation) lies between the adversarial and the stochastic setting. The assumption on rationality brings additional considerations on the design of incentives, which is the main concern of our work. Also, different from the standard or the label-efficient settings, in our model, the true signal is never revealed (or only revealed for a constant number of rounds), adding difficulty to distinguish poorly-performed experts.
1.2.3 Information elicitation and peer prediction. The field of information elicitation studies the mechanism design task to incentivize honest feedback from untrusted but rational participants, generally via designing scoring rules [Li et al., 2022] as rewards or penalties for participants. Particularly, peer prediction [Miller et al., 2005] studies the scenarios in which ground truth is unavailable for direct verification of collected reports, with applications in dataset acquisition and evaluation [Chen et al., 2020, Zheng et al., 2024], crowdsourcing [Dasgupta and Ghosh, 2013], and recent blockchain-based decentralized ecosystems [Wang et al., 2023, Zhao et al., 2024]. The general paradigm of peer prediction mechanisms is to ask multiple participants the same question and reward them according to the comparison among their reports. While peer prediction mechanisms provide elegant results on truthful Nash equilibria without requirement of ground-truth information, most of existing mechanism rely on strong unrealistic assumptions on know prior and observation matrices, forming a gap to practical usage in real-world systems.
For practical usage, researcher develop a series of works with relaxed assumptions or stronger incentive guarantees. Particularly, [Kong, 2024] develops a prior-free multi-task peer prediction mechanism with dominant-strategy incentive compatibility, with a lack of permutation-proof property that is impossible for any prior-free peer prediction mechanisms [Kong and Schoenebeck, 2019]. Besides, [Shnayder et al., 2016] provides an informed truthful mechanism ensuring that the truthful equilibrium achieves highest utilities among all Nash equilibria, and [Zhang et al., 2025] develops a mechanism with a stochastic dominance property ensuring incentive compatibility even under non-linear utilities.
In our setting, we address the gap between existing prior-dependent designs and reality via acquiring the prior distribution by online learning, with a multi-round adaptive mechanism that learns the distributional information during the process. Besides, we also explicitly consider the robustness property that ensures incentive guarantees under inaccurate knowledge, thus making the peer prediction framework applicable in realistic applications.
We consider the sequential mechanism design problem where a principal seeks to elicit truthful reports from rational agents. The principal sequentially assigns ๐ prediction tasks to a group of ๐ rational agents. Each task has a true label ๐ ๐ก โ Y, i.i.d. sampled from an unknown and stationary distribution ๐ ๐ (โข). Unless stated otherwise, this true label is not revealed to anyone, either the principal or the agents. We let Y be finite to avoid mathematical complications, and assume each true label ๐ฆ appears with uniformly bounded probability ๐ โค ๐ ๐ (๐ฆ) โค ๐. In each round, each agent ๐ = 1, โข โข โข , ๐ independently studies the task, acquiring her own observation ๐ ๐๐ก โ Y with a constant cost ๐. ๐ ๐๐ก is generated according to the agent’s skill ๐ ๐ (๐ฅ | ๐ฆ), a stationary conditional probability law. We assume ๐ ๐ is non-degenerate, i.e., there exists two labels ๐ฆ, ๐ฆ โฒ where ๐
In other words, observation should bring new information by stochastically distinguishing at least two of the labels. Each agent might know her own skill distribution ๐ ๐ , but has no information about anyone else’s, and the principal initially knows none of them. Aside from observation, agents also has an outside option of lazily reporting a random label without observing the label (does not incur cost ๐ as well).
After studying the task, agents independently produce their public reports ๐ ๐๐ก โ Y to the principal. We assume agents are risk-neutral and myopic. Being risk-neutral means agents aim to maximize their expected reward, conditional on the public and private information they have. Being myopic means agents only care about immediate reward in the current round but not future rewards. Under these settings, we have rational agents who do not necessarily report their observations. Instead, they would lie (report ๐ ๐๐ก โ ๐ ๐๐ก ) or be lazy (report ๐ ๐๐ก without observation ๐ ๐๐ก ) when they expect an advantage in doing so. We denote the observation and report profile of all agents in a round by ๐ฟ ๐ก and ๐ ๐ก . Note that ๐ ๐ and ๐ ๐ together defines a joint law ๐ ๐ฟ over ๐ฟ โ Y ๐ , and we later show that learning this ๐ ๐ฟ is crucial for optimal mechanisms.
Collecting the reports, the principal rewards each agent ๐ with an reward mechanism ๐ ๐๐ก (๐ 1 , โข โข โข , ๐ ๐ก ). The reports can then be used for downstream decision-making tasks, such as aggregation. Note that the reward mechanism is non-anticipating, meaning the mechanism can only decide on past and current but not future report profiles.
The principal aims to design the online reward mechanism ๐น = (๐ ๐๐ก ) ๐ โ [๐ ],๐ก โ [๐ ] with three objectives:
โข Truthfulness (aka incentive-compatibility): given all other agents act honestly, a agent would maximize her own expected utility when she works, obtains observations, and then reports honestly (๐ ๐๐ก = ๐ ๐๐ก ). โข Data quality: the reward mechanism should incentivize the highest-quality reports, such that downstream decision-making tasks may achieve the optimal objective.
โข Cost-optimality: maintaining truthfulness and data-quality, the principal minimizes its total expected payment to agents. We now compare between our setup and typical modeling assumptions in the online learning and mechanism design literature. Online learning mainly targets at minimizing cumulative decisional error, while treating all reports as truthful. Mechanism design, by contrast, centers on strategic incentives, but usually presumes agents’ type distributions are known or even common knowledge. These assumptions ease analysis, yet rarely hold in practice. Our model pursues both goals at once and relaxes the assumptions from both fields.
Remark 1. The proposed model can be further generalized to match the classical model in the mechanism design literature. Here, agents’ observations are their own types of the round. Assume in each round agents’ types ๐ฟ are sampled from a stationary joint distribution. agents then report their types (not necessarily truthful) ๐ to the principal and receive rewards. All of our analysis and algorithms applies to this generalized setting. In fact, our analysis does not require how the types are generated and make no use of ๐ ๐ and ๐ ๐ and focuses exclusively on the joint law ๐ ๐ฟ .
Remark 2. We note that each agent’s utility is linear in only her own reward (๐ข ๐ = ๐ ๐ ). In the most general setting of mechanism design, an agent’s utility is a function of all agents’ types and the resource allocation from principal: ๐ข ๐ : Y ๐ ร R โ R, where R is the space of the principal’s resource allocation decisions. For example, in contracts, utility depends on the agent’s own type ๐ฅ ๐ and the principal’s payment ๐ ๐ (๐ข ๐ = ๐ (๐ฅ ๐ , ๐ ๐ )). In auctions, utility depends on whether or not the agent gets the item (with probability ๐), her valuation of the item (type ๐ฅ ๐ ), and her requested payment ๐ ๐ (๐ข ๐ = ๐ โข ๐ฅ ๐ -๐ ๐ ). Our analysis may potentially be generalized to the case when such utility function is exactly known by the principal.
We conclude this section by revealing the importance of truthfulness. After all, the principal’s top objectives in outsourcing tasks are to improve data quality and lower costs. Truthfulness, as a mechanism design objective, might not be of interest if the mechanism that reaches the highest quality or the lowest cost promotes dishonest behaviors. From the revelation principle [Myerson, 1979], we know that truthfulness is “free”, in the sense that we don’t lose anything by focusing only on mechanisms with their incentive-compatible Nash equilibria. For the same reason, it suffices to consider the setting where the true label ๐ , observation ๐ , and report ๐ all belong to the same space Y. The following proposition actually proves a stronger result, showing that truthfulness is not only free, but in fact almost necessary for maximal quality. Proposition 2.1 is derived from Blackwell’s informativeness theorem [Blackwell, 1953]. Each round, the true label, observations, reports, and decisions form a Markov chain: ๐ ๐ก โ ๐ฟ ๐ก โ ๐ ๐ก โ ๐ด ๐ก . An intuition is that optimal decision-making requires maximal information from upstream. Due to the data processing inequality [Cover, 1999], information never increases going downstream; therefore, the best approach is to preserve as much information as possible at each link. Truthful reporting preserves full information at the link ๐ฟ ๐ก โ ๐ ๐ก , which allows for optimal subsequent decisions. Any lies from agents erode information. Lazy behavior also produces less information than observing. Aside from truthfulness, an alternative case that preserves full information is when agents use a permutation reporting strategy. However, such a case is unrealistic in practical settings, as the principal would need to know each agent’s permutation rule to reverse the encoding and uncover the true observation. Therefore, this proposition essentially shows that eliciting truthfulness is the practical way to achieve maximal performance.
Proposition 2.1 also shows that truthfulness is necessary not only for data quality but also cost-optimality. This is because the design of cost-optimal mechanisms is itself a decision-making task and Proposition 2.1 applies.
In our work, a central relaxation of modeling is that we don’t assume prior distribution of labels ๐ ๐ or agents’ skills ๐ ๐ are known by agents or the principal. The principal’s attempts to maintain truthfulness with unknown or inaccurate estimation of such knowledge. In this section, we focus on distributionally robust mechanisms, which aim to incentivize truthful behavior under knowledge ambiguity.
We begin with the analysis of optimal mechanism design with known ๐ ๐ and ๐ ๐ within a single round. When there are no true labels available, we apply the principles of peer prediction, which is to use other agents’ report to verify a focal agent’s report. The delicacy lies in the careful design of the reward mechanism to ensure that truthfulness is a Nash Equilibrium.
We start with the two-agent mechanism. With a focal agent ๐ and a reference agent ๐, the optimal two-agent mechanism design problem could be formulated as a linear programming problem. The objective is to minimize expected reward to agents, and the constraints are the desired properties of the mechanism. (we hide subscript ๐ก for simplicity.) min
The expectation is taken under the joint probability law ๐ ๐ฟ induced by ๐ ๐ , ๐ ๐ , and ๐ ๐ . In fact, ๐ ๐ฟ uniquely determines the mechanism design problem. The first constraint is the individual rationality property, meaning when the agent obtains observations and then reports honestly, she would get a non-negative reward. The second one states the truthfulness property, where the agent receives a non-positive reward when she lies. The final constraint implements the no-free-lunch property, meaning the agent cannot get a positive utility when she is lazy. We introduce the individual rationality and the no-free-lunch constraint to prevent arbitrary decrease of the objective function by applying an affine transformation to the reward mechanisms. (For risk-neutral agents, affine transformations on reward do not affect utility ordering and strategic behavior.)
Example 3.1 (Image Labeling). Suppose there are two types of images Y = {Cat, Tiger}, abbreviated with ๐ถ and ๐ respectively. We further assume that the prior distribution of the image types is balanced, i.e., ๐ ๐ (๐ถ) = ๐ ๐ (๐ ) = 0.5. For each image with an unknown true label ๐ โ Y, the principal would like to let two agents 1, 2 individually observe it, and truthfully report their observations ๐ 1 , ๐ 2 to label that image. Assume that both agents are 90% accurate:
The principal designs the reward mechanism as follows: both agents receive 1 reward if their reports ๐ 1 , ๐ 2 agree with each other, and receive -1 otherwise, i.e. ๐ {1,2} (๐
and assume that observation incurs cost ๐ = 0.1. Now we assume that agent 2 observes and report honestly, and analyze the incentive of agent 1.
Suppose agent 1 observes Cat, Bayes’ formula gives that P(๐ 2 = ๐ถ | ๐ 1 = ๐ถ) = 0.82 and P(๐ 2 = ๐ถ | ๐ 1 = ๐ถ) = 0.18. On the other hand, if agent 1 does not pay the effort to toss the coin, then her Bayesian belief on agent 2’s observation (and report) is ๐ (๐ 2 = ๐ถ) = ๐ (๐ 2 = ๐ ) = 0.5. She can then work out expected reward under truthful, lying, and lazy strategies: truthful (0.54) > lazy (0) > lying (-0.74). Hence truthful behavior is desired. In fact, this simple mechanism is a feasible solution to Eq.( 1). Intuitively, after the focal agent’s observation, her posterior probability on the other agent’s observing the same label is higher than observing a different label. Therefore, it is preferable to report whatever you observe in the first place. Such a mechanism is called peer prediction [Miller et al., 2005], originated the fact that rational agents always tries to predict their peers’ observations before action.
Define the belief matrix B where B ๐ฅ๐ฅ โฒ = P(๐ ๐ = ๐ฅ โฒ | ๐ ๐ = ๐ฅ), and reward matrix R where R ๐ฅ๐ฅ โฒ = ๐ ๐ (๐ฅ, ๐ฅ โฒ ). We also let d be a column vector display of ๐’s observation distribution: d ๐ฅ = P(๐ ๐ = ๐ฅ). Then we can reformulate (1) into the following equivalent problem:
Note that the second constraint only enforces pure lying strategies to incur non-positive reward, nevertheless, it is sufficient since any mixed strategy is a convex combination of pure strategies and its corresponding reward is also a convex combination with the same weights. The final constraint assumes all entries of Rd are negative, thus making sure any report strategy without observing incurs a non-positive reward. Theorem 3.2 (Optimal cost of a two-agent peer-prediction mechanism). Suppose the belief matrix B is invertible, and there does not exist ๐ฅ โ Y such that P(๐ ๐ = ๐ฅ) = 1. Then the linear program (1) (equivalently, its matrix form (2)) is feasible. Moreover, the minimum achievable expected payment equals the individual rationality threshold ๐; that is, min
In addition, at optimality, the first constraint in (1) is binding.
The tight result on the objective function is in the spirit of the classical Crรฉmer-McLean mechanism [Crรฉmer and McLean, 1988], which can extract full surplus from the agents when type distributions are common knowledge. The conditions are satisfied for “almost all” B and d.
Example 3.3 (Optimal Mechanism in Image Labeling). Continuing the image labeling example from Example 3.1, we show that the optimal mechanism pays both agents the observation cost ๐ = 0.1 in expectation, as a demonstration of Theorem 3.2. The optimal mechanism can be acquired by solving Eq.( 2). Here, we first compute the belief matrix and agent 2’s observation distribution.
0.82 0.18 0.18 0.82 , d = 0.5 0.5
.
Solving the linear program would give the following mechanism: both agents receive 5/32 reward if their reports agree, and receive -5/32 otherwise. We now show that the mechanism satisfies all constraints and is cost-optimal. First, no-free-lunch is satisfied since when a agent reports without observation, no matter what strategy she follows, expected reward is always 0.5 ร 5/32 -0.5 ร 5/32 = 0 (since the other agent observes both labels with equal probability). When she lies, expected reward is (BR โบ ) ๐ฅ ๐ฆ = -0.1. When she is truthful, she gets ๐ฅ,๐ฅ โฒ P(๐ ๐ = ๐ฅ)B ๐ฅ๐ฅ โฒ R ๐ฅ๐ฅ โฒ = 0.1 in reward, exactly equal to her observation cost. The optimal mechanism extracts full surplus from agents.
Remark 3. From Lemma 1 of [Radanovic and Faltings, 2013], it is known that for any mechanism with more than two agents with a truthful Bayesian Nash Equilibrium, it is possible to construct a 2-agent mechanism (with one focal and one reference agent) that achieves a truthful Bayesian Nash Equilibrium with the same expected payment. This lemma narrows our attention to reward mechanisms with only two agents. In application, the reference agent can be randomly picked to avoid collusion.
Now we move on to the scenario where agents and the principal only have inaccurate knowledge of ๐ ๐ฟ . Suppose they only know that the true distributions agents’ observations ๐ฟ belong to some ambiguity sets ๐ โ P ๐ฟ . From the principal’s perspective, the challenge would be to design a distributionally robust mechanism, such that for any possible realizations within the ambiguity set, the truthfulness constraint would still be met. Its objective now becomes minimizing the expected payment in the worst case.
The notion of (distributionally) robust mechanisms has been studied in [Bergemann andMorris, 2005, Koรงyiฤit et al., 2020]. We focus on a specific family that is cheap enough and guarantees truthfulness under ambiguity, but we don’t pursue exact worst-case optimality. This greatly reduces computational complexity, and turns out to be sufficient for subsequent adaptive mechanism design.
We begin with a sensitivity analysis of (1) with respect to shifts in probability law. Assume the principal obtain a design by solving (1) according to an erroneous probability law ๐, but the true law is ๐ * . According to Theorem 3.2, at optimality, we have a binding constraint in (1). Therefore, any slight deviation from ๐ would lead to the potential violation of constraints. To hedge against violations, following the idea of [Zhao et al., 2024], the principal could add safety margins on the constraints. Instead of only requiring the expected reward of truthful behaviors to be greater than ๐, the principal could let it be no less than ๐ + ๐ฟ, where ๐ฟ > 0 is the margin width. In this case, even if the expected reward of truthful behaviors might decrease under ๐ โฒ , as long as the decrease is no more than ๐ฟ, the individual rationality property is still preserved. Under this idea, we look at a variant of the mechanism design problem with margin ๐ฟ:
There are two curious features to this problem variant. First, the margin ๐ฟ added to the constraints protects the principal against inaccurate knowledge at an additional cost of at least ๐ฟ, since the expected payment under each possible observation is at least ๐ + ๐ฟ. This is a lower bound on the cost of robustness. Increasing ๐ฟ means the mechanism from (3) is robust for a higher degree of inaccuracies, but it would also cost more. Pursuing optimality, the principal want to find the lowest ๐ฟ just enough to guarantee constraints are still satisfied under ๐ * . It would be crucial to understand the connection between the degree of misspecification and the minimal required margin ๐ฟ. Second, the objective function alters from minimizing expected payment to minimizing worst-case payment. The reason for this change is to limit the sensitivity of expected payment to worst case probability law deviation from ๐ * to ๐. Following the “compactness” criteria discussed in [Zhao et al., 2024], the outcome incurring the highest absolute payment has the highest sensitivity to probability deviation, hence a large ๐ฟ/๐ ratio would ensure a high robustness to such deviations. Therefore, for a fixed ๐ฟ, we would like to lower ๐ as much as possible.
Theorem 3.4 (Robustness to distributional misspecification). Let ๐ be the distribution used in designing the margin-๐ฟ mechanism ๐ ๐ in (3) and let ๐ * be the true distribution. Let ๐ (๐ฅ ๐ | ๐ฅ ๐ ) be the induced prior or posterior distribution of the reference agent’s observation ๐ฅ ๐ , conditional on the focal agent’s observation ๐ฅ ๐ โ Y โช {โ }.
Denote total variation distance by TV(๐, ๐ * ). If
where ๐ = max |๐ ๐ |. Then the mechanism ๐ ๐ produced under ๐ remains feasible for the original problem (1) when evaluated under ๐ * .
Theorem 3.4 proves the robustness of the margin-๐ฟ mechanism under inaccurate distributional knowledge. This suggests it is possible to design a mechanism that guarantees truthfulness for all distributions in an ambiguity set centered at a distribution estimation ๐: P = {๐ โฒ |๐ โฒ satisfying (4)}. However, notice that the objective ๐ itself is influenced by the margin ๐ฟ we choose and the distribution ๐ used. Actually, when we increase ๐ฟ, the minimal achievable ๐ would also increase, hence the increment in the provided robustness (i.e. ๐ฟ/๐ ) may diminish. Therefore, one cannot infinitely increase ๐ฟ hoping for unlimited robustness. In the following, we first provide an objective upper bound on (3), then provide the upper and lower bounds on the amount of robustness that can be provided by (3). Theorem 3.5 (Bounds on payments of robust mechanism). Suppose the belief matrix B is invertible, and there does not exist ๐ฅ โ Y such that P(๐ ๐ = ๐ฅ) = 1. Let (๐ * , ๐ * ๐ ) be the optimal solution of (3) with design distribution ๐ and margin ๐ฟ. Then we have
โข Worst-case payment:
where ๐พ = max ๐ฅ P(๐ ๐ = ๐ฅ) < 1. โข Expected payment: there exists a solution (๐ , ๐ ๐ ) that satisfies the bound (5), while ensuring the expected payment of truthful equilibrium under ๐ is ๐ + ๐ฟ, the lowest possible.
The essential insight from Theorem 3.5 is that (3) is a linear programming problem. Hence, if we consider ๐ฟ as a perturbation on the constraints, the shift in the objective itself should also linearly relate to the perturbation. In other words, the sensitivity of both worst-case and expected payment to perturbation ๐ฟ is ๐ (๐ฟ). Rearranging terms in (5) would easily give us the following bounds.
Corollary 3.6 (Bounds on achievable robustness). Let the (possibly inaccurate) design distribution be ๐. Then for any margin parameter ๐ฟ > 0, there exists a mechanism ๐ ๐ feasible for (3) such that
where ๐ = max |๐ ๐ | and ๐พ = max ๐ฅ P(๐ ๐ = ๐ฅ) < 1. Moreover, we have ๐ฟ/2๐ โค 1/2 for all ๐ฟ and corresponding feasible mechanism ๐ ๐ , meaning no mechanism can achieve robustness more than 1/2. Corollary 3.6 primarily provides the minimum robustness achieved by solving (3). Notice that increasing ๐ฟ provides more robustness, but comes at increased cost. Also, the marginal robustness from increasing ๐ฟ would diminish: as ๐ฟ โ โ, the lower bound is at most (1-๐พ)/โฅB -1 โฅ ((1+๐พ)|Y|+2). When ๐ฟ โ 0, the robustness provided scales linearly with ๐ฟ. In addition, from the lower bound on the worst case payments, we have an upper bound 1/2 on the maximum robustness that can be provided from our scheme. This result does not contradict the impossibility result (Theorem 1) shown in [Radanovic and Faltings, 2013], which proves that no mechanism can guarantee truthfulness for all distributions.
Our theorem shows that while an all-round strictly dominantly truthful mechanism does not exist, it is still possible to design a mechanism that covers distributions that are relatively close to a design distribution ๐. For distributions distanced too far away, the impossibility result still holds. In other words, our scheme is enough to cover inaccuracies that are not too extreme. For that reason, Corollary 3.6 is already sufficient for our purposes of adaptive mechanism design. If the principal starts with a distribution estimation not too off (such that the total variation distance condition is satisfied with the robustness floor), then it is possible to maintain truthfulness and refine estimation at the same time. The principal would first apply a mechanism that provides abundant robustness for its initial ambiguity set. As more data is obtained and estimation becomes more accurate, it shrinks the ambiguity set and selects a smaller ๐ฟ, eventually converging to ๐ฟ = 0 and the optimal mechanism design. Theorem 3.7 (Cost of robustness). Suppose we have an ambiguity set
with design distribution ๐ and ambiguity parameter ๐.
there exists a mechanism ๐ ๐ such that agent ๐ would act truthfully when her belief belongs to this ambiguity set.
Moreover, if the actual distribution ๐ * also belongs to this set, then the principal’s expected payment for guaranteeing such truthful behavior is at most
where the second part is the additional cost of robustness.
Theorem 3.7 is essentially a combination of Theorem 3.4 and 3.5. Notice that the additional cost takes the format of ๐ โข ๐ถ 1 ๐/(1 -๐ถ 2 ๐). This means when ๐ โ 0, the additional cost of robustness is roughly ๐ (๐), and we establish a linear relation between robustness and additional cost.
Example 3.8 (Distributionally Robust Mechanism in Image Labeling). We follow the same example as in Example 3.1 and 3.3. Now we compare the simple mechanism that pays 1 on agreement and the optimal mechanism that pays 5/32. Although the simple mechanism is suboptimal, it is robust to misspecification of agents’ skills.
For example, suppose that the agents’ true observation accuracy is 0.8 instead of 0.9. With the same procedure in Example 3.1, one can show that the simple mechanism still guarantees truthfulness (truthful (0.26) > lazy (0) > lying (-0.46). On the other hand, the previously optimal mechanism breaks down (truthfulness gives 9/160 < ๐ = 0.1, so agents have no incentives in participation). In fact, one can show that as long as both agents’ accuracies are the same and stay within the range [0.66, 1], the simple mechanism always guarantees truthfulness. The lower bound (10 + โ 10)/20 โ 0.66 is when the truthful strategy’s expected reward falls to 0.1. This property holds true even if agents know the actual skill level, while the principal does not have that information. In a word, the additional payment in the simple mechanism serves as insurance against ambiguity.
In this section, we study the problem of adaptive mechanism design where the principal has no initial knowledge. Define the principal’s (empirical cumulative) regret after๐ rounds as ๐ ๐ก =1 ๐ ๐=1 (๐ ๐๐ก -๐ * ๐๐ก ), where ๐ * ๐๐ก is the optimal payment (E[๐ * ๐๐ก ] = ๐ as shown by Theorem 3.2). Starting from oblivion, the principal’s aim is to minimize regret, while ensuring agents’ truthfulness.
We present our algorithm, “Distributionally Robust Adaptive Mechanism” (DRAM), in Algorithm 1. The algorithm maintains truthfulness and reduces cost by designing a sequence of distributionally robust mechanisms with shrinking ambiguity parameter ๐. The ambiguity parameter tracks the accuracy of the principal’s estimation at each round.
Estimate reference distribution with all past reports:
Let ambiguity parameter ๐ ๐ = โ๏ธ log((๐ + 1)2 ๐ ๐ (log๐ )/๐)/2๐๐ ๐ -1 . For each agent ๐, set their safety margin
Here B (๐ ) is the matrix representation of p๐๐ (๐ฅ ๐ | ๐ฅ ๐ ) (see Section 3), and ๐พ (๐ ) = max ๐ฅ P(๐ ๐ = ๐ฅ). Compute the mechanism ๐ ๐๐ for each agent by solving Eq.(3) with parameter p๐๐ก and ๐ฟ ๐๐ . Deploy the mechanism ๐ ๐๐ for rounds ๐ก = ๐ ๐ -1 + 1, . . . , ๐ ๐ . end
In DRAM, the entire horizon is divided into two phases: warm-start phase and adaptive phase. As suggested by Theorem 3.7, our scheme fails when the ambiguity level is above a certain threshold ฮท. Therefore, the warm-start phase is designed to reduce ambiguity below that threshold. In this phase, the principal would use ground truth ๐ ๐ก for verification. Then, the principal moves to the adaptive phase, which is split into epochs. At the beginning of each epoch, the principal uses the empirical distribution for estimation. As the principal obtains more data, estimation is more accurate. This allows it to design a mechanism with decreasing ๐ and therefore reduce additional cost for robustness. The ambiguity parameter ๐ shrinks at a proper rate, so as to make sure truthfulness is preserved with high probability for each round.
During the entire algorithm, the principal uses agents’ past report to estimate ๐ * (๐ฅ ๐ | ๐ฅ ๐ ), which is the posterior distribution of reference agent’s observation ๐ฅ ๐ conditional on the focal agent ๐ฅ ๐ . Note that the principal only has agents’ reports ๐ง but not the true observation ๐ฅ. This means truthfulness must be guaranteed at all times for estimation fidelity. This adds another evidence on the necessity of truthfulness in addition to Theorem 2.1.
Inputs. Out of all the input variables, failure tolerance can be decided arbitrarily, and the rest depend on the agents. The ambiguity threshold for all players is defined as ฮท = min ๐ ฮท๐ , where the agent-specific threshold ฮท๐ is computed according to Theorem 3.7:
Warm-start phase. The main objective of the warm-start phase is to reduce the principal’s ambiguity below the threshold suggested by Theorem 3.7, so that distributionally robust mechanisms can be applied. There are multiple approaches to reduce ambiguity, and here the principal learns ๐ * (๐ฅ ๐ | ๐ฅ ๐ ) by collecting truthful reports from agents. We incentivize truthfulness in this phase by using ground truth verification. Suppose the principal can now obtain the ground truth ๐ ๐ก from an external expert. With ground truth available, The principal can compare reports with it, and then reward according to a fact-checking mechanism ๐ ๐๐ก (๐ ๐๐ก , ๐ ๐ก ). This phase lasts ๐ (log log๐ ) tasks, so cost is controlled even when ground truth is expensive. Lemma 4.1 (Fact checking under diagonal dominance). Recall the assumption that each true label ๐ฆ appears with uniformly bounded probability ๐ โค ๐ ๐ (๐ฆ) โค ๐. If for all ๐ฆ โ Y and ๐ฅ โ ๐ฆ, agent ๐’s skill ๐ ๐ satisfies the diagonal dominance property:
then the simple fact-checking rule ๐ ๐๐ก (๐ ๐๐ก , ๐ ๐ก ) = 1{๐ ๐๐ก = ๐ ๐ก } guarantees agent ๐’s truthfulness.
The diagonal dominance condition essentially assumes that agents are more likely to obtain the correct observation than to make a mistake. Therefore, any lying behavior would decrease the probability of the report being correct; thus, the agent should be incentivized to tell the truth. We note that it is generally impossible to design a fact-checking rule that guarantees truthfulness for arbitrary skill distribution ๐ ๐ and ๐ ๐ . It is shown in [Lambert, 2011] that if a agent’s observation has overlaps, i.e., the agent can have the same observation under two different labels, then there always exists an adversarial prior under which a fact-checking mechanism fails.
Adaptive phase. After the ambiguity is lower than the threshold, the principal moves onto the adaptive phase. This phase divides the entire time horizon into epochs, with each epoch double the size of the previous one. In total, we would have ๐ (log๐ ) epochs.
At the beginning of each epoch, the principal would call two oracles: i) an offline estimation oracle for the reference distribution ๐ * (๐ฅ ๐ | ๐ฅ ๐ ) (in Algorithm 1 it is the empirical distribution estimator), and ii) an optimization oracle that computes the distributionally robust mechanisms by solving Eq.(3). Then, the principal would use the same produced mechanism throughout the entire epoch, and no further computation is needed. This indicates DRAM is computationally efficient with ๐ (๐ log๐ ) total calls to both oracles.
At the same time, DRAM is also statistically efficient for we have the following regret guarantee.
Theorem 4.2 (Regret upper bound of DRAM). Consider the sequential mechanism design problem with ๐ agents and ๐ rounds. With probability at least 1 -๐, Algorithm 1 simultaneously achieves:
โข truthfulness is guaranteed for all ๐ agents in all ๐ rounds.
โข expected total regret of the algorithm is at most
Theorem 4.2 recovers the ๐ ( โ ๐ ) terms typically seen in bandits and online learning literature [Lattimore and Szepesvรกri, 2020]. In fact, oracle calls can be further reduced to ๐ (๐ log log๐ ) when ๐ is known. In DRAM, we use the classical doubling trick [Cesa-Bianchi and Lugosi, 2006] from the online learning literature. This trick does not require exact knowledge of the number of tasks. (Although we need to know the magnitude log(๐ ) to compute epochwise ambiguity parameter
๐ (similar to [Cesa-Bianchi et al., 2014]) maintains the same regret guarantee, while requiring even fewer oracle calls ๐ (log log๐ ).
๐. With probability at least 1 -๐, the principal simultaneously guarantees truthfulness across all rounds for all agents, and the expected total regret maintains the same upper bound ๐ ๐
๐ log(๐ log๐ /๐) with only ๐ (log log๐ ) epochs.
We now show that DRAM is statistically optimal up to logarithmic factors. In particular, we prove a matching lower bound demonstrating that any policy which guarantees truthfulness with high probability must incur regret of order at least ฮฉ(๐ โ ๐ ).
Theorem 4.4. Consider the sequential mechanism design problem with ๐ agents and ๐ rounds. Fix any failure tolerance ๐ โ (0, 1/4). For any (possibly randomized) non-anticipating reward policy that guarantees truthfulness across all agents and rounds with probability at least 1 -๐, there exists a type distribution ๐ ๐ฟ โ ฮ(Y N ) under which, with probability at least 1 -๐, the total regret is at least
The proof proceeds by constructing a pair of statistically indistinguishable problem instances whose corresponding cost-optimal truthful mechanisms are incompatible. Specifically, any reward mechanism that is both truthful and near-optimal under one instance must either violate truthfulness or incur strictly larger payments under the other. This incompatibility allows us to reduce adaptive mechanism design to a hypothesis testing problem, and we invoke Le Cam’s two-point method to derive the lower bound.
The results together with its proof (see Appendix A) reveals that the regret bottleneck of adaptive mechanism design is the difficulty in learning players’ conditional beliefs, namely the posterior distributions ๐ * (๐ฅ ๐ | ๐ฅ ๐ ) that govern incentives. Because the lower bound is derived via a two-point argument, it does not explicitly depend on the alphabet size ๐ = |Y|. However, since estimating a discrete distribution over Y incurs a minimax risk of order โ๏ธ ๐/๐ [Han et al., 2015], we conjecture that the regret bound achieved by DRAM is also optimal in its dependence on ๐, up to logarithmic factors.
An important observation of DRAM is that the estimation oracle and the optimization oracle are decoupled. They are connected via the ambiguity parameter ๐ ๐ , which measures the distance between the estimated distribution and the actual one. This means that DRAM is flexible with estimators, so long as its estimation could satisfy the requirement in Eq.( 4). Therefore, the empirical estimator could be swapped with any other distribution estimator that may better exploit and reflect the underlying structures of agents’ skills. Based on this, we propose the algorithm DRAM+ working with general distribution estimators. Definition 4.5 (General Discrete Distribution Estimator). Let ๐ be a discrete distribution on space Y. Given ๐ก samples independently and identically drawn from ๐, the generalized distribution estimator provides an estimation q๐ก such that with probability 1 -๐, we have TV(๐, q๐ก ) โค ๐ ๐ (๐ก).
(11)
This estimation guarantees ๐ ๐ (๐ก) should monotonically decrease with ๐ก, and monotonically increase with failure probability ๐. Such a bound is commonly seen in the probably approximately correct (PAC) framework of statistical learning, where a good estimator should achieve a lower gap with higher probability and lower ๐ก.
Now we introduce DRAM+ (Algorithm 2), which modifies DRAM to work with general discrete distribution estimators in Definition 4.5. In DRAM+, we don’t restrict how epoch schedules are designed. Generally, one should aim for a geometric epoch schedule, as this typically results in the best possible bounds and only ๐ = ๐ (log๐ ) epochs. Moreover, the ambiguity parameters now follow the guarantee ๐ ๐ (๐ก), in order to ensure truthfulness holds with high probability.
ALGORITHM 2: Distributionally Robust Adaptive Mechanism+ (DRAM+) Input : ambiguity threshold ฮท; failure tolerance ๐; lower bound on observation frequency 0 < ๐ < min ๐,๐ฅ โ Y P(๐ ๐ = ๐ฅ); distribution estimator E. Compute the warm-start phase length as the smallest ๐ such that ๐ ๐/๐๐ (๐+1) (๐๐/2) < ฮท. For each agent ๐, assign a corresponding reference agent ๐.
Warm-start phase. Follows the same procedure as in Algorithm 1. Adaptive phase.
Estimate reference distribution with the general distribution estimator for each ๐ฅ ๐ โ Y:
Let ambiguity parameter ๐ ๐ = ๐ ๐/๐๐ (๐+1) (๐๐ ๐ -1 /2).
Compute the safety margin ๐ฟ ๐๐ and deploy the mechanism ๐ ๐๐ the same way as in Algorithm 1. end Theorem 4.6 (Regret upper bound of DRAM+). Consider the sequential mechanism design problem with ๐ agents and ๐ rounds. With probability at least 1-๐ -๐ (๐ +1) โข ๐ ๐=1 exp(-๐๐ ๐ -1 /8), Algorithm 2 simultaneously achieves:
โข truthfulness is guaranteed for all ๐ agents in all ๐ rounds.
โข expected total regret of the algorithm is at most
Compared to Theorem 4.6, an additional overhead ๐ (๐ + 1) โข ๐ ๐=1 exp(-๐๐ ๐ -1 /8) appears in the high probability guarantee. This is due to a failed event when a player does not observe a certain label ๐ฅ ๐ for enough number of times, and there are not enough data to recover ๐ * (โข|๐ฅ ๐ ). This failed event is universal, and without a closed-form description of the estimation gap, we cannot merge this probability with ๐. Nonetheless this term decays exponentially, and, with appropriately chosen schedule (such as ๐ ๐ -๐ ๐ -1 = 2 ๐ -1 ๐), it should be dominated by ๐.
The central interpretation of DRAM+ and Theorem 4.6 is that any estimation guarantees for discrete distribution can be immediately translated to mechanism regret guarantees. This suggests that a principled reduction from online mechanism design to offline learning may indeed be possible, similar to the reduction from contextual bandits to offline estimation in [Simchi-Levi and Xu, 2022].
Compared to the classical multi-armed bandit, the adaptive mechanism design problem is, in some sense, both simpler and harder. On the one hand, the core challenge in bandit problems lies in the exploration-exploitation trade-off, since the arm-sampling policy in previous rounds affects observed data distributions. From that perspective, the mechanism design problem is simpler than bandits, since the underlying distributions remain unaffected by the principal’s mechanism decision as long as agents behave truthfully. On the other hand, when participants are rational, incentivizing truthfulness in an optimal way is nontrivial. Agents’ incentives and skills are unknown, and any deviation can cause unpredictable dynamics. In contrast, there are no incentives involved in bandits, and each arm always gives truthful feedback.
We collect some interesting observations from the Algorithm 1 and 2 and their corresponding guarantees (Theorem 4.2 and 4.6). These observations further demonstrate the generality of our results.
Robustness to fluctuation/non-stationarity of agent performance. We assume each agent’s skill (i.e., law ๐ ๐ (๐ฅ ๐ | ๐ฆ)) is consistent throughout the sequential tasks, an assumption not necessarily true in practice. agents may under or over-perform in certain rounds compared to other rounds, resulting in skill fluctuation and non-stationarity. In DRAM, we apply a distributionally robust mechanism each round. This robustness holds not only for estimation inaccuracy, but also to inaccuracy from other sources. This means as long as the actual reference distribution stays within the ambiguity set defined by p๐๐ and ๐ ๐ , agents are still incentivized to stay truthful. Indeed, the principal could even widen up or narrow the ambiguity set by adjusting ๐ ๐ , looking for more robustness or less cost.
Robustness to adversary. For the same reason, the distributionally robust mechanism would also provide robustness to adversarial behavior from agents. When an agent intentionally lies in a small portion of rounds, it would only slightly bias the estimation. As long as it does not surpass the ambiguity margin as designed in each epoch, the mechanism would not break down. In addition, the assignment procedure of reference agents may provide additional defense. An adversary would at most disrupt at most 2๐ out of ๐๐ agent-task interactions in total (being one focal agent and one reference agent), possibly spread across different agents and tasks.
Flexibility with reference agent assignment procedure. In DRAM, each agent is assigned one corresponding reference agent ๐, to which her reports will be compared. Any assignment procedure (deterministic or randomized) could be used for this process, and some could provide robustness to possible adverse agents. As an example, suppose we use cyclic matching, where agent ๐ + 1 is assigned to agent ๐ as reference for ๐ < ๐ , and agent 1 is assigned to agent ๐ . Under cyclic matching, any adversary would disrupt at most two agents, and the majority is not affected. Furthermore, at the beginning of each epoch, we could rerun the procedure and assign new agents. Such replacement generates little extra computational costs, since the principal needs to update estimation and regenerate mechanism anyway.
Compatibility with delayed/batched feedback. In the practical setting, the feedback to the principal might not be immediately available and may come in batches [Chapelle andLi, 2011, McMahan et al., 2013]. The delayed/batched feedback setting has been studied in multiple online learning and decision-making problems [Gao et al., 2019, Joulani et al., 2013]. In DRAM, the mechanisms are computed at the beginning of each epoch, and stay the same throughout. This means DRAM naturally handles delayed and batched feedback since report data is only required for computation at the beginning of each epoch. Particularly, Corollary 4.3 suggests that ๐ (log log๐ ) epochs are already sufficient for the ๐ ( โ ๐ ) bound up to logarithmic terms. Nevertheless, such small epoch counts rely on a carefully designed epoch schedule. For example, DRAM uses a geometric epoch schedule, under which it quickly adapts early on, and slows down when sufficient data are gained. Deviation from the ๐ ( โ ๐ ) bound may appear when the principal faces a different epoch schedule constraint [Perchet et al., 2016].
In this section, we perform numerical experiments to verify and demonstrate our proposed algorithm.
Environment. We consider a sequential labeling game (as in Figure 1 and Example 3.1) with ๐ = 3 agents and ๐ = 3 labels with a uniform prior ๐ ๐ (๐ฆ) = 1/3. Each agent ๐ has a diagonallydominant skill distribution ๐ ๐ (โข | ๐ฆ) that is symmetric across labels:
Thus for ๐ = 3, ๐ผ 0 = 0.68, ๐ผ 1 = 0.70, and ๐ผ 2 = 0.72, with the remaining probability mass spread uniformly over the ๐ -1 incorrect labels. We use horizon ๐ = 10 6 and observation cost ๐ = 0.3. During warm-start, the principal acquires the ground-truth label ๐ ๐ก from an external expert at cost ๐ถ lab = 3.0 per round. We run 1000 independent episodes. Algorithm setup. We implement the exact DRAM algorithm (Algorithm 1) in this simulation. To match the theoretical parameterization, we compute
and the agent-wise robustness thresholds ฮท๐ from Theorem 3.7, then set ฮทtrue = min ๐ ฮท๐ , ฮทused = min 0.9 ฮทtrue , 1/ โ 2 , ๐ used = 0.99 ๐ true .
Given ๐ = 10 -3 , we plug ( ฮทused , ๐ used ) into the warm-start length formula in Algorithm 1 to obtain ๐. For this setting, ๐ is on the order of 10 5 , so the warm-start phase occupies only a moderate fraction of the horizon. In the warm-start phase, we use the simple fact-checking mechanism: reward both agent 1 if their report agrees, and 0 if not. Truthfulness checks. We verify truthfulness via a retrospective approach. We set all participating agents to be always truthful. At the beginning of every epoch, we perform a truthfulness check using the true joint distribution ๐ * (๐ ๐ , ๐ ๐ ).
Fig. 2. Minimum reward gap between truthful reporting and other pure strategies across 1000 runs of a sequential labeling game. Negative gap means the constraints are violated. In this simulation, the minimum gap distribution is well separated from 0, meaning that truthful reporting dominates other strategies by a considerable margin, and DRAM guarantees truthfulness even with spare robustness.
We compute the truthful expected utility
and compare it against two families of deviations:
โข Lazy strategies: the best constant report (without observation) ๐ง โ Y with no observation cost, ๐
โข Misreporting strategies: all deterministic mappings from observation to ๐ : Y โ Y excluding the identity, ๐ ๐ ๐๐ = E (๐ ๐ ,๐ ๐ )โผ๐ * ๐ ๐๐ (๐(๐ ๐ ), ๐ ๐ ) -๐. We define the IC gap for agent ๐ in epoch ๐ as
and, for each episode, record the minimum Gap ๐๐ across all agents and epochs. If any Gap ๐๐ is negative, we count the episode as an IC violation. Figure 2 shows the histogram of per-episode minimum IC gaps.
Regret checks. We collect the cumulative regret at each round within each episode of the game. In addition to the regret notion defined in Section 4, we include the warm-start verification cost in the regret formula. Specifically, we plot the cumulative regret up to time ๐ก:
and report the mean and standard deviation across 1000 episodes. Figure 3 shows the resulting regret trajectory.
Results. DRAM passes the truthfulness checks as across the 1000 episodes, we observe no truthfulness violations. The global minimum gap is approximately 0.0743 > 0, and the distribution of per-episode minimum gaps is well separated from zero. This indicates that, in a setting that exactly matches our assumptions and with theoretically chosen parameters (๐, ๐ฟ), DRAM indeed implements a truthful mechanism in practice. DRAM also consistently achieves the ๐ ( โ ๐ ) regret, as shown in Figure 3. In this simulation we have 5 epochs in total. Within each epoch, the mechanism stays unchanged, therefore the cumulative regret curve is piecewise linear. In summary, this experiment demonstrates the efficiency and robustness of the vanilla DRAM algorithm, and validates the correctness of Theorem 4.2. In fact, the existence of extra IC gap seems to suggest that further refinement are possible, as the current parameters are set for theoretical proofs rather than optimized for practical implementations.
In this paper, we designed an adaptive mechanism for the sequential mechanism design problem. The studied problem assumes rational feedback compared to the prediction with expert advice problem from online learning, and relaxes the common knowledge assumption compared to the peer prediction problem from mechanism design. Drawing insights from both fields, our proposed mechanism ensures truthful behaviors with high probability, while achieving payment regret. It also remains robust and adaptable in changing environments.
Looking forward, our work motivates interesting questions. In Section 3, the mechanism design problem is formulated as a linear optimization problem, with truthfulness encoded as constraints.
A key idea of our algorithm is to solve a distributionally robust variant of this problem while gradually learning the relevant constraints over time. This principle seems to be broadly applicable: since many decision-making problems can be cast as optimization tasks, the same approach might extend naturally to online, adaptive, or sequential variants of other real-world decision-making problems beyond mechanism design.
Banghua Zhu, Stephen Bates, Zhuoran Yang, Yixin Wang, Jiantao Jiao, and Michael I Jordan. 2022. The sample complexity of online contract design. arXiv preprint arXiv:2211.05732 (2022).
First, we prove that such a strategy cannot have overlapped labels, meaning there does not exist ๐ง โ Y, such that both ๐ (๐ง | ๐ฅ 1 ) and ๐ (๐ง | ๐ฅ 2 ) are greater than 0 for some ๐ฅ 1 , ๐ฅ 2 . Because ๐ ๐ can be arbitrary, it means any strategy ๐ โฒ is a garbling of ๐ . Consider the truthful strategy ๐ โฒ (๐ง | ๐ฅ) = 1{๐ง = ๐ฅ }, we would have
Since ๐ง โ Y ๐ (๐ง | ๐ฅ) = 1, for each ๐ง where ๐ (๐ง | ๐ฅ) > 0, we must have ฮ(๐ฅ | ๐ง) = 1, otherwise the sum of weighted average would fall short of 1. Now, assume we have such overlapped lables ๐ง and the corresponding ๐ฅ 1 and ๐ฅ 2 , it would mean that ฮ(๐ฅ
probability distribution, so it cannot assign probability 1 to two different outcomes simultaneously, forming a contradiction and we prove the overlapped signals cannot exist.
Since ๐ cannot have overlapped labels, by counting we know each observation must be mapped to one and only one label, meaning it is a permutation strategy.
Suboptimality of laziness. Finally, we show that the lazy option (directly report according to a prior belief p๐ ) is strictly worse than observation with permutation strategy regardless of p๐ used. The induced information structure is ๐ ๐ (๐ง | ๐ฆ) = p๐ (๐ง), and the information structure from a permutation strategy is
However, ๐ ๐ is not a garbling of ๐ ๐ since the corresponding row-stochastic matrix of ๐ ๐ is rank 1 but because of non-degeneracy ๐ ๐ is at least rank 2. Therefore Lemma A.1 suggests lazy option is strictly dominated by observation with permutation strategy. โก
A.2 Proof of Theorem 3.2
Feasibility. Suppose B is invertible. Then for arbitrary matrix M, there exists R = (B -1 M) โบ such that BR โบ = M. Notice that the entry M ๐ฅ ๐ฆ is exactly agent 1’s expected reward given she observes label ๐ฅ and reports label ๐ฆ. Hence for our purposes, we can construct an M whose diagonal entries are greater than ๐, and off-diagonal entries are less than ๐, then the corresponding R satisfies the first two constraints. (Actually, if B is invertible, the first two constraints of (2) can be satisfied for arbitrary right-hand side values.) Now we consider the third constraint Rd โค 0. Notice that we have d โบ = ๐ฅ P(๐ ๐ = ๐ฅ) โข B ๐ฅ: . Therefore letting BR โบ = M, we have
Therefore, to satisfy all the three constraints, we need to find a matrix M whose linear combinations of its rows under coefficients {P(๐ ๐ = ๐ฅ)} ๐ฅ โ Y yield a vector with non-positive entries. Let ๐พ = max ๐ฅ P(๐ ๐ = ๐ฅ), we know that ๐พ < 1. Then if we let all diagonal values of M be ๐, and all off-diagonal values of M equals -๐๐พ/(1 -๐พ), then we would have for all ๐ฅ โฒ โ Y,
Hence such a matrix M exists and the corresponding R is a feasible solution.
Optimality. Notice that the objective is essentially ๐ฅ P(๐ ๐ = ๐ฅ)M ๐ฅ๐ฅ . Since we constructed M with all diagonal values being ๐, the objective value is ๐. The first constraint is binding. Smaller objective is not possible as it would require M ๐ฅ๐ฅ < ๐ for some ๐ฅ, violating the first constraint. โก
A.3 Proof of Theorem 3.4
It is more convenient to use the matrix notation (see ( 2)). The condition is essentially saying โฅB -B * โฅ โ โค ๐ฟ/๐ and โฅdd * โฅ 1 โค ๐ฟ/๐ , where โฅ โข โฅ โ is the matrix norm induced by vector โ-norm.
(It is essentially the maximum absolute row sum of the matrix.) Therefore, we have max
Here it is crucial to notice that โฅ (B -B * ) ๐ฅ:
. Similarly, we can show that โฅRdโฅ โ โค ๐ฟ. Therefore, the constraints in (3) shift by at most ๐ฟ, which means the ๐ฟ-margin mechanism ๐ ๐ satisfies (1). โก
A.4 Proof of Theorem 3.5
Worst-case payment. We still use the matrix formulation for the problem (see ( 2)). Under this notation, the problem becomes min
We call this problem LP(๐, ๐, ๐ฟ), since it is a linear programming problem with distribution ๐, cost ๐ and margin ๐ฟ. Notice that if (๐ , R) is a feasible solution to LP(๐, ๐, 0), and (๐ โฒ , R โฒ ) is a feasible solution to LP(๐, 0, 1), then (๐ + ๐ฟ๐ โฒ , R + ๐ฟR โฒ ) is a feasible solution to LP(๐, ๐, ๐ฟ). Therefore, we can construct upper bounds of LP(๐, ๐, ๐ฟ) by constructing upper bounds on LP(๐, ๐, 0) and LP(๐, 0, 1) separately. We apply the same strategy as proof of Theorem 3.2, that is, to consider the intermediate solution M = BR โบ . The mechanism can be easily acquired by R = (B -1 M) โบ . With this reformulation (see Section A.2 for details), the problem can be constructed as min
The lower bound is apparent, since when absolute maximal payment goes under ๐ + ๐ฟ we violate the constraint M ๐ฅ๐ฅ โฅ ๐ + ๐ฟ.
Upper bounds of LP(๐, ๐, 0). Following the same construction as Appendix A.2, we let M has all diagonal values being ๐, and all off-diagonals equal -๐๐พ/(1 -๐พ), where ๐พ = max ๐ฅ P(๐ ๐ = ๐ฅ). Then M satisfies the three constraints on matrix. With this M, we have
Here, all eigenvalues of โฅMโฅ 2 can be easily calculated since M is a combination of identity matrix I and all-ones matrix J, whose eigenvalues are known. In the end we can take ๐ โค โฅB -1 โฅ 2 โข ๐ (๐พ |Y| + 1)/(1 -๐พ).
Upper bounds of LP(๐, 0, 1). Similarly, construct M’ with diagonal 1 and off-diagonal -(1 +๐พ)/(1 -๐พ). This M โฒ satisfies all three constraints on matrix. A similar argument gives us
Combining the two upper bounds, it means that (๐ + ๐ฟ๐ โฒ , M + ๐ฟM โฒ ) is a feasible solution, and we end up with an upper bound on LP(๐, ๐, ๐ฟ), which is: To ensure agent ๐ stays truthful, from Theorem 3.4 we need to find a margin ๐ฟ such that ๐ฟ/2๐ * โฅ ๐.
We consider the best case ๐ฟ/2๐ * = ๐. Combining with Theorem 3.5 leads to
This gives us
And under this margin we have a robust mechanism that guarantees truthfulness. Note that the second part of Theorem 3.5 tells there exists a mechanism that ensures the above bound holds, while making sure the expected payment of truthful equilibrium under ๐ is ๐ + ๐ฟ. So if the actual distribution ๐ * is in the required ambiguity set, it shifts this expected payment by at most an additional ๐ฟ, making the final expected payment at most ๐ + 2๐ฟ. Combining it with the upper bound gives us the final result. Therefore, we first focus on accurate estimations on this distribution using agents reports. Throughout this part, agents’ reports are assumed to be truthful, i.e. ๐ง ๐ = ๐ฅ ๐ , so intuitively speaking principal should faithfully recover ๐ * if given enough data.
We begin with a lemma on concentration bound on using the empirical estimator for a discrete distribution. Let ๐ be a discrete distribution on sample space Y, from which we obtain ๐ก i.i.d. samples. Let q be the empirical probability distribution where q๐ก (๐ฆ) = ๐ก ๐ฆ /๐ก. Here ๐ก ๐ฆ is number of times label ๐ฆ appears in the ๐ก samples. We also define ๐ = |Y|.
Lemma A.2 (Concentration ineqality of the empirical distribution [Weissman et al., 2003]). For all ๐ > 0, we have
where ๐ (๐ฅ) = log((1 -๐ฅ)/๐ฅ)/(1 -2๐ฅ) with ๐ (1/2) = 2, and ๐ ๐ = max ๐ดโ Y min(P(๐ด)), 1 -P(๐ด)).
Lemma A.2 gives an concentration inequality on the empirical distribution. We now apply this lemma to derive a concentration bound on estimating conditional distribution using the empirical conditional distribution estimator. The empirical conditional distribution, defined as p๐ก (๐ฅ ๐ | ๐ฅ ๐ ) = ๐ก ๐ฅ ๐ |๐ฅ ๐ /๐ก ๐ฅ ๐ , is what we ended up using in Algorithm 1. Lemma A.3 (Concentration property of the empirical conditional distribution). Suppose the principal has received ๐ rounds of reports from agent ๐ and ๐. Assuming agents are always truthful. Let p (๐ฅ ๐ | ๐ฅ ๐ ) be the empirical conditional distribution defined in Algorithm 1. Define the ambiguity set with ambiguity level ๐ as
where ๐ = min ๐ฅ โ Y P(๐ ๐ = ๐ฅ), then with probability at least 1 -๐, the true distribution ๐ * belongs to ๐ ๐ .
We note the increasing rate of ๐ is on the order of ๐ (log(1/๐)/๐ 2 ) for arbitrary ๐ and small ๐. Even when ๐ is large, there is still a threshold ๐ > ๐ (log(1/๐)) that must be satisfied. This is because there are two possible ways for the event ๐ ๐ to fail: the first case is when the estimator for a certain conditional distribution is ๐-away from the true distribution, and the second case is when a certain symbol ๐ฅ ๐ never appears in ๐’s report. To ensure the second case does not happen with probability larger than ๐, we need ๐ to be large enough.
Proof. We first consider one label ๐ฅ ๐ โ Y. Within ๐ rounds, the count of ๐ฅ ๐ from agent ๐’s report follows a binomial distribution. Let ๐ = min ๐ฅ โ Y P(๐ ๐ = ๐ฅ).
We then have
Utilizing union bound across all (๐ + 1) symbols ๐ฅ ๐ โ Y โช {โ } would give us
Inverting this inequality gives that when
, the original event in lemma holds with probability at least 1 -๐.
Notice that we have log
where the first inequality holds sincelog(1 -๐ฅ) โฅ ๐ฅ, and the second holds since 1exp(-๐ฅ) โฅ ๐ฅ/(1 + ๐ฅ). Thus a sufficient bound is
โก Warm-starting. The ambiguity threshold ฮท is the smallest value across agents, on the maximum ambiguity a distributionally robust mechanism can tolerate. For mathematical convenience we update the parameter to let ฮท < 1/ โ 2, so that we have a cleaner bound in subsequent derivations. The fact-checking mechanism ensures truthfulness because of Lemma 4.1. The length of this phase is ๐ (log(๐ log๐ )), which results in smaller order total regrets even when we have constant regret in each round.
Bounding the regret. From Theorem 3.7, we know that the mechanism ๐ ๐๐ acquired by solving Eq.(??) guarantees agent ๐’s truthfulness when ๐ * ๐ โ ๐ ๐ ๐ ( p๐๐ ). Also, in the duration of epoch ๐, the expected regret for a single round is ๐ โข ๐ถ 1 ๐ ๐ /(1 -๐ถ 2 ๐ ๐ ) where ๐ถ 1 and ๐ถ 2 are constants as defined in Eq.( 7). Therefore, the total expect regret for agent ๐ across all ๐ rounds is
In the last inequality we used the equation on the sum of geometric sequences. Finally, we know that for one agent in one epoch, the scheme ensures that applying union bound across all ๐ agents and log๐ periods. Hence applying union bound, we know that truthfulness is held with probability 1 -๐. Also, note that the regret in the warm start phase is at most ๐ (๐) and therefore dominated by the regret from the adaptive phase. Thus the total regret is ๐ times the single agent’s regret and therefore ๐ (๐ โ ๐ log(๐ (log๐ )/๐)).
A.8 Proof of Corollary 4.3
The updated upper bound under the epoch schedule ๐ ๐ -๐ ๐ -1 = ๐ 1-2 -(๐ -1) ๐. All steps are identical to, except for the final step, where we sum over the regret across epochs:
We note that this epoch schedule is sub-geometric, but it grows faster at the first few steps than the geometric epoch schedule ๐ ๐ -๐ ๐ -1 = 2 ๐ -1 ๐, therefore it uses logarithmically smaller ๐ (log log๐ ) epoch count to reach ๐ .
(Here the 1 + 2๐ฟ is a relaxation on the bound (๐ + ๐)/2 < 1 + ๐ฟ.) Notice that M ๐ฅ ๐ฆ is exactly agent 1’s expected reward given she observes label ๐ฅ and reports label ๐ฆ. Therefore, for any M 0 that is cheap and satisfies the constraints under ๐ 0 , its performance under ๐ 1 is M 1 = B 1 B -1 0 M 0 . However, we have that:
๐ ๐ ๐ ๐ =
(1-2๐ฟ )๐+4๐ฟ๐ 1+2๐ฟ
(1-2๐ฟ )๐+4๐ฟ๐ 1+2๐ฟ
๐ ๐ .
Therefore, the first entry must have
leading to a violation of the truthfulness constraint. Similar procedure would also prove the statement with ๐ 0 and ๐ 1 reversed. For any cheap and feasible mechanism under ๐ 0 , the resulting first entry of corresponding M 0 would be
Hence we prove the competition property in both ways. Similarity.
The first inequality holds since log(1 + ๐ฅ) โค ๐ฅ, and the second holds since ๐ฟ โ (0, 1/4). โก
Now we consider the mechanism design problem. Suppose we have a policy ๐ that maps the i.i.d. collected data ๐ป ๐ก to a reward mechanism ๐ ๐ก +1 . Consider the good event ๐บ ๐ , ๐ โ {0, 1}: ๐บ ๐ = {๐ ๐ก +1 satisfies IC constraints and pays less than 1 + ๐ฟ in expectation} Notice that ๐บ 0 and ๐บ 1 are disjoint since we have the competition property from Lemma A.4. Give such policy, we can construct a test ฯ that distinguishes ๐ 0 and ๐ 1 : the test outputs 0 if ๐ ๐ก +1 โ ๐บ 0 , outputs 1 if ๐ ๐ก +1 โ ๐บ 1 , and arbitrarily if neither is satisfied. Under Bretagnolle-Huber inequality, we have that the minimax error rate of any tests on ๐ 0 and ๐ 1 has lower bound as: inf ๐ (P 0 (๐ (๐ป ๐ก ) = 1)
where the second inequality is due to the tensorization property of divergence and the similarity property from Lemma A.4. Therefore, our special test ฯ must also follow this condition, and thus any policy ๐ must fail to ensure both IC constraints and cheapness with probability at least exp(-8๐ ๐ฟ 2 )/4 in one of the hard instances. In other words, for possible nice policies that guarantee truthfulness and cheapness with worstcase probability at least 1 -๐, we must set the cheapness threshold to be greater than the threshold:
, and we must have the one-agent, single-round expected regret be greater than the same lower bound (since all mechanism with expected regret smaller than the lower bound would fail the test with probability greater than ๐.) Therefore, summing across all periods would give us:
Reg(๐, ๐) โฅ ฮฉ โ๏ธ ๐ log(1/๐) .
From Lemma 1 of [Radanovic and Faltings, 2013], it is known that for any mechanism with more than two agents, its truthful Bayesian Nash Equilibrium can correspond to a 2-agent mechanism with the same expected payment at its truthful Bayesian Nash Equilibrium. Therefore, more agents would not bring additional benefits than the 2-agent case, and hence we can simply apply the same lower bound for the N-agent case. Hence the ๐ -agent mechanism design problem would have regret lower bound as: The proof roughly follows the same procedure as Theorem 4.2, with a few modifications. First, since we don’t have an explicit formula for the PAC guarantee, we cannot invert the function ๐ ๐ (๐ ) to get a closed-form bound for ๐ under certain ๐ and ๐. This may lead to a relatively looser bound for certain estimators. The tightest bound can always be specifically derived following proof of Theorem 4.2. Second, we would use the estimator for conditional distribution ๐ (โข | ๐ฅ ๐ ), for each ๐ฅ ๐ โ Y. There are two scenarios where the estimator could be off:
(1) agent ๐ does not observe ๐ฅ ๐ for enough number of times. (๐ ๐ฅ ๐ is small).
(2) The estimation on ๐ (โข | ๐ฅ ๐ ) is off.
The first scenario is not decided by whatever the estimator used by principal, since it is a tail events of a multinomial distribution. For the same reason as the first, we cannot directly merge the two probabilities together as it is done in Lemma A.3, resulting in the following lemma. Thus we know from Lemma A.5 that ๐ * ๐ โ ๐ ๐ ๐ ( p๐๐ ) with probability at least 1 -๐/(๐๐) -(๐ + 1) exp(-๐๐ ๐ -1 /8).
From Theorem 3.7, we know that the mechanism ๐ ๐๐ acquired by solving Eq.(??) guarantees agent ๐’s truthfulness when ๐ * ๐ โ ๐ ๐ ๐ ( p๐๐ ). Also, in the duration of epoch ๐, the expected regret for a single round is ๐ โข ๐ถ 1 ๐ ๐ /(1 -๐ถ 2 ๐ ๐ ) where ๐ถ 1 and ๐ถ 2 are constants as defined in Eq.( 7). Therefore, the total expect regret for agent ๐ across all ๐ rounds is
๐ ๐/๐๐ (๐+1) (๐๐ ๐ -1 /2) โข (๐ ๐ -๐ ๐ -1 )
Finally, we know that for one agent in one epoch, the scheme ensures that applying union bound across all ๐ agents and ๐ periods. Hence applying union bound, we know that truthfulness is held with probability 1 -๐ -๐ (๐ + 1)
Also, note that the regret in the warm start phase is at most ๐ (๐) and therefore dominated by the regret from the adaptive phase. Thus the total regret is ๐ times the single agent’s regret and therefore
- ๐ โ ๐ ๐ ๐ ( p๐๐ ) with probability at least 1 -๐/๐ log๐ .