What assortments (subsets of items) should be offered, to collect data for estimating a choice model over $n$ total items? We propose a structured, non-adaptive experiment design requiring only $O(\log n)$ distinct assortments, each offered repeatedly, that consistently outperforms randomized and other heuristic designs across an extensive numerical benchmark that estimates multiple different choice models under a variety of (possibly mis-specified) ground truths. We then focus on Nested Logit choice models, which cluster items into "nests" of close substitutes. Whereas existing Nested Logit estimation procedures assume the nests to be known and fixed, we present a new algorithm to identify nests based on collected data, which when used in conjunction with our experiment design, guarantees correct identification of nests under any Nested Logit ground truth. Our experiment design was deployed to collect data from over 70 million users at Dream11, an Indian fantasy sports platform that offers different types of betting contests, with rich substitution patterns between them. We identify nests based on the collected data, which lead to better out-of-sample choice prediction than ex-ante clustering from contest features. Our identified nests are ex-post justifiable to Dream11 management.
Understanding agent choice is important in many applications: a retailer wants to know how customers choose between different brands of a product; a car dealer wants to know how buyers select from its available models; a policymaker wants to know how citizens substitute among transportation modes. The goal is to estimate a choice model, which specifies for any menu or "assortment" of options, the expected market share that each option would receive.
Choice models capture the phenomenon that offering a smaller assortment concentrates more market share on each remaining option. In the simplest Multi-Nomial Logit (MNL) choice model, these market shares are assumed to all increase by the same %. For example, this is corroborated by the sales on Day 2 in Table 1: when milk ( ) and boba ( ) were not offered, the sales of apple juice ( ) and orange juice ( ) both increased by 20% relative to Day 1, where all drinks were offered. This suggests that on Day 2, some of the customers whose favorite drink would have been or chose or instead, following a 2 to 1 ratio that is consistent with the sales on Day 1. However, the MNL assumption is often violated, e.g. on Day 3 in Table 1: when was not offered, the sales of increased disproportionately (doubling relative to Day 1 sales) compared to the 20% increases of and . This calls for richer choice models such as Nested Logit, which partitions the items into “nests” of close substitutes. Under this model, when a customer’s favorite drink ( ) is not offered, they are more likely to switch to a drink in the same nest (in this case, a different juice ) than to drinks in other nests ( , ). Note that fewer total sales were lost on Day 3, because at least one item from each nest was offered, justifying why it may be important to learn the grouping of items into nests.
Experiment design problem. Some assortment variation is necessary for learning richer choice models, and this variation may need to be deliberately designed. In the example in Table 1, it is in fact impossible to discern whether and are close substitutes or not, because they were either both offered or both unavailable on each day. This motivates our primary research question.
How to deliberately design the assortments offered so that complex substitution patterns can be detected and rich choice models can be estimated?
This question is largely ignored in papers that estimate empirical choice models, because they use observational data in which the assortments offered were out of the researcher’s control (see Section 2.1). Meanwhile, for papers that estimate choice models from synthetic data, they generally draw observations from randomized assortments (see Section 2.2), which do not provide the most efficient form of data collection.
In this paper, we introduce a combinatorial design (explained in Section 1.1) that deliberately arranges the items into a small number of experimental assortments, which are much more informative than randomized assortments. To demonstrate this, we replicate the numerical framework of Berbeglia et al. [2022], and show that by only replacing their randomized experimental assortments while keeping the ground truths and estimation methods fixed, estimation error is robustly decreased (see Section 1.3). That is, even though our experiment design is motivated by Nested Logit, it significantly improves choice model estimation regardless of whether the true and/or estimated models are Nested Logit.
For 20 days in Spring 2025, our experiment design was deployed across 70 million users at Dream11, an Indian fantasy sports platform (see Section 1.4). Dream11 offers different types of “contests” for users to join, and wants to understand how users choose between them.
Nest identification problem. After our experiments were deployed, the managers at Dream11 wanted to understand how the contests could be classified into nests of close substitutes. The contests are diverse in many ways, so a priori, it is difficult to subjectively classify them into nests (the way that one could classify , as “juices”). We instead want the classification into nests to be based on the data collected, which motivates our second research question.
How to automatically identify nests based on sales data, instead of relying on subjective classification?
Historically, papers on Nested Logit choice estimation (see Section 2.6) have assumed the nests are fixed, focusing instead on estimating the other model parameters for a combination of tractability and interpretability reasons [Train, 2009, §4]. Standard statistics packages (e.g. nlogit in Stata) also make this assumption. Papers that consider nest identification are surprisingly scant, as we discuss in Section 2.6.
Our second contribution is to propose a new algorithm for nest identification, based on the simple intuition from Table 1. To elaborate, for each item in each experimental assortment (Day 2, Day 3), we define its boost factor to be the ratio of its
This content is AI-processed based on open access ArXiv data.