Surrogate-Based Prevalence Measurement for Large-Scale A/B Testing

Reading time: 5 minute
...

📝 Original Info

  • Title: Surrogate-Based Prevalence Measurement for Large-Scale A/B Testing
  • ArXiv ID: 2602.16111
  • Date: 2026-02-18
  • Authors: ** - 논문 본문에 저자 정보가 명시되지 않아 저자 미상(Authors not provided)으로 표기합니다. **

📝 Abstract

Online media platforms often need to measure how frequently users are exposed to specific content attributes in order to evaluate trade-offs in A/B experiments. A direct approach is to sample content, label it using a high-quality rubric (e.g., an expert-reviewed LLM prompt), and estimate impression-weighted prevalence. However, repeatedly running such labeling for every experiment arm and segment is too costly and slow to serve as a default measurement at scale. We present a scalable \emph{surrogate-based prevalence measurement} framework that decouples expensive labeling from per-experiment evaluation. The framework calibrates a surrogate signal to reference labels offline and then uses only impression logs to estimate prevalence for arbitrary experiment arms and segments. We instantiate this framework using \emph{score bucketing} as the surrogate: we discretize a model score into buckets, estimate bucket-level prevalences from an offline labeled sample, and combine these calibrated bucket level prevalences with the bucket distribution of impressions in each arm to obtain fast, log-based estimates. Across multiple large-scale A/B tests, we validate that the surrogate estimates closely match the reference estimates for both arm-level prevalence and treatment--control deltas. This enables scalable, low-latency prevalence measurement in experimentation without requiring per-experiment labeling jobs.

💡 Deep Analysis

📄 Full Content

Modern recommender systems and media platforms must balance user engagement against the need to manage exposure to certain content attributes. Teams often summarize such exposure objectives in terms of prevalence: the fraction of impressions associated with a given target category. At the same time, product decisions are largely driven by large-scale A/B experiments [8], where variants adjust ranking or filtering and are evaluated on both engagement and attribute-specific exposure metrics.

One way to measure prevalence is to sample contents from traffic, label it with LLMs using expert-validated prompts, and apply an estimation. In our setting, we use PPSWOR (probability-proportionalto-size without replacement) sampling [7] and the Hansen-Hurwitz estimator [6] to obtain high-quality and unbiased measurements.

However, directly using LLM-based prevalence as a default metric for every experiment is impractical. Running a separate LLM job per arm and per segment is expensive, and quickly becomes infeasible on a platform with hundreds of concurrent experiments. Moreover, experimenters often care about relatively small but meaningful changes in prevalence. A single LLM measurement per arm tends to emphasize the absolute level, so many small treatmentcontrol deltas appear statistically non-significant. Running LLM labeling per experiment, per segment, and per day to address this would further amplify cost and latency.

We address these challenges with a ML-score surrogate method that reuses a single LLM-based calibration of model scores across many experiments. We discretize model scores into buckets, estimate bucket-level prevalences from an offline calibration sample, and then combine these with the observed distribution of impressions over buckets in each experiment arm to obtain fast, log-based prevalence estimates. We integrate this method into a large-scale experiment platform, and augment it with a day-level aggregation that focuses on treatment and control delta, improving sensitivity to small shifts.

Our approach is closely related to surrogate-outcome methods, where the quantity of interest is not directly observed but inferred using an intermediate signal [1]. Such methods are commonly used when long-run outcomes are costly or slow to measure in experiments, and only short-run outcomes are readily observed. In our setting, direct daily LLM-based prevalence measurement is accurate but operationally impractical; the ML-score therefore serves as a surrogate for prevalence. A key advantage of our setting is that we can directly validate surrogate quality by comparing surrogatebased estimates to LLM-based prevalence estimates on the same experiments.

We consider the problem of estimating the prevalence of a category 𝑘 on a large-scale media platform. Let K = {Food-Recipe, Lawn and Garden, Gen-AI Generated, …} denote the set of content-attributes considered in this study, and fix a particular 𝑘 ∈ K.

Our target quantity is the prevalence of category 𝑘 within 𝑆which can denote the full population or a specific subset such as an experiment arm (control, treatment), a user demographic (country, age), an app surface, or intersections of these.

The engineering details of our production prevalence pipeline are described in Farooq et al. [5]. In this section, we briefly recap the core estimator and introduce notation that we will use throughout the rest of the paper, in particular when we describe the ML-score surrogate method and its calibration.

We consider a large population of content items, indexed by 𝑖 = 1, . . . , 𝑁 . For each item 𝑖, we observe:

• 𝐼 𝑖 : the total number of impressions of item 𝑖 over a given time window. • 𝑍 𝑖,𝑘 ∈ {0, 1}: a label indicating whether item 𝑖 is truly in category 𝑘 (1) or not (0).

For a given segment 𝑆, let D (𝑆) ⊆ {1, . . . , 𝑁 } denote the set of items that receive impressions from 𝑆, and let 𝐼 𝑖 (𝑆) be the number of impressions of item 𝑖 from 𝑆 over the time window of interest. The total impressions of item 𝑖 satisfy 𝐼 𝑖 = 𝑆 𝐼 𝑖 (𝑆). The prevalence of category 𝑘 in segment 𝑆 can be written as:

When 𝑆 is the entire population, D (𝑆) = {1, . . . , 𝑁 } and 𝐼 𝑖 (𝑆) = 𝐼 𝑖 , recovering the global prevalence definition.

Directly labeling all items in D (𝑆) (often billions of items across many segments) is infeasible, so we estimate these quantities from a sample. We adopt a PPSWOR design, in which items are sampled with probabilities proportional to a chosen size measure. Compared with simple random sampling, PPSWOR allows us to oversample items that are more informative (e.g., high-impression items), achieving similar statistical power with a much smaller sample size.

For category 𝑘 and segment 𝑆, each item 𝑖 ∈ D (𝑆) is assigned a sampling weight 𝑤 𝑖,𝑘 (𝑆) ∝ 𝑓 𝐼 𝑖 (𝑆) , (2) and is selected into the sample with probability

Here 𝑓 is a function of impressions; it can be 𝐼 𝑖 (𝑆) itself, or a combination such as 𝐼 𝑖 (𝑆) multiplied by other factors (e.g., model score). We

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut