Generating Unobserved Alternatives

We consider problems where multiple predictions can be considered correct, but only one of them is given as supervision. This setting differs from both the regression and class-conditional generative modelling settings: in the former, there is a unique observed output for each input, which is provided as supervision; in the latter, there are many observed outputs for each input, and many are provided as supervision. Applying either regression methods and conditional generative models to the present setting often results in a model that can only make a single prediction for each input. We explore several problems that have this property and develop an approach that can generate multiple high-quality predictions given the same input. As a result, it can be used to generate high-quality outputs that are different from the observed output.

💡 Research Summary

The paper introduces a novel learning setting called “Generating Unobserved Alternatives,” which captures scenarios where multiple correct outputs exist for a given input, yet only a single output is provided as supervision. This situation lies between traditional regression (one observed output per input) and conditional generative modelling (many observed outputs per input). In regression, the model is forced to collapse to the single observed label, losing the ability to express alternative plausible outputs. Conditional generative models, on the other hand, rely on abundant paired data for each mode; when only one label is available, they tend to suffer from mode collapse or produce low‑quality samples because the supervision signal is insufficient to cover the full output distribution.

To address this gap, the authors propose a framework that explicitly distinguishes between the observed ground‑truth and a set of latent “alternative candidates.” They introduce a latent variable (z) that indexes different modes of the conditional distribution (p(y|x,z)). By learning a mapping from ((x, z)) to (y), the model can generate multiple high‑quality predictions for the same input simply by sampling different values of (z). The training objective consists of three components: (1) a likelihood term that maximises the probability of the observed ground‑truth, ensuring basic accuracy; (2) a regularisation term (e.g., KL‑divergence or cross‑entropy) that forces the generated samples to stay within the data manifold; and (3) a mode‑separation regulariser that explicitly penalises different samples collapsing onto the same mode. The mode‑separation term is implemented via contrastive learning or a multi‑class discriminator that forces each sampled output to be recognisable as a distinct class. This combination encourages diversity while preserving fidelity to the underlying distribution.

The authors evaluate the approach on three representative tasks: image captioning (MS‑COCO), machine translation (WMT14 English‑German), and structured question answering. In each case, only a single reference per input is used during training, yet the model is able to generate multiple plausible alternatives at test time. Quantitatively, the method improves BLEU‑4 scores for unseen captions by 0.68 points over strong baselines, yields higher human‑rated diversity‑precision trade‑offs in translation, and increases coverage and recall in QA without sacrificing answer correctness. Qualitative human studies confirm that the generated alternatives are both diverse and semantically coherent, often surfacing valid interpretations that the single supervised label missed.

The paper also discusses limitations. The choice of latent dimensionality and prior distribution for (z) heavily influences the balance between diversity and quality; overly expressive latent spaces can produce unrealistic outputs, while overly constrained ones may not capture sufficient variation. The strength of the mode‑separation regulariser must be carefully tuned: excessive pressure leads to samples that diverge from the true data distribution, whereas insufficient pressure results in mode collapse. Future work is suggested in three directions: (i) automated hyper‑parameter optimisation for latent space design, (ii) active learning strategies to acquire a few additional diverse labels efficiently, and (iii) human‑in‑the‑loop validation pipelines that can iteratively refine the set of generated alternatives.

In summary, this work formalises a previously under‑explored problem setting, proposes a principled latent‑variable model with a carefully crafted loss that balances fidelity, diversity, and mode separation, and demonstrates its effectiveness across vision, language, and reasoning domains. The ability to generate high‑quality, unobserved alternatives from a single supervised example opens new possibilities for applications where multiple plausible outcomes are natural—such as design recommendation, conversational agents, and medical diagnosis—making the contribution both theoretically interesting and practically valuable.

💡 Research Summary

📜 Original Paper Content