Conjuring Semantic Similarity

Conjuring Semantic Similarity
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The semantic similarity between sample expressions measures the distance between their latent ‘meaning’. These meanings are themselves typically represented by textual expressions. We propose a novel approach whereby the semantic similarity among textual expressions is based not on other expressions they can be rephrased as, but rather based on the imagery they evoke. While this is not possible with humans, generative models allow us to easily visualize and compare generated images, or their distribution, evoked by a textual prompt. Therefore, we characterize the semantic similarity between two textual expressions simply as the distance between image distributions they induce, or ‘conjure.’ We show that by choosing the Jeffreys divergence between the reverse-time diffusion stochastic differential equations (SDEs) induced by each textual expression, this can be directly computed via Monte-Carlo sampling. Our method contributes a novel perspective on semantic similarity that not only aligns with human-annotated scores, but also opens up new avenues for the evaluation of text-conditioned generative models while offering better interpretability of their learnt representations.


💡 Research Summary

The paper introduces a novel way to measure semantic similarity between textual expressions by leveraging the visual content generated by text‑conditioned diffusion models. Instead of relying on textual co‑occurrence statistics or the distribution of language model continuations, the authors propose to “conjure” the meaning of a prompt: they examine the distribution of images that a diffusion model produces when conditioned on that prompt. The core idea is to compare the stochastic processes (reverse‑time SDEs) induced by two prompts and to quantify their difference using Jeffreys divergence, which is the symmetrized Kullback‑Leibler (KL) divergence between the two path measures.

Mathematically, a diffusion model learns a score function sθ(x, t | y) that approximates the gradient of the log‑density of the data conditioned on a textual prompt y. The reverse‑time SDE for a given prompt y can be written as
dx = μθ(x, t, y) dt + g(t) d\bar w_t,
where μθ(x, t, y) = f(x, t) − ½ g(t)² sθ(x, t | y). For two prompts y₁ and y₂ we obtain two SDEs with drifts μ₁ and μ₂. By applying Girsanov’s theorem, the KL divergence between the corresponding path measures reduces to an expectation of the squared L₂ norm of the drift difference, weighted by 1/g(t)². Symmetrizing yields Jeffreys divergence, which (ignoring constants) becomes an expectation over time t and the mixture of the two marginal image distributions of the squared norm ‖sθ(x, t | y₁) − sθ(x, t | y₂)‖².

In practice the authors approximate this expectation with Monte‑Carlo sampling. They sample an initial noise vector from the prior π (a standard Gaussian), then run the diffusion model’s denoising process twice—once conditioned on y₁ and once on y₂—collecting the intermediate latent states xₜ at each timestep. At each step they compute the Euclidean distance between the two score predictions and accumulate the squared norm. Averaging over a small number of Monte‑Carlo runs (they use T = 10 timesteps for feasibility) yields a scalar similarity score d₍ours₎(y₁, y₂).

The experimental setup uses Stable Diffusion v1.4 with classifier‑free guidance (scale = 7) and the LMS scheduler. Images are generated at 512 × 512 resolution, but distances are computed in the 64 × 64 latent space. The authors evaluate their metric against human‑annotated semantic similarity scores on standard benchmark datasets. The proposed distance correlates strongly with human judgments, achieving Pearson and Spearman coefficients comparable to or exceeding those of zero‑shot CLIP‑based baselines. Moreover, the method provides visual explanations: by visualizing the denoising trajectories for two prompts (e.g., “Snow Leopard” vs. “Bengal Tiger”), one can see how the model morphs characteristic visual attributes (spots ↔ stripes), offering an interpretable view of the semantic gap.

Key contributions include: (1) redefining semantic similarity as a visual grounding problem; (2) deriving a tractable, theoretically grounded distance based on Jeffreys divergence of diffusion SDEs; (3) demonstrating that this distance aligns well with human perception while also delivering visual interpretability; (4) opening a new avenue for evaluating text‑conditioned generative models beyond traditional quality‑or‑diversity metrics such as FID or CLIP score.

The paper acknowledges limitations. The approach is currently tied to text‑conditioned diffusion models; extending it to other generative families (e.g., VAE‑GANs, text‑to‑video models) would require analogous reverse‑process formulations. Multi‑modal prompts that evoke diverse image modes may be oversimplified when collapsed into a single distribution. Monte‑Carlo sampling is computationally expensive (≈2 seconds per step on an RTX 4090), so scaling to large corpora would need more efficient estimators or variance‑reduction techniques.

Future work could explore (a) applying the framework to other conditional generative architectures, (b) reducing computational cost via importance sampling or learned surrogate models, and (c) integrating multi‑modal representations to handle prompts that generate heterogeneous visual content. Overall, the study offers a compelling, theoretically sound, and practically useful perspective on semantic similarity, bridging the gap between textual meaning and visual imagination in modern AI systems.


Comments & Academic Discussion

Loading comments...

Leave a Comment