Let LLMs Speak Embedding Languages: Generative Text Embeddings via Iterative Contrastive Refinement
Existing large language model (LLM)-based embeddings typically adopt an encoder-only paradigm, treating LLMs as static feature extractors and overlooking their core generative strengths. We introduce GIRCSE (Generative Iterative Refinement for Contrastive Sentence Embeddings), a novel framework that leverages autoregressive generation to iteratively refine semantic representations. By producing sequences of soft tokens optimized under contrastive objective, GIRCSE captures latent concepts and implicit semantics that encoder-only methods often miss. To guide this process, we propose an Iterative Contrastive Refinement (ICR) objective that encourages each refinement step to yield better representations. Extensive experiments show that GIRCSE outperforms strong LLM-based embedding baselines on the MTEB benchmark and instruction-following tasks. Moreover, GIRCSE exhibits an emergent test-time scaling property: generating more tokens at inference steadily improves embedding quality. Our results establish generative iterative refinement as a new paradigm for representation learning.
💡 Research Summary
The paper introduces GIRCSE (Generative Iterative Refinement for Contrastive Sentence Embeddings), a novel framework that leverages the generative, autoregressive capabilities of large language models (LLMs) to produce high‑quality text embeddings. Traditional LLM‑based embedding methods treat the model as a static encoder: a single forward pass extracts a fixed‑size vector, and contrastive learning is applied only once. This paradigm ignores the core strength of LLMs—their ability to generate and iteratively refine information through sequential token prediction.
GIRCSE departs from this by having the LLM generate a short sequence of “soft tokens” after the original input. Each soft token is not a discrete word but a probability distribution over the entire vocabulary, produced by the model’s language‑model head. The distribution is then linearly combined with the token‑embedding matrix to obtain a continuous embedding vector. Because the combination is differentiable, gradients can flow through the generation process, enabling end‑to‑end training.
The workflow proceeds as follows: (1) The input sentence T is embedded via the model’s token embedding matrix E, yielding X. (2) For K steps (typically 5–20), the decoder predicts a soft token s_k conditioned on X and all previously generated soft tokens. The soft token is turned into an embedding d_k via a weighted sum of all vocabulary embeddings. (3) The sequence of d_k vectors is concatenated with X and fed back into the decoder, producing hidden states for both the original tokens and the generated ones. (4) The hidden states corresponding to the K generated tokens are pooled (mean pooling by default) to form the final sentence representation z.
The key training objective is the Iterative Contrastive Refinement (ICR) loss. Unlike standard contrastive learning that supervises only the final embedding, ICR applies a stepwise contrastive loss L_k at every generation step, encouraging each intermediate representation z_k to already align with positive examples and repel negatives. This prevents early steps from drifting into meaningless space and provides richer supervision. To ensure that later steps truly improve over earlier ones, a regularization term L_reg penalizes cases where the contrastive loss does not decrease monotonically across steps. The total loss is L_total = L_contrast + λ·L_reg.
Experiments are conducted with Mistral‑7B and Qwen2.5‑7B as backbones, fine‑tuned via LoRA on a contrastive dataset that mixes supervised query‑document pairs and hard negatives (≈0.2 M samples). GIRCSE is compared against 18 strong baselines, including E5‑Mistral, BGE‑Enicl, NV‑Embed, and the generative Inbedder. Results on the Massive Text Embedding Benchmark (MTEB) place GIRCSE consistently within the top‑5–6 models, achieving 2–4 percentage‑point absolute gains over the strongest encoder‑only competitors. On instruction‑following tasks, GIRCSE ranks in the top‑2–3, demonstrating that the generated refinement tokens capture task‑relevant nuances (e.g., emotion words like “frustrated” or “struggle”) that static embeddings miss.
A striking property is test‑time scaling: increasing the number of generated tokens at inference time steadily improves embedding quality, a behavior analogous to compute scaling in reasoning LLMs. The authors show that even with a modest increase (e.g., from 5 to 15 tokens) performance gains are near‑linear, while computational overhead is mitigated by KV‑caching, keeping FLOPs within ~1.0–1.1× of a standard single‑pass model.
Limitations are acknowledged. Iterative generation incurs higher latency and memory usage, which may be problematic for real‑time services despite caching optimizations. The soft tokens are not human‑readable, making debugging and interpretability harder. Finally, the work focuses on contrastive supervision; integrating other signals such as ranking losses or clustering objectives remains an open direction.
In summary, GIRCSE pioneers a “generate‑then‑refine” paradigm for text embeddings, showing that LLMs can speak an internal “embedding language” that is continuously refined through contrastive feedback. The framework achieves state‑of‑the‑art results on broad benchmarks, introduces a novel test‑time scaling mechanism, and opens a promising research avenue for combining generative LLM capabilities with representation learning. Future work could explore larger models, multimodal extensions, and richer supervisory signals to further push the boundaries of embedding quality and applicability.
Comments & Academic Discussion
Loading comments...
Leave a Comment