Spanning the Visual Analogy Space with a Weight Basis of LoRAs

Reading time: 5 minute
...

📝 Original Info

  • Title: Spanning the Visual Analogy Space with a Weight Basis of LoRAs
  • ArXiv ID: 2602.15727
  • Date: 2026-02-17
  • Authors: ** - 논문에 명시된 저자 정보는 제공되지 않았으나, 연구는 NVIDIA Research Lab (PAR) 소속 연구진이 수행한 것으로 추정된다. **

📝 Abstract

Visual analogy learning enables image manipulation through demonstration rather than textual description, allowing users to specify complex transformations difficult to articulate in words. Given a triplet $\{\mathbf{a}$, $\mathbf{a}'$, $\mathbf{b}\}$, the goal is to generate $\mathbf{b}'$ such that $\mathbf{a} : \mathbf{a}' :: \mathbf{b} : \mathbf{b}'$. Recent methods adapt text-to-image models to this task using a single Low-Rank Adaptation (LoRA) module, but they face a fundamental limitation: attempting to capture the diverse space of visual transformations within a fixed adaptation module constrains generalization capabilities. Inspired by recent work showing that LoRAs in constrained domains span meaningful, interpolatable semantic spaces, we propose LoRWeB, a novel approach that specializes the model for each analogy task at inference time through dynamic composition of learned transformation primitives, informally, choosing a point in a "space of LoRAs". We introduce two key components: (1) a learnable basis of LoRA modules, to span the space of different visual transformations, and (2) a lightweight encoder that dynamically selects and weighs these basis LoRAs based on the input analogy pair. Comprehensive evaluations demonstrate our approach achieves state-of-the-art performance and significantly improves generalization to unseen visual transformations. Our findings suggest that LoRA basis decompositions are a promising direction for flexible visual manipulation. Code and data are in https://research.nvidia.com/labs/par/lorweb

💡 Deep Analysis

📄 Full Content

Text-based image editing models [5,6,49,60,67] have recently emerged as powerful tools for controllable image generation and manipulation, enabling users to modify images through textual descriptions. However, many visual transformations are inherently difficult to articulate precisely through text alone. For example, consider describing the transformation that converts a photo into the style of a specific painting, or conveying an exact target pose through text. Such limitations motivates alternative paradigms that can capture and apply complex visual transformations.

Visual analogy learning [23] offers a compelling solution to this challenge by enabling models to understand transformations through examples rather than explicit descriptions. In this paradigm, given a triplet of images {a, a ′ , b}, the goal is to generate an image b ′ such that the visual relationship a : a ′ :: b : b ′ holds. That is, the transformation applied between a and a ′ should be analogously applied to b to produce b ′ . This approach allows users to specify complex visual changes through demonstration, making it possible to capture nuanced transformations that would be difficult or impossible to describe textually.

Early learning-based approaches trained stand-alone analogy models directly from analogy data [4,32,44,57,58,61], but this lead to limited task diversity and image quality, or required extensive compute. More recent work aims to leverage the rich prior of powerful text-to-image backbones by adapting them to the visual analogy task, using a single Low-Rank Adaptation (LoRA) module [17,34,50]. While effective, these methods face a fundamental limitation: they attempt to capture the diverse space of possible transformations within a single adapter. This constraint may limit the model’s ability to generalize across the rich variety of relationships that exist in images.

We hypothesize that specializing the model to each specific analogy task at inference time may improve performance and generalization. While this objective could theoretically be achieved via hypernetworks that generate taskspecific LoRAs [50], these are notoriously difficult to train and often suffer from instability [39]. Instead, we draw inspiration from recent work demonstrating that LoRAs from fine-tuned models (e.g., for personalization tasks) can span a meaningful semantic basis, and that interpolating between these LoRAs can effectively cover new points in this semantic space [12]. Building on this insight, we explore a similar principle for visual analogy learning and propose LoRWeB, a two-component system: (1) a learnable basis of LoRA modules and (2) a lightweight encoder that dynamically combines LoRAs from this basis at inference time based on the input analogy pair. These components are jointly trained, enabling the model to compose appropriate transformations for novel analogies unseen during training.

Existing methods typically encode analogy images using vision-language models such as CLIP [42] or SigLIP [63] and provide these encodings as context to the generative model. This can provide the higher-level semantic understanding needed for understanding the analogy task. However, this might lead to loss of detail in fine-grained visual detail preservation. Recent advances have shown that diffusion models can extract remarkably accurate visual details through extended attention mechanisms [5,8]. Thus, we leverage this capability by providing the full analogy triplet directly to the diffusion model via an extended-attention mechanism, while reserving CLIP-based encodings specifically for LoRA selection. This approach allows LoRWeB to balance fine-detail consistency with the higher-level semantics required to understand the analogy task.

We evaluate LoRWeB against established baselines and demonstrate it achieves state-of-the-art results. Our contributions include: (1) a novel architecture that decomposes visual analogy learning into a basis of LoRAs with dynamic composition, and (2) a comprehensive evaluation showing improved generalization to unseen transformations compared to existing single-LoRA approaches.

Visual analogies. Visual analogies, also known as “Image Analogies” [23],“Visual Prompting” [4] or “Visual Relations” [17], is the task of learning a transformation from a pair of before-and-after exemplars and applying it analogously to new images. Early non-neural methods learned explicit perpair filters for simpler tasks like style transfer [23]. Networkbased methods later used image embedding spaces to present analogies through simple vector arithmetic [44]. While these methods showed promise on datasets of simple, isolated objects, they struggled with the complexity of real-world images. Newer methods instead treat analogy learning as in-context learning, where the model is directly conditioned on the exemplar pair and a reference image, and is trained to successfully synthesize the matching target [4,57,58,61].

More recentl

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut