ScribbleSense: Generative Scribble-Based Texture Editing with Intent Prediction
Interactive 3D model texture editing presents enhanced opportunities for creating 3D assets, with freehand drawing style offering the most intuitive experience. However, existing methods primarily support sketch-based interactions for outlining, while the utilization of coarse-grained scribble-based interaction remains limited. Furthermore, current methodologies often encounter challenges due to the abstract nature of scribble instructions, which can result in ambiguous editing intentions and unclear target semantic locations. To address these issues, we propose ScribbleSense, an editing method that combines multimodal large language models (MLLMs) and image generation models to effectively resolve these challenges. We leverage the visual capabilities of MLLMs to predict the editing intent behind the scribbles. Once the semantic intent of the scribble is discerned, we employ globally generated images to extract local texture details, thereby anchoring local semantics and alleviating ambiguities concerning the target semantic locations. Experimental results indicate that our method effectively leverages the strengths of MLLMs, achieving state-of-the-art interactive editing performance for scribble-based texture editing.
💡 Research Summary
ScribbleSense introduces a novel interactive framework for editing textures on 3D models by leveraging multimodal large language models (MLLMs) together with state‑of‑the‑art image generation and segmentation tools. The authors first identify three major shortcomings of existing texture‑editing pipelines: (1) lack of prior knowledge about user behavior, leading to ambiguous interpretation of scribble inputs; (2) semantic spatial ambiguity when global text prompts are combined with local scribble cues; and (3) limited diversity in diffusion‑model training data, which hampers the generation of plausible local details. To overcome these issues, ScribbleSense proceeds in four tightly coupled stages.
-
Intent Prediction with Multiview Context – Users draw free‑hand colored scribbles on the mesh surface. The scribble is split into color and coarse mask components. Four rendered views of the original textured mesh (θ = 0°, φ = 0°, 90°, 180°, 270°) are concatenated with the scribble images and fed to an MLLM (e.g., GPT‑4V or InternVL). The model exploits both color cues and the surrounding visual context to infer the most likely semantic meaning of each scribble (e.g., “lava”, “moss”, “iron‑rich rock”). Multi‑view input mitigates misinterpretations that arise from a single perspective.
-
Global Prompt Generation and Image Synthesis – The inferred semantics are passed back to the same MLLM, which automatically expands them into a detailed global scene description (e.g., “a stylized volcanic landscape with glowing lava flows cutting through rugged slopes”). These prompts are submitted to a recent Stable Diffusion model, producing several full‑scene images that embed the desired local texture within a coherent global context.
-
Local Texture Selection – From the generated global images, the system extracts candidate texture patches that match the target semantics. An MLLM again evaluates each patch based on color fidelity, semantic relevance, and stylistic consistency with the original mesh texture, selecting the most appropriate one. This step leverages the fact that diffusion models are trained on whole‑object images; by extracting patches, ScribbleSense sidesteps the limitation and reuses existing style cues from the original texture.
-
Geometry‑Guided Mask Refinement – Because scribbles are inherently coarse, the initial mask is refined using geometric information from the 3D mesh. The Segment Anything Model (SAM) first produces a minimal segmentation around the scribble in the initial view. The mask is then projected onto subsequent views using mesh correspondence, and SAM is reapplied iteratively. This multi‑view refinement converges to a precise editing region that aligns with the user’s true intent.
Finally, the refined mask and the selected local texture patch are fed to an inpainting model, which seamlessly blends the new texture into the original mesh, preserving continuity at the boundaries.
Extensive experiments compare ScribbleSense against prior scribble‑based methods such as TEXTure, Diffusion Texture Painting, and SketchDream. Quantitative metrics (PSNR, FID) and user studies demonstrate superior color accuracy, semantic alignment, and overall visual quality. Ablation studies confirm that both the MLLM‑driven intent prediction and the geometry‑guided mask refinement contribute significantly to performance gains.
In summary, ScribbleSense showcases how the combination of multimodal language understanding, powerful diffusion synthesis, and geometry‑aware segmentation can transform free‑hand scribble inputs into precise, high‑fidelity texture edits on 3D assets. The approach promises to streamline content creation workflows in virtual reality, gaming, and digital entertainment, offering an intuitive, annotation‑free editing experience that respects both user intent and existing texture style.
Comments & Academic Discussion
Loading comments...
Leave a Comment