Semantic Document Derendering: SVG Reconstruction via Vision-Language Modeling
📝 Abstract
Multimedia documents such as slide presentations and posters are designed to be interactive and easy to modify. Yet, they are often distributed in a static raster format, which limits editing and customization. Restoring their editability requires converting these raster images back into structured vector formats. However, existing geometric raster-vectorization methods, which rely on low-level primitives like curves and polygons, fall short at this task. Specifically, when applied to complex documents like slides, they fail to preserve the high-level structure, resulting in a flat collection of shapes where the semantic distinction between image and text elements is lost. To overcome this limitation, we address the problem of semantic document derendering by introducing SliDer, a novel framework that uses Vision-Language Models (VLMs) to derender slide images as compact and editable Scalable Vector Graphic (SVG) representations. SliDer detects and extracts attributes from individual image and text elements in a raster input and organizes them into a coherent SVG format. Crucially, the model iteratively refines its predictions during inference in a process analogous to human design, generating SVG code that more faithfully reconstructs the original raster upon rendering. Furthermore, we introduce Slide2SVG, a novel dataset comprising raster-SVG pairs of slide documents curated from real-world scientific presentations, to facilitate future research in this domain. Our results demonstrate that SliDer achieves a reconstruction LPIPS of 0.069 and is favored by human evaluators in 82.9% of cases compared to the strongest zero-shot VLM baseline.
💡 Analysis
Multimedia documents such as slide presentations and posters are designed to be interactive and easy to modify. Yet, they are often distributed in a static raster format, which limits editing and customization. Restoring their editability requires converting these raster images back into structured vector formats. However, existing geometric raster-vectorization methods, which rely on low-level primitives like curves and polygons, fall short at this task. Specifically, when applied to complex documents like slides, they fail to preserve the high-level structure, resulting in a flat collection of shapes where the semantic distinction between image and text elements is lost. To overcome this limitation, we address the problem of semantic document derendering by introducing SliDer, a novel framework that uses Vision-Language Models (VLMs) to derender slide images as compact and editable Scalable Vector Graphic (SVG) representations. SliDer detects and extracts attributes from individual image and text elements in a raster input and organizes them into a coherent SVG format. Crucially, the model iteratively refines its predictions during inference in a process analogous to human design, generating SVG code that more faithfully reconstructs the original raster upon rendering. Furthermore, we introduce Slide2SVG, a novel dataset comprising raster-SVG pairs of slide documents curated from real-world scientific presentations, to facilitate future research in this domain. Our results demonstrate that SliDer achieves a reconstruction LPIPS of 0.069 and is favored by human evaluators in 82.9% of cases compared to the strongest zero-shot VLM baseline.
📄 Content
parses identified assets like images and text, to semantically reconstruct the document into an editable form. This enables quick, accessible editing without the need to re-design documents from scratch.
Among common structured representation methods, Scalable Vector Graphics (SVGs) offer a flexible structure for representing multimedia documents by encoding image and text assets as discrete and editable elements. Its hierarchical design allows for precise manipulation of individual components, facilitating straightforward editing and reordering.
Despite the advantages of SVG, most existing approaches that derender raster images into SVG format rely on lowlevel geometric primitives, such as curves and polygons (Ma et al. 2022;Rodriguez et al. 2023;Carlier et al. 2020;Reddy et al. 2021), which work well for simple icons and logos but fall short when applied to complex multimedia documents. These methods often produce unstructured representations that fail to capture the semantic layout of documents like slides, underscoring the need for SVG reconstruction techniques for multimedia documents that move beyond primitive-based approximations.
Overcoming these limitations requires a model that can interpret intricate visual inputs and generate structured code, a combination of capabilities that constitutes a core strength of modern Vision-Language Models (VLMs). Recent advancements in VLMs have showcased robust performance in code generation (Jiang et al. 2024;Zheng et al. 2023) and image-to-text tasks (Team et al. 2023;Achiam et al. 2023;Bai et al. 2023;Team et al. 2025), demonstrating powerful image understanding and object detection abilities that are well-suited for high-level SVG reconstruction.
Motivated by these successes, we present SliDer (Slide Derenderer), a novel VLM-based framework that converts raster multimedia documents into structured, editable SVG representations. We focus on slide-based documents, as their rich composition of text, images, and complex layouts makes them both a popular communication tool in many domains and a challenging benchmark. As illustrated in Figure 1, our method derenders a raster slide into an SVG representation that faithfully reconstructs the original raster slide upon rendering. A key feature of our approach is its ability to iteratively refine its own predictions at inference time, allowing it to correct initial errors and progressively improve reconstruction fidelity. Notably, the images and text contained in the raster slide are parsed into individual, editable assets, enabling independent modifications.
To develop our method and advance research in this domain, we also introduce Slide2SVG, a new dataset for slide derendering. Comprising approximately 38,000 samples collected from real-world scientific presentations, it spans a wide array of designs, content, and layouts, providing a robust foundation for future work in structured document reconstruction.
Using Slide2SVG, we evaluate SliDer with quantitative metrics and human judgments, focusing on the visual fidelity of its reconstructions. In pairwise tests, human evaluators chose Gemini-based SliDer over the strongest zero-shot VLM baseline, GPT-4o (Hurst et al. 2024), in 82.9% of cases and over LIVE (Ma et al. 2022), a leading raster vectorization method, in 91.8%. Perceptual metrics also support this preference: SliDer achieves an LPIPS1 of 0.069 compared to 0.118 and 0.169 for GPT-4o and LIVE, respectively, significantly reducing the perceptual distance between the original raster and the reconstruction.
The primary contributions of our work are as follows2 :
• We formulate the task of semantic document derendering, which involves extracting the overall layout of a multimedia document and parsing each individual asset into an editable format, eventually transforming the raster document into a structured, editable representation.
• We propose SliDer, a VLM-based framework that iteratively converts raster slides into structured SVG representations, faithfully reconstructing the original slides upon rendering.
• We introduce Slide2SVG, a novel dataset containing raster slides and their compact SVG representations, to address the shortage of image-to-SVG datasets for multimedia documents.
• We demonstrate through comprehensive quantitative and human evaluations on Slide2SVG, that SliDer consistently surpasses strong zero-shot VLM and raster vectorization baselines in reconstruction fidelity.
We briefly survey work on vision-language models, raster vectorization, and document datasets most relevant to our setting, and refer the reader to the Appendix for an extended overview.
Large Vision-Language Models (VLMs) have shown strong performance in image-to-text generation and visual reasoning (Li et al. 2025;Zhang et al. 2024;Hu et al. 2022;Xie et al. 2022;Hartsock and Rasool 2024;Lee et al. 2024), including visual document understanding (Li et al. 2024;Luo et al. 2022). Because they can both parse comp
This content is AI-processed based on ArXiv data.