From Dead Pixels to Editable Slides: Infographic Reconstruction into Native Google Slides via Vision-Language Region Understanding

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Infographics are widely used to communicate information with a combination of text, icons, and data visualizations, but once exported as images their content is locked into pixels, making updates, localization, and reuse expensive. We describe \textsc{Images2Slides}, an API-based pipeline that converts a static infographic (PNG/JPG) into a native, editable Google Slides slide by extracting a region-level specification with a vision-language model (VLM), mapping pixel geometry into slide coordinates, and recreating elements using the Google Slides batch update API. The system is model-agnostic and supports multiple VLM backends via a common JSON region schema and deterministic postprocessing. On a controlled benchmark of 29 programmatically generated infographic slides with known ground-truth regions, \textsc{Images2Slides} achieves an overall element recovery rate of $0.989\pm0.057$ (text: $0.985\pm0.083$, images: $1.000\pm0.000$), with mean text transcription error $\mathrm{CER}=0.033\pm0.149$ and mean layout fidelity $\mathrm{IoU}=0.364\pm0.161$ for text regions and $0.644\pm0.131$ for image regions. We also highlight practical engineering challenges in reconstruction, including text size calibration and non-uniform backgrounds, and describe failure modes that guide future work.

💡 Research Summary

Images2Slides is an end‑to‑end system that converts a static infographic image (PNG/JPG) into an editable Google Slides slide. The core idea is to use a vision‑language model (VLM) to extract a region‑level layout in a strict JSON schema, map pixel coordinates to slide points, and then recreate the elements via the Google Slides batchUpdate API. The pipeline is modular: (1) VLM analysis – a prompt forces the model to output JSON containing image dimensions and a list of regions, each with an ID, type (text or image), bounding box in pixel space, extracted text (for text regions), optional style hints, and confidence scores. (2) Validation & deterministic post‑processing – the raw JSON is parsed into typed objects, boxes are clamped to image bounds, whitespace is normalized, minimum size constraints are enforced, and a fallback reading order is computed if missing. (3) Asset preparation – for every image region the system crops the region (adding a 10‑pixel right/bottom padding to avoid clipping), hashes the bytes to enable content‑addressed deduplication, and uploads the crop to an HTTPS‑accessible location. (4) Geometry mapping – the slide page size (in points) and the original image size (in pixels) define a scale factor s = min(WS/WI, HS/HI). Center offsets Δx and Δy are computed, and each pixel box (x, y, w, h) is transformed to slide coordinates (Δx + s·x, Δy + s·y, s·w, s·h). (5) Slide reconstruction – a single batch request creates a new slide, inserts text boxes, populates them with the extracted text, applies style information, and places image elements using the uploaded URLs. Text regions receive a base font size derived from the VLM’s estimate multiplied by s; because VLMs tend to underestimate small fonts, a piecewise‑linear calibration boosts sizes below 14 pt, enforcing a minimum readable size of 8 pt. To avoid overflow, the system optionally expands the text box width proportionally to the font scaling factor while checking for collisions with neighboring regions. Deterministic object IDs (e.g., TXT_, IMG_) enable safe retries. (6) Optional background synthesis – when the –synthesize‑background flag is set, a VLM is asked to return a background sample box and a mode label (solid or tile). The sampled patch is either color‑filled (solid) or tiled across the slide, uploaded as a single image, and placed beneath all other elements. If disabled, the background is omitted to avoid duplicating foreground content.

The authors evaluate the system on a controlled benchmark of 29 programmatically generated infographic slides. Ground‑truth slide layouts are known, rasterized, and then reconstructed. Results show an overall element recovery rate of 0.989 ± 0.057, with text recovery 0.985 ± 0.083 and image recovery 1.000. Text transcription error is low (CER = 0.033 ± 0.149, WER = 0.037 ± 0.167). Layout fidelity, measured by Intersection‑over‑Union, is 0.364 ± 0.161 for text regions and 0.644 ± 0.131 for image regions. The VLM inference takes on average 55 seconds per slide, while the Slides API batch update takes about 5.8 seconds, indicating that the pipeline is practical for batch processing.

Failure analysis highlights three main challenges: (1) under‑estimation of small font sizes, mitigated by the piecewise‑linear calibration; (2) non‑uniform or textured backgrounds, addressed by optional background synthesis but still a source of visual discrepancy; (3) overlapping or tightly packed regions, where the collision‑aware width expansion may still produce minor misalignments. The paper discusses these limitations and suggests future work on higher‑resolution OCR integration, chart‑specific derendering, and real‑time pipeline optimization.

By treating infographic reconstruction as a derendering problem targeting the Google Slides object model, Images2Slides offers a practical workflow for designers, marketers, and content creators to edit, localize, and repurpose existing infographic assets without manual redesign. The model‑agnostic JSON schema and deterministic processing make the system extensible to emerging VLMs and adaptable to diverse infographic styles.

From Dead Pixels to Editable Slides: Infographic Reconstruction into Native Google Slides via Vision-Language Region Understanding

💡 Research Summary

Comments & Academic Discussion

Leave a Comment