VecFusion: Vector Font Generation with Diffusion

VecFusion: Vector Font Generation with Diffusion
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We present VecFusion, a new neural architecture that can generate vector fonts with varying topological structures and precise control point positions. Our approach is a cascaded diffusion model which consists of a raster diffusion model followed by a vector diffusion model. The raster model generates low-resolution, rasterized fonts with auxiliary control point information, capturing the global style and shape of the font, while the vector model synthesizes vector fonts conditioned on the low-resolution raster fonts from the first stage. To synthesize long and complex curves, our vector diffusion model uses a transformer architecture and a novel vector representation that enables the modeling of diverse vector geometry and the precise prediction of control points. Our experiments show that, in contrast to previous generative models for vector graphics, our new cascaded vector diffusion model generates higher quality vector fonts, with complex structures and diverse styles.


💡 Research Summary

VecFusion introduces a novel two‑stage cascaded diffusion framework for generating high‑quality vector fonts. The first stage, called Raster‑DM, is a UNet‑based diffusion model that transforms Gaussian noise into a low‑resolution (64×64) raster image of a target glyph while simultaneously producing an auxiliary three‑channel control‑point field. This field encodes each control point’s 2‑D location, its ordering within a Bézier path, and multiplicity information using colored Gaussian blobs. Conditioning information consists of a character codepoint embedding and a font‑style embedding, which can be either a one‑hot ID or a feature map extracted from a few exemplar raster glyphs via a small CNN.

During training, Raster‑DM learns to predict the added noise at each diffusion step using the standard mean‑squared error loss on both the raster image and the control‑point field. The model is trained for 1,000 diffusion steps with a cosine noise schedule on eight A100 GPUs for five days.

The second stage, Vector‑DM, receives the raster image and control‑point field through cross‑attention and generates a structured vector representation. The target vector tensor y₀ is an M×D matrix where M is an upper bound on the total number of control points (e.g., 256) and D contains a discrete path‑membership index, continuous normalized coordinates, and a binary existence flag. To reduce permutation ambiguity, paths are lexicographically sorted by their coordinates before training. Vector‑DM employs a transformer‑based denoiser that leverages self‑attention to capture long‑range dependencies among control points and cross‑attention to incorporate the raster guidance. The diffusion loss is again the MSE between predicted and true noise, but the noise now perturbs spatial positions and path memberships rather than pixel intensities.

A key contribution is the mixed discrete‑continuous representation that lets the model automatically decide how many paths and control points a glyph needs. Unlike prior VAE or autoregressive approaches that assume a fixed token sequence, VecFusion can dynamically activate or deactivate paths during diffusion, enabling it to handle glyphs with highly variable topologies such as those found in Latin, Hangul, Devanagari, and other scripts.

Experiments were conducted on a large, multi‑script font dataset containing thousands of glyphs across dozens of styles. VecFusion was compared against state‑of‑the‑art vector generators (DeepVecFont‑v2, VAE‑based models) and against raster‑to‑vector pipelines. Quantitative metrics include Fréchet Inception Distance (FID), Learned Perceptual Image Patch Similarity (LPIPS), and mean absolute error (MAE) of predicted control‑point coordinates. VecFusion outperformed all baselines on every metric, demonstrating superior visual fidelity and more accurate geometry. Qualitative results show smooth Bézier curves for complex characters (e.g., “g”, “y”, “श”) and consistent style transfer when only a few exemplar glyphs are provided.

Ablation studies confirm the necessity of both stages: removing Raster‑DM or the control‑point field degrades vector quality dramatically, leading to unstable path counts and noisy control‑point placements. Reducing the control‑point field from three to one channel also harms performance, indicating that color‑coded ordering and multiplicity cues are essential.

Limitations include the reliance on a low‑resolution raster guide, which may miss fine details for highly intricate glyphs, and the current focus on cubic Bézier curves only. Future work could explore multi‑scale diffusion, graph‑neural‑network encodings of vector topology, and integration with differentiable rasterizers for real‑time interactive editing.

In summary, VecFusion demonstrates that a cascaded diffusion approach—first generating a raster scaffold with auxiliary geometric hints, then refining it into a precise vector description—can overcome the longstanding challenges of topological diversity and control‑point precision in automatic font generation. This opens new possibilities for font design automation, missing‑glyph completion, few‑shot style transfer, and font interpolation across diverse writing systems.


Comments & Academic Discussion

Loading comments...

Leave a Comment