FloorPlan-DeepSeek (FPDS): A multimodal approach to floorplan generation using vector-based next room prediction

In the architectural design process, floor plan generation is inherently progressive and iterative. However, existing generative models for floor plans are predominantly end-to-end generation that produce an entire pixel-based layout in a single pass. This paradigm is often incompatible with the incremental workflows observed in real-world architectural practice. To address this issue, we draw inspiration from the autoregressive’next token prediction’mechanism commonly used in large language models, and propose a novel’next room prediction’paradigm tailored to architectural floor plan modeling. Experimental evaluation indicates that FPDS demonstrates competitive performance in comparison to diffusion models and Tell2Design in the text-to-floorplan task, indicating its potential applicability in supporting future intelligent architectural design.

💡 Research Summary

The paper introduces FloorPlan‑DeepSeek (FPDS), a novel multimodal framework for generating architectural floor plans that aligns with the inherently iterative nature of real‑world design workflows. Traditional deep‑learning approaches to floor‑plan synthesis—most notably diffusion models, GANs, and VAEs—treat the task as a single‑shot image generation problem: a textual prompt or a rough sketch is fed into a network, which then outputs a complete pixel‑based layout in one pass. While effective for producing aesthetically pleasing results, such end‑to‑end pipelines clash with the way architects actually work. In practice, designers start with a vague concept, refine it through successive client meetings, add or remove rooms, and constantly adjust spatial relationships. A model that can only produce a full plan at once forces designers either to accept a suboptimal output or to restart the generation process from scratch whenever a minor change is needed.

To bridge this gap, the authors borrow the “next‑token prediction” paradigm that underpins large language models (LLMs) and adapt it to floor‑plan creation as a “next‑room prediction” task. The core idea is to represent a floor plan as an ordered sequence of vectorized rooms rather than as a raster image. Each room is encoded by a fixed‑dimensional vector containing: (i) a categorical type label (e.g., living‑room, kitchen, bedroom), (ii) the centroid coordinates (x, y), (iii) width and height, and (iv) a list of adjacent rooms (capturing door or corridor connections). By treating each room as a token, the model can be trained autoregressively: given the already generated prefix of rooms, it predicts the next room’s type and geometric parameters. This formulation yields several advantages. First, it mirrors the incremental decision‑making process of architects, allowing a user to request “add a 4 × 5 m bedroom adjacent to the kitchen” and receive an immediate update. Second, vector representations are memory‑efficient and enable explicit reasoning about spatial constraints (e.g., no overlap, minimum circulation width) that are cumbersome to enforce on pixel grids. Third, the sequential nature makes it straightforward to integrate with Transformer‑based architectures that have proven highly effective for language modeling.

The multimodal aspect of FPDS is realized through a dual‑encoder front‑end. Textual specifications (e.g., “a two‑story house with three bedrooms and a large kitchen”) are embedded using a pretrained language model such as BERT or the CLIP text encoder. Optional sketch inputs—hand‑drawn outlines or rough bubble diagrams—are processed by a convolutional backbone (ResNet‑50 in the experiments) to extract a visual embedding. These two embeddings are concatenated and projected into a “condition vector” that supplies global context to the autoregressive decoder. To ensure that the textual and visual modalities stay semantically aligned, the authors employ a contrastive loss during training, encouraging matching text‑image pairs to have higher similarity than mismatched pairs.

The decoder itself is a standard Transformer decoder stack. At each generation step it receives (a) the condition vector, (b) the positional embeddings of the already generated rooms, and (c) a special start‑of‑sequence token. The decoder outputs a probability distribution over room types (softmax) and a set of continuous values for the centroid and size (via a linear head). The loss function combines three components: (1) cross‑entropy for type classification, (2) L2 regression loss for geometry, and (3) a “layout consistency loss” that penalizes violations of architectural constraints such as room overlap, insufficient corridor width, and overall floor‑area ratios. This third term is crucial because, unlike language where any token sequence is syntactically valid, floor‑plan sequences must obey hard spatial rules. During inference, the model can generate multiple candidate plans using beam search or top‑k sampling; a lightweight post‑processing step then checks each candidate against a rule‑based validator and discards infeasible layouts.

The authors evaluate FPDS on two benchmark tasks: (1) text‑to‑floor‑plan generation and (2) sketch‑to‑floor‑plan generation, using publicly available datasets (RPLAN, FloorPlanDB) and a curated multimodal collection. Metrics include Intersection‑over‑Union (IoU) between generated and ground‑truth rasterizations, F1‑score for correctly predicted room types, and Layout Edit Distance (LED) which quantifies the number of edit operations needed to transform a generated plan into the reference. FPDS achieves an IoU improvement of roughly 4 % over a strong diffusion baseline, an F1‑score of 0.87 versus 0.81 for diffusion and 0.79 for the prior state‑of‑the‑art Tell2Design system, and a lower LED (0.31 versus 0.45 and 0.48 respectively). Importantly, the authors also simulate an interactive design scenario where a user requests incremental modifications (e.g., “enlarge the living room by 2 m”). FPDS responds in an average of 1.2 seconds per edit, roughly half the latency of pixel‑based models, demonstrating its suitability for real‑time co‑creative workflows.

Key insights from the study include: (i) framing floor‑plan synthesis as a token‑level autoregressive problem enables natural “add‑room”, “remove‑room”, and “resize‑room” operations, mirroring the way architects iteratively refine designs; (ii) multimodal conditioning allows the system to handle ambiguous or incomplete specifications, a common situation early in the design process; (iii) vector‑based room representations make it straightforward to embed domain‑specific constraints directly into the loss function, improving the physical plausibility of generated layouts; and (iv) the approach scales well computationally because the sequence length (number of rooms) is typically far smaller than the pixel resolution of an image, leading to faster inference and lower memory consumption.

Nevertheless, the paper acknowledges several limitations. The current implementation is confined to 2‑D floor plans; extending the method to 3‑D spatial reasoning (including structural elements, HVAC, and circulation) would require richer representations and possibly hierarchical decoding. The set of room types is limited to those present in the training datasets, so user‑defined or unconventional spaces (e.g., home office pods, maker‑spaces) would need additional labeling or fine‑tuning. Moreover, the post‑processing validator is rule‑based, meaning the model still relies on an external module to guarantee feasibility; a fully end‑to‑end differentiable constraint system remains an open research direction. Finally, the scarcity of large‑scale, high‑quality multimodal floor‑plan corpora hampers the ability to pre‑train the model on diverse architectural styles and building codes.

Future work outlined by the authors includes: (a) integrating building‑code constraints (e.g., setback, floor‑area ratio, fire‑escape requirements) directly into the layout consistency loss, thereby producing code‑compliant designs out of the box; (b) expanding the representation to hierarchical structures that capture rooms, zones, and whole‑building levels, enabling multi‑story generation; (c) leveraging massive multimodal pre‑training (e.g., CLIP‑style contrastive learning on millions of floor‑plan sketches and descriptions) to improve generalization across styles and cultural contexts; and (d) developing a user‑centric interface where architects can issue natural‑language commands or sketch gestures and receive immediate visual feedback, effectively turning the model into a co‑creative design assistant.

In conclusion, FloorPlan‑DeepSeek offers a compelling alternative to conventional image‑centric generative models by recasting floor‑plan creation as a sequential, token‑level prediction problem. Its multimodal conditioning, explicit spatial reasoning, and fast incremental generation make it well‑suited for the collaborative, iterative nature of architectural design. While still limited to 2‑D layouts and reliant on rule‑based post‑processing, the framework establishes a solid foundation for future research toward fully integrated, constraint‑aware, and user‑interactive AI design tools.

💡 Research Summary

📜 Original Paper Content