PointT2I: LLM-based text-to-image generation via keypoints

PointT2I: LLM-based text-to-image generation via keypoints
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Text-to-image (T2I) generation model has made significant advancements, resulting in high-quality images aligned with an input prompt. However, despite T2I generation’s ability to generate fine-grained images, it still faces challenges in accurately generating images when the input prompt contains complex concepts, especially human pose. In this paper, we propose PointT2I, a framework that effectively generates images that accurately correspond to the human pose described in the prompt by using a large language model (LLM). PointT2I consists of three components: Keypoint generation, Image generation, and Feedback system. The keypoint generation uses an LLM to directly generate keypoints corresponding to a human pose, solely based on the input prompt, without external references. Subsequently, the image generation produces images based on both the text prompt and the generated keypoints to accurately reflect the target pose. To refine the outputs of the preceding stages, we incorporate an LLM-based feedback system that assesses the semantic consistency between the generated contents and the given prompts. Our framework is the first approach to leveraging LLM for keypoints-guided image generation without any fine-tuning, producing accurate pose-aligned images based solely on textual prompts.


💡 Research Summary

PointT2I introduces a novel framework that enables accurate human‑pose generation in text‑to‑image (T2I) synthesis without relying on external pose annotations or fine‑tuning of the diffusion model. The system consists of three sequential components: (1) keypoint generation, (2) image generation, and (3) a feedback loop, all driven by a frozen large language model (LLM).

In the keypoint generation stage, the LLM receives the natural‑language prompt and is instructed to output 3‑dimensional coordinates for the 17 major human body joints. By parsing pose‑related semantics—whether the prompt contains a pose name such as “boat pose” or a detailed description like “a V‑shaped pose with raised legs and a leaned‑back torso”—the LLM infers a plausible 3‑D skeleton. The method fixes the ground plane at z = 0, maintains realistic body proportions, and anchors contact points (feet, hands) to the ground, thereby producing human‑like poses without any pose‑specific training data.

The generated 3‑D keypoints are then orthographically projected onto a 2‑D image plane. Eight candidate views are considered (uniformly spaced around the vertical axis), and the view with the highest variance in 2‑D coordinates is selected to minimize overlap and preserve spatial information. These 2‑D keypoints serve as structural conditioning inputs for a standard diffusion‑based T2I backbone (e.g., Stable Diffusion or SD‑XL) via a ControlNet‑style branch, while the original text prompt is fed to the backbone’s text encoder. This dual conditioning allows the diffusion model to respect both semantic content and precise pose geometry.

A two‑module LLM‑based feedback system evaluates the consistency of the generated keypoints and the final image with the original prompt. The first module checks whether the keypoints faithfully represent the described pose; if a mismatch is detected, the keypoint generator is invoked again. The second module assesses the rendered image; when pose‑level inconsistencies are found, the image generator is prompted to regenerate. This iterative refinement loop automatically corrects errors that may arise from the LLM’s initial inference or from the diffusion model’s limited pose fidelity.

Extensive experiments on a diverse set of 200 prompts—including common actions, yoga poses, and acrobatic configurations—demonstrate that PointT2I outperforms existing pose‑controlled methods such as ControlNet, HumanSD, and GLIGEN. Quantitative metrics (e.g., Percentage of Correct Keypoints) and human preference studies show superior pose accuracy and visual realism. Notably, the system remains robust across different prompt styles: it handles both succinct pose names and elaborate natural‑language descriptions with comparable performance.

The paper acknowledges limitations: the current pipeline does not model clothing, accessories, or complex scene elements, and the LLM‑produced coordinates may lack precise physical accuracy, requiring downstream smoothing. Future directions include integrating multimodal LLMs that jointly process text, images, and 3‑D information, and developing lightweight post‑processing networks for real‑time applications.

In summary, PointT2I establishes a “prompt → keypoints → image” paradigm that leverages the semantic understanding of LLMs to translate high‑level textual descriptions into concrete pose structures, unlocking the latent pose‑control capabilities of existing diffusion backbones without any additional training. This represents a significant step toward more controllable and semantically faithful text‑to‑image generation.


Comments & Academic Discussion

Loading comments...

Leave a Comment