Moonworks Lunara Aesthetic II: An Image Variation Dataset

Moonworks Lunara Aesthetic II: An Image Variation Dataset
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We introduce Lunara Aesthetic II, a publicly released, ethically sourced image dataset designed to support controlled evaluation and learning of contextual consistency in modern image generation and editing systems. The dataset comprises 2,854 anchor-linked variation pairs derived from original art and photographs created by Moonworks. Each variation pair applies contextual transformations, such as illumination, weather, viewpoint, scene composition, color tone, or mood; while preserving a stable underlying identity. Lunara Aesthetic II operationalizes identity-preserving contextual variation as a supervision signal while also retaining Lunara’s signature high aesthetic scores. Results show high identity stability, strong target attribute realization, and a robust aesthetic profile that exceeds large-scale web datasets. Released under the Apache 2.0 license, Lunara Aesthetic II is intended for benchmarking, fine-tuning, and analysis of contextual generalization, identity preservation, and edit robustness in image generation and image-to-image systems with interpretable, relational supervision. The dataset is publicly available at: https://huggingface.co/datasets/moonworks/lunara-aesthetic-image-variations.


💡 Research Summary

Lunara Aesthetic II is a newly released, ethically sourced image‑variation dataset designed to enable controlled evaluation and training of contextual consistency in modern image generation and editing systems. The collection consists of 2,854 anchor‑linked variation pairs derived from 336 original photographs and artworks created by Moonworks. Each pair applies one or more contextual transformations—illumination‑time, weather‑atmosphere, scene‑composition, mood‑atmosphere, viewpoint‑camera, and color‑tone—while preserving the core identity of the source image. On average each original has 8.49 variations (median 9, max 15) and each variation carries about 2.18 contextual changes, allowing both single‑factor and multi‑factor edit analysis.

The dataset construction pipeline begins with prompt extraction using the QWEN3‑VL vision‑language model, followed by generation of 3,324 candidate variations with the Moonworks Lunara diffusion‑mixture model (≈10 B parameters). Automated visual difference detection and manual verification ensure that the labeled contextual changes accurately reflect the visual edits. After filtering, 2,854 high‑quality pairs remain, emphasizing identity preservation over raw visual diversity.

Statistical analysis shows original images have a median resolution of 3264 × 2448, while variations are down‑sampled to 896 × 1184 for practical use. Prompt lengths are comparable (≈14 words) to avoid confounding effects from verbosity. The dataset’s label distribution is heavily weighted toward illumination‑time (≈50 % of rows) with substantial co‑occurrence of other axes; conditional probability matrices reveal that illumination often co‑occurs with mood, composition, viewpoint, and weather changes (0.53‑0.59), indicating realistic editing scenarios where lighting adjustments accompany broader scene modifications.

Automated metrics evaluate each contextual axis on three dimensions: axis specificity (0.59‑0.69), prompt alignment (Cohen’s d ranging from –0.20 to –1.07), and normalized entropy (0.97‑1.00). Higher specificity values denote better isolation of axes, with illumination‑time being the most isolated. Larger negative Cohen’s d values indicate substantial token‑level prompt edits; illumination‑time shows the strongest shift (‑1.07). High entropy across axes confirms diverse lexical realizations rather than a few repetitive templates.

Aesthetic quality is assessed using the LAION Aesthetics v2 predictor, a CLIP‑based model trained to approximate human judgments. Lunara‑II achieves a mean aesthetic score of 5.91, surpassing large‑scale web datasets such as CC3M (4.78), LAION‑2B‑Aesthetic (5.25), and WIT (5.08). Although its mean is lower than the earlier Lunara‑I (6.32), Lunara‑II maintains a high‑quality tail: 33.99 % of images exceed the 6.5 threshold, demonstrating that controlled contextual variation does not sacrifice perceptual appeal.

Human evaluation reports an identity stability rating of 4.65/5 and an attribute realization rate of 87.2 %, confirming that the variations retain the original subject while accurately applying the intended contextual changes. The dataset is released under the Apache 2.0 license, enabling unrestricted academic and commercial use.

In summary, Lunara Aesthetic II provides (1) a clear supervision signal via identity‑preserving variations, (2) structured multi‑axis labeling with quantified inter‑axis dependencies, (3) rigorous automated and human quality assessments, and (4) open licensing for broad adoption. It fills a gap left by scale‑oriented web collections, offering a benchmark for testing whether generative models truly learn transferable contextual concepts rather than memorizing pixel patterns. Future work can leverage this dataset for research on contextual disentanglement, multi‑factor edit robustness, and fine‑grained control in text‑to‑image and image‑to‑image pipelines.


Comments & Academic Discussion

Loading comments...

Leave a Comment