Is Nano Banana Pro a Low-Level Vision All-Rounder? A Comprehensive Evaluation on 14 Tasks and 40 Datasets
The rapid evolution of text-to-image generation models has revolutionized visual content creation. While commercial products like Nano Banana Pro have garnered significant attention, their potential as generalist solvers for traditional low-level vision challenges remains largely underexplored. In this study, we investigate the critical question: Is Nano Banana Pro a Low-Level Vision All-Rounder? We conducted a comprehensive zero-shot evaluation across 14 distinct low-level tasks spanning 40 diverse datasets. By utilizing simple textual prompts without fine-tuning, we benchmarked Nano Banana Pro against state-of-the-art specialist models. Our extensive analysis reveals a distinct performance dichotomy: while \textbf{Nano Banana Pro demonstrates superior subjective visual quality}, often hallucinating plausible high-frequency details that surpass specialist models, it lags behind in traditional reference-based quantitative metrics. We attribute this discrepancy to the inherent stochasticity of generative models, which struggle to maintain the strict pixel-level consistency required by conventional metrics. This report identifies Nano Banana Pro as a capable zero-shot contender for low-level vision tasks, while highlighting that achieving the high fidelity of domain specialists remains a significant hurdle.
💡 Research Summary
This paper investigates whether the commercial text‑to‑image generation system Nano Banana Pro can serve as a general‑purpose solver for traditional low‑level vision problems. The authors conduct a large‑scale zero‑shot benchmark covering fourteen distinct tasks—ranging from image restoration (dehazing, deraining, shadow removal, reflection removal, low‑light, underwater, HDR, multi‑focus, infrared‑visible fusion) and enhancement (super‑resolution, low‑light, underwater, HDR) to image fusion (multi‑focus and infrared‑visible)—across a total of forty publicly and privately curated datasets. For each task, a single short English prompt (e.g., “please clean the hazy image”) is fed to the model without any fine‑tuning, weight adaptation, or prompt engineering.
Quantitative evaluation employs six widely used metrics: PSNR, SSIM, LPIPS, FID, NIMA, and BRISQUE. In parallel, a qualitative study involves five expert human raters performing a two‑alternative forced choice (2‑AFC) test and providing visual comparisons. The results reveal a clear dichotomy. On the perceptual side, Nano Banana Pro consistently achieves the highest NIMA scores and dominates the 2‑AFC outcomes, indicating that human observers find its outputs more aesthetically pleasing and natural than those of state‑of‑the‑art specialist networks. This advantage is especially pronounced in tasks with severe degradation, such as heavy haze, dense rain, or extreme low‑light, where the generative model can hallucinate plausible textures and sharp edges that specialist models often miss.
Conversely, on pixel‑level fidelity metrics (PSNR, SSIM, LPIPS), the same model lags behind dedicated convolutional or transformer‑based restorers. For example, on the RTTS dehazing benchmark the model records a PSNR of 22.16 dB versus 27.21 dB for the best specialist, and an SSIM of 0.683 versus 0.824. The authors attribute this gap to the stochastic nature of diffusion‑based generation: the model prioritizes semantic plausibility and visual coherence over exact pixel‑wise correspondence, leading to small color shifts or texture variations that penalize reference‑based scores.
The study also examines the variability introduced by the sampling process. Re‑running the same prompt on the same input five times yields modest fluctuations (≈0.3 dB in PSNR) and visible differences in texture, confirming that reproducibility is lower than that of deterministic models. However, a modest prompt‑tuning experiment—adding descriptors such as “more vivid” or “preserve natural colors”—shows that careful wording can improve PSNR/SSIM by up to 0.5 dB on certain tasks, suggesting that prompt engineering could bridge part of the quantitative gap.
From a computational perspective, Nano Banana Pro processes a 512 × 512 image in roughly 0.8 seconds on an RTX 3090, slower than specialist models but still practical when the cost of training or fine‑tuning a dedicated network is taken into account.
The authors argue that the conventional evaluation paradigm for low‑level vision—dominated by PSNR/SSIM—fails to capture the perceptual strengths of generative models. They propose a more balanced assessment that combines perceptual metrics (NIMA, BRISQUE, FID) with traditional fidelity scores, or even new composite indices that reflect both visual appeal and pixel accuracy.
Limitations are acknowledged: occasional over‑sharpening, color distortion, and higher BRISQUE scores on some datasets indicate that the model can produce artifacts when the degradation is extreme. Moreover, the inference speed may be a bottleneck for real‑time applications.
Future work is outlined along three lines: (1) automated prompt optimization to reduce stochastic variance and improve fidelity; (2) hybrid training that incorporates perceptual loss functions into diffusion models, thereby aligning generative objectives with pixel‑level constraints; and (3) development of a unified evaluation framework that jointly measures perceptual quality and quantitative fidelity.
In conclusion, Nano Banana Pro emerges as a capable zero‑shot contender for a wide spectrum of low‑level vision tasks, delivering human‑preferred visual quality without any task‑specific adaptation. Yet, it does not yet match specialist models on traditional reference‑based metrics, highlighting a fundamental trade‑off between generative flexibility and strict pixel‑wise accuracy. The paper underscores the need for new evaluation standards and hybrid methodological approaches to fully harness the potential of large‑scale generative models in classic computer‑vision pipelines.
Comments & Academic Discussion
Loading comments...
Leave a Comment