Uni-ISP: Toward Unifying the Learning of ISPs from Multiple Mobile Cameras
Modern end-to-end image signal processors (ISPs) can learn complex mappings from RAW/XYZ data to sRGB (and vice versa), opening new possibilities in image processing. However, the growing diversity of camera models, particularly in mobile devices, renders the development of individual ISPs unsustainable due to their limited versatility and adaptability across varied camera systems. In this paper, we introduce Uni-ISP, a novel pipeline that unifies ISP learning for diverse mobile cameras, delivering a highly accurate and adaptable processor. The core of Uni-ISP is leveraging device-aware embeddings through learning forward/inverse ISPs and its special training scheme. By doing so, Uni-ISP not only improves the performance of forward and inverse ISPs but also unlocks new applications previously inaccessible to conventional learned ISPs. To support this work, we construct a real-world 4K dataset, FiveCam, comprising more than 2,400 pairs of sRGB-RAW images captured synchronously by five smartphone cameras. Extensive experiments validate Uni-ISP’s accuracy in learning forward and inverse ISPs (with improvements of +2.4dB/1.5dB PSNR), versatility in enabling new applications, and adaptability to new camera models.
💡 Research Summary
Uni‑ISP addresses the growing impracticality of training a separate learned image signal processor (ISP) for each mobile camera model. The authors propose a unified architecture that simultaneously learns forward (RAW/XYZ → sRGB) and inverse (sRGB → RAW/XYZ) ISP mappings for multiple smartphones. The key innovation is the use of device‑aware embeddings: for each camera a, a learnable vector Eₐ is introduced. These embeddings interact with the bottleneck features of a shared encoder‑decoder backbone via a cross‑attention module called DEIM (Device‑aware Embedding Interaction Module). This design lets the network capture common ISP characteristics in the shared weights while allowing per‑device nuances (color tone, noise characteristics, tone‑mapping behavior) to be expressed through the embeddings.
The backbone consists of Local Feature Extraction Blocks (LFEB) that handle fine‑grained spatial processing (convolutions, half‑instance‑norm, channel and spatial attention) and Global Feature Manipulation Blocks (GFMB) that inject global exposure metadata (exposure time, ISO, f‑number) extracted from EXIF. This mirrors real camera pipelines, which combine global adjustments (exposure, white‑balance) with local tone‑mapping.
To train such a model, the authors built a new dataset, FiveCam, comprising 2,464 synchronized 4K sRGB‑RAW pairs captured simultaneously by five different smartphones (including iPhone, Samsung Galaxy, Google Pixel, Xiaomi, etc.). Because perfect pixel‑level alignment across devices is impossible, they first align images using optical‑flow warping, which inevitably introduces frequency bias (blurred high‑frequency details). To counter this, they introduce a Frequency‑Bias‑Correction (FBC) loss that penalizes the loss of high‑frequency content in the warped ground‑truth, preserving texture fidelity.
Training proceeds in two complementary stages. In the self‑camera stage, each camera’s own sRGB‑XYZ pairs are used to minimize L1 losses for both the inverse ISP (sRGB → XYZ) and forward ISP (XYZ → sRGB), ensuring fair comparison with prior single‑camera methods. In the cross‑camera stage, the forward ISP is trained to map an sRGB image from camera a to the sRGB style of camera b, using the same inverse ISP to obtain a device‑independent XYZ intermediate. This enables applications such as photographic appearance transfer, interpolation/extrapolation between camera styles, and zero‑shot forensic tasks.
Quantitatively, Uni‑ISP outperforms state‑of‑the‑art single‑camera learned ISPs by +2.4 dB PSNR on forward conversion and +1.5 dB PSNR on inverse conversion. Qualitatively, mixing embeddings (e.g., 0.5 × Samsung + 0.5 × Xiaomi) yields smooth style blends that are visually natural. For forensics, the model’s learned ISP behavior serves as a fingerprint: without any explicit training, it can identify the source camera of an image (zero‑shot source‑camera identification) and detect spliced regions by checking consistency of the inferred ISP across patches.
Overall, Uni‑ISP demonstrates that a single, compact neural ISP can be made adaptable to many devices through lightweight per‑device embeddings, while still delivering superior image quality and unlocking new cross‑camera functionalities. The work suggests a path toward a universal ISP that can be deployed on mobile platforms, reducing the need for device‑specific pipelines and opening avenues for style‑controlled photography and forensic analysis without additional supervision.
Comments & Academic Discussion
Loading comments...
Leave a Comment