In Video on Demand (VoD) scenarios, traditional codecs are the industry standard due to their high decoding efficiency. However, they suffer from severe quality degradation under low bandwidth conditions. While emerging generative neural codecs offer significantly higher perceptual quality, their reliance on heavy frame-by-frame generation makes real-time playback on mobile devices impractical. We ask: is it possible to combine the blazing-fast speed of traditional standards with the superior visual fidelity of neural approaches? We present HybridPrompt, the first generative-based video system capable of achieving real-time 1080p decoding at over 150 FPS on a commercial smartphone. Specifically, we employ a hybrid architecture that encodes Keyframes using a generative model while relying on traditional codecs for the remaining frames. A major challenge is that the two paradigms have conflicting objectives: the "hallucinated" details from generative models often misalign with the rigid prediction mechanisms of traditional codecs, causing bitrate inefficiency. To address this, we demonstrate that the traditional decoding process is differentiable, enabling an end-to-end optimization loop. This allows us to use subsequent frames as additional supervision, forcing the generative model to synthesize keyframes that are not only perceptually high-fidelity but also mathematically optimal references for the traditional codec. By integrating a two-stage generation strategy, our system outperforms pure neural baselines by orders of magnitude in speed while achieving an average LPIPS gain of 8% over traditional codecs at 200kbps.
Video on Demand (VoD) services dominate global network traffic [12], yet delivering high-quality streams to mobile users in unstable network environments remains a fundamental challenge. In scenarios with fluctuating bandwidth, such as crowded subways or rural areas, traditional video coding standards like H.264 [1], H.265 [2], and VP9 [10] often fail to maintain acceptable visual quality. These standards work by dividing video into independent reference pictures called Keyframes (I-frames) and dependent frames called Predicted frames (P-/B-frames) that only record changes. Under strict bitrate constraints, the codec is forced to heavily compress the I-frames to save data. Since P-and B-frames rely entirely on I-frames for reconstruction, a low-quality, blurry I-frame causes error to propagate through the entire video segment, resulting in a poor viewing experience [5].
Generative Semantic Models offer a potential solution by reconstructing high-fidelity images from compact descriptions, but they currently lack the decoding speed required for mobile playback. The core mechanism of approaches like Promptus [14] relies on a “gradient based inversion”: an offline encoding process that iteratively optimizes a tiny text-like code (prompt) that drives a model (e.g., Stable Diffusion [3]) to synthesize an image. When encoding a frame, it adjust the small prompt-like code until the generated output matches the source video. This allows for incredibly low bandwidth usage since only the compact codes are transmitted. However, the tradeoff is the computational cost of decoding: existing generative video solutions typically require running this heavy generation process for every single frame. On a mobile device with limited power, this approach is prohibitively slow, often taking seconds to generate just one frame, whereas smooth video requires at least 30 FPS.
Is it possible to combine the superior perceptual quality of generative models with the blazing-fast decoding speed of traditional codecs? We propose HybridPrompt, a novel hybrid video coding strategy designed to achieve the best of both worlds. Our key insight is to use the heavy generative model only for I-frames, while using the lightweight traditional codec for the motion-based P-and B-frames. This creates a “Bitrate Arbitrage”: because the generative model can compress an I-frame to a negligible size while maintaining high visual quality, we save a massive amount of data. We then reallocate this saved bitrate budget to the traditional encoded P-and B-frames. As a result, the P-and B-frames receive enough data to maintain high clarity, and because they are decoded by the traditional hardware codec, the overall playback remains smooth and real-time.
However, simply stitching these two technologies together creates a conflict, where the AI’s “hallucinated” details confuse the traditional codec. Generative models are designed to make images look good to the human eye (Perceptual Quality), often inventing realistic textures like grass or hair that may not pixel-match the original video. Traditional codecs like H.265, however, are mathematically rigid: they expect precise pixel alignment for prediction. If the generative model generates a texture that is slightly offset, traditional codecs interpret this as a massive error and wastes data trying to “fix” it, negating our bitrate savings. To solve this, we leverage the fact that the traditional decoding process is essentially a sequence of differentiable operations. This enables us to extend the iterative inversion into an endto-end optimization loop. We no longer optimize the latent code solely for I-frame reconstruction. Instead, we introduce additional supervision to force the generated I-frame to align with the motion vectors and residuals of the traditional codec. This ensures high perceptual quality across all frames.
We implement HybridPrompt on a commercial smartphone and demonstrate that it is the first generative-based system to achieve real-time 1080p decoding. Our contributions are threefold: First, we propose a generic hybrid architecture that leverages generative I-frames to subsidize the bitrate 16 Pro Max, which is orders of magnitude slower than the 33ms requirement for real-time playback (30 FPS). This creates a dilemma: we need the visual quality of neural models but the decoding speed of traditional codecs.
Our first insight is that we can achieve the “best of both worlds” through a hybrid architecture. The core observation is that neural codecs are asymmetrically efficient for I-frames.
As shown in Table 2, a neural-generated I-frame (Promptus) achieves significantly better perceptual quality (LPIPS 0.463 vs 0.51) with a much smaller file size (8.8 KB vs 11.2 KB) compared to a traditional codec (H.265). This suggests a strategy of “Bitrate Arbitrage”: just using the neural codec for I-frames to save a massive amount of bitrate, and then reallocating this budget to subsequent P-frames en
This content is AI-processed based on open access ArXiv data.