End-to-End Learning-based Video Streaming Enhancement Pipeline: A Generative AI Approach

Reading time: 5 minute
...

📝 Original Info

  • Title: End-to-End Learning-based Video Streaming Enhancement Pipeline: A Generative AI Approach
  • ArXiv ID: 2512.14185
  • Date: 2025-12-16
  • Authors: Emanuele Artioli, Farzad Tashtarian, Christian Timmerer

📝 Abstract

The primary challenge of video streaming is to balance high video quality with smooth playback. Traditional codecs are well tuned for this trade-off, yet their inability to use context means they must encode the entire video data and transmit it to the client. This paper introduces ELVIS (End-to-end Learning-based VIdeo Streaming Enhancement Pipeline), an end-to-end architecture that combines server-side encoding optimizations with client-side generative in-painting to remove and reconstruct redundant video data. Its modular design allows ELVIS to integrate different codecs, inpainting models, and quality metrics, making it adaptable to future innovations. Our results show that current technologies achieve improvements of up to 11 VMAF points over baseline benchmarks, though challenges remain for real-time applications due to computational demands. ELVIS represents a foundational step toward incorporating generative AI into video streaming pipelines, enabling higher quality experiences without increased bandwidth requirements.

💡 Deep Analysis

📄 Full Content

End-to-End Learning-based Video Streaming Enhancement Pipeline: A Generative AI Approach Emanuele Artioli emanuele.artioli@aau.at Alpen-Adria-Universitaet Klagenfurt, Kaernten, Austria Farzad Tashtarian farzad.tashtarian@aau.at Alpen-Adria-Universitaet Klagenfurt, Kaernten, Austria Christian Timmerer christian.timmerer@aau.at Alpen-Adria-Universitaet Klagenfurt, Kaernten, Austria Abstract The primary challenge of video streaming is to balance high video quality with smooth playback. Traditional codecs are well tuned for this trade-off, yet their inability to use context means they must encode the entire video data and transmit it to the client. This paper introduces ELVIS (End-to-end Learning-based VIdeo Streaming Enhancement Pipeline), an end-to-end architecture that combines server-side encoding optimizations with client-side gen- erative in-painting to remove and reconstruct redundant video data. Its modular design allows ELVIS to integrate different codecs, in- painting models, and quality metrics, making it adaptable to future innovations. Our results show that current technologies achieve improvements of up to 11 VMAF points over baseline benchmarks, though challenges remain for real-time applications due to com- putational demands. ELVIS represents a foundational step toward incorporating generative AI into video streaming pipelines, en- abling higher quality experiences without increased bandwidth requirements. CCS Concepts • Computing methodologies →Object identification; Artifi- cial intelligence; Concurrent algorithms; Image compression; • Information systems →Online analytical processing; Multime- dia streaming. Keywords HTTP adaptive streaming, Generative AI, End-to-end architecture, Quality of Experience 1 Introduction With the increasing demand for high-quality video streaming and storage, video compression methods are becoming more crucial than ever. Traditional codecs have significantly reduced file sizes while preserving visual quality, with each new iteration improving upon its predecessor by about 50% [1]. However, further advance- ments are needed to meet the growing requirements of bandwidth- constrained environments and the ever-increasing resolution of video content. A promising avenue is represented by neural codecs [2, 3], i.e., compressing a video into the weights of a neural network, which is then prompted by the client to recreate frames. In ad- dition to server-side compression efficiency, video enhancement The financial support of the Austrian Federal Ministry for Digital and Economic Affairs, the National Foundation for Research, Technology and Development, and the Christian Doppler Research Association, is gratefully acknowledged. Christian Doppler Laboratory ATHENA: https://athena.itec.aau.at/. This work is licensed under a Creative Commons Attribution 4.0 International License. ELVIS controller Client side Network Performance monitoring Server side Video encoding Frame extraction Frame shrinking Complexity calculation Frame in-painting Video rendering Frame stretching Video decoding Figure 1: Overview of the ELVIS pipeline. techniques have been explored to leverage client-side computation, such as frame interpolation and super-resolution [4, 5, 6, 7]. All of the aforementioned techniques face a common limitation: they can only tackle low-level features of videos, such as edges and block-wise flow. With the advent of large generative models, AI is now able to learn and replicate high-level video features, objects, and up to a few seconds of the whole video [8]. Therefore, a new and yet to be explored avenue for enhancing video compression is the integration of video in-painting techniques [9, 10, 11] that analyze the video as a whole, gather context as to what is represented in it, and fill in missing or corrupted regions. Using the latest advances in machine learning, such as attention mechanisms [12], and training on increasingly large and curated datasets, state-of-the-art (SOTA) in-painting algorithms learn how objects typically appear and move in videos, giving them the ability to recreate far larger portions of content than previously possible [9, 10, 11]. This paper’s contributions are twofold: (𝑖) it presents ELVIS, an innovative method that implements video in-painting alongside encoding, aiming to enhance compression efficiency by eliminating parts of the video that are challenging to encode, but can be regen- erated by the client using in-painting algorithms, without signifi- cantly deteriorating the viewing experience. This approach allows the encoder to focus on portions that cannot be easily replicated at the client side, thereby increasing video quality without additional bandwidth requirements. The effectiveness of this method is eval- uated using a variety of metrics, to ensure that the reconstructed video meets the high standards required for practical deployment. Contribution (𝑖𝑖) is the release 1 of an end-to-end pipeline, outlined

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut