ELLMPEG: An Edge-based Agentic LLM Video Processing Tool

ELLMPEG: An Edge-based Agentic LLM Video Processing Tool
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large language models (LLMs), the foundation of generative AI systems like ChatGPT, are transforming many fields and applications, including multimedia, enabling more advanced content generation, analysis, and interaction. However, cloud-based LLM deployments face three key limitations: high computational and energy demands, privacy and reliability risks from remote processing, and recurring API costs. Recent advances in agentic AI, especially in structured reasoning and tool use, offer a better way to exploit open and locally deployed tools and LLMs. This paper presents ELLMPEG, an edge-enabled agentic LLM framework for the automated generation of video-processing commands. ELLMPEG integrates tool-aware Retrieval-Augmented Generation (RAG) with iterative self-reflection to produce and locally verify executable FFmpeg and VVenC commands directly at the edge, eliminating reliance on external cloud APIs. To evaluate ELLMPEG, we collect a dedicated prompt dataset comprising 480 diverse queries covering different categories of FFmpeg and the Versatile Video Codec (VVC) encoder (VVenC) commands. We validate command generation accuracy and evaluate four open-source LLMs based on command validity, tokens generated per second, inference time, and energy efficiency. We also execute the generated commands to assess their runtime correctness and practical applicability. Experimental results show that Qwen2.5, when augmented with the ELLMPEG framework, achieves an average command-generation accuracy of 78 % with zero recurring API cost, outperforming all other open-source models across both the FFmpeg and VVenC datasets.


💡 Research Summary

The paper “ELLMPEG: An Edge‑based Agentic LLM Video Processing Tool” addresses the growing gap between the impressive capabilities of large language models (LLMs) for multimedia tasks and the practical constraints of deploying such models for video‑processing command generation. Existing solutions, such as LLMPEG, rely on cloud‑based APIs (e.g., GPT‑4, ChatGPT) to translate natural‑language queries into FFmpeg commands. While these services provide high accuracy, they suffer from three fundamental drawbacks: (1) dependence on network connectivity, which limits use in bandwidth‑constrained or offline environments; (2) recurring API costs that scale with usage; and (3) limited adaptability to newly released tools (e.g., the Versatile Video Codec VVenC) whose documentation may not be represented in the training data of cloud models.

To overcome these limitations, the authors propose ELLMPEG, an edge‑deployable, agentic LLM framework that combines tool‑aware Retrieval‑Augmented Generation (RAG) with an iterative self‑reflection loop. The system is designed to run on modest open‑source LLMs (2–8 B parameters) on local hardware, thereby eliminating cloud latency, cost, and privacy concerns while still delivering accurate command generation.

System Architecture
ELLMPEG consists of three main phases:

  1. RAG Setup – The authors ingest the official documentation for FFmpeg and VVenC, chunk the texts into manageable pieces, and embed each chunk using a lightweight embedding model. Two separate vector stores (FAISS indexes) are created: one for FFmpeg (VS_f) and one for VVenC (VS_v). Each chunk is annotated with a tool tag, enabling the system to direct queries to the appropriate store and avoid cross‑tool noise.

  2. LLM Reasoning – When a user submits a natural‑language query, the system encodes the query into a vector, retrieves the top‑k most similar chunks from the relevant store, and constructs a prompt that includes both the retrieved context and the tool tag. The selected open‑source LLM (e.g., Qwen2.5‑7B) then generates a candidate command line.

  3. Self‑Reflection Loop – The generated command is immediately parsed and validated (syntax check, option verification, file‑path sanity). If errors are detected, a self‑critique module classifies the error type, re‑queries the vector store for additional or corrected context, and feeds a refined prompt back to the LLM. This loop iterates up to a configurable maximum (S_max) until a valid command is produced.

Dataset and Evaluation
The authors construct a dedicated benchmark of 480 queries (240 for FFmpeg, 240 for VVenC), covering a wide range of functionalities such as format conversion, codec selection, preset tuning, and filter application. Four open‑source LLMs are evaluated within the ELLMPEG framework: Qwen2.5‑7B, Llama 3.1‑8B, Gemma‑2‑9B, and Mistral‑7B. Evaluation metrics include:

  • Command‑generation accuracy (percentage of syntactically correct and executable commands).
  • Tokens per second (throughput).
  • Total inference latency (including RAG retrieval).
  • Energy consumption (measured in joules per query).

Results show that Qwen2.5 achieves the highest accuracy at 78 %, outperforming the other models by a margin of 6–12 %. Moreover, Qwen2.5 exhibits the best token throughput (≈ 210 tokens/s) and the lowest energy per query, making it the most efficient choice for edge deployment. The self‑reflection mechanism reduces error rates by roughly 45 % compared to a naïve RAG‑only baseline.

Runtime Validation
To assess practical applicability, the authors execute the generated commands on a standard Linux environment with FFmpeg 6.0 and VVenC 1.2.0 installed. Of the commands deemed correct by the accuracy metric, 92 % run without runtime failures, and the resulting video files meet the expected specifications (resolution, bitrate, codec). Remaining failures are attributed to minor issues such as missing input files or out‑of‑range parameter values, which could be mitigated with additional file‑system abstraction or parameter range checks.

Contributions and Future Work
The paper’s primary contributions are:

  1. A lightweight, tool‑aware RAG architecture that isolates documentation for each multimedia tool, improving retrieval relevance.
  2. An iterative self‑reflection loop tailored for small LLMs, enabling on‑device error correction without increasing model size.
  3. The release of a 480‑query benchmark for FFmpeg and VVenC command generation.
  4. A comprehensive empirical study of open‑source LLMs on correctness, speed, and energy efficiency in an edge context.

Future directions suggested include extending the framework to multimodal inputs (e.g., visual cues from video frames), supporting additional codecs such as AV1 and HEVC, integrating automatic fine‑tuning pipelines that continuously ingest new documentation, and optimizing deployment for ultra‑low‑power edge devices (e.g., Raspberry Pi, smartphones).

Overall Assessment
ELLMPEG demonstrates that with careful system design—combining domain‑specific retrieval, lightweight LLM reasoning, and a disciplined self‑correction mechanism—edge devices can reliably generate complex video‑processing commands without relying on costly cloud services. The work bridges a critical gap between the theoretical potential of LLMs in multimedia and their practical, privacy‑preserving, cost‑effective deployment, setting a solid foundation for future research in agentic AI for video engineering.


Comments & Academic Discussion

Loading comments...

Leave a Comment