Large language models (LLMs) and multimodal models have become powerful general-purpose reasoning systems. However, radio-frequency (RF) signals, which underpin wireless systems, are still not natively supported by these models. Existing LLM-based approaches for telecom focus mainly on text and structured data, while conventional RF deep-learning models are built separately for specific signal-processing tasks, highlighting a clear gap between RF perception and high-level reasoning. To bridge this gap, we introduce RF-GPT, a radio-frequency language model (RFLM) that utilizes the visual encoders of multimodal LLMs to process and understand RF spectrograms. In this framework, complex in-phase/quadrature (IQ) waveforms are mapped to time-frequency spectrograms and then passed to pretrained visual encoders. The resulting representations are injected as RF tokens into a decoder-only LLM, which generates RF-grounded answers, explanations, and structured outputs. To train RF-GPT, we perform supervised instruction fine-tuning of a pretrained multimodal LLM using a fully synthetic RF corpus. Standards-compliant waveform generators produce wideband scenes for six wireless technologies, from which we derive time-frequency spectrograms, exact configuration metadata, and dense captions. A text-only LLM then converts these captions into RF-grounded instruction-answer pairs, yielding roughly 12,000 RF scenes and 0.625 million instruction examples without any manual labeling. Across benchmarks for wideband modulation classification, overlap analysis, wireless-technology recognition, WLAN user counting, and 5G NR information extraction, RF-GPT achieves strong multi-task performance, whereas general-purpose VLMs with no RF grounding largely fail.
Large language models (LLMs) have significantly advanced natural language processing, enabling powerful capabilities in long-form generation, code synthesis, tool use, and multi-step reasoning. Multimodal systems now extend these capabilities beyond text, where multimodal and vision-language models (VLMs), such as GPT-4o [1], Gemini [2], LLaVA [3], Qwen-VL [4], and InternVL [5] integrate image and text inputs to support visual reasoning, captioning, and visual question answering, among other tasks. On the other hand, in the audio domain, large transformer-based models such as Whisper [6], Qwen2-Audio [7], Minimax-Speech [8] demonstrate that massive unlabeled speech and sound corpora can be leveraged to learn robust representations for transcription, generation, and H. Zou, Y. Tian, B. Wang, L. Bariah and M. Debbah are with Research Institute for Digital Future, Khalifa University, 127788 Abu Dhabi, UAE (emails: {hang.zou, yu.tian, lina.bariah, merouane.debbah}@ku.ac.ae).
B. Wang and C. Huang are with College of Information Science and Electronic Engineering, Zhejiang University, 310027, Hangzhou, China (email: {bohao.wang, chongwen.huang}@zju.edu.cn) S. Lasaulce is with Université de Lorraine, CNRS, CRAN, F-54000 Nancy, France (email: samson.lasaulce@univ-lorraine.fr). audio understanding. Despite these advances in text, vision, and audio, radio-frequency (RF) signals, which represent the physical layer for wireless communications, radar sensing, and integrated sensing-and-communications (ISAC), have not yet been integrated into these foundation model frameworks.
Existing machine learning-driven RF intelligence is mainly designed based on narrow, task-specific models, for automatic modulation classification, channel estimation, beam selection, interference identification, and spectrum sensing [9], to name a few. These models are typically trained on small, heterogeneous datasets under constrained assumptions about channel models, hardware impairments, and traffic patterns, and they are evaluated on task-specific metrics. While such models can achieve high accuracy, they exhibit several limitations. First, each task requires its own architecture, dataset, and training pipeline, resulting in limited reuse across tasks. Second, building diverse and well-labeled RF datasets is expensive and usually requires expert annotation, making large-scale supervision difficult. Third, models trained under particular setups experience degraded performance when deployed under different SNR ranges, channel conditions, or hardware scenarios. Finally, most RF models produce only labels or regression outputs, without explanations or a natural interface for human interaction. Within this context, a language-model-based approach introduces a radical change in these approaches in two important ways. First, it allows multiple RF tasks to be achieved within a single model through instructions, rather than through separate architectures. Second, it introduces an interface for reasoning and interaction, where the model can describe what it observes, justify its predictions, and respond to follow-up questions. Instead of training a new neural network for each RF objective, tasks can be designed as prompts that are executed over a shared representation.
Despite recent progress on wireless and RF foundation models, such as WFM [10], LWM [11], and WirelessGPTlike channel models [12], most existing approaches still rely on task-specific output heads or per-task fine-tuning to achieve competitive performance on various downstream tasks. In practice, each new application (e.g., channel estimation, localization, sensing, or RF classification) requires its own prediction head, loss design, and fine-tuning pipeline on labeled data, significantly limiting the potential of an RF foundation model. At the same time, many of these models are relatively small and optimized for a small set of benchmarks, making it difficult to balance performance across heterogeneous tasks or to scale them according to the famous scaling law [13].
In parallel, the 6G research roadmap envisions AI-native networks that integrate sensing, communication, computing, and control, with autonomous, intent-driven operation across the radio access and core network [14]- [16]. Within this vision, LLMs have been proposed as unified interfaces for knowledge access, reasoning, tool orchestration, and policy optimization, enabling automated fault diagnosis, configuration generation, and closed-loop network management. Early work on LLM4Telecom has explored network and service management assistants, domain-specialized instruction-tuned models, and agentic workflows that connect LLMs to monitoring systems, operations support systems (OSS) and business support systems (BSS) data [17]- [21]. Telecom-specific LLMs, such as TelecomGPT [22], integrate domain knowledge directly into the language model to improve its reasoning capabilities over alarms, KPIs, logs, and configuration data.
This content is AI-processed based on open access ArXiv data.