LLM2Fx-Tools: Tool Calling For Music Post-Production

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

This paper introduces LLM2Fx-Tools, a multimodal tool-calling framework that generates executable sequences of audio effects (Fx-chain) for music post-production. LLM2Fx-Tools uses a large language model (LLM) to understand audio inputs, select audio effects types, determine their order, and estimate parameters, guided by chain-of-thought (CoT) planning. We also present LP-Fx, a new instruction-following dataset with structured CoT annotations and tool calls for audio effects modules. Experiments show that LLM2Fx-Tools can infer an Fx-chain and its parameters from pairs of unprocessed and processed audio, enabled by autoregressive sequence modeling, tool calling, and CoT reasoning. We further validate the system in a style transfer setting, where audio effects information is transferred from a reference source and applied to new content. Finally, LLM-as-a-judge evaluation demonstrates that our approach generates appropriate CoT reasoning and responses for music production queries. To our knowledge, this is the first work to apply LLM-based tool calling to audio effects modules, enabling interpretable and controllable music production.

💡 Research Summary

LLM2Fx‑Tools presents a novel multimodal framework that leverages large language models (LLMs) to automatically generate executable audio‑effects chains (Fx‑chains) for music post‑production. By integrating chain‑of‑thought (CoT) planning and tool‑calling, the system can understand both natural‑language instructions and audio inputs (dry and reference signals), select appropriate effect modules, determine their ordering, and predict precise parameter values.

Core Architecture
The pipeline combines a pretrained audio encoder (Fx‑Encoder++) with a transformer‑based audio‑language adapter that projects audio patch embeddings into the embedding space of Qwen‑3‑4B, a 4‑billion‑parameter LLM. Text instructions, separator tokens (“dry audio”, “reference audio”), and the audio embeddings are concatenated into a single token sequence. During autoregressive generation the model first emits a CoT narrative that decomposes the task into four sub‑steps (input analysis, effect selection, order planning, parameter planning). This CoT serves as contextual conditioning for the subsequent tool‑calling stage, where the model outputs a structured sequence of tool calls of the form (tool_name, {parameter: value}). Finally, a natural‑language response is produced to close the conversational loop.

Training Objectives
Two losses are combined: (1) standard cross‑entropy over the token sequence, and (2) a Number‑Token Loss (NTL) that measures the Wasserstein‑1 distance between the predicted token distribution and the ground‑truth numeric value, thereby encouraging numerically accurate parameter predictions. The total loss L_total = L_CE + λ·L_NTL balances linguistic fidelity and numeric precision.

Multi‑Stage Training
Training proceeds in two phases. First, the adapter is pretrained on audio‑to‑Fx‑chain pairs while the LLM weights are frozen, using randomly sampled Fx‑chains to cover a broad parameter space. Second, the full model (adapter + LLM) is fine‑tuned with LoRA (rank 128, α 256) on the complete conversational dataset, which includes user instructions, CoT, tool calls, and responses.

Robustness Measures
To bridge the distribution gap between training (where both dry and wet audio are available) and real‑world inference (often only wet audio), the authors apply Fx‑Removal and Fx‑Normalization preprocessing to obtain pseudo‑dry audio, and introduce random dry‑audio masking during training. This enables the model to handle both reverse‑engineering and blind‑estimation scenarios.

Dataset – LP‑Fx
LP‑Fx is built on MedleyDB, comprising 2,119 curated raw tracks spanning nine genres and 80 instruments. Using the Pedalboard library and three custom modules, nine effect types (compressor, distortion, reverb, delay, limiter, gain, three‑band EQ, stereo widener, panner) with a total of 26 parameters are defined. The data generation pipeline consists of: (1) random sampling of musically plausible Fx‑chains to synthesize dry/wet audio pairs, (2) prompting an LLM to create instruction‑following dialogues grounded in those chains, (3) prompting another LLM to generate CoT that explicitly links the instruction to the effect transformations, and (4) filtering the resulting samples with an LLM‑as‑a‑judge to retain high‑quality entries. The final corpus contains 101 K conversational examples, each annotated with instruction, tool calls, CoT, and assistant response.

Experimental Results

Reverse Engineering: When both dry and reference audio are provided, LLM2Fx‑Tools outperforms state‑of‑the‑art regression and multitask baselines, achieving a 2.3 dB improvement in SDR and a 15 % reduction in parameter MAE.
Blind Estimation: With dry audio masked, the system still delivers competitive performance, demonstrating the effectiveness of the preprocessing and CoT‑guided reasoning.
Style Transfer: The inferred Fx‑chain from a reference track can be applied to new content, and subjective listening tests report a 78 % agreement that the transferred style matches the reference.
LLM‑as‑Judge Evaluation: Human judges and the LLM‑as‑judge agree on the quality of generated CoT and responses with an average rating of 3.6/4, confirming interpretability and conversational usefulness.

Contributions

First structured tool‑calling approach for audio‑effects generation, enabling LLMs to invoke non‑differentiable plugins directly.
CoT‑based planning that yields transparent, step‑by‑step reasoning for effect selection, ordering, and parameterization.
Multimodal instruction‑following that integrates textual preferences with audio conditioning.
Release of the LP‑Fx dataset, a large‑scale, high‑quality resource for future research on LLM‑driven audio processing.

Limitations & Future Work
The current system supports only nine effect types and 26 parameters, limiting its applicability to more complex production chains (e.g., multi‑band dynamics, spatialization). Real‑time deployment remains challenging due to inference latency and memory demands of a 4 B‑parameter LLM. Future directions include expanding the tool library, optimizing inference (e.g., quantization, distillation), and incorporating online user feedback via reinforcement learning to refine the Fx‑chain iteratively.

In summary, LLM2Fx‑Tools demonstrates that large language models, when equipped with chain‑of‑thought reasoning and explicit tool‑calling capabilities, can bridge the gap between high‑level artistic intent and low‑level signal processing, offering an interpretable, controllable, and extensible solution for modern music post‑production.

LLM2Fx-Tools: Tool Calling For Music Post-Production

💡 Research Summary

Comments & Academic Discussion

Leave a Comment