Cost-Aware Model Selection for Text Classification: Multi-Objective Trade-offs Between Fine-Tuned Encoders and LLM Prompting in Production
Large language models (LLMs) such as GPT-4o and Claude Sonnet 4.5 have demonstrated strong capabilities in open-ended reasoning and generative language tasks, leading to their widespread adoption across a broad range of NLP applications. However, for structured text classification problems with fixed label spaces, model selection is often driven by predictive performance alone, overlooking operational constraints encountered in production systems. In this work, we present a systematic comparison of two contrasting paradigms for text classification: zero- and few-shot prompt-based large language models, and fully fine-tuned encoder-only architectures. We evaluate these approaches across four canonical benchmarks (IMDB, SST-2, AG News, and DBPedia), measuring predictive quality (macro F1), inference latency, and monetary cost. We frame model evaluation as a multi-objective decision problem and analyze trade-offs using Pareto frontier projections and a parameterized utility function reflecting different deployment regimes. Our results show that fine-tuned encoder-based models from the BERT family achieve competitive, and often superior, classification performance while operating at one to two orders of magnitude lower cost and latency compared to zero- and few-shot LLM prompting. Overall, our findings suggest that indiscriminate use of large language models for standard text classification workloads can lead to suboptimal system-level outcomes. Instead, fine-tuned encoders emerge as robust and efficient components for structured NLP pipelines, while LLMs are better positioned as complementary elements within hybrid architectures. We release all code, datasets, and evaluation protocols to support reproducibility and cost-aware NLP system design.
💡 Research Summary
This paper tackles the often‑overlooked problem of model selection for structured text classification when deploying to production systems. While large language models (LLMs) such as GPT‑4o and Claude Sonnet 4.5 have demonstrated impressive zero‑ and few‑shot capabilities, the authors argue that choosing a model solely on predictive performance ignores critical operational constraints such as latency, cost, throughput, and governance (reproducibility, auditability, privacy).
To investigate, the authors conduct a systematic empirical study on four canonical English benchmarks—IMDB, SST‑2, AG News, and DBPedia—comparing two contrasting paradigms: (1) fully fine‑tuned encoder‑only models from the BERT family (BERT, RoBERTa, DistilBERT) and (2) zero‑ and few‑shot prompt‑based inference with state‑of‑the‑art LLMs (GPT‑4o, Claude Sonnet 4.5). For each configuration they measure three core metrics: macro F1 (balanced accuracy), end‑to‑end inference latency (p50, p95, p99), and monetary cost per million tokens based on contemporary API pricing.
Crucially, the evaluation is framed as a multi‑objective decision problem. The authors define three operational constraints—latency budget, throughput requirement, and budget ceiling—and augment them with governance variables. They map every model onto a three‑dimensional space (accuracy, latency, cost) and compute Pareto frontiers. To make the analysis actionable, they introduce a parameterized utility function U = α·F1 − β·Latency − γ·Cost, allowing practitioners to weight the three objectives according to their deployment regime (e.g., “accuracy‑first”, “latency‑critical”, or “cost‑sensitive”).
Results show that fine‑tuned BERT‑family encoders dominate across almost all scenarios. They achieve macro F1 scores typically above 0.90, latency in the 25‑35 ms range on commodity CPUs/GPUs, and inference costs that are one to two orders of magnitude lower than LLMs (≈ $0.01‑$0.1 per million tokens versus $10‑$30 for LLM APIs). LLMs, even when using few‑shot prompts, lag behind in accuracy (F1 0.78‑0.85), incur substantially higher latency (300‑600 ms due to token‑by‑token generation), and generate recurring usage‑based costs that quickly become prohibitive at scale. Moreover, LLM APIs suffer from variability: provider‑side model updates, pricing changes, and network conditions can alter behavior, undermining reproducibility and auditability—key concerns for regulated or high‑risk applications.
The paper does identify niches where LLMs retain value. In ultra‑low‑data regimes where rapid domain adaptation is needed, a well‑crafted prompt can be deployed without any fine‑tuning effort. For tasks that require generative output beyond a single label (e.g., sentiment with rationale, multi‑label explanations), LLMs’ generative flexibility is advantageous. The authors also explore hybrid architectures—using an encoder for the primary classification and an LLM for post‑processing or confidence calibration—which can achieve points near the Pareto frontier, balancing accuracy, latency, and cost.
Beyond empirical findings, the authors contribute a reproducible benchmark suite: code, data splits, prompt templates, latency measurement scripts, and cost‑calculation utilities are released publicly. This knowledge artifact is intended to serve as a living decision‑support resource that can be updated as new models, pricing structures, or hardware become available.
In conclusion, the study provides strong evidence that for standard fixed‑label text classification workloads, fine‑tuned encoder models are the most cost‑effective and operationally robust choice. LLMs should be viewed as complementary components—useful for rapid prototyping, generative extensions, or hybrid pipelines—rather than default replacements. The multi‑objective framework and released artifacts give practitioners a concrete methodology to align model selection with real‑world production constraints.
Comments & Academic Discussion
Loading comments...
Leave a Comment