3D Optimization for AI Inference Scaling: Balancing Accuracy, Cost, and Latency
Reading time: 2 minute
...
📝 Original Info
- Title: 3D Optimization for AI Inference Scaling: Balancing Accuracy, Cost, and Latency
- ArXiv ID: 2510.18905
- Date: 2025-10-21
- Authors: ** 정보 없음 (논문에 저자 정보가 제공되지 않았습니다.) **
📝 Abstract
AI inference scaling is often tuned through 1D heuristics (a fixed reasoning pass) or 2D bivariate trade-offs (e.g., accuracy vs. compute), which fail to consider cost and latency constraints. We introduce a 3D optimization framework that jointly calibrates accuracy, cost, and latency within a unified decision space, enabling constraints-aware inference scaling. Using Monte Carlo simulations across three representative scenarios and nine simulated large language models, we evaluate four optimization methods to address the 3D multi-objective optimization (MOO) problem. Framing inference scaling in MOO shapes a feasible space that 1D and 2D optimizations fail to capture, enabling environment-adaptive selection of the inference scaling~$k$. Results show that knee-point optimization based on Pareto frontiers achieves the best balance, while accuracy-maximization remains favorable when accuracy is prioritized. Our results further show that smaller models, when combined with optimal inference scaling, can match or exceed the performance of larger models at a fraction of the cost. The framework establishes a theoretical foundation for deployment-aware inference scaling across diverse operational conditions.💡 Deep Analysis
📄 Full Content
Reference
This content is AI-processed based on open access ArXiv data.