CoSteer: Collaborative Decoding-Time Personalization via Local Delta Steering
Personalization has become crucial for adapting models to the diverse and evolving needs of users across cultural, temporal, and contextual dimensions. While existing methods often rely on centralized fine-tuning or static preference alignment within a single model, they struggle to achieve both real-time and high-quality personalization under the resource and privacy constraints of personal devices. To address this challenge, we propose CoSteer, a collaborative framework that enables tuning-free, real-time personalization via decoding-time adaptation. By leveraging logit differences between context-aware and context-agnostic local small models, CoSteer steers cloud-based large models, ensuring effective personalization while preserving the large model’s capabilities. Personalization is handled locally, with only final tokens sent to the cloud, maintaining both user context and system efficiency. Through extensive experiments across a wide range of tasks, we demonstrate that CoSteer generates high-quality personalized content, ensuring both effectiveness and computational efficiency. Our results highlight its robustness across models and environments, confirming its practical applicability in real-world scenarios.
💡 Research Summary
CoSteer tackles the pressing need for real‑time, high‑quality personalization on resource‑constrained edge devices while preserving user privacy. The authors observe that existing personalization approaches fall into two camps: (1) training‑time methods that fine‑tune large language models (LLMs) with user data, which are computationally heavy and expose private information, and (2) inference‑time techniques that inject user context via prompts or logit adjustments, but require transmitting the raw context to the cloud. Both suffer from a trade‑off between quality, latency, and privacy.
The proposed solution is a collaborative edge‑cloud framework that steers a cloud‑hosted LLM using “local delta steering.” A small language model (SLM) runs on the user’s device and produces two probability distributions over the vocabulary for the same prefix: (i) a base distribution conditioned only on the user query (p_base) and (ii) a personalized distribution conditioned on both the query and private user context (p_pers). The log‑probability difference Δlog π = log π*_pers − log π*_base serves as a steering signal that encodes the direction in which the generation should be biased to reflect the user’s preferences. Because the SLM and the LLM share the same token space, this delta can be applied directly to the LLM’s logits without ever sending the private context to the cloud.
To integrate the steering signal into generation, the authors formulate decoding as an online learning problem. At each token step t they define a utility function
u_t(a) = α·(log π_t(a) − log π_base(a)) + β·Δlog π(a)
where π_t is the current target policy, π_base is the LLM’s base policy, and α, β balance the influence of the LLM’s own confidence and the SLM‑derived delta. They then maximize a regularized objective U_t(π) = u_t(π) − λ KL(π‖π_0) with π_0 = π_base, which keeps the updated policy close to the original LLM distribution.
Using the Follow‑the‑Regularized‑Leader (FTRL) algorithm, the authors derive a closed‑form update for the token policy:
π_t(a) ∝ exp
Comments & Academic Discussion
Loading comments...
Leave a Comment