Conv-FinRe: A Conversational and Longitudinal Benchmark for Utility-Grounded Financial Recommendation

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Most recommendation benchmarks evaluate how well a model imitates user behavior. In financial advisory, however, observed actions can be noisy or short-sighted under market volatility and may conflict with a user’s long-term goals. Treating what users chose as the sole ground truth, therefore, conflates behavioral imitation with decision quality. We introduce Conv-FinRe, a conversational and longitudinal benchmark for stock recommendation that evaluates LLMs beyond behavior matching. Given an onboarding interview, step-wise market context, and advisory dialogues, models must generate rankings over a fixed investment horizon. Crucially, Conv-FinRe provides multi-view references that distinguish descriptive behavior from normative utility grounded in investor-specific risk preferences, enabling diagnosis of whether an LLM follows rational analysis, mimics user noise, or is driven by market momentum. We build the benchmark from real market data and human decision trajectories, instantiate controlled advisory conversations, and evaluate a suite of state-of-the-art LLMs. Results reveal a persistent tension between rational decision quality and behavioral alignment: models that perform well on utility-based ranking often fail to match user choices, whereas behaviorally aligned models can overfit short-term noise. The dataset is publicly released on Hugging Face, and the codebase is available on GitHub.

💡 Research Summary

Conv‑FinRe is a newly proposed benchmark that moves financial recommendation evaluation beyond the traditional behavior‑centric paradigm. Most existing recommender benchmarks treat user clicks, ratings, or purchases as the sole ground truth, implicitly assuming that these actions perfectly reflect user utility. In the domain of stock investment, this assumption breaks down because observed trades are heavily influenced by short‑term market noise, emotions, and temporary constraints, which may diverge from a user’s long‑term risk‑adjusted return objectives. Conv‑FinRe addresses this gap by introducing a multi‑view, conversational, and longitudinal evaluation framework.

The benchmark consists of three core components: (1) an onboarding interview that elicits a user’s static profile (financial background, investment goals, and initial risk tolerance); (2) step‑wise market context that provides daily and intraday price data for a curated set of ten S&P 500 stocks, stratified by beta into low, medium, and high volatility groups; and (3) a simulated advisory dialogue in which an LLM interacts with three “expert” advisors—one representing rational utility, one representing pure market momentum, and one representing risk‑sensitivity. At each decision step the LLM receives the current market state, the longitudinal interaction history, and the three expert recommendations, and must output a ranked list of the candidate stocks.

Crucially, Conv‑FinRe defines four complementary reference rankings (views):

User Choice (y_user) – the empirical ordering derived from the participant’s actual selections.
Rational Utility (y_util) – an idealized ordering based on a calibrated utility function that balances expected return against volatility and downside risk.
Market Momentum (y_mom) – a profit‑oriented ordering that ranks stocks solely by recent cumulative returns.
Risk Sensitivity (y_safe) – a conservative ordering that isolates the user’s risk‑avoidance component by penalizing variance and drawdown according to inferred personal risk parameters.

These views are intentionally conflicting, allowing researchers to diagnose whether a model is over‑relying on market trends, merely mimicking noisy user behavior, or genuinely performing rational utility‑maximizing reasoning.

To construct the rational utility view without exposing the latent utility to the model, the authors employ inverse optimization. They assume that each user’s decisions follow a multinomial logit model with Gumbel‑distributed error terms, where the systematic component is a linear utility:

U(s) = μ̃_s – λ·σ̃_s – γ·Drawdowñ_s

Here μ̃_s, σ̃_s, and Drawdowñ_s are standardized mean return, variance, and maximum drawdown over a 7‑day window, while λ and γ capture the user’s sensitivity to volatility and downside risk. By minimizing a regularized negative log‑likelihood over the entire interaction horizon, the benchmark estimates (λ, γ) for each participant. These parameters are then used to generate the y_util and y_safe rankings, but they remain hidden from the LLM, preserving the “latent‑utility” challenge.

Data collection involved ten participants who completed a detailed questionnaire (demographics, financial capacity, investment experience, and risk attitudes) designed according to MiFID II and FINRA suitability guidelines. Participants then engaged in a 30‑day simulated trading environment where they could incrementally buy the ten curated stocks, receive portfolio‑level feedback (realized returns, volatility, drawdown), and have their actions logged. The authors released the simulation tool for reproducibility.

Instead of collecting free‑form dialogues, the authors transformed each participant’s static profile and longitudinal trajectory into a structured, reproducible conversation. The onboarding phase consists of a four‑turn exchange that verbalizes the questionnaire responses. The longitudinal phase presents, at each day, the market snapshot, the three expert recommendations, and a short dialogue turn between the user and the LLM, thereby mimicking a realistic advisory session while keeping the evaluation deterministic.

The benchmark was used to evaluate several state‑of‑the‑art LLMs, including GPT‑4, Claude‑2, and Llama‑2‑70B. Performance was measured with NDCG, Kendall‑τ, and normalized correlation against each of the four views. Key findings:

Models that excel on the rational‑utility view (e.g., GPT‑4) often diverge substantially from the user‑choice view, indicating that they capture the theoretical optimal trade‑off but may be overly influenced by market momentum or insufficiently respect the user’s inferred risk tolerance.
Models that align closely with user choices (e.g., Llama‑2‑70B) tend to overfit short‑term noise, showing weaker performance on the utility and risk‑sensitivity views.
Claude‑2 displayed a pronounced bias toward the market‑momentum view, aggressively recommending high‑return stocks during bullish periods while neglecting the user’s risk parameters.

These results expose a fundamental tension between “rational decision quality” and “behavioral alignment” in financial LLM advisors. Conv‑FinRe therefore provides a diagnostic lens that can reveal whether an LLM’s misalignment stems from over‑reliance on market signals, failure to infer risk preferences, or simply mimicking noisy historical actions.

The authors publicly release the dataset on Hugging Face and the full codebase on GitHub, ensuring reproducibility and enabling the community to extend the benchmark. Limitations include the modest size of the stock universe (10 stocks), the short 30‑day horizon, and the synthetic nature of the dialogue generation, which may not capture the full richness of natural conversational advice. Moreover, the utility function is linear and limited to return, variance, and drawdown, omitting other realistic considerations such as taxes, liquidity, or multi‑objective preferences.

Future work suggested by the authors includes scaling the benchmark to larger, more diverse asset universes and longer horizons, collecting authentic free‑form advisory conversations, incorporating non‑linear or multi‑objective utility models, and exploring multi‑agent negotiation scenarios where the LLM must balance conflicting advice from multiple expert personas.

In summary, Conv‑FinRe is the first conversational, longitudinal benchmark that evaluates LLM‑based financial recommendation against both observed user behavior and a rigorously inferred utility ground truth. By providing multi‑view evaluation, inverse‑optimization‑derived risk parameters, and a reproducible conversational simulation, it offers a powerful platform for developing and assessing truly utility‑grounded AI financial advisors.

Conv-FinRe: A Conversational and Longitudinal Benchmark for Utility-Grounded Financial Recommendation

💡 Research Summary

Comments & Academic Discussion

Leave a Comment