Meta-Learning Reinforcement Learning for Crypto-Return Prediction
Predicting cryptocurrency returns is notoriously difficult: price movements are driven by a fast-shifting blend of on-chain activity, news flow, and social sentiment, while labeled training data are scarce and expensive. In this paper, we present Meta-RL-Crypto, a unified transformer-based architecture that unifies meta-learning and reinforcement learning (RL) to create a fully self-improving trading agent. Starting from a vanilla instruction-tuned LLM, the agent iteratively alternates between three roles-actor, judge, and meta-judge-in a closed-loop architecture. This learning process requires no additional human supervision. It can leverage multimodal market inputs and internal preference feedback. The agent in the system continuously refines both the trading policy and evaluation criteria. Experiments across diverse market regimes demonstrate that Meta-RL-Crypto shows good performance on the technical indicators of the real market and outperforming other LLM-based baselines.
💡 Research Summary
The paper introduces Meta‑RL‑Crypto, a unified transformer‑based framework that merges meta‑learning with reinforcement learning (RL) to create a self‑improving cryptocurrency trading agent. Starting from an instruction‑tuned LLM (Llama‑7B), the system cycles the same model through three distinct roles: Actor, Judge, and Meta‑Judge. The Actor receives multimodal market inputs—on‑chain metrics (transaction counts, wallet activity, gas fees), off‑chain news articles, and sentiment scores—encoded as structured prompts, and generates K candidate next‑day forecasts for selected crypto assets (BTC, ETH, SOL). Each candidate is evaluated by the Judge using a multi‑objective reward vector comprising five channels: (1) Return‑Based Reward (realized net profit after fees), (2) Sharpe‑Based Reward (incremental risk‑adjusted return), (3) Drawdown Penalty, (4) Liquidity Bonus (based on on‑chain volume and estimated slippage), and (5) Sentiment Alignment (cosine similarity between the model’s textual rationale and a frozen sentiment‑LM derived sentiment vector). These channels are normalized to
Comments & Academic Discussion
Loading comments...
Leave a Comment