SkyNet: Belief-Aware Planning for Partially-Observable Stochastic Games
In 2019, Google DeepMind released MuZero, a model-based reinforcement learning method that achieves strong results in perfect-information games by combining learned dynamics models with Monte Carlo Tree Search (MCTS). However, comparatively little wo…
Authors: Adam Haile
MSOE Artificial Intelligence Research Paper Sk yNet: Belief-A ware Planning in P artially Observ able Stochastic Games Adam Haile Dwight and Dian Diercks School of Adv anced Computing Milwauk ee School of Engineering 1025 N Broadway St, Milw aukee, WI 53202 hailea@msoe.edu Abstract In 2019, Google DeepMind released MuZero, a model-based reinforcement learning method that achie ves strong results in perfect-information games by combining learned dynamics models with Monte Carlo T ree Search (MCTS). Howe ver , comparati vely little work has extended MuZero to partially observable, stochastic, multi-player en vironments, where agents must act under uncertainty about hidden state. Such settings arise not only in card games b ut in domains such as autonomous negotiation, financial trading, and multi-agent robotics. In the absence of explicit belief modeling, MuZero’ s latent encoding has no dedicated mechanism for representing uncertainty ov er unobserv ed v ariables. T o address this, we introduce SkyNet (Belief-A w are MuZero), which adds ego-conditioned auxiliary heads for winner prediction and rank estimation to the standard MuZero architec- ture. These objecti ves encourage the latent state to retain information predicti v e of outcomes under partial observ ability , without requiring e xplicit belief-state tracking or changes to the search algorithm. W e e v aluate Sk yNet on Skyjo, a partially observable, non-zero-sum, stochastic card game, using a decision-granularity en vironment, transformer-based encoding, and a curriculum of heuristic opponents with self-play . In 1000-game head-to-head ev aluations at matched check- points, SkyNet achie ves a 75.3% peak win rate against the baseline (+194 Elo, p < 10 − 50 ). SkyNet also outperforms the baseline against heuristic opponents (0.720 vs. 0.466 win rate). Critically , the belief-a ware model initially underperforms the baseline b ut decisiv ely surpasses it once training throughput is sufficient, suggesting that belief-aware auxiliary supervision improves learned representations under partial observability , but only giv en adequate data flo w . 1 1 Intr oduction Game-playing AI has long served as a benchmark for pr ogress in machine learning. From Deep Blue to AlphaZero [2], AI systems ha ve achie ved superhuman performance in deter- ministic, perfect-information two-player games. MuZero [1] extended this line of work by learning a dynamics model entirely from experience, obviating the necessity for kno wledge of game rules during planning. These advances were po wered by Monte Carlo T ree Search (MCTS) [3], which uses learned policy and v alue estimates to guide search through a game tree. Ho we ver , many games and decision problems are partially observable , stochastic , and multi-player . These properties fundamentally challenge classical MuZero’ s assumptions. In such settings, an agent cannot observe the full game state, must reason about hidden information, and faces chance e vents whose outcomes are unkno wn until they occur . Card games represent an important class of such imperfect-information domains, as demonstrated by landmark results in poker [20, 21, 22]. Skyjo is a multi-player card game (2–8 players) in which each player manages a 3 × 4 grid of cards, many of which begin face-do wn. Players take turns dra wing from a shared deck or discard pile, making decisions about which cards to rev eal, k eep, or replace. The game combines risk management, information gathering, and strategic timing: a player who re veals all their cards triggers the end of a round, but f aces a score-doubling penalty if they do not hold the lowest score. Play continues across rounds until a cumulati ve score threshold is reached. This combination of hidden information, stochastic draws, multi-player dynamics, and non-zero-sum scoring makes Skyjo a compelling testbed for advancing model-based re- inforcement learning beyond the perfect-information regime. The partially observable stochastic domain (POMDP) framework [4] provides the theoretical foundation for rea- soning in such settings, where optimal beha vior requires maintaining beliefs o v er hidden state. This paper makes the follo wing contributions: 1. A complete, deterministic, seedable Sk yjo en vironment with a decision-granularity action decomposition suitable for MuZero-style training. 2. Belief-A w are MuZero, an extension of MuZero that adds ego-conditioned winner and rank prediction heads to improv e representation learning under partial observ ability . 3. An empirical comparison between classical MuZero and Belief-A ware MuZero on Skyjo, demonstrating statistically significant superiority of the belief-a ware v ariant in head-to-head play . 4. An analysis of training dynamics, stability , and the role of data throughput in realizing the benefits of belief-aw are auxiliary supervision. 2 2 Related W ork 2.1 MuZer o and Model-Based Reinf or cement Lear ning MuZero [1] learns a latent dynamics model purely from interaction data and achie ved strong results in Go, Chess, Shogi, and Atari under perfect information and deterministic or near -deterministic transitions. The architecture comprises a representation network, a dynamics network, and a prediction network; MCTS [3] uses these to plan in latent space. Grimm et al. [14] formalized that models need only be v alue-equi valent rather than observ ation-reconstructi ve, which is relev ant in card games where full hidden-state predic- tion is impossible. MuZero’ s original formulation does not address partial observability . Extensions such as Stochastic MuZero [7] (chance ev ents and card draws), Gumbel MuZero [8] (limited simulation budgets), and MuZero Reanalyse [9] (target staleness) address other limitations but not hidden information. EfficientZero [10] and self-predictiv e representation learning [15] sho w that auxiliary representation-shaping losses improv e sample efficienc y; the UNREAL framew ork [16] established that auxiliary predicti ve tasks help when extrinsic re ward is sparse. 2.2 Partial Obser vability and Belief Modeling In partially observ able en vironments, optimal decision-making requires reasoning about the distribution ov er hidden states (the belief state ) [4]. Belief-state and information-set methods are powerful but heavyweight: POMCP [5] uses particle filtering for belief updates; IS-MCTS [6] operates over information sets; ReBeL [11] and Student of Games [12] combine learned models with game-theoretic search ov er public belief states; BetaZero [19] runs MuZero-style iteration directly in belief space. These approaches substantially improv e performance under hidden information (e.g., poker [20, 21, 22]) but add significant algorithmic and computational cost. A lighter alternativ e is to treat MuZero’ s recurrent hidden state as an implicit belief representation. Our work fills the gap between full belief- state planning and unmodified MuZero: we add auxiliary outcome-prediction heads that shape the latent state for partial observability without maintaining explicit belief distributions or changing the search algorithm. 3 En vir onment Design 3.1 Skyjo Game Rules Skyjo is a card game for 2–8 players. The deck contains 150 cards with v alues ranging from − 2 to 12 , distributed as follows: fiv e cards of v alue − 2 , ten cards each of v alue − 1 and v alues 1 through 12 , and fifteen cards of v alue 0 . 3 Each player recei ves 12 cards arranged in a 3 × 4 grid, all initially face-do wn (see Figure 1 for an e xample of a player’ s grid during play). At the start of a round, each player re veals two cards. The player with the highest sum of rev ealed cards tak es the first turn. On each turn, a player must either: • Draw the top card from the discard pile and replace an y card in their grid (the replaced card is discarded face-up), or • Draw from the deck, observ e the card’ s value, then either: – K eep the drawn card and replace an y grid card, or – Discard the drawn card and flip one f ace-do wn card face-up. Figure 1: A player’ s 3 × 4 card grid during a game of Skyjo, with the deck shown at right. Cards range from − 2 to 12 , with lower total scores being more desirable. Image credit: 3rd Grade Thoughts ( https://www.3rdgradethoughts.com/2019/01/ board- game- review- skyjo.html ). When all three cards in a column are face-up and share the same value, that column is remov ed from the grid and no longer contrib utes to the player’ s score. When a player has all remaining cards face-up, the round ends: each other player gets one final turn. The round-ending player’ s score is doubled if they do not hold the strictly lo west score. Play continues across rounds until any player’ s cumulativ e score reaches 100; the player with the lo west cumulati ve score wins. 3.2 Decision-Granularity Action Decomposition A nai ve approach to modeling Skyjo turns would compress an entire turn into a single macro-action. Ho wev er , this is fundamentally flawed because the value of a card drawn from the deck is unknown at the time the draw decision is made, so the agent must observe the drawn card before deciding whether to keep or discard it. This is the same insight that 4 moti v ates Stochastic MuZero’ s [7] separation of decision nodes from chance nodes, and Sampled MuZero’ s [17] decoupling of a v ailable actions from actions considered in search. W e decompose each turn into a sequence of decision points: • Phase A – Choose Sour ce: The agent selects between the deck and the discard pile ( |A| = 2 ). • Phase B – Keep or Discard: If the agent dre w from the deck, the card v alue is re vealed. The agent then decides to keep or discard the drawn card ( |A| = 2 ). • Phase C – Choose Position: The agent selects a grid position for replacement or flipping ( |A| = 12 ). This yields a masked 16-action policy space where legal actions depend on the current decision phase. The decomposition ensures correct information flo w: the agent’ s keep- or-discard decision is conditioned on the observed card v alue, and MCTS can branch meaningfully at each decision point. 3.3 Observ ation Model The en vironment provides each agent with a partial observ ation consisting of: • Board tokens: For each player , 12 tokens encoding position, owner , visibility flag, and card v alue (or a special UNKNOWN sentinel for face-do wn cards). • Discard token: The top card v alue and discard pile size. • Global token: Deck size, step count, current player , turn phase, round phase bucket (early/mid/late), and round index. • Action history tokens: The last K = 16 public actions, each encoding the actor , action type, source, target position, and card v alues in v olved. • Decision token: The current decision phase and the drawn card v alue (if applicable). This tok en-based representation is designed for direct consumption by a transformer encoder and scales naturally with the number of players. 4 Methodology 4.1 Classical MuZer o Baseline Our baseline follo ws the standard MuZero architecture [1], which learns a latent dynamics model of the en vironment without access to its rules. Unlike model-free approaches that map observ ations directly to actions, MuZero constructs an internal model that can be used 5 for lookahead planning via MCTS. The ke y insight, formalized by the v alue-equi v alence principle [14], is that this internal model does not need to reconstruct observ ations f aithfully; it only needs to support accurate v alue and policy predictions. The architecture decomposes into three learned components that together define a self-contained planning engine in latent space. Figure 2 illustrates the baseline architecture alongside the belief-aw are extension described in Section 4.2. Observation o t Representation Network Transformer Encoder L a t e n t S t a t e h Dynamics Network MLP + Residual A c t i o n a k R e w a r d r k h k + 1 Prediction Network MLP Heads P o l i c y V a l u e v MCTS Planning + P olicy Improvement (a) Baseline MuZero Observation o t Representation Network Transformer Encoder L a t e n t S t a t e h Dynamics Network MLP + Residual A c t i o n a k R e w a r d r k h k + 1 Ego Conditioning h + e e g o + e c u r + e N Prediction Network MLP Heads P o l i c y V a l u e v W inner R ank Auxiliary Belief Heads MCTS Planning + P olicy Improvement (b) Belief -A ware MuZero (SkyNet) Figure 2: Architecture comparison between baseline MuZero (left) and Belief-A w are MuZero / SkyNet (right). Both share the same representation, dynamics, and prediction networks. SkyNet adds an ego conditioning layer that injects player identity before pre- diction, and two auxiliary heads (winner and rank) that shape the latent representation via outcome-prediction objecti ves. Pink-shaded outputs indicate the additional belief heads. 4.1.1 Representation Netw ork The representation network h 0 = f rep ( o t ) maps a tokenized observation to a latent state vector . W e use a transformer encoder operating over the concatenation of board tokens, discard tokens, global tokens, action history tok ens, and decision tokens. Each token type is embedded through dedicated embedding layers. Board tokens combine embeddings for owner , position, visibility , and v alue. A learnable [CLS] token is prepended, and positional embeddings are added to the full sequence. The transformer encoder consists of 6 layers with 8 attention heads, GELU acti v ations, and pre-norm architecture. The [CLS] 6 output is projected through a layer norm and linear layer with T anh acti v ation to produce a latent vector h 0 ∈ R 512 . 4.1.2 Dynamics Network The dynamics network h k +1 , r k = f dyn ( h k , a k ) predicts the next latent state and re ward gi ven the current latent state and an action. The action is embedded via a learned embedding, concatenated with the current latent state, and processed through a two-hidden-layer MLP with GELU activ ations. A residual connection adds the MLP output to the input latent state, follo wed by layer normalization. A separate re ward head predicts the rew ard as a cate gorical distribution o ver a discrete support. 4.1.3 Prediction Netw ork The prediction network π k , v k = f pred ( h k ) produces policy logits ov er the action space and a v alue estimate from the latent state. Both outputs are produced by separate MLPs. The v alue is represented as a categorical distrib ution ov er the support [ − V max , V max ] with V max = 200 . In this non-zero-sum multiplayer setting, the v alue head predicts the e xpected discounted return from the e go player’ s perspectiv e (the player whose turn it is at the root of the current search). T raining uses the same terminal re ward (win/loss or score-based outcome) for that player; the v alue is not a zero-sum payoff and does not in vert between players. MCTS uses this ego-conditioned v alue only for the acting player when comparing actions. 4.1.4 T raining T raining follows the standard MuZero procedure [1]: MCTS is used during self-play to generate improv ed policy targets from visit count distributions, and n -step bootstrapped returns [18] provide value tar gets. The loss combines cross-entropy for policy and value heads and re ward prediction: L = K X k =0 L ( k ) π + L ( k ) v + K − 1 X k =0 L ( k ) r + λ ∥ θ ∥ 2 (1) where K is the unroll length and λ controls weight decay re gularization. 4.2 Belief-A war e MuZero The Belief-A ware MuZero v ariant extends the baseline with tw o ke y modifications: auxiliary prediction heads for winner and rank estimation, and ego-conditioned latent representations. 7 This approach draws on the principle that auxiliary tasks improve representations when extrinsic rew ard is sparse [10, 16], while the ego conditioning mechanism is inspired by multi-player search methods that maintain player-specific v alue estimates [11, 12]. 4.2.1 Ego Conditioning In multi-player partially observ able games, the v alue of a state depends critically on whose perspective it is ev aluated from. W e introduce ego conditioning by adding three learned embeddings to the latent state before prediction: h cond = LayerNorm ( h + e ego + e current + e nplayers ) (2) where e ego is the embedding of the ego player (the player whose value we are estimating), e current is the embedding of the player currently acting, and e nplayers encodes the total number of players. This conditioning allo ws a single network to produce player -specific predictions, enabling perspecti ve augmentation during training where each trajectory generates training examples from e very player’ s vie wpoint. 4.2.2 Winner and Rank Heads T wo auxiliary prediction heads are added: • Winner head: Predicts P ( player i finishes first ) for each player , trained with cross- entropy loss against the observ ed game outcome. • Rank head: Predicts the expected final rank distribution for each player , providing a denser training signal than the sparse binary win indicator . Both heads operate on the ego-conditioned latent state and share the same MLP architecture as the policy and v alue heads. The total loss becomes: L belief = L MuZero + α · L winner + β · L rank (3) where α and β are loss weights that follow a ramping schedule to prev ent auxiliary tasks from dominating early training. 4.2.3 Rationale The auxiliary heads serve two purposes. First, they encourage the latent state to retain information useful for outcome prediction under partial observability , plausibly including 8 hidden-card and opponent-state structure. This acts as a form of self-supervised regulariza- tion on the latent space, analogous to the representation-shaping objectiv es in Ef ficientZero [10] and SPR [15]. Second, the winner head provides a direct estimate of win probability that can be used alongside the shaped v alue estimate during planning. 4.3 T raining Pipeline 4.3.1 Self-Play with MCTS Both models use MCTS [3] during self-play to generate training data. At each decision point, the agent runs N sim simulations using the PUCT selection criterion with Dirichlet noise at the root for exploration. The visit count distribution pro vides the policy tar get, and the search v alue provides bootstrapped v alue targets. For the belief-aware model, MCTS incorporates ego conditioning: the ego player ID is propagated through dynamics unrolls so that value estimates remain perspecti ve-consistent. 4.3.2 Curriculum and Opponent Pool T raining employs a curriculum of six hand-crafted heuristic opponents: gr eedy value r eplacement , information-first flip , column hunter , risk-awar e unknown r eplacement , end- r ound aggr o , and anti-discar d . These bots exploit dif ferent strategic principles of Skyjo and provide di verse training signal during early iterations. As training progresses, a checkpoint opponent pool accumulates past versions of the agent. Self-play opponents are sampled from a mixture of the current policy (70%) and uniformly from the pool (30%), preventing the self-play collapse and oscillation that arises from training exclusi vely against the latest polic y . 4.3.3 Replay Buffer and Sampling Episodes are stored in a lar ge replay b uf fer with phase-stratified sampling to ensure represen- tation of all decision phases. Prioritized experience replay [23] principles guide sampling, with the buf fer capacity set to accommodate thousands of episodes and a minimum warmup threshold ensuring suf ficient data di versity before training be gins. 4.3.4 Simulation Schedule MCTS simulation counts follo w a schedule that increases over training: 200 simulations per move for iterations 0–200, 400 for iterations 200–500, and 600 for iterations 500+. 9 This balances computational cost against target quality , following the insight from Gumbel MuZero [8] that policy impro v ement quality depends critically on simulation budget. 4.3.5 Time P enalty The training pipeline supports an optional per -step penalty λ t added to re wards to discourage pathologically long games. In early experiments with this penalty enabled, both models would frequently collapse to policies that optimized only for the fastest possible game termination, a degenerate strategy that did not restabilize with continued training. The current models were trained without this penalty ( λ t = 0 ); the opponent pool and curriculum di versity pro v ed suf ficient to maintain healthy game lengths without requiring explicit time pressure. 5 Experimental Setup 5.1 Hyperparameters T able 1 summarizes the ke y hyperparameters shared between models and those specific to each v ariant. T able 1: T raining hyperparameters. Parameter Baseline Belief-A ware Iterations 2000 2000 Self-play episodes/iter 32 32 T rain steps/iter 96 96 Batch size 64 64 Unroll steps / TD steps 8 / 10 8 / 10 Discount ( γ ) 0.997 0.997 Learning rate 3 × 10 − 4 3 × 10 − 4 Optimizer AdamW AdamW T ransformer layers / heads 6 / 8 6 / 8 d model / Latent dim 256 / 512 256 / 512 FF hidden dim 1024 1024 MCTS sims (self-play / e v al) 200–600 / 200 200–600 / 200 Dirichlet α / fraction 0.3 / 0.25 0.3 / 0.25 W inner loss weight ( α ) — 0 . 1 → 0 . 5 Rank loss weight ( β ) — 0 . 1 → 0 . 25 Action space 16 16 10 5.2 Evaluation Pr otocol Models are e v aluated through two mechanisms: 1. Bot evaluation: Periodic ev aluation against the curriculum of heuristic opponents, measuring win rate, mean score dif ferential, and mean episode length. 2. Head-to-head comparison: Direct competition between baseline and belief-a ware checkpoints at matched training iterations, with alternating seat assignments and 1000 games per comparison. W in rate is determined by final Skyjo score (lowest score wins), computed reg ardless of whether the episode terminated naturally or was truncated. T runcation rate is tracked separately to ensure e v aluation inte grity . Ev aluation uses greedy MCTS (temperature ≈ 0 ) with 200 simulations per m ov e and no exploration noise. This protocol follows best practices from game e v aluation frameworks such as OpenSpiel [24] and RLCard [25]. 5.3 Comparison F air ness The baseline and belief-a ware models are compared under matched conditions. Both use the same training pipeline (self-play episodes per iteration, replay buf fer , curriculum, opponent pool), the same core hyperparameters (learning rate, batch size, unroll steps, discount, MCTS simulation schedule), and the same e valuation protocol (200 sims per mov e, alternating seats, 1000 games per head-to-head). No additional hyperparameter tuning was performed for the belief-aw are v ariant be yond the auxiliary loss weight ramp. Checkpoint selection uses the same iteration-matching protocol: head-to-head comparisons use the same training iteration for both agents. The belief-aware model adds only the e go conditioning embeddings and two auxiliary MLP heads (winner and rank) to the shared representation and dynamics; parameter counts are comparable. Thus the performance difference is attributable to the architectural change rather than capacity or tuning adv antage. 6 Usage of R OSIE T raining MuZero-style agents on a complex partially observ able game is computationally demanding: each iteration in volves running MCTS-guided self-play episodes, updating the replay buf fer, and performing gradient steps through a transformer-based representation network and unrolled dynamics. At this scale, compute access is a hard constraint on experimental v elocity , not a con v enience. T raining on MSOE’ s T4 teaching-cluster nodes required approximately 24 hours per 100 training iterations. Running the full 2000-iteration curriculum on such hardware alone would tak e roughly 480 hours (20 days) per model , serial, before an y ablations or head-to- 11 head e v aluations could be conducted. MSOE’ s R OSIE supercomputer provides access to DGX H100 nodes, which achie ved approximately 1.25 × the throughput of the T4 nodes, reducing wall-clock time meaningfully . Attempting to train locally , whether on CPU or a consumer-grade GPU, produced iteration times at roughly 0.88 × the speed of the teaching nodes, making sustained multi-thousand-iteration runs infeasible in practice. Beyond raw throughput, R OSIE enabled a qualitati vely different dev elopment workflow . Because ev ery additional iteration translates directly into additional de velopment c ycles and the ability to test new hypotheses, the ability to run training at scale compressed the experimental feedback loop substantially . Critically , R OSIE made it possible to parallelize the research program: the baseline MuZero and SkyNet models were trained simultaneously on separate nodes, head-to-head e v aluations between checkpoints were run concurrently with ongoing training, and inference-time ablation e xperiments were conducted in parallel without interrupting either training job . This parallelism w as essential for completing the matched-checkpoint head-to-head comparisons (T able 3) and the inference-time ablation study (T able 4) within a practical project timeline. 7 Results 7.1 T raining Dynamics Both models exhibit healthy training dynamics: total loss decreases monotonically , gradient norms stabilize in the range of 1–3 after initial transients, and policy entropy decreases gradually without premature collapse. The belief-aw are model shows additional loss components that con ver ge meaningfully . The winner head loss decreases from approximately 1.16 to belo w 0.02 ov er training, indicating that the network learns to predict game outcomes with increasing accurac y . The rank head loss follo ws a similar trajectory . 7.2 Self-Play Evaluation During training with the scaled pipeline (32 self-play episodes per iteration, opponent pool, simulation schedule), the belief-a ware model consistently achiev es higher ev aluation win rates against the heuristic bot curriculum. T able 2 summarizes e valuation statistics computed ov er all e valuation checkpoints. 12 0 500 1000 Training Iteration 0 2 4 6 8 T otal Loss T otal Loss Baseline Belief -A ware 0 500 1000 Training Iteration 0.0 0.5 1.0 1.5 2.0 P olicy Loss P olicy Loss Baseline Belief -A ware Figure 3: T raining loss curves for baseline MuZero and Belief-A ware MuZero. Raw v alues sho wn in light color; 20-iteration rolling a verages in bold. Both models con verge, with the belief-aw are model’ s total loss higher due to the additional auxiliary loss terms. 0 500 1000 Training Iteration 0.0 0.2 0.4 0.6 W inner Loss W inner Head Loss 0 500 1000 Training Iteration 0.0 0.2 0.4 0.6 R ank Loss R ank Head Loss Figure 4: Con vergence of the belief-a ware model’ s auxiliary prediction heads. The winner head loss (left) and rank head loss (right) both decrease substantially ov er training, confirm- ing that the network learns to predict game outcomes with increasing accurac y . T able 2: Evaluation statistics ov er training (self-play vs. heuristic opponents). Bracketed ranges: µ ± 1 . 96 σ across checkpoints. Metric Baseline Belief-A ware Mean e v al win rate ( ± 1 . 96 σ ) 0.466 [0.32–0.61] 0.720 [0.58–0.87] Std e v al win rate 0.073 0.074 Max e v al win rate 0.600 0.825 Min e v al win rate 0.313 0.525 Mean truncation rate 0.002 0.000 Mean episode length 87.5 87.7 The belief-aware model achiev es a mean win rate of 0.720 compared to the baseline’ s 13 0.466. The near -identical episode lengths and ne gligible truncation rates confirm that this performance dif ference is not attrib utable to tempo manipulation or e valuation artif acts. The belief-aw are model’ s minimum observed win rate e xceeds the baseline’ s mean across check- points, further supporting the consistenc y of the improvement. These results corroborate the head-to-head findings reported in Section 7.3. 0 200 400 600 800 1000 1200 1400 Training Iteration 0.0 0.2 0.4 0.6 0.8 1.0 W in R ate vs. Heuristic Bots Evaluation W in R ate During Training Baseline (mean=0.511) Belief -A ware (mean=0.639) R andom (50%) Figure 5: Evaluation win rate against the heuristic bot curriculum o ver training. The belief- aw are model consistently achie ves higher win rates after an initial ramp-up period. The dashed line indicates the 50% random baseline. 7.3 Head-to-Head Results T able 3 presents head-to-head results between the two architectures at various matched checkpoints, each e v aluated ov er 1000 games. T able 3: Head-to-head results (1000 games per comparison). Belief WR 95% CI: W ilson score interv al. Checkpoint Belief Wins Baseline W ins Draws Belief WR (95% CI) ∆ Elo Iter 125 360 640 0 36.0% [33.1–39.0] − 99 Iter 250 422 578 0 42.2% [39.2–45.3] − 55 Iter 500 668 332 0 66.8% [63.8–69.6] +120 Iter 750 742 252 6 74.2% [71.4–76.8] +184 Iter 1000 753 240 7 75.3% [72.5–77.9] +194 A clear crosso ver occurs between iterations 250 and 500: the belief-aware model initially underperforms the baseline (likely due to the additional parameter complexity and auxiliary 14 loss ov erhead during early training), b ut surpasses it decisi vely once suf ficient training data has been accumulated. The advantage peaks at iteration 1000 with a 75.3% win rate, corresponding to +194 Elo [26]. Figure 6 visualizes this trajectory . 125 250 500 750 1000 Training Iteration 0 10 20 30 40 50 60 70 80 W in R ate (%) 36.0% 42.2% 66.4% 74.2% 75.3% 64.0% 57.8% 33.6% 25.2% 24.0% Head-to-Head: SkyNet vs. Baseline (1000 games each) SkyNet (Belief) Baseline MuZero Figure 6: Head-to-head win rates between SkyNet (Belief-A ware) and baseline MuZero at matched training checkpoints (1000 games each). The crossover between iterations 250 and 500 sho ws SkyNet o vertaking the baseline after sufficient training. 7.4 Statistical Significance For the peak at iteration 1000 (753 wins out of 1000 games), under the null hypothesis that both models are equally strong ( p = 0 . 5 ): z = 0 . 753 − 0 . 5 p 0 . 5 · 0 . 5 / 1000 = 0 . 253 0 . 0158 ≈ 16 . 0 (4) This yields p < 10 − 50 , providing overwhelming statistical e vidence that the belief-aware model is stronger at this checkpoint. The 95% W ilson score interval for the iter 1000 win rate is [72.5%, 77.9%], excluding 50%. At iteration 500, the win rate of 66.8% (668/1000) yields z ≈ 10 . 6 ( p < 10 − 25 ), with 95% W ilson interval [63.8%, 69.6%], confirming that the adv antage is already decisi ve before reaching the peak. 7.5 Infer ence-Time Ablation Because full retraining ablations were not feasible within the project timeline, we use an inference-time ablation to isolate whether the trained policy activ ely relies on ego 15 conditioning during planning. The same trained SkyNet checkpoint (iter 500) is ev aluated in two modes: (i) full inference with ego conditioning and (ii) ablated inference with e go conditioning disabled. The model weights are identical; only the prediction path differs (bypassing the ego, current-player , and number -of-players embeddings). T able 4: Inference-time ablation: full SkyNet vs. e go-conditioning-of f (1000 games, iter 500). WR 95% CI: W ilson score interval. Checkpoint Full Wins Ablated W ins Draws Full WR (95% CI) ∆ Elo Iter 500 690 305 5 69.0% [66.1–71.8] +139 Iter 1000 811 181 8 81.1% [78.6–83.4] +253 At iteration 500, the full model wins 69.0% of 1000 games against the ablated v ariant ( z = 12 . 0 , p < 10 − 30 ). At iteration 1000, this eff ect strengthens: the full model wins 81.1% of 1000 games ( z ≈ 19 . 7 , p < 10 − 50 ). These results suggest that ego-conditioned predictions contribute directly to decision quality during planning rather than serving only as a training-time representation shaping signal. Combined with the baseline comparison (Section 7.3), this sho ws that the performance gain is not solely attrib utable to auxiliary-loss regularization; the model acti vely exploits e go conditioning at inference time. This is an inference-time ablation only and does not separate the indi vidual contributions of training-time auxiliary supervision from e go conditioning during training. A full training ablation (ego-only , heads-only) remains future work. 7.6 T raining Stability Early experiments with limited self-play throughput (4 episodes per iteration) revealed se vere instability in both models, with win rates oscillating between 0.25 and 0.875 across e v aluation points. The belief-aware model exhibited higher v ariance than the baseline under these conditions, suggesting that the additional model comple xity amplifies sensiti vity to data scarcity . Scaling self-play throughput to 32 episodes per iteration, introducing the opponent pool, and applying the simulation schedule eliminated this instability . Under the scaled pipeline, both models’ ev aluation win rate standard de viations are approximately 0.074, indicating that the belief-aw are model’ s advantage is not achiev ed at the cost of stability . Figure 7 sho ws that gradient norms and policy entrop y e volve similarly for both models, confirming stable training dynamics. 16 0 500 1000 Training Iteration 2 4 6 Gradient Norm Gradient Norm Baseline Belief -A ware 0 500 1000 Training Iteration 0.4 0.6 0.8 1.0 P olicy Entropy Self -Play P olicy Entropy Baseline Belief -A ware Figure 7: T raining diagnostics for both models (20-iteration rolling av erages). Gradient norms (left) stabilize in a health y range for both models. Policy entropy (right) decreases gradually without premature collapse, indicating progressi ve strate gy refinement. 7.7 Latent Repr esentation Analysis T o understand how the auxiliary heads shape the latent space, we train linear probes (ridge regression, 5-fold cross-v alidation) on frozen latent vectors from 44,179 decision steps across 150 games. W e probe three representations: the baseline latent state, the belief-a ware latent state before ego conditioning, and the belief-aware latent state after e go conditioning. Figure 8 summarizes ho w much hidden-state information is linearly decodable from each. Figure 8: Linear probe R 2 on frozen latent vectors for sev en game-state features. Features left of the dashed line are directly observ able or semi-observ able; features to the right are hidden. Ne gati ve R 2 indicates worse-than-mean prediction. The ego-conditioned belief representation is the only v ariant that achie ves positi ve R 2 on both hidden-card features. 17 All three representations encode observ able features well: the number of face-do wn cards ( R 2 = 0 . 75 – 0 . 85 ), deck size ( R 2 ≈ 0 . 71 – 0 . 77 ), and visible card sum ( R 2 = 0 . 36 – 0 . 54 ) are linearly recov erable. The e go-conditioned representation achie ves the highest R 2 on face- do wn card count ( 0 . 845 vs. baseline 0 . 748 ) and visible sum ( 0 . 539 vs. 0 . 477 ), indicating that ego conditioning sharpens the encoding of player -specific observable state. The more striking result in volv es genuinely hidden features. Neither the baseline nor the raw belief representation achiev es positi v e R 2 on the sum of face-down card v alues or opponent hidden card sum. The ego-conditioned representation, ho we ver , is the only v ariant that crosses into positi ve territory on both: R 2 = 0 . 076 for own hidden card sum and R 2 = 0 . 168 for opponent hidden sum. While these v alues are modest, they demonstrate that the auxiliary heads encourage the latent state to retain linear traces of hidden information that the baseline discards entirely . Across all hidden features, the belief models produce consistently less negati ve R 2 than the baseline (e.g., true score adv antage: − 0 . 334 vs. − 1 . 062 ), suggesting improv ed alignment with hidden-state structure ev en where full linear decodability is not achie ved. These findings support the interpretation advanced in Section 7.1 that the auxiliary heads act as representation-shaping re gularizers, encouraging the latent space to retain outcome- rele v ant hidden information without requiring explicit belief-state reconstruction. 8 Discussion 8.1 Why Belief Modeling Helps The belief-aw are model’ s adv antage likely stems from two mechanisms. First, the auxiliary winner and rank heads impose an inductiv e bias on the shared representation: to predict game outcomes accurately , the latent state is encouraged to retain information relev ant under partial observability , plausibly including hidden-card structure, deck composition, and opponent state. This acts as a form of self-supervised regularization that impro v es the quality of latent states used for both v alue estimation and MCTS planning. This is consistent with findings from Ef ficientZero [10] and the value-equi valence principle [14] that representations should be optimized for planning-relev ant quantities rather than observation reconstruction. Second, the ego conditioning mechanism allows the network to produce player-specific predictions from a shared representation, enabling effecti v e perspectiv e augmentation during training. Each self-play trajectory generates training signal from ev ery player’ s viewpoint, ef fecti vely multiplying the data ef ficiency by the number of players. 18 8.2 The Cr ossov er Effect The observ ation that the belief-aware model underperforms the baseline at early checkpoints (iterations 125–250) b ut dominates later (iterations 500+) is consistent with a well-kno wn pattern in multi-task and auxiliary-loss learning: the additional loss terms compete with the primary MuZero objectiv e for gradient bandwidth during early training when data is scarce. The ramping loss weight schedule ( α : 0 . 1 → 0 . 5 , β : 0 . 1 → 0 . 25 ) mitigates b ut does not eliminate this ef fect. This finding has practical implications: practitioners should not e v aluate belief-augmented architectures too early in training, as initial underperformance may be followed by substantial gains. 8.3 Non-Zer o-Sum Dynamics Skyjo is fundamentally non-zero-sum: players’ scores are independent quantities, and one player’ s improv ement does not necessarily correspond to another’ s decline. This distinguishes it from zero-sum games like Chess and Go where MuZero was originally v alidated. In non-zero-sum settings, self-play training is more prone to instability because v alue estimates cannot be cleanly in verted between players. The belief-aware model’ s explicit multi-player heads (winner prediction per player , rank estimation per player) provide a more natural interface for non-zero-sum v alue estimation than a single scalar v alue head, analogous to ho w ReBeL [11] and Student of Games [12] handle imperfect-information structure through explicit game-theoretic reasoning. 8.4 The Role of T raining Scale Perhaps the most important finding is that the belief-aw are model’ s adv antage only mate- rializes under suf ficient training throughput. With 4 self-play episodes per iteration, the additional model complexity produced more instability than benefit. W ith 32 episodes per iteration and an opponent pool, the same architecture produced consistent, statistically significant gains. This suggests that belief-aware auxiliary supervision requires adequate data flo w to realize its potential, a practical constraint that mirrors findings in the broader model-based RL literature [27]. 8.5 Limitations Se veral limitations should be noted. First, our belief heads predict game outcomes (winner , rank) rather than maintaining explicit beliefs o ver hidden state v ariables (e.g., distributions ov er face-do wn card values). Explicit belief-state planning, as in BetaZero [19] or POMCP [5], could further improve planning quality by enabling informed chance-node expansion in MCTS. Accordingly , we interpret SkyNet as a belief-aware representation shaping method, 19 not an explicit belief-state planner . Second, our e v aluation is limited to 2-player Skyjo; the model supports 2–8 players architecturally but scaled multi-player e xperiments remain future work. Third, comparison to human performance has not been conducted, though the models’ ability to outperform hand-crafted heuristic opponents suggests competiti ve play quality . 9 Futur e W ork Se veral directions could e xtend this w ork: training the belief head to predict explicit distribu- tions ov er face-down card values and using these beliefs to weight chance branches in MCTS follo wing Stochastic MuZero [7]; e valuating 3–8 player Skyjo, where win signal sparsity ( ∼ 1 / N ) and opponent modeling complexity increase; migrating to a fully asynchronous actor-learner pipeline [28] to further scale throughput; conducting controlled studies against human players; and applying the belief-aware frame work to other imperfect-information games (e.g., Hanabi, Hearts) to assess generality . 10 Conclusion W e presented Belief-A ware MuZero, an extension of the MuZero frame work that augments the standard architecture with ego-conditioned auxiliary heads for winner prediction and rank estimation. Applied to Skyjo, a partially observable, stochastic, multi-player card game with non-zero-sum dynamics, the belief-aware v ariant achiev es a peak 75.3% win rate (+194 Elo) against the classical MuZero baseline in 1000-game head-to-head ev aluation ( p < 10 − 50 ). The ke y insight is that belief-aware auxiliary supervision appears to improve the quality of learned representations under partial observability , but this benefit requires suf ficient training throughput to materialize. Under lo w-data regimes, the additional model comple xity increases instability , whereas under adequate data flo w it produces consistent and substantial gains. These results suggest that MuZero can be effecti vely adapted to imperfect-information multi-player card games, and that ev en simple belief-aw are auxiliary heads, applied without explicit hidden-state modeling, provide meaningful improv ements. This opens a path to ward stronger game-playing agents in the broad class of partially observ able, stochastic, multi-player domains that more closely resemble real-world decision-making challenges. 0 Skyjo W eb Arena: https://skyjo.artzima.dev/ SkyNet Repository: https://github. com/DevArtech/skynet 20 Refer ences 1. Schrittwieser , J., Antonoglou, I., Hubert, T ., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T ., Lillicrap, T ., & Silv er , D. (2020). Mastering Atari, Go, Chess and Shogi by planning with a learned model. Natur e , 588(7839), 604–609. 2. Silver , D., Hubert, T ., Schrittwieser , J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., K umaran, D., Graepel, T ., Lillicrap, T ., Simonyan, K., & Hassabis, D. (2018). A general reinforcement learning algorithm that masters Chess, Shogi, and Go through self-play . Science , 362(6419), 1140–1144. 3. K ocsis, L. & Szepesvári, C. (2006). Bandit based Monte-Carlo planning. Eur opean Confer ence on Machine Learning (ECML) , 282–293. 4. Kaelbling, L. P ., Littman, M. L., & Cassandra, A. R. (1998). Planning and acting in partially observ able stochastic domains. Artificial Intelligence , 101(1–2), 99–134. 5. Silver , D. & V eness, J. (2010). Monte-Carlo planning in large POMDPs. Advances in Neural Information Pr ocessing Systems (NeurIPS) , 23, 2164–2172. 6. Co wling, P . I., Po wley , E. J., & Whitehouse, D. (2012). Information set Monte Carlo tree search. IEEE T ransactions on Computational Intelligence and AI in Games , 4(2), 120–143. 7. Antonoglou, I., Schrittwieser , J., Ozair , S., Hubert, T ., & Silver , D. (2022). Planning in stochastic en vironments with a learned model. International Confer ence on Learning Repr esentations (ICLR) . 8. Danihelka, I., Guez, A., Schrittwieser , J., & Silv er , D. (2022). Policy improv ement by planning with Gumbel. International Confer ence on Learning Repr esentations (ICLR) . 9. Schrittwieser , J., Hubert, T ., Manber , A., Hassabis, D., & Silv er , D. (2021). Online and of fline reinforcement learning by planning with a learned model. Advances in Neural Information Pr ocessing Systems (NeurIPS) , 34. 10. Y e, W ., Liu, S., Kurutach, T ., Abbeel, P ., & Gao, Y . (2021). Mastering Atari games with limited data. Advances in Neural Information Pr ocessing Systems (NeurIPS) , 34. 11. Bro wn, N., Bakhtin, A., Lerer , A., & Gong, Q. (2020). Combining deep reinforcement learning and search for imperfect-information games. Advances in Neur al Information Pr ocessing Systems (NeurIPS) , 33. 21 12. Schmid, M., Moravcík, M., Burch, N., Kadlec, R., Da vidson, J., W augh, K., Lisý, V ., Bo wling, M., Lanctot, M., & Munos, R. (2023). Student of Games: A unified learning algorithm for both perfect and imperfect information games. Science Advances , 9(46). 13. Whitehouse, D., Po wley , E. J., & Cowling, P . I. (2011). Determinization and informa- tion set Monte Carlo tree search for the card game Dou Di Zhu. IEEE Confer ence on Computational Intelligence and Games (CIG) , 87–94. 14. Grimm, C., Barreto, A., Singh, S., & Silver , D. (2020). The v alue equi v alence principle for model-based reinforcement learning. Advances in Neural Information Pr ocessing Systems (NeurIPS) , 33. 15. Schwarzer , M., Anand, A., Goel, R., Hjelm, R. D., Courville, A., & Bachman, P . (2021). Data-efficient reinforcement learning with self-predictiv e representations. International Confer ence on Learning Repr esentations (ICLR) . 16. Jaderberg, M., Mnih, V ., Czarnecki, W ., Schaul, T ., Leibo, J. Z., Silver , D., & Kavukcuoglu, K. (2017). Reinforcement learning with unsupervised auxiliary tasks. International Confer ence on Learning Repr esentations (ICLR) . 17. Hubert, T ., Schrittwieser , J., Antonoglou, I., Barekatain, M., Schmitt, S., & Silver , D. (2021). Learning and planning in comple x action spaces. International Confer ence on Machine Learning (ICML) . 18. Sutton, R. S. & Barto, A. G. (2018). Reinforcement Learning: An Intr oduction (2nd ed.). MIT Press. 19. Moss, R. J., Corso, A., Caers, J., & K ochenderfer , M. J. (2024). BetaZero: Belief-state planning for long-horizon POMDPs using learned approximations. arXiv pr eprint arXiv:2306.00249 . 20. Morav ˇ cík, M., Schmid, M., Burch, N., Lisý, V ., Morrill, D., Bard, N., Davis, T ., W augh, K., Johanson, M., & Bo wling, M. (2017). DeepStack: Expert-le vel artificial intelligence in heads-up no-limit poker . Science , 356(6337), 508–513. 21. Bro wn, N. & Sandholm, T . (2017). Superhuman AI for heads-up no-limit poker: Libratus beats top professionals. Science , 359(6374), 418–424. 22. Bro wn, N. & Sandholm, T . (2019). Superhuman AI for multiplayer poker . Science , 365(6456), 885–890. 23. Schaul, T ., Quan, J., Antonoglou, I., & Silver , D. (2016). Prioritized experience replay . International Confer ence on Learning Repr esentations (ICLR) . 24. Lanctot, M., Lockhart, E., Lespiau, J.-B., Zambaldi, V ., Upadhyay , S., Pérolat, J., 22 Srini v asan, S., Timbers, F ., T uyls, K., Omidshafiei, S., Hennes, D., Morrill, D., Muller , P ., Eber , T ., Duran-Martin, G., De Vylder , B., Munos, R., Abramson, J., V inyals, O., & Bo wling, M. (2020). OpenSpiel: A frame work for reinforcement learning in games. arXiv pr eprint arXiv:1908.09453 . 25. Zha, D., Lai, K.-H., Cao, Y ., Huang, S., W ei, R., Guo, J., & Hu, X. (2020). RLCard: A platform for reinforcement learning in card games. International J oint Confer ence on Artificial Intelligence (IJCAI) . 26. Elo, A. E. (1978). The Rating of Chessplayers, P ast and Present . Arco Publishing. 27. Moerland, T . M., Broekens, J., Plaat, A., & Jonk er , C. M. (2023). Model-based reinforcement learning: A surve y . F oundations and T r ends in Machine Learning , 16(1), 1–118. 28. Horg an, D., Quan, J., Budden, D., Barth-Maron, G., Hessel, M., v an Hasselt, H., & Silver , D. (2018). Distrib uted prioritized experience replay . International Confer ence on Learning Repr esentations (ICLR) . 23
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment