SHADOW: Seamless Handoff And Zero-Downtime Orchestrated Workload Migration for Stateful Microservices

Migrating stateful microservices in Kubernetes requires careful state management because in-memory state is lost when a container restarts. For StatefulSet-managed workloads, the problem is compounded by identity constraints that prohibit two pods wi…

Authors: Hai Dinh-Tuan

SHADOW: Seamless Handoff And Zero-Downtime Orchestrated Workload Migration for Stateful Microservices
SHADO W : Seamless Handof f And Zero-Do wntime Orchestrated W orkload Migration for Stateful Microservices Hai Dinh-T uan T echnisc he Univer sit ¨ at Berlin Berlin, Germany hai.dinh-tuan@tu-berlin.de Abstract —Migrating stateful microservices in Kubernetes re- quires careful state management because in-memory state is lost when a container restarts. For StatefulSet-managed work- loads, the problem is compounded by identity constraints that prohibit two pods with the same ordinal from running si- multaneously , forcing a sequential stop-recr eate cycle with a median 38.5 s of service do wntime. This paper pr esents SHADO W (Seamless Handoff And Zero-Do wntime Orchestrated Workload Migration), a Kuber netes-native framework that implements the Message-based Stateful Microser vice Migration (MS2M) approach as a Kubernetes Operator . SHADOW introduces the ShadowP od strategy , where a shadow pod is created fr om a CRIU checkpoint image on the target node while the source pod continues serving traffic, allowing concurrent operation during message replay . F or StatefulSet workloads, an identity swap procedur e with the ExchangeFence mechanism re-checkpoints the shadow pod, creates a StatefulSet-owned replacement, and drains both message queues to guarantee zero message loss during the handoff. An evaluation on a bare-metal Kuber netes cluster with 280 migration runs across four configurations and seven message rates (10–120 msg/s) shows that, compared to the sequential baseline on the same StatefulSet workload, the ShadowP od strategy reduces the restore phase by up to 92%, eliminates service downtime entir ely , and r educes total migration time by up to 77%, with zero message loss across all 280 runs. Index T erms —microservices, liv e migration, Kubernetes, con- tainer checkpointing, CRIU , stateful services, operator pattern I . I N T RO D U C T I O N The increasing adoption of microservice architectures in cloud-nativ e and cloud/edge en vironments has introduced ne w challenges around service mobility . While stateless services can be rescheduled across cluster nodes, stateful microser- vices that maintain in-memory state for performance-critical operations require careful state management during migration. T raditional migration techniques, originally dev eloped for vir- tual machines, treat service state as an opaque memory block and rely on pre-copy or post-copy mechanisms that introduce additional complexity and network overhead [1], [2]. The Message-based Stateful Microservice Migration (MS2M) frame work [3] addresses this challenge by using the application’ s own messaging infrastructure to reconstruct state on the tar get node. Rather than copying raw memory , MS2M checkpoints the running container, restores it on a target node, and replays buf fered messages to synchronize state, reducing downtime to the brief checkpoint creation phase. A subsequent extension [4] inte grated MS2M into Kubernetes using Forensic Container Checkpointing (FCC) [5], introduced a threshold-based cutoff mechanism for bounded replay times, and adapted the procedure for StatefulSet workloads using a Sequential (stop-then-recreate) strategy . Howe ver , the Kubernetes integration in [4] exhibited two performance bottlenecks. First, the restore phase for State- fulSet pods required a median of 38.5 seconds due to Ku- bernetes’ identity constraints, which prohibit two pods with the same StatefulSet-managed hostname from running simul- taneously . This forced a sequential stop-recreate cycle, during which the service w as completely unav ailable. Second, the checkpoint transfer phase consumed a median of 5.4 seconds for a 27 MB checkpoint, as the process in volved building an OCI-compliant container image, pushing it to a container registry , and pulling it on the target node. Additionally , the migration was orchestrated by an an out-of-band agent ( Migration Manager ) that communicated with the clus- ter via kubectl , operating outside Kubernetes’ declarative resource model. This paper presents SHADOW (Seamless Handof f And Zero-Downtime Orchestrated W orkload Migration), a Kubernetes-nati ve frame work that addresses these limitations through three contributions: 1) Kubernetes Operator: a Custom Resource Definition ( StatefulMigration ) and an idempotent state ma- chine reconciler that replaces the Migration Manager with a Kubernetes-nati ve control loop. 2) ShadowP od Migration Strate gy: a shadow pod is created on the target node from the checkpoint image while the source pod continues serving, allowing concurrent operation during message replay . This works for both Deployment- and StatefulSet-managed workloads. 3) Identity Swap via Exchang eF ence: after Shado wPod mi- gration on a StatefulSet, the shadow pod is orphaned out- side the StatefulSet’ s ownership. SHADOW provides an identity swap procedure that re-checkpoints the shadow pod, creates a StatefulSet-owned replacement, and uses the ExchangeFence mechanism (unbinding, draining queues, and rebinding) to guarantee zero message loss during this handoff. I I . B A C K G RO U N D A N D P R I O R W O R K A. The MS2M F ramework The Message-based Stateful Microservice Migration (MS2M) frame work [3] was designed for message-dri ven microservices that derive their state from processed message streams. Unlike traditional migration approaches that treat service state as an opaque memory block, MS2M uses a property of message-driven architectures: service state is the deterministic r esult of the messages a service has pr ocessed . This makes state reconstruction possible through message replay rather than low-le vel memory transfer . The frame work defines a fi ve-phase migration procedure coordinated by a Migration Manager: (1) Checkpoint Cr e- ation – the source container is briefly paused using CRIU [6] to create a process checkpoint while a secondary queue buf fers incoming messages; (2) Checkpoint T ransfer – the archiv e is sent to the target host while the source continues serving; (3) Service Restoration – the checkpoint is restored on the tar get; (4) Message Replay – the restored instance processes buf fered messages from the secondary queue; and (5) Finalization – the target switches to the primary queue and the source is terminated. A key assumption is that service state is fully determined by processed messages. During replay , source and target consume from separate queues, so no message is consumed twice. Applications must suppress side effects during replay (detectable via START_REPLAY / END_REPLAY control mes- sages) or ensure idempotent downstream operations. The original proof-of-concept [3] demonstrated a 19.92% reduction in service do wntime compared to traditional stop- and-copy migration. B. K ubernetes Inte gration The integration of MS2M into Kubernetes [4] required adapting the frame work to the platform’ s declarative, desired- state resource model. This used F or ensic Container Check- pointing (FCC) [5], an experimental Kubernetes feature (since v1.25) that extends the kubelet API to create CRIU check- points of running containers without requiring direct access to the container runtime. Checkpoint transfer was implemented by packaging the archi ve as an OCI-compliant container image and pushing it to a container registry , from which the target node pulls during pod creation. That work identified two ke y challenges: Unbounded Replay T ime: When the incoming message rate approaches the service’ s processing capacity , the replay queue grows indefinitely , making migration time unpredictable. A Thr eshold-Based Cutoff Mechanism was introduced to address this: the source service is terminated after a calculated duration T cutoff ≤ T replay max · µ target λ , bounding the number of messages to replay and guaranteeing migration completion within a specified time window . StatefulSet Constraints: StatefulSet pods possess stable, unique network identities that Kubernetes enforces by pro- hibiting two pods with the same ordinal index from running simultaneously . This prev ents the concurrent source-target operation that MS2M relies on, forcing a sequential stop- recreate cycle ( Sequential strate gy ) that dominated migration time at a median of 38.5 seconds. The prior work treated this as an inherent limitation of StatefulSet workloads. C. K ubernetes Operator s A Kubernetes Operator [7] is a design pattern that combines a Custom Resource Definition (CRD) with a dedicated con- troller to manage application-specific operational kno wledge. The controller implements a r econcile loop that continuously compares the desired state (expressed in the custom resource’ s spec ) with the actual cluster state and takes correctiv e actions to con verge. This declarative model provides automatic retry on failure, leader election for high av ailability , integration with Kubernetes Role-Based Access Control (RBA C) and audit logging, and native observability through resource status fields and Kubernetes e vents. I I I . S Y S T E M D E S I G N This section presents the architecture of SHADO W 1 and its two optimization strategies. Figure 1 provides an ov erview of the system components. A. The SHADO W Operator SHADO W implements the MS2M migration procedure as a Kubernetes Operator using the controller-runtime framew ork [8]. The operator manages a custom resource, StatefulMigration , whose specification declares the mi- gration intent and whose status reflects the current progress. 1) Custom Resour ce Definition: The StatefulMigration CRD (API group: migration.ms2m.io/v1alpha1 ) captures the migration intent through the following key specification fields: • sourcePod : The name of the pod to migrate. • targetNode : The destination worker node. • migrationStrategy : Sequential or ShadowPod . Auto-detected from the pod’ s owner reference chain if not specified. • transferMode : Registry (default) or Direct (node-to-node via ms2m-agent). • messageQueueConfig : RabbitMQ connection de- tails, queue name, and exchange name for the replay mechanism. • replayCutoffSeconds : Maximum duration for the cutoff mechanism. The resource status tracks the current phase , per-phase durations ( phaseTimings ), the checkpoint identifier , and metadata cached from the source pod (labels, container specs, owner references). 1 The SHADOW implementation is publicly av ailable at https://github.com/ haidinhtuan/shadow Control Plane MS2M Operator (Reconciler) API Server watch/update StatefulMigration CR Worker 1 (Sourc e) Kubelet CRI-O + CRIU ms2m-agent Source Pod (Consumer) Worker 2 (T arget) ms2m-agent Kubelet CRI-O + CRIU Shadow Pod (Restored) Message Broker Primary Queue Replay Queue RabbitMQ 1. Checkpoint 2a. Direct 2b. via Registry 3. Shadow pod 4. Finalize: switch traffic, remove source consume replay Fig. 1. Architecture of the SHADO W frame work. The reconciler watches StatefulMigration custom resources and orchestrates migration through the API server: checkpoint creation via the kubelet API ( 1 ), checkpoint transfer between ms2m-agent instances (direct) or via registry ( 2 ), shadow pod creation on the target node ( 3 ), and traf fic switchover during finalization ( 4 ). 2) State Machine Reconciler: The reconciler implements the MS2M fi ve-phase procedure as an idempotent state machine with sev en phases, illustrated in Figure 2: Pending , Checkpointing , Transferring , Restoring , Replaying , Finalizing , and Completed (or Failed ). Each reconciliation in vocation dispatches to the handler for the current phase, which either advances to the next phase or re-queues for a later attempt. Phase handlers are idempotent : re-entering a phase after a transient failure does not corrupt migration state. The recon- ciler implements phase chaining (executing synchronously- completed phases within a single API call) and exponential polling backof f for long-running phases, reducing API server load. W all-clock durations for each phase are recorded in status.phaseTimings for built-in instrumentation. When migrationStrategy is not specified, the op- erator auto-detects it from the pod’ s owner reference chain (StatefulSet → Sequential, Deployment → ShadowPod), but this can be ov erridden explicitly . B. ShadowP od Migr ation Strate gy The ShadowPod strategy remov es the sequential restore bottleneck by allowing source and target pods to operate concurrently during the replay phase. Kubernetes Services route traffic based on label selectors , while StatefulSet identity Pending Checkpointing T ransferring Restoring Replaying Finalizing Completed Failed pod validated CRIU checkpoint image ready pod running queue drained traffic switched on error Fig. 2. State machine of the StatefulMigration reconciler. The reconcile loop dispatches to the handler for the current phase, adv ancing on success or transitioning to Failed on error . Each handler is idempotent for safe retry . constraints are enforced through owner r eferences . A shadow pod that carries the same application labels b ut is not owned by the StatefulSet can coexist with the original pod and recei ve SS-Seq Source: Checkpoint Transfer T arget: Restore Replay Finalize SS-Shadow/D-Reg Source: Running Shadow: Transfer Restore Replay Finalize Running SS-Swap Source: Running Shadow: Swap: Re-Checkpoint ExchangeFence Running Transfer Restore Replay Running Checkpoint Checkpoint Downtime Running Fig. 3. T imeline comparison of the migration strategies. SS-Seq: service is unav ailable during the 38.5 s restore phase. SS-Shadow/D-Re g: the source continues serving while the shado w pod restores and replays. SS-Swap: after replay , the shadow pod continues serving during the identity swap (re- checkpoint, ExchangeFence), and is replaced only once the StatefulSet-owned replacement is ready . All ShadowPod variants achieve zero downtime. Service traffic without violating identity constraints. 1) Cor e Pr ocedure: Figure 3 illustrates the key differences between the three StatefulSet strategies. In Sequential, the source pod must be terminated before the target can be created, causing a service gap. In ShadowPod, the shadow pod runs alongside the source. In ShadowPod+Swap, an additional iden- tity swap phase replaces the shadow pod with a StatefulSet- owned replacement. The Shado wPod strategy modifies the Restoring and Fi- nalizing phases of the migration procedure. During the Restoring phase, instead of scaling down a StatefulSet and recreating its pod, the operator creates a new pod named -shadow on the target node. This shadow pod uses the checkpoint container image with imagePullPolicy: Never (since the image is already present in the target node’ s local image store). The shadow pod carries the same application labels as the source pod, so it can recei ve traffic via the Kubernetes Service immediately upon becoming ready . Both the source pod and the shadow pod run concurrently during this phase and throughout the subsequent replay phase. Because the shadow pod is restored from a CRIU check- point (which includes the running HTTP server thread), its health endpoint is a vailable immediately after restore. Ku- bernetes marks the shadow pod as Ready and adds it to the Service’ s endpoint list, so both pods serve traf fic during replay . The source provides fully up-to-date state while the shadow pod’ s state con verges as replay progresses. 2) F inalization by W orkload T ype: For Deployment- managed pods, finalization sends END_REPLAY to the shadow pod, deletes the source pod, patches the Deployment’ s nodeAffinity to the target node, and cleans up the replay queue. For StatefulSet-managed pods, the StatefulSet is scaled down by one replica instead of deleting the source directly (which would cause the controller to recreate it). The State- fulSet controller remov es the highest-ordinal pod (the source), while the shadow pod continues serving traffic. 3) T rade-offs: For StatefulSets, the ShadowPod strategy results in the workload being served by a standalone shadow pod outside StatefulSet ownership. StatefulSet guarantees (or- dered scaling, PVC attachments) do not apply . SHADO W therefore targets in-memory stateful services whose state is deriv ed from message processing, not persistent disk storage. For Deployment-managed workloads, no such trade-off exists. C. Identity Swap via ExchangeF ence The ShadowPod strategy for StatefulSets leaves the work- load served by an orphaned shadow pod outside the State- fulSet’ s ownership. T o restore full StatefulSet management, SHADO W provides an identity swap procedure during the Finalizing phase: 1) The shadow pod is re-checkpointed on the target node using a local CRIU dump. 2) The StatefulSet is scaled do wn by one replica, removing the original source pod. 3) A replacement pod with the original name (e.g., consumer-0 ) is created from the re-checkpoint image. The StatefulSet controller adopts this pod. 4) The shadow pod is terminated. During step 4, traffic must be handed of f from the shadow pod to the replacement pod without message loss. A naive cutoff (terminating the shadow and starting the replacement) risks duplicate processing or lost messages if the handoff is not atomic. SHADO W addresses this with the ExchangeF ence mechanism: 1) A temporary buf fer queue is bound to the message exchange. 2) The exchange is unbound from both the primary and swap queues, fencing ne w messages into the buffer . 3) Both the primary and swap queues are drained to zero. 4) The shadow pod is terminated, the primary queue is rebound to the exchange, and the buf fer queue is drained into the replacement pod. The ExchangeFence provides a consistent cut: no messages are lost because the buf fer queue captures all messages during the fence window , and no messages are duplicated because both source queues are fully drained before the handof f com- pletes. This procedure adds latencies to the Finalizing phase (depending on message rate and queue depths) but restores full StatefulSet ownership of the migrated workload. D. Dir ect Node-to-Node T ransfer The second optimization targets the container registry round-trip during checkpoint transfer . SHADO W provides two alternativ es to Job-based registry transfer . First, when the ms2m-agent DaemonSet is present on the source node, the controller dele gates the registry push to the agent process, av oiding the overhead of creating and scheduling a Kubernetes Job (agent-assisted registry transfer). Second, the controller implements a direct transfer mode through worker -to-worker communication that bypasses the registry entirely . All four ev aluation configurations use registry-based transfer; the direct mode is not e valuated but is described here as a design contribution. 1) The ms2m-agent DaemonSet: The ms2m-agent is deployed as a Kubernetes DaemonSet on all worker nodes. The agent e xposes an HTTP endpoint (port 9443) that ac- cepts checkpoint archives and performs local image con- struction. The agent’ s pod specification includes hostPath volume mounts for /var/lib/kubelet/checkpoints (read-only , for accessing FCC checkpoint archiv es) and /var/lib/ms2m (read-write, for temporary image construc- tion). 2) T ransfer Pr ocedur e: When transferMode: Direct is specified, the Transferring phase operates as follows: 1) The operator creates a transfer Job scheduled on the source node. 2) The Job reads the checkpoint archi ve from the kubelet’ s checkpoint directory . 3) The Job streams the archiv e as a multipart HTTP POST to the target node’ s ms2m-agent at http://ms2m- agent. ms2m- system.svc:9443/checkpoint. 4) The target agent writes the archive to disk, constructs an OCI-compliant container image (a single uncompressed layer with CRI-O checkpoint annotations), and loads it into the local image store using skopeo copy . 5) The agent responds with the local image reference, which the operator records in the migration status. The OCI image uses an uncompressed layer, as CPU com- pression cost outweighs bandwidth sa vings on cluster -local networks. This direct path eliminates the re gistry push/pull round-trip. The default Registry transfer mode from [4] remains av ailable as a fallback. 3) CRIU Hostname Resolution: After CRIU restore, gethostname() returns the source pod’ s name (from the UTS (UNIX T ime-Sharing) namespace snapshot), causing the control message protocol to listen on the wrong queue. This is resolved by reading /etc/hostname (bind-mounted by the runtime, reflecting the actual pod name) instead. I V . E V A L UAT I O N This section presents the experimental ev aluation of SHADO W across four migration configurations, comparing the baseline Sequential approach with the ShadowPod strategy on both StatefulSet and Deployment workloads. A. Experimental Setup 1) Infr astructure: The e valuation was conducted on a bare- metal Kubernetes cluster provisioned on a European cloud provider , consisting of three dedicated servers: • Control plane: 1 node running the K ubernetes API server , scheduler , controller manager, and the MS2M operator . • W orkers: 2 nodes serving as alternating source and target for round-trip migrations. Each server is equipped with 4 dedicated vCPUs, 8 GB RAM, and 80 GB SSD storage. All nodes run Ubuntu 22.04 with Kubernetes v1.32, CRI-O as the container runtime, and CRIU v4.0 compiled from source for checkpoint/restore operations. An in-cluster container registry (deployed in the registry namespace) is used for OCI checkpoint image transfer . RabbitMQ 3.13 is deployed as a single-instance StatefulSet for message brokering. 2) W orkload: The e valuation workload consists of two microservices: • Producer: A Go Deployment that publishes messages to a RabbitMQ fanout exchange at a configurable rate. Each message contains a sequence number and timestamp. • Consumer: A single-replica workload (StatefulSet or De- ployment, depending on configuration) implemented in Python. The consumer maintains an in-memory counter of processed messages and the last-seen sequence num- ber as its stateful context. It exposes an HTTP health endpoint on port 8080 that reports the current processing state, serving both as a readiness probe and a do wntime measurement target. The consumer implements the MS2M control message protocol (Section II-A) for queue switching during replay . 3) Configur ations: Four migration configurations are ev al- uated, designed to isolate the effect of the ShadowPod strategy and identity swap: 1) SS-Seq (statefulset-sequential): Baseline from [4]. State- fulSet consumer , Sequential strategy , registry transfer . 2) SS-Shadow (statefulset-shadowpod): StatefulSet con- sumer , ShadowPod strate gy , registry transfer . Isolates the ShadowPod effect. 3) SS-Swap (statefulset-shadowpod-sw ap): StatefulSet con- sumer , ShadowPod strategy with ExchangeFence iden- tity swap (Section III-C), agent-assisted registry transfer . 4) D-Reg (deployment-registry): Deployment consumer, ShadowPod strategy , registry transfer . The comparison between configurations 1 and 2 isolates the effect of the ShadowPod strategy on the same workload type (StatefulSet). The comparison between configurations 2 and 3 isolates the effect of identity swap (and its associated agent-assisted transfer path) on the same workload type. The comparison between configurations 2 and 4 reveals the effect of workload type (StatefulSet vs. Deployment). 4) P arameters: Each configuration is ev aluated at se ven message rates: 10, 20, 40, 60, 80, 100, and 120 messages per second. Each rate–configuration combination is repeated 10 times, yielding 4 × 7 × 10 = 280 total migration runs. The replay cutoff is set to 120 seconds. Between runs, message queues are purged to prev ent accumulation, checkpoint images are cleaned from w orker nodes to prevent disk e xhaustion, and the consumer pod is verified to be ready and processing messages before initiating the next migration. 5) Downtime Measur ement: Service downtime is measured using an external HTTP probe pod that sends requests to the consumer’ s Service endpoint at 10 ms intervals. Do wntime is 10 20 40 60 80 100 120 0 30 60 90 120 150 180 replay cutoff (120 s) Message Rate (msg/s) T otal Migration T ime (s) SS-Seq SS-Shadow SS-Swap D-Reg Fig. 4. T otal migration time (median, n = 10 ) across message rates. At low rates ( ≤ 40 msg/s), ShadowPod reduces time by 73–76%. At high rates where the 120 s replay cutoff dominates, the reduction narrows to 22%. The dashed line marks the replay cutoff boundary . the longest contiguous str eak of failed probes during the mi- gration window , with a 3 s gap threshold to av oid merging un- related failures. A dedicated probe pod sends HTTP requests to the consumer’ s Service endpoint at 10 ms intervals. T otal mi- gration time is the elapsed time from StatefulMigration resource creation to the Completed phase. B. Results 1) T otal Migration T ime: T able I presents the median total migration time ( n = 10 per cell) for each configuration across all sev en message rates. Figure 4 visualizes the trend. At low rates (10 msg/s), ShadowPod configurations com- plete in 12.4–13.8 s, a 73–76% reduction from the Sequential baseline of 50.8 s, primarily by eliminating the 38.5 s median restore phase. At high rates ( ≥ 80 msg/s), the 120 s replay cut- off dominates, narrowing the ShadowPod adv antage to 22%. At the intermediate rate of 60 msg/s, SS-Shado w completes in 36.0 s vs. 157.5 s for Sequential (77% reduction). The SS-Swap configuration sho ws similar total times to SS- Shadow at low rates (15.8 s vs. 13.8 s at 10 msg/s), with the additional 2 s attrib utable to the identity swap finalization. At high rates, SS-Swap is 10 s slower than SS-Shadow (141.6 s vs. 129.0 s at 120 msg/s) due to the 17.5 s finalize phase, but still 14% faster than Sequential. 2) Phase-by-Phase Br eakdown: T ables II and III present the per -phase durations at 10 and 60 msg/s respectively , illus- trating the source of the performance improv ement at both ends of the rate spectrum. Checkpointing is consistent at a median of 0.34 s across all configurations. T ransferring takes 5.0–5.9 s (I/O-bound registry transfer). The Restoring phase shows the primary ShadowPod benefit: Sequential requires a median of 38.5 s (StatefulSet identity constraint), while ShadowPod requires only 2.3–3.2 s, a 92% reduction. The Replaying phase scales with message rate: at 60 msg/s, SS-Shadow replays in 27.7 s vs. 112.8 s for Sequential, because fewer messages accumulate 0 50 100 150 SS-Seq SS-Shadow SS-Swap D-Reg Duration (s) Finalize Replay Restore Transfer Checkpoint Fig. 5. Phase duration breakdown at 60 msg/s (medians, n = 10 ). The Sequential configuration is dominated by the 38.4 s restore phase. ShadowPod reduces restore to 2.5–2.9 s, shifting the bottleneck entirely to replay . during the shorter restore window . At ≥ 80 msg/s, all config- urations hit the 120 s cutoff. Finalizing is near -instantaneous ( < 0.1 s). The SS-Swap configuration shows near-zero transfer time (0.19 s at 60 msg/s) due to the agent-assisted re gistry push (which av oids K ubernetes Job scheduling overhead), but a much longer Finalizing phase (14.73 s) due to the identity swap procedure: re-checkpointing, pod replacement, and ExchangeFence queue draining. Figure 5 visualizes the breakdown at 60 msg/s. 3) Service Downtime: T able IV reports the measured ser- vice downtime (median, n = 10 ) using the HTTP probe methodology described in Section IV -A5. The Sequential baseline exhibits a consistent median of 31.1 s of downtime across all rates (rate-independent, deter- mined by the StatefulSet scale-do wn/up cycle). All three Shad- owPod configurations achiev e zero measured downtime across all rates and all 210 ShadowPod runs (70 per configuration). 4) Messa ge Loss: Across all 280 migration runs, no mes- sages were lost during migration. The message replay mech- anism successfully synchronized state between source and target in e very completed run across all four configurations and all sev en message rates, confirming the zero-loss guarantee of the MS2M framew ork. C. Discussion 1) Interpr etation: The comparison between configura- tions 1 and 2 isolates the effect of the Shado wPod strategy on the same workload type (StatefulSet). The 92% reduction in re- store duration confirms that the StatefulSet identity constraint, which prior work treated as an inherent limitation, can be cir- cumvented by decoupling traffic routing (label selectors) from pod o wnership (controller references). Kubernetes Services and StatefulSet identity operate on independent mechanisms, and this separation is what makes zero-downtime migration possible without requiring a workload type change. The comparison between configurations 2 and 3 isolates the effect of identity swap. SS-Swap’ s transfer phase is 25x faster T ABLE I T OT A L M I G RATI O N T I M E , M E D I AN ( S EC O N D S ) , n = 10 P E R C E LL . V A L UE S I N PA RE N T HE S E S I N D IC ATE T H E I N TE R Q UA RT IL E R A N G E ( Q 1 – Q3 ) . Rate SS-Seq SS-Shadow SS-Swap D-Reg 10 50.8 (50.3–80.1) 13.8 (12.3–13.8) 15.8 (14.9–17.4) 12.4 (12.2–13.3) 20 59.2 (58.2–60.0) 14.3 (13.3–44.3) 17.6 (16.7–20.4) 14.0 (13.0–16.3) 40 81.1 (80.2–105.9) 19.5 (18.7–84.8) 23.1 (21.9–28.4) 20.3 (18.9–29.4) 60 157.5 (130.1–164.5) 36.0 (29.1–100.5) 31.8 (30.0–32.8) 58.2 (27.9–108.3) 80 164.7 (163.9–165.2) 117.2 (74.7–129.3) 58.3 (52.5–67.5) 126.0 (103.4–129.7) 100 165.0 (164.5–165.9) 129.8 (128.9–130.9) 139.9 (138.9–140.2) 128.9 (128.4–130.0) 120 164.8 (164.3–165.5) 129.0 (128.5–130.1) 141.6 (141.1–142.0) 129.2 (128.9–131.4) T ABLE II P H AS E D U RATI O N S ( M E DI A N S E C ON D S , n = 10 ) AT 1 0 M S G / S Phase SS-Seq SS-Shadow SS-Swap D-Reg Checkpointing 0.33 0.33 0.37 0.34 T ransferring 5.30 5.60 0.20 5.26 Restoring 38.49 3.23 2.61 2.28 Replaying 6.53 4.08 3.78 4.30 Finalizing 0.01 0.02 8.64 0.03 T otal 50.8 13.8 15.8 12.4 T ABLE III P H AS E D U RATI O N S ( M E DI A N S E C ON D S , n = 10 ) AT 6 0 M S G / S Phase SS-Seq SS-Shadow SS-Swap D-Reg Checkpointing 0.34 0.34 0.39 0.33 T ransferring 5.35 5.64 0.19 5.38 Restoring 38.41 2.89 3.02 2.50 Replaying 112.80 27.68 13.44 49.33 Finalizing 0.00 0.02 14.73 0.04 T otal 157.5 36.0 31.8 58.2 than SS-Shadow (0.19 s vs. 5.64 s at 60 msg/s) because the agent-assisted registry push av oids Kubernetes Job scheduling ov erhead, but its finalize phase is 700x longer (14.73 s vs. 0.02 s) due to the re-checkpoint, pod replacement, and Ex- changeFence procedure. The net effect is a modest increase in total migration time (31.8 s vs. 36.0 s at 60 msg/s for SS- Swap vs. SS-Shado w), but with the benefit of restoring full StatefulSet ownership of the migrated workload. The comparison between configurations 2 and 4 (SS- Shadow vs. D-Reg) sho ws that the ShadowPod strategy applies to both StatefulSet and Deployment workloads, with compara- ble restore times (2.89 s vs. 2.50 s at 60 msg/s) and zero down- time in both cases. The replay duration differs more noticeably (27.7 s vs. 49.3 s at 60 msg/s), which reflects differences in the Deployment finalization path (affinity patching vs. StatefulSet scale-down) and their effect on ho w quickly the shadow pod begins draining the replay queue. At high message rates ( ≥ 80 msg/s), the replay cutoff dom- inates total migration time across all configurations. This rev eals a limitation of the MS2M replay mechanism: when µ target < λ , the replay queue grows faster than it can be drained, and the cutoff fires with messages still pending. The ShadowPod strategy does r educe the replay queue length T ABLE IV S E R V I C E D OW N T IM E ( M E D IA N ) B Y C O N FIG U R A T I O N A N D M E S SA GE R A T E ( n = 10 P E R C E L L ) . S S -S E Q V A L UE S I N S E CO N D S ; S H A D OW P OD V A L U ES I N M I L LI S E CO N D S . Rate (msg/s) SS-Seq SS-Shadow SS-Swap D-Reg 10 31.2 s 0 ms 0 ms 0 ms 20 31.2 s 0 ms 0 ms 0 ms 40 31.2 s 0 ms 0 ms 0 ms 60 31.2 s 0 ms 0 ms 0 ms 80 31.1 s 0 ms 0 ms 0 ms 100 31.1 s 0 ms 0 ms 0 ms 120 30.9 s 0 ms 0 ms 0 ms compared to Sequential (because the source continues process- ing during the shorter restore window), but cannot eliminate the rate-dependent bottleneck. T echniques such as batched message processing, increased consumer parallelism, or CRIU pre-dump for incremental checkpointing would address this regime. Zero message loss across all 280 runs confirms the correct- ness of the concurrent source-shadow operation: the fanout exchange ensures that both pods receiv e identical message streams on separate queues, and the control message protocol guarantees an orderly handoff. 2) Comparison with Prior W ork: The prior e v aluation [4] used GCE e2-medium VMs with a Jav a consumer . Despite different infrastructure, checkpoint (median 0.34 s vs. 0.4 s), transfer (median 5.4 s vs. 6 s), and Sequential restore (median 38.5 s vs. 39 s) durations are comparable, confirming these phases are I/O-bound and the restore bottleneck is intrinsic to StatefulSet identity constraints. 3) T rade-offs and Operator Overhead: The ShadowPod strategy for StatefulSets transitions the workload to a stan- dalone shado w pod; the Sequential strategy remains av ailable for workloads requiring StatefulSet guarantees. The direct transfer mechanism (not ev aluated in this paper) can eliminate registry dependency at the cost of deploying the ms2m-agent DaemonSet with hostPath mounts; the agent-assisted reg- istry push used in SS-Swap provides a partial benefit without bypassing the registry . The operator’ s reconcile loop ov erhead (API round-trips per phase transition) is negligible relati ve to phase durations and is justified by automatic retry , declarativ e lifecycle management, and nati ve observability . V . R E L A T E D W O R K Container liv e migration techniques are surveyed by Soussi et al. [9]. Pre-copy transfers dirty pages iterativ ely , trading total time for lower downtime [10]; post-copy resumes immediately but degrades performance during page faults [2]. SHADOW av oids this trade-off by reconstructing state through message replay rather than memory transfer . FCC [5] provides the kubelet API for CRIU check- pointing. KubeSPT [11] addresses stateful pod migration through T -Proxy (TCP connection preservation), lazy-restore of hot memory pages, and decoupled pod recreation, re- porting 86–93% downtime reduction for memory-intensive workloads. Ho wev er, KubeSPT focuses on raw memory state, not application-le vel consistency for message-driven services. SHADO W instead uses messaging infrastructure for state consistency and zero message loss, with the ShadowPod strategy achieving true zero-do wntime by keeping the source serving throughout. Other CRIU-based approaches include Guitart’ s [2] diskless iterati ve migration for HPC, UMS’ s [12] Frontman container for traffic management, and Ma et al. ’ s [1] filesystem-layer transfer for edge (56–80% time reduction). None support concurrent source-target operation, which is what makes SHADO W’ s zero-downtime guarantee possible. Calagna et al. [13] present COA T for TCP connection preservation during edge migration and P AM for migration KPI prediction, but target Podman-based connection-oriented workloads rather than Kubernetes-nati ve message-driven ser- vices. Laigner et al. [14] confirm that microservice state is typically deriv ed from processed messages, which is the property SHADO W uses for state reconstruction. The Operator pattern [7] has been widely adopted for man- aging complex application lifec ycles in Kubernetes, including database operators (e.g., for PostgreSQL, MySQL) and stateful middlew are. Both KubeSPT [11] and SHADO W use Custom Resource Definitions to drive migration as a Kubernetes-nati ve workflo w . SHADO W extends this pattern with an idempo- tent state machine reconciler that supports multiple migration strategies (Sequential, ShadowPod) and transfer modes (Reg- istry , Direct), supporting declarativ e migration orchestration within Kubernetes’ desired-state model. V I . C O N C L U S I O N SHADO W shows that Kubernetes’ StatefulSet identity con- straint, previously treated as an unavoidable source of migra- tion downtime, can be circumvented at the application lev el by separating traf fic routing (label-based) from pod ownership (controller-based). The ShadowPod strategy uses this separa- tion for concurrent source-target operation, eliminating service downtime entirely while preserving zero message loss. This result holds across both Deployment-managed and StatefulSet- managed workloads, confirming that the approach generalizes beyond a single workload controller . For StatefulSet workloads that require full controller o wn- ership after migration, the identity swap procedure with Ex- changeFence re-checkpoints the shadow pod and creates a StatefulSet-adopted replacement, adding overheads but restor- ing ordered scaling and crash recov ery guarantees. Beyond the performance improv ements (92% restore re- duction, 73% total time reduction at low rates), SHADOW confirms the practical value of encoding migration proce- dures as a Kubernetes-nativ e control loop. The idempotent state machine reconciler provides automatic failure recovery , declarativ e lifecycle management, and built-in observability; these are operational properties that are difficult to achieve with external orchestration scripts. The e v aluation across 280 migration runs with zero failures confirms the reliability of this approach under varying message rates. The remaining performance bottleneck is the replay phase at high message rates, where the cutoff mechanism bounds migration time but leav es the shadow pod with partially- synchronized state. Future work can address this through CRIU pre-dump support for incremental checkpointing (reduc- ing both checkpoint size and transfer time), batched message replay to increase effectiv e consumer throughput and extend- ing support to multi-container pods. R E F E R E N C E S [1] L. Ma, S. Y i, N. Carter, and Q. Li, “Efficient Live Migration of Edge Services Leveraging Container Layered Storage, ” IEEE T ransactions on Mobile Computing , vol. 18, no. 9, pp. 2020–2033, 2018. [2] J. Guitart, “Practicable Live Container Migrations in High Performance Computing Clouds: Diskless, Iterati ve, and Connection-Persistent, ” Jour - nal of Systems Arc hitecture , vol. 152, p. 103157, 2024. [3] H. Dinh-T uan and F . Beierle, “MS2M: A Message-Based Approach for Live Stateful Microservices Migration, ” in 2022 5th Conference on Cloud and Internet of Things (CIoT) . IEEE, 2022, pp. 100–107. [4] H. Dinh-T uan and J. Jiang, “Optimizing Stateful Microservice Migration in Kubernetes with MS2M and Forensic Checkpointing, ” in 2025 28th Confer ence on Innovation in Clouds, Internet and Networks (ICIN) . IEEE, 2025, pp. 83–90. [5] K ubernetes Contributors, “Forensic container checkpointing, ” https:// kubernetes.io/docs/reference/node/kubelet- checkpoint- api/, 2023, kuber- netes Documentation. [6] CRIU Contributors, “CRIU – checkpoint/restore in userspace, ” https: //criu.org/, 2024. [7] K ubernetes Contributors, “Operator pattern, ” https://kubernetes.io/docs/ concepts/extend- kubernetes/operator/, 2024, kubernetes Documentation. [8] K ubernetes SIGs, “controller-runtime – Kubernetes controller runtime library , ” https://github .com/kubernetes- sigs/controller- runtime, 2024. [9] W . Soussi, G. G ¨ ur , and B. Stiller, “Democratizing Container Liv e Migration for Enhanced Future Networks – A Survey , ” A CM Computing Surveys , vol. 57, no. 4, pp. 1–37, 2024. [10] Y . Lu and Y . Jiang, “A Container Pre-Copy Migration Method Based on Dirty Page Prediction and Compression, ” in 2022 IEEE 28th In- ternational Conference on P arallel and Distributed Systems (ICP ADS) . IEEE, 2023, pp. 704–711. [11] H. Zhang, S. W u, H. F an, Z. Huang, W . Xue, C. Y u, S. Ibrahim, and H. Jin, “KubeSPT : Stateful Pod T eleportation for Service Resilience W ith Liv e Migration, ” IEEE Tr ansactions on Services Computing , 2025. [12] S. Nadgowda, S. Suneja, N. Bila, and C. Isci, “V oyager: Complete Container State Migration, ” in 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS) . IEEE, 2017, pp. 2137– 2142. [13] A. Calagna, Y . Y u, P . Giaccone, and C. F . Chiasserini, “Design, Mod- eling, and Implementation of Robust Migration of Stateful Edge Mi- croservices, ” IEEE T ransactions on Network and Service Management , vol. 21, no. 2, pp. 1877–1893, 2023. [14] R. Laigner , G. Christodoulou, K. Psarakis, A. Katsifodimos, and Y . Zhou, “Transactional Cloud Applications: Status Quo, Challenges, and Opportunities, ” in Companion of the 2025 International Conference on Management of Data , 2025, pp. 829–836.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment