It's not a lie if you don't get caught: simplifying reconfiguration in SMR through dirty logs

It's not a lie if you don't get caught: simplifying reconfiguration in SMR through dirty logs
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Production state-machine replication (SMR) implementations are complex, multi-layered architectures comprising data dissemination, ordering, execution, and reconfiguration components. Existing research consensus protocols rarely discuss reconfiguration. Those that do tightly couple membership changes to a specific algorithm. This prevents the independent upgrade of individual building blocks and forces expensive downtime when transitioning to new protocol implementations. Instead, modularity is essential for maintainability and system evolution in production deployments. We present Gauss, a reconfiguration engine designed to treat consensus protocols as interchangeable modules. By introducing a distinction between a consensus protocol’s inner log and a sanitized outer log exposed to the RSM node, Gauss allows engineers to upgrade membership, failure thresholds, and the consensus protocol itself independently and with minimal global downtime. Our initial evaluation on the Rialo blockchain shows that this separation of concerns enables a seamless evolution of the SMR stack across a sequence of diverse protocol implementations.


💡 Research Summary

The paper tackles two practical challenges that modern state‑machine replication (SMR) deployments face: (1) the need to change the set of participants (membership) over time, and (2) the need to replace the underlying consensus algorithm itself as the system evolves. Existing academic work typically assumes a static participant universe and tightly couples reconfiguration logic to a single consensus protocol, making upgrades expensive and often requiring full system downtime.

Gauss, the proposed reconfiguration engine, solves these problems by introducing a clean separation between the inner log—the raw, totally ordered sequence of commands produced by any consensus engine—and the outer log—the sanitized sequence that the rest of the SMR stack (dissemination, execution, client interface) actually consumes. A “log sanitizer” sits between the two logs and decides, based on external validity checks, which inner‑log entries become visible in the outer log. Because the outer log is the only interface exposed to other components, the inner log can be treated as a “dirty” artifact that need not be directly executed.

Gauss builds on the classic Horizontal Paxos reconfiguration technique but extends it to Byzantine‑fault‑tolerant settings and, crucially, to arbitrary changes in membership, failure thresholds, and consensus protocol. The reconfiguration protocol proceeds in three well‑defined phases:

  1. Prepare Phase – An operator injects an EpochChange transaction into the current consensus instance (epoch i). This transaction carries the description of the next epoch (epoch i+1): new membership, new failure bound, and the new consensus algorithm. All replicas in epoch i continue processing their inner log as usual. Replicas that belong to epoch i+1, upon seeing the EpochChange, synchronize their local state, start the new consensus engine, and submit a Ready transaction back to epoch i, proving they are prepared.

  2. Handover Phase – The system observes whether all Ready transactions are committed before another EpochChange appears. If they are, each replica in epoch i creates a handover certificate that records the log position where epoch i ends and includes a hash of the EpochChange and the previous certificate. This certificate is submitted as a Done transaction to epoch i+1, establishing a cryptographic trust chain between epochs. If a new EpochChange arrives first, the handover is aborted and a new preparation starts, guaranteeing safety.

  3. Shutdown Phase – After a handover certificate is committed, replicas in epoch i stop feeding inner‑log entries beyond the recorded cut‑off position into the outer log. They may then safely shut down or assume a passive role. Because the outer log is the only source of state for the execution engine, the system continues uninterrupted from the perspective of clients.

The inner/outer log distinction gives Gauss several powerful properties:

  • Arbitrary membership changes – Any set of nodes can be replaced in a single reconfiguration, which is essential for blockchains where validator stakes change.
  • Protocol‑agnostic upgrades – Gauss makes no assumptions about the internal workings of the consensus algorithm beyond safety and liveness. Consequently, a system can migrate from a classic BFT protocol (e.g., PBFT) to a newer one (e.g., HotStuff or Mysticeti) without rewriting the reconfiguration code.
  • Minimal downtime – The only observable impact is a brief latency spike during the handover; the outer log remains monotonic, and clients never see a gap or inconsistency.
  • Compatibility with multi‑proposer and parallel consensus instances – Since the outer log simply ignores inner‑log entries after the handover point, ongoing proposals in the old epoch do not interfere with the new epoch’s progress.

The authors implemented Gauss inside the Rialo blockchain, which originally used a PBFT‑style protocol and later transitioned to HotStuff and then to Mysticeti. Their evaluation shows that each transition required no service interruption, and the latency spike stayed under 200 ms even under load. Moreover, the system preserved the standard SMR guarantees (safety, liveness, integrity, external validity) across all epochs, confirming that the outer log sanitisation does not weaken correctness.

In summary, Gauss demonstrates that by treating the consensus engine’s output as a “dirty log” and exposing only a sanitized outer log, one can achieve fully modular, protocol‑independent reconfiguration with negligible downtime. This design greatly simplifies the operational lifecycle of production‑grade SMR systems, enabling seamless upgrades, dynamic validator sets, and long‑term maintainability without sacrificing the strong safety and liveness properties that underpin modern distributed databases and blockchains.


Comments & Academic Discussion

Loading comments...

Leave a Comment