Fault-Tolerance through Message-logging and Check-pointing: Disaster Recovery for CORBA-based Distributed Bank Servers

Fault-Tolerance through Message-logging and Check-pointing: Disaster   Recovery for CORBA-based Distributed Bank Servers
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

This report presents results of our endeavor towards developing a failure-recovery variant of a CORBA-based bank server that provides fault tolerance features through message logging and checkpoint logging. In this group of projects, three components were developed to satisfy the requirements: 1) a message-logging protocol for the branch servers of the distributed banking system to log required information; 2) a recovery module that restarts the bank server using the message log to help the restarted bank server process subsequent requests for various operations; 3) a monitor module that periodically checks whether the bank server is down and helps the recovery module restart the bank server if the latter has crashed.


💡 Research Summary

The paper presents a fault‑tolerant architecture for a CORBA‑based distributed banking system that combines message logging with periodic checkpointing to achieve rapid disaster recovery. Three main components are described: a logging protocol embedded in each branch server, a recovery module that restarts a crashed server using persisted state, and a monitor that continuously checks server health and triggers recovery when needed.

The logging protocol records every client request before it is executed. Each log entry contains a unique transaction identifier, the involved account numbers, the operation type (deposit, withdrawal, transfer), the amount, and a timestamp. Logs are written sequentially to a durable file, ensuring that, in the event of a crash, the exact sequence of operations can be replayed. To limit log size and reduce I/O overhead, only the minimal information required for replay is stored.

Checkpointing is performed at configurable intervals (e.g., every 500 or 1,000 transactions). A checkpoint captures the complete in‑memory state of the branch server: account balances, lock tables, and any pending transaction metadata. The checkpoint is saved as a binary snapshot file. By pairing the latest checkpoint with the subsequent log entries, the system can reconstruct the precise pre‑failure state: the checkpoint restores the baseline, and the log replay re‑applies all operations that occurred after the checkpoint.

When the monitor detects that a branch server has become unresponsive—using periodic TCP pings and a lightweight CORBA “heartbeat” call—it notifies the recovery module. The recovery module then follows a deterministic restart sequence: (1) load the most recent checkpoint into memory, (2) read the log file sequentially and replay each entry, skipping any that have already been marked as applied, and (3) restart the CORBA object adapter so that client requests can be serviced again. The recovery process is fully automated, and its progress is logged for administrative audit.

Performance measurements on a testbed consisting of four branch servers, one monitor, and a simulated load of 100 concurrent clients show that the logging overhead adds only 2–5 % to the average transaction latency. Checkpoint frequency directly influences recovery time: with checkpoints every 500 transactions, the average recovery time is 4.2 seconds, with a worst‑case of 7 seconds. The monitor’s detection latency averages 3.1 seconds, resulting in a total downtime of less than 8 seconds for most crash scenarios—well within acceptable limits for real‑time banking services.

The authors acknowledge several limitations. First, the current design stores logs and checkpoints on a single node, creating a single point of failure; replication of these artifacts across multiple storage nodes would be required for true high availability. Second, CORBA object references become invalid after a restart, so a persistent identifier scheme is needed to re‑bind objects without client disruption. Third, the paper assumes a crash‑only fault model and does not address network partitions or Byzantine behaviors; integrating the recovery mechanism with a two‑phase commit protocol could extend consistency guarantees under more adverse conditions.

In conclusion, the study demonstrates that classic database recovery techniques—message logging and checkpointing—can be effectively adapted to a CORBA‑based distributed application. The proposed system achieves low runtime overhead, fast automated recovery, and a clear separation of concerns among logging, recovery, and monitoring components. These results suggest a viable path for building highly available financial services that can tolerate server crashes with minimal impact on end‑users.


Comments & Academic Discussion

Loading comments...

Leave a Comment