Ensuring Timeliness in Primary-Backup Replication
25 October 2011
Mechanisms which add fault-tolerance to applications (e.g., through a virtual-machine hypervisor) incur network overhead. In particular, existing protocols for Primary-Backup replication can add significant latency and jitter, seriously degrading the network performance of latency-sensitive applications. A recent impossibility proof shows that some delays are unavoidable for the standard primary-backup structure. We use hints from the proof and propose a number of new techniques: a jitter smoother, an overlapped checkpointing mechanism and a new structure we call Inline Backup. We demonstrate analytically and experimentally that these ideas dramatically improve network performance. For instance, under latency-sensitive workloads such as SPECWeb, our techniques reduce the latency overhead of state-of-art replication mechanisms from 95% to 35% for overlapped checkpointing and to 12.5% for Inline Backup. This is done without weakening strong consistency guarantees on memory and disk data.