Repairing Replicas in Leaderless Systems
In leaderless replication, nodes can temporarily go down, miss writes, and fall out of sync. Three mechanisms ensure replicas eventually converge: hinted handoff, read repair, and anti-entropy with Merkle trees.
Hinted Handoff
When a write targets node C but node C is down, another node (say node D) can temporarily store the write along with a hint: "this data belongs on node C." Node D is not normally responsible for this key—it is just holding the data temporarily.
When node C comes back online, node D detects this (via gossip protocol or heartbeat) and forwards the stored writes to node C. Once node C confirms receipt, node D deletes the hint. This ensures that writes are not lost during temporary outages, even when using sloppy quorums.
Important limitation: Hinted handoff provides durability, not consistency. Until the hint is delivered, node C is missing data. If node D also fails before delivering the hint, the data is lost (unless other replicas have it).
Read Repair
During a normal quorum read, the coordinator contacts R nodes and compares their responses. If some nodes return stale values (older version numbers), the coordinator sends the latest value to those stale nodes after returning the result to the client.
Read repair is opportunistic: it only fixes data that is actually read. Infrequently-accessed keys may remain inconsistent indefinitely. This is why anti-entropy is also needed.
Anti-Entropy with Merkle Trees
A Merkle tree (hash tree) is used to efficiently detect which key ranges are inconsistent between two replicas. Each leaf node is the hash of a key-value pair. Parent nodes are hashes of their children. The root hash summarizes the entire dataset.
To compare two replicas:
- Exchange root hashes. If they match, the replicas are identical—done.
- If roots differ, exchange hashes of child nodes. Only descend into subtrees where hashes differ.
- Continue recursively until the divergent key ranges are identified.
- Synchronize only the differing keys.
This is dramatically more efficient than comparing every key. For a dataset with millions of keys, you might only need to exchange a few hundred hashes to find the 10 keys that differ. Cassandra rebuilds Merkle trees periodically and uses them during anti-entropy repair (nodetool repair).
Real-Life: Cassandra Repair and DynamoDB
Cassandra anti-entropy repair:
- Administrators run
nodetool repairperiodically (recommended at least once within thegc_grace_secondswindow, typically 10 days). - The repair process builds Merkle trees for each token range, compares them between replicas, and streams the differing data.
- Incremental repair (since Cassandra 2.1): only SSTables written since the last repair are included, dramatically reducing the data compared.
- Sub-range repair: repair specific token ranges instead of the entire dataset, reducing the scope of each operation.
DynamoDB:
- Uses a similar anti-entropy mechanism internally, but it is fully managed—users never run repair manually.
- DynamoDB also implements hinted handoff within its storage nodes to handle temporary failures across Availability Zones.
Riak:
- Active anti-entropy (AAE) runs continuously in the background, building and comparing Merkle trees between replicas.
- Read repair is enabled by default: every read compares values across replicas and triggers repair if they disagree.
- Hinted handoff: Riak stores hints in a dedicated on-disk "hintlog" to survive coordinator restarts.