Leaderless Replication: The Dynamo Approach

Prerequisites(1)

Builds on to(1)

Leaderless replication abandons the concept of a designated leader entirely. Any node can accept both reads and writes. This design was popularized by Amazon's Dynamo paper (2007) and inspired Cassandra, Riak, and Voldemort.

How Writes Work

When a client writes a value, it sends the write to multiple nodes in parallel. The client (or a coordinator node) sends the write to all N replicas that own the key. The write is considered successful when at least W nodes acknowledge it. The client does not wait for all N nodes—only W.

If some nodes are unreachable, the write still succeeds as long as W nodes respond. There is no single point of failure: any node going down does not block writes.

How Reads Work

The client reads from R nodes in parallel and takes the value with the highest version number. If responses disagree (some nodes have older data), the most recent version is returned to the client.

Sloppy Quorum

In a strict quorum, the W writes and R reads must go to the designated N nodes for that key. In a sloppy quorum, if some designated nodes are unavailable, the write can go to any available node, even if that node isn't normally responsible for the key. This dramatically improves write availability during partial failures. The temporarily-storing node holds the data with a "hint" about which node it's meant for (hinted handoff).

Version Vectors (Vector Clocks)

Since writes can arrive at different nodes in different orders, the system needs to track causality. A version vector assigns a counter per node. When node A processes a write, it increments its own counter: {A:3, B:2, C:1}. When comparing two version vectors:

If every counter in V1 is ≥ the corresponding counter in V2, then V1 dominates (is a successor of V2).
If neither dominates, the writes are concurrent and must be resolved (siblings in Riak, or LWW in Cassandra).

Anti-Entropy

A background process that compares data across replicas and repairs inconsistencies. Unlike read repair (which only fixes data that is actually read), anti-entropy proactively synchronizes all data, ensuring eventual convergence even for infrequently-read keys.

Real-Life: Amazon DynamoDB and Cassandra

Real-World Example

Amazon DynamoDB evolved from the original Dynamo paper:

Each item is replicated across three Availability Zones within a region.
Writes use a Paxos-based protocol for the "leader" of each partition, but the system appears leaderless to the client since any node can coordinate.
Eventually consistent reads (default): read from any one replica, fast but possibly stale.
Strongly consistent reads: read from the partition leader, guaranteeing the latest value.

Apache Cassandra implements Dynamo-style leaderless replication:

Configurable consistency levels per query: ONE, QUORUM, ALL, LOCAL_QUORUM.
QUORUM = floor(N/2) + 1 nodes must respond. With replication factor 3, QUORUM = 2.
A write with QUORUM + a read with QUORUM guarantees overlap: at least one node in the read set has the latest write.
Lightweight transactions (LWT): Cassandra supports IF NOT EXISTS using Paxos for linearizable operations when needed.

Riak (now Riak KV): Stores conflicting concurrent writes as siblings and lets the application resolve them. This avoids data loss from LWW but requires application logic.

**Leaderless replication** abandons the concept of a designated leader entirely. Any node can accept both reads and writes. This design was popularized by Amazon's **Dynamo** paper (2007) and inspired Cassandra, Riak, and Voldemort.

### How Writes Work

When a client writes a value, it sends the write to **multiple nodes in parallel**. The client (or a coordinator node) sends the write to all N replicas that own the key. The write is considered successful when at least **W** nodes acknowledge it. The client does not wait for all N nodes—only W.

If some nodes are unreachable, the write still succeeds as long as W nodes respond. There is no single point of failure: any node going down does not block writes.

### How Reads Work

The client reads from **R** nodes in parallel and takes the value with the highest version number. If responses disagree (some nodes have older data), the most recent version is returned to the client.

### Sloppy Quorum

In a strict quorum, the W writes and R reads must go to the **designated** N nodes for that key. In a **sloppy quorum**, if some designated nodes are unavailable, the write can go to **any available node**, even if that node isn't normally responsible for the key. This dramatically improves write availability during partial failures. The temporarily-storing node holds the data with a "hint" about which node it's meant for (hinted handoff).

### Version Vectors (Vector Clocks)

Since writes can arrive at different nodes in different orders, the system needs to track **causality**. A **version vector** assigns a counter per node. When node A processes a write, it increments its own counter: `{A:3, B:2, C:1}`. When comparing two version vectors:
- If every counter in V1 is ≥ the corresponding counter in V2, then V1 **dominates** (is a successor of V2).
- If neither dominates, the writes are **concurrent** and must be resolved (siblings in Riak, or LWW in Cassandra).

### Anti-Entropy

A background process that compares data across replicas and repairs inconsistencies. Unlike read repair (which only fixes data that is actually read), anti-entropy proactively synchronizes all data, ensuring eventual convergence even for infrequently-read keys.

**Amazon DynamoDB** evolved from the original Dynamo paper:
- Each item is replicated across three Availability Zones within a region.
- Writes use a Paxos-based protocol for the "leader" of each partition, but the system appears leaderless to the client since any node can coordinate.
- **Eventually consistent reads** (default): read from any one replica, fast but possibly stale.
- **Strongly consistent reads**: read from the partition leader, guaranteeing the latest value.

**Apache Cassandra** implements Dynamo-style leaderless replication:
- Configurable consistency levels per query: ONE, QUORUM, ALL, LOCAL_QUORUM.
- `QUORUM` = `floor(N/2) + 1` nodes must respond. With replication factor 3, QUORUM = 2.
- A write with `QUORUM` + a read with `QUORUM` guarantees overlap: at least one node in the read set has the latest write.
- **Lightweight transactions (LWT)**: Cassandra supports `IF NOT EXISTS` using Paxos for linearizable operations when needed.

**Riak** (now Riak KV): Stores conflicting concurrent writes as **siblings** and lets the application resolve them. This avoids data loss from LWW but requires application logic.

Leaderless Write and Read (W=2, R=2, N=3)

Step 1 of 2