fivenines

Theory / 10 min

Failover Controller

Failover is what happens when Redis has to change who is allowed to be primary.

That sounds simple until the network becomes unreliable. Promoting a replica is easy. Knowing when to promote one, which one to promote, and how to avoid two primaries is the real problem.

The high-level sequence is:

detect failure -> agree failure is real -> choose replica -> promote -> reconfigure topology

Failure Is Observed, Not Revealed

A primary does not always fail cleanly. It may crash, pause, become unreachable from some nodes, or sit behind a network partition. One monitor failing to reach the primary does not prove the primary is dead. It may only prove that monitor's network path is broken.

This is why Sentinel-style systems distinguish subjective failure from objective failure.

subjective down: one observer believes the primary is unavailable
objective down: enough observers agree to act

Quorum matters because failover is a distributed decision made under uncertainty.

Choosing The Best Replica

Once failover is justified, the controller should promote a replica that is likely to have the most complete and healthy copy of the dataset.

Useful signals include:

replica reachability
replication offset
staleness
configured priority
health checks

A replica with a higher processed replication offset has seen more of the primary's write stream. Promoting a stale replica can lose more acknowledged writes.

Asynchronous replication makes some data loss possible during failover. The controller's job is to minimize that loss while restoring availability.

Avoiding Split Brain

The nightmare scenario is split brain: two nodes both believe they are primary and accept writes independently. Once that happens, there may be no automatic way to merge the histories correctly.

Failover systems use leader election, quorum, epochs, and configuration versioning to reduce that risk.

An epoch is a version number for topology decisions. Newer epochs supersede older ones, which helps prevent stale controllers from overwriting a more recent failover outcome.

Reconfiguration Is Part Of Failover

Promotion is only one step. The system also has to tell other replicas to follow the new primary, update discovery information, and guide clients toward the new write target.

The old primary may later return. When it does, it cannot simply resume as if nothing happened. It must learn the newer topology and usually become a replica of the promoted node.

Failover is therefore a topology transition, not a local command.

Availability Has A Price

Automatic failover improves availability, but it does not make distributed systems painless. Under partitions, the controller may have to choose between refusing writes and risking divergence. Under asynchronous replication, acknowledged writes may be missing from the promoted replica.

Redis failover is best understood as controlled damage reduction. It tries to restore service quickly, promote the safest available replica, and prevent conflicting primaries. The complexity is not accidental. It is the cost of making decisions when the old source of truth may be silent.