Chaos Engineering & Distributed Consensus: A Deep Dive

In the world of distributed systems, "happy path" testing is insufficient. Networks fail, clocks drift, and nodes crash. This simulator demonstrates the Raft Consensus Algorithm, the industry standard for maintaining consistency in the face of these failures. Used by Kubernetes (etcd), CockroachDB, and HashiCorp Consul, Raft is the bedrock of modern cloud infrastructure.

The Fallacy of the Reliable Network

L. Peter Deutsch famously coined the "Eight Fallacies of Distributed Computing". The first is: The Network is Reliable. It is not. In a cloud environment, network partitions (where one part of the network cannot talk to another) are inevitable.

When a partition occurs, a distributed system risks "Split Brain"—a state where two different parts of the cluster think they are in charge, leading to data corruption. Consensus algorithms like Raft prevent this by requiring a **Quorum** (Majority) to make any decision.

Understanding Raft Consensus

Raft decomposes consensus into three sub-problems: Leader Election, Log Replication, and Safety.

1. Leader Election

Raft uses a strong leader model. One node is the Leader, and it handles all client interactions. The others are Followers.

Heartbeats: The Leader sends periodic heartbeats to Followers to assert authority. This suppresses elections.
Election Timeout: If a Follower doesn't hear from the Leader within a randomized timeout (e.g., 150-300ms), it assumes the Leader is dead.
Candidate State: The Follower increments its Term counter, votes for itself, and requests votes from others.
Majority Rules: If the Candidate receives votes from a majority of the cluster (N/2 + 1), it becomes the new Leader.

2. Terms & Logical Clocks

Time in Raft isn't measured in seconds, but in Terms. A Term is an arbitrary increasing number. If a Leader sees a higher Term in a message from a peer, it realizes it is outdated and immediately steps down to Follower. This effectively handles "Zombie Leaders" that come back online after a partition.

3. The Dangers of Clock Skew

Distributed systems rely on timeouts. If a node's clock runs too fast (simulated here with the Skew slider), it might trigger an Election Timeout prematurely, causing disruptive leadership changes. If it runs too slow, it might not detect a dead leader in time. This is why standard NTP synchronization is critical in real production clusters.

Chaos Scenarios to Try

The Isolated Leader

Isolate the current Leader (e.g., Node A) by cutting its links to B and C.

Outcome: Node A thinks it is still leader (for a while), but cannot replicate data. Meanwhile, B and C will timeout, realize A is gone, and since they form a majority (2 out of 3), they will elect a NEW Leader (e.g., B).

Runaway Candidate

Set Node C's clock skew to 2.0x (Fast).

Outcome: Node C's election timer expires much faster than the Leader's heartbeat arrives. C will constantly interrupt the cluster, declare itself candidate, increment the Term, and force elections. This is a classic distributed system performance bug caused by misconfigured timeouts.

No Quorum (System Halt)

Cut the link between A-B, B-C, and C-A. Complete isolation.

Outcome: No node can reach a majority (2 votes). The system enters a stalemate. No Leader can be elected. Writes would be rejected. The system chooses Consistency (Safety) over Availability (CAP Theorem).

References & Further Reading

The Secret Lives of Data (Visualization) - An amazing step-by-step visualization of Raft.
etcd Documentation - The storage backend for Kubernetes, built on Raft.
In Search of an Understandable Consensus Algorithm (PDF) - The original Raft paper by Diego Ongaro and John Ousterhout.
Principles of Chaos Engineering - Why we need to break things in production.

Chaos Playground

Cluster Events

Related Tools

CAP Theorem

Circuit Breaker Visualizer

Pod Eviction

How to Use This Simulator