Chaos Playground
Runs in browserSimulate Failed Networks & Clock Drift in Consensus Clusters
Cluster Events
How to Use This Simulator
Simulate Failures:
- Cut Wires: Click the icon on connections to simulate packet loss / partitions.
- Clock Drift: Drag the "Skew" slider on a node. 0.0x stops time, 2.0x speeds it up (causing timeouts).
Observe Raft:
- Watch Leader Election when the Leader is partitioned.
- See how Terms increment as Candidates try to win votes.
Chaos Engineering & Distributed Consensus: A Deep Dive
In the world of distributed systems, "happy path" testing is insufficient. Networks fail, clocks drift, and nodes crash. This simulator demonstrates the Raft Consensus Algorithm, the industry standard for maintaining consistency in the face of these failures. Used by Kubernetes (etcd), CockroachDB, and HashiCorp Consul, Raft is the bedrock of modern cloud infrastructure.
The Fallacy of the Reliable Network
L. Peter Deutsch famously coined the "Eight Fallacies of Distributed Computing". The first is: The Network is Reliable. It is not. In a cloud environment, network partitions (where one part of the network cannot talk to another) are inevitable.
When a partition occurs, a distributed system risks "Split Brain"—a state where two different parts of the cluster think they are in charge, leading to data corruption. Consensus algorithms like Raft prevent this by requiring a **Quorum** (Majority) to make any decision.
Understanding Raft Consensus
Raft decomposes consensus into three sub-problems: Leader Election, Log Replication, and Safety.
1. Leader Election
Raft uses a strong leader model. One node is the Leader, and it handles all client interactions. The others are Followers.
- Heartbeats: The Leader sends periodic heartbeats to Followers to assert authority. This suppresses elections.
- Election Timeout: If a Follower doesn't hear from the Leader within a randomized timeout (e.g., 150-300ms), it assumes the Leader is dead.
- Candidate State: The Follower increments its Term counter, votes for itself, and requests votes from others.
- Majority Rules: If the Candidate receives votes from a majority of the cluster (N/2 + 1), it becomes the new Leader.
2. Terms & Logical Clocks
Time in Raft isn't measured in seconds, but in Terms. A Term is an arbitrary increasing number. If a Leader sees a higher Term in a message from a peer, it realizes it is outdated and immediately steps down to Follower. This effectively handles "Zombie Leaders" that come back online after a partition.
3. The Dangers of Clock Skew
Distributed systems rely on timeouts. If a node's clock runs too fast (simulated here with the Skew slider), it might trigger an Election Timeout prematurely, causing disruptive leadership changes. If it runs too slow, it might not detect a dead leader in time. This is why standard NTP synchronization is critical in real production clusters.
Chaos Scenarios to Try
The Isolated Leader
Isolate the current Leader (e.g., Node A) by cutting its links to B and C.
Outcome: Node A thinks it is still leader (for a while), but cannot replicate
data. Meanwhile, B and C will timeout, realize A is gone, and since they form a majority
(2 out of 3), they will elect a NEW Leader (e.g., B).
Runaway Candidate
Set Node C's clock skew to 2.0x (Fast).
Outcome: Node C's election timer expires much faster than the Leader's
heartbeat arrives. C will constantly interrupt the cluster, declare itself candidate, increment
the Term, and force elections. This is a classic distributed system performance bug caused
by misconfigured timeouts.
No Quorum (System Halt)
Cut the link between A-B, B-C, and C-A. Complete isolation.
Outcome: No node can reach a majority (2 votes). The system enters a
stalemate. No Leader can be elected. Writes would be rejected. The system chooses
Consistency (Safety) over Availability (CAP Theorem).
References & Further Reading
- The Secret Lives of Data (Visualization) - An amazing step-by-step visualization of Raft.
- etcd Documentation - The storage backend for Kubernetes, built on Raft.
- In Search of an Understandable Consensus Algorithm (PDF) - The original Raft paper by Diego Ongaro and John Ousterhout.
- Principles of Chaos Engineering - Why we need to break things in production.