Pod Eviction Simulator
Runs in browserSimulate K8s pod eviction scenarios
PodDisruptionBudget
Resource Pressure
Nodes
How to Use
Simulate Kubernetes node and pod behavior.
- Cordon/Drain: Manage node scheduling and eviction
- Taints: Add taints to nodes to repel pods
- Pressure: Simulate CPU/Memory pressure
- Results: Watch pods get evicted or protected
Pods
Event Log
No events yet. Try draining a node or applying taints.
The Definitive Guide to Kubernetes Pod Eviction
In a perfectly dimensioned Kubernetes cluster, every Pod has exactly the resources it needs. In reality, clusters are chaotic environments subject to traffic spikes, memory leaks, and noisy neighbors. Pod Eviction is the highly structured, merciless process by which the `kubelet` (the node agent) actively deletes running Pods to forcefully reclaim compute resources and prevent the entire physical Node from crashing.
Eviction is the ultimate safety valve. Understanding exactly why and which Pods get evicted is the difference between a minor service degradation and a cascading catastrophic cluster failure.
1. The Kubelet and Eviction Signals
Every 10 seconds (by default), the `kubelet` evaluates the current state of its host Node's physical resources against a set of strictly defined thresholds known as Eviction Signals. If a signal crosses a threshold, the Node goes into a "Pressure" state, repels new Pods by tainting itself, and begins the eviction slaughter.
memory.available
The most common trigger. If the Node's available RAM drops below this threshold (default: `100Mi`), the `kubelet` triggers `MemoryPressure`. It immediately calculates the RAM usage of every Pod on the node and prepares to kill those using more than they formally requested.
nodefs.available
Tracks the filesystem holding the `kubelet`'s root volume (logs, ephemeral storage). If this drops below `10%`, the node enters `DiskPressure`. Pods writing massive temporary files or exploding logs without log rotation will be targeted.
imagefs.available
Tracks the filesystem storing container images. It works differently than nodefs: before evicting Pods, the `kubelet` will first attempt to garbage collect unused container images on the disk. Only if GC fails to free enough space does it resort to eviction.
pid.available
Process ID exhaustion. A fork-bomb or improperly configured thread pool can rapidly consume all available PIDs on the Linux kernel without consuming much RAM. If this hits the threshold, `PIDPressure` begins.
Hard vs Soft Eviction
A Soft threshold (e.g., `memory.available<200Mi` with a 1.5-minute
grace period) allows the `kubelet` to wait, giving Pods their standard
`terminationGracePeriodSeconds` to shut down gracefully (saving database states,
draining connections).
A Hard threshold (e.g., `memory.available<100Mi`) has zero grace period.
The `kubelet` instantly terminates targeted Pods using `SIGKILL` without any warning, treating
the Node as being on the brink of total failure.
2. The Kill Order: Quality of Service (QoS) Classes
When the `kubelet` enters a Pressure state, it doesn't kill Pods randomly. It strictly follows the Quality of Service (QoS) hierarchy assigned to each Pod. You do not define QoS classes directly; Kubernetes calculates them automatically based purely on how you define `requests` and `limits` in your Pod manifests.
Tier 1: BestEffort (First to Die)
requests = nil, limits = nil
If a Pod provides no resource hints whatsoever, it falls into the BestEffort class. These Pods are the absolute bottom of the food chain. The moment the Node experiences compute pressure, the `kubelet` annihilates all BestEffort Pods before touching anything else.
Tier 2: Burstable
requests < limits (or limits = nil)
The most common class. The Pod guarantees a baseline (requests) but is allowed to spike (burst) up to its limits. If the `kubelet` has already killed all BestEffort pods and is still under pressure, it attacks Burstable Pods.
Crucial Detail: It doesn't kill Burstable Pods randomly. It calculates the memory usage of each Burstable Pod relative to its Request. The Pod exceeding its request by the highest mathematical percentage is targeted first. If two Pods are tied, the one with incredibly lower `PriorityClass` dies first.
Tier 3: Guaranteed (Protected Status)
requests = limits (must specify both CPU and RAM)
These Pods are the VIPs of your cluster (usually critical databases or high-priority API endpoints). They are essentially immune to eviction due to memory pressure caused by other Pods.
The `kubelet` will only ever evict a Guaranteed Pod if system daemons (like `docker` or `systemd` itself) are consuming so much memory that the Node is utterly exhausted, and there are literally zero BestEffort or non-compliant Burstable Pods left to kill on the physical hardware.
3. The Race: Kubelet Eviction vs Linux OOMKiller
There are two separate executioners constantly running on a Kubernetes Node: the `kubelet`'s eviction manager, and the underlying Linux operating system's native OOM (Out Of Memory) Killer. They serve different purposes, but frequently race against each other.
Linux OOMKiller (cgroups limit)
Trigger: A single container tries to allocate memory beyond its mathematically defined `limit` in its `cgroup`.
Action: The Linux kernel instantly terminates the primary process with `OOMKilled` (Exit Code 137). The `kubelet` had nothing to do with it. The Pod remains on the node, and the `kubelet` will attempt to restart the container in-place according to the `restartPolicy`.
Kubelet Eviction (Node Limits)
Trigger: The total available RAM on the entire physical Node drops below the `memory.available` threshold.
Action: The `kubelet` identifies the worst-offending Pod (usually BestEffort or highly over-requested Burstable). It terminates the entire Pod and effectively marks it `Evicted`. It does not restart the Pod. The Deployment Controller must schedule a brand new replacement Pod on a different, healthier Node.
Architectural Trap: If your `kubelet` eviction threshold is set to 100Mi, but the Linux Kernel OOM killer threshold kicks in at 50Mi, your node might freeze before the `kubelet` even notices. The `kubelet` is a normal process; if the Node runs completely out of memory, the `kubelet` itself might be paused or OOMKilled by Linux, leading to a permanent `NotReady` Node state.
4. Taints, Tolerations & Voluntary Eviction
Beyond immediate emergency resource pressure, administrators intentionally evict Pods to perform routine maintenance (like upgrading the OS kernel or rotating EC2 instances) or to protect specific high-value hardware.
`kubectl drain node-1`
The standard operator CLI command. It instantly applies a Cordon (a `node.kubernetes.io/unschedulable:NoSchedule` taint) refusing all incoming future workloads. It then gracefully requests the eviction of every existing running Pod, respecting their `terminationGracePeriodSeconds` and verifying `PodDisruptionBudgets` are maintained. If you try to drain a node running a critical application that has a strict `minAvailable: 100%` budget, the drain command literally blocks and hangs until you override it, preventing self-inflicted downtime.
The `NoExecute` Taint
Taints are rules applied to Nodes. Most taints use the `NoSchedule` effect, which just means "New pods can't land here." But the `NoExecute` effect is aggressive. The moment it is applied to a Node, the `kubelet` violently evicts any already-existing Pods that do not possess a mathematically matching `Toleration` in their YAML manifest. This is frequently used by cloud providers to instantly evacuate nodes when underlying hardware failures are detected.
Further Reading
- Kubernetes Official Docs: Node-pressure Eviction - The foundational documentation detailing every eviction signal calculation formula used by the `kubelet`.
- Sysdig: Understanding OOMKilled vs Evicted - A brilliant operational breakdown of how to use Prometheus and metrics to track down the difference between Kernel interventions and Kubelet interventions.