Pod Eviction Simulator

Runs in browser

Simulate K8s pod eviction scenarios

PodDisruptionBudget

Resource Pressure

Nodes

node-1
CPU 1000m / 4000m
Memory 1024Mi / 8192Mi
node-2
CPU 1000m / 4000m
Memory 1024Mi / 8192Mi
node-3
CPU 2000m / 8000m
Memory 4096Mi / 16384Mi
dedicated=gpu:NoSchedule

How to Use

Simulate Kubernetes node and pod behavior.

  • Cordon/Drain: Manage node scheduling and eviction
  • Taints: Add taints to nodes to repel pods
  • Pressure: Simulate CPU/Memory pressure
  • Results: Watch pods get evicted or protected
Pods & Events

Pods

api-server-1
Running
Node: node-1
Priority: 100
CPU: 500m, Mem: 512Mi
PDB Protected
api-server-2
Running
Node: node-1
Priority: 100
CPU: 500m, Mem: 512Mi
PDB Protected
worker-1
Running
Node: node-2
Priority: 50
CPU: 1000m, Mem: 1024Mi
gpu-job-1
Running
Node: node-3
Priority: 200
CPU: 2000m, Mem: 4096Mi

Event Log

No events yet. Try draining a node or applying taints.

The Definitive Guide to Kubernetes Pod Eviction

In a perfectly dimensioned Kubernetes cluster, every Pod has exactly the resources it needs. In reality, clusters are chaotic environments subject to traffic spikes, memory leaks, and noisy neighbors. Pod Eviction is the highly structured, merciless process by which the `kubelet` (the node agent) actively deletes running Pods to forcefully reclaim compute resources and prevent the entire physical Node from crashing.

Eviction is the ultimate safety valve. Understanding exactly why and which Pods get evicted is the difference between a minor service degradation and a cascading catastrophic cluster failure.


1. The Kubelet and Eviction Signals

Every 10 seconds (by default), the `kubelet` evaluates the current state of its host Node's physical resources against a set of strictly defined thresholds known as Eviction Signals. If a signal crosses a threshold, the Node goes into a "Pressure" state, repels new Pods by tainting itself, and begins the eviction slaughter.

memory.available

The most common trigger. If the Node's available RAM drops below this threshold (default: `100Mi`), the `kubelet` triggers `MemoryPressure`. It immediately calculates the RAM usage of every Pod on the node and prepares to kill those using more than they formally requested.

nodefs.available

Tracks the filesystem holding the `kubelet`'s root volume (logs, ephemeral storage). If this drops below `10%`, the node enters `DiskPressure`. Pods writing massive temporary files or exploding logs without log rotation will be targeted.

imagefs.available

Tracks the filesystem storing container images. It works differently than nodefs: before evicting Pods, the `kubelet` will first attempt to garbage collect unused container images on the disk. Only if GC fails to free enough space does it resort to eviction.

pid.available

Process ID exhaustion. A fork-bomb or improperly configured thread pool can rapidly consume all available PIDs on the Linux kernel without consuming much RAM. If this hits the threshold, `PIDPressure` begins.

Hard vs Soft Eviction

A Soft threshold (e.g., `memory.available<200Mi` with a 1.5-minute grace period) allows the `kubelet` to wait, giving Pods their standard `terminationGracePeriodSeconds` to shut down gracefully (saving database states, draining connections).

A Hard threshold (e.g., `memory.available<100Mi`) has zero grace period. The `kubelet` instantly terminates targeted Pods using `SIGKILL` without any warning, treating the Node as being on the brink of total failure.


2. The Kill Order: Quality of Service (QoS) Classes

When the `kubelet` enters a Pressure state, it doesn't kill Pods randomly. It strictly follows the Quality of Service (QoS) hierarchy assigned to each Pod. You do not define QoS classes directly; Kubernetes calculates them automatically based purely on how you define `requests` and `limits` in your Pod manifests.

1

Tier 1: BestEffort (First to Die)

requests = nil, limits = nil

If a Pod provides no resource hints whatsoever, it falls into the BestEffort class. These Pods are the absolute bottom of the food chain. The moment the Node experiences compute pressure, the `kubelet` annihilates all BestEffort Pods before touching anything else.

2

Tier 2: Burstable

requests < limits (or limits = nil)

The most common class. The Pod guarantees a baseline (requests) but is allowed to spike (burst) up to its limits. If the `kubelet` has already killed all BestEffort pods and is still under pressure, it attacks Burstable Pods.

Crucial Detail: It doesn't kill Burstable Pods randomly. It calculates the memory usage of each Burstable Pod relative to its Request. The Pod exceeding its request by the highest mathematical percentage is targeted first. If two Pods are tied, the one with incredibly lower `PriorityClass` dies first.

3

Tier 3: Guaranteed (Protected Status)

requests = limits (must specify both CPU and RAM)

These Pods are the VIPs of your cluster (usually critical databases or high-priority API endpoints). They are essentially immune to eviction due to memory pressure caused by other Pods.

The `kubelet` will only ever evict a Guaranteed Pod if system daemons (like `docker` or `systemd` itself) are consuming so much memory that the Node is utterly exhausted, and there are literally zero BestEffort or non-compliant Burstable Pods left to kill on the physical hardware.


3. The Race: Kubelet Eviction vs Linux OOMKiller

There are two separate executioners constantly running on a Kubernetes Node: the `kubelet`'s eviction manager, and the underlying Linux operating system's native OOM (Out Of Memory) Killer. They serve different purposes, but frequently race against each other.

Linux OOMKiller (cgroups limit)

Trigger: A single container tries to allocate memory beyond its mathematically defined `limit` in its `cgroup`.

Action: The Linux kernel instantly terminates the primary process with `OOMKilled` (Exit Code 137). The `kubelet` had nothing to do with it. The Pod remains on the node, and the `kubelet` will attempt to restart the container in-place according to the `restartPolicy`.

Kubelet Eviction (Node Limits)

Trigger: The total available RAM on the entire physical Node drops below the `memory.available` threshold.

Action: The `kubelet` identifies the worst-offending Pod (usually BestEffort or highly over-requested Burstable). It terminates the entire Pod and effectively marks it `Evicted`. It does not restart the Pod. The Deployment Controller must schedule a brand new replacement Pod on a different, healthier Node.

Architectural Trap: If your `kubelet` eviction threshold is set to 100Mi, but the Linux Kernel OOM killer threshold kicks in at 50Mi, your node might freeze before the `kubelet` even notices. The `kubelet` is a normal process; if the Node runs completely out of memory, the `kubelet` itself might be paused or OOMKilled by Linux, leading to a permanent `NotReady` Node state.


4. Taints, Tolerations & Voluntary Eviction

Beyond immediate emergency resource pressure, administrators intentionally evict Pods to perform routine maintenance (like upgrading the OS kernel or rotating EC2 instances) or to protect specific high-value hardware.

`kubectl drain node-1`

The standard operator CLI command. It instantly applies a Cordon (a `node.kubernetes.io/unschedulable:NoSchedule` taint) refusing all incoming future workloads. It then gracefully requests the eviction of every existing running Pod, respecting their `terminationGracePeriodSeconds` and verifying `PodDisruptionBudgets` are maintained. If you try to drain a node running a critical application that has a strict `minAvailable: 100%` budget, the drain command literally blocks and hangs until you override it, preventing self-inflicted downtime.

The `NoExecute` Taint

Taints are rules applied to Nodes. Most taints use the `NoSchedule` effect, which just means "New pods can't land here." But the `NoExecute` effect is aggressive. The moment it is applied to a Node, the `kubelet` violently evicts any already-existing Pods that do not possess a mathematically matching `Toleration` in their YAML manifest. This is frequently used by cloud providers to instantly evacuate nodes when underlying hardware failures are detected.

Further Reading