The Definitive Guide to Kubernetes Pod Eviction

In a perfectly dimensioned Kubernetes cluster, every Pod has exactly the resources it needs. In reality, clusters are chaotic environments subject to traffic spikes, memory leaks, and noisy neighbors. Pod Eviction is the highly structured, merciless process by which the `kubelet` (the node agent) actively deletes running Pods to forcefully reclaim compute resources and prevent the entire physical Node from crashing.

Eviction is the ultimate safety valve. Understanding exactly why and which Pods get evicted is the difference between a minor service degradation and a cascading catastrophic cluster failure.

1. The Kubelet and Eviction Signals

Every 10 seconds (by default), the `kubelet` evaluates the current state of its host Node's physical resources against a set of strictly defined thresholds known as Eviction Signals. If a signal crosses a threshold, the Node goes into a "Pressure" state, repels new Pods by tainting itself, and begins the eviction slaughter.

memory.available

The most common trigger. If the Node's available RAM drops below this threshold (default: `100Mi`), the `kubelet` triggers `MemoryPressure`. It immediately calculates the RAM usage of every Pod on the node and prepares to kill those using more than they formally requested.

nodefs.available

Tracks the filesystem holding the `kubelet`'s root volume (logs, ephemeral storage). If this drops below `10%`, the node enters `DiskPressure`. Pods writing massive temporary files or exploding logs without log rotation will be targeted.

imagefs.available

Tracks the filesystem storing container images. It works differently than nodefs: before evicting Pods, the `kubelet` will first attempt to garbage collect unused container images on the disk. Only if GC fails to free enough space does it resort to eviction.

pid.available

Process ID exhaustion. A fork-bomb or improperly configured thread pool can rapidly consume all available PIDs on the Linux kernel without consuming much RAM. If this hits the threshold, `PIDPressure` begins.

Hard vs Soft Eviction

A Soft threshold (e.g., `memory.available<200Mi` with a 1.5-minute grace period) allows the `kubelet` to wait, giving Pods their standard `terminationGracePeriodSeconds` to shut down gracefully (saving database states, draining connections).

A Hard threshold (e.g., `memory.available<100Mi`) has zero grace period. The `kubelet` instantly terminates targeted Pods using `SIGKILL` without any warning, treating the Node as being on the brink of total failure.

2. The Kill Order: Quality of Service (QoS) Classes

When the `kubelet` enters a Pressure state, it doesn't kill Pods randomly. It strictly follows the Quality of Service (QoS) hierarchy assigned to each Pod. You do not define QoS classes directly; Kubernetes calculates them automatically based purely on how you define `requests` and `limits` in your Pod manifests.

Tier 1: BestEffort (First to Die)

requests = nil, limits = nil

If a Pod provides no resource hints whatsoever, it falls into the BestEffort class. These Pods are the absolute bottom of the food chain. The moment the Node experiences compute pressure, the `kubelet` annihilates all BestEffort Pods before touching anything else.

Tier 2: Burstable

requests < limits (or limits = nil)

The most common class. The Pod guarantees a baseline (requests) but is allowed to spike (burst) up to its limits. If the `kubelet` has already killed all BestEffort pods and is still under pressure, it attacks Burstable Pods.

Crucial Detail: It doesn't kill Burstable Pods randomly. It calculates the memory usage of each Burstable Pod relative to its Request. The Pod exceeding its request by the highest mathematical percentage is targeted first. If two Pods are tied, the one with incredibly lower `PriorityClass` dies first.

Tier 3: Guaranteed (Protected Status)

requests = limits (must specify both CPU and RAM)

These Pods are the VIPs of your cluster (usually critical databases or high-priority API endpoints). They are essentially immune to eviction due to memory pressure caused by other Pods.

The `kubelet` will only ever evict a Guaranteed Pod if system daemons (like `docker` or `systemd` itself) are consuming so much memory that the Node is utterly exhausted, and there are literally zero BestEffort or non-compliant Burstable Pods left to kill on the physical hardware.

3. The Race: Kubelet Eviction vs Linux OOMKiller

There are two separate executioners constantly running on a Kubernetes Node: the `kubelet`'s eviction manager, and the underlying Linux operating system's native OOM (Out Of Memory) Killer. They serve different purposes, but frequently race against each other.

Linux OOMKiller (cgroups limit)

Trigger: A single container tries to allocate memory beyond its mathematically defined `limit` in its `cgroup`.

Action: The Linux kernel instantly terminates the primary process with `OOMKilled` (Exit Code 137). The `kubelet` had nothing to do with it. The Pod remains on the node, and the `kubelet` will attempt to restart the container in-place according to the `restartPolicy`.

Kubelet Eviction (Node Limits)

Trigger: The total available RAM on the entire physical Node drops below the `memory.available` threshold.

Action: The `kubelet` identifies the worst-offending Pod (usually BestEffort or highly over-requested Burstable). It terminates the entire Pod and effectively marks it `Evicted`. It does not restart the Pod. The Deployment Controller must schedule a brand new replacement Pod on a different, healthier Node.

Architectural Trap: If your `kubelet` eviction threshold is set to 100Mi, but the Linux Kernel OOM killer threshold kicks in at 50Mi, your node might freeze before the `kubelet` even notices. The `kubelet` is a normal process; if the Node runs completely out of memory, the `kubelet` itself might be paused or OOMKilled by Linux, leading to a permanent `NotReady` Node state.

4. Taints, Tolerations & Voluntary Eviction

Beyond immediate emergency resource pressure, administrators intentionally evict Pods to perform routine maintenance (like upgrading the OS kernel or rotating EC2 instances) or to protect specific high-value hardware.

`kubectl drain node-1`

The standard operator CLI command. It instantly applies a Cordon (a `node.kubernetes.io/unschedulable:NoSchedule` taint) refusing all incoming future workloads. It then gracefully requests the eviction of every existing running Pod, respecting their `terminationGracePeriodSeconds` and verifying `PodDisruptionBudgets` are maintained. If you try to drain a node running a critical application that has a strict `minAvailable: 100%` budget, the drain command literally blocks and hangs until you override it, preventing self-inflicted downtime.

The `NoExecute` Taint

Taints are rules applied to Nodes. Most taints use the `NoSchedule` effect, which just means "New pods can't land here." But the `NoExecute` effect is aggressive. The moment it is applied to a Node, the `kubelet` violently evicts any already-existing Pods that do not possess a mathematically matching `Toleration` in their YAML manifest. This is frequently used by cloud providers to instantly evacuate nodes when underlying hardware failures are detected.

Pod Eviction Simulator

PodDisruptionBudget

Resource Pressure

Nodes

How to Use

Pods

Event Log

Related Tools

Resource Calculator

K8s Rollout

K8s Pod Creation