K8s Pod Creation Flow

The Engineering of Kubernetes: The Choreography of State

Kubernetes is not merely an engine for running containers; it is fundamentally a deeply robust, distributed state machine. The entire architecture is built around a single paradigm: Level-Triggered Declarative State. You declare what the world should look like, and independent, asynchronous control loops work relentlessly to make the physical reality match your declaration.

Part 1: The API Server and etcd

The kube-apiserver is the absolute center of the Kubernetes universe. It is a stateless REST API that serves as the only component allowed to communicate directly with etcd, the highly-available, distributed key-value store acting as the cluster's permanent memory.

When you run kubectl apply -f pod.yaml, the API server executes a strict defensive pipeline:

Authentication: Specifically verifies the cryptographic signature of your TLS certificate or bearer token.
Authorization (RBAC): Checks if your specific role (e.g., "Developer") has the "create" verb permission for the "pods" resource in the target namespace.
Mutating Admission: Webhooks intercept the payload and modify it on the fly (e.g., Istio automatically injecting an Envoy proxy sidecar container into your Pod spec).
Validating Admission: Webhooks perform final semantic checks (e.g., rejecting the pod if it doesn't specify CPU limits as required by security policy).

Only if all checks pass does the API server serialize the object into Protobuf format and commit it to etcd. At this exact millisecond, the Pod legally "exists" in the cluster's Desired State, even though no physical container is running yet.

Part 2: The Scheduler's Mathematical Filter

The kube-scheduler continuously watches the API Server for any Pod that has an empty spec.nodeName field. Its job is incredibly specific: pick the mathematical best-fit physical node for this pending Pod. It uses a bipartite algorithm:

Filtering (Hard Constraints): It eliminates nodes that physically cannot run the Pod. Does the node lack sufficient CPU/RAM? Does it have a taint that the Pod doesn't tolerate? Is the node out of disk space? If a cluster has 1000 nodes, filtering might quickly reduce the eligible candidates to 50.
Scoring (Soft Constraints): It ranks the surviving 50 nodes. It assigns higher scores to nodes that already have the required Docker image cached locally, or nodes that spread the Pod out across different physical availability zones (Anti-Affinity) to maximize fault tolerance.

The Scheduler selects the highest-scoring Node and issues a "Binding" POST request to the API server, officially updating the Pod's nodeName.

Part 3: The Kubelet and the Container Runtime

Every physical worker node runs an agent called the Kubelet. It continuously watches the API server, filtering strictly for Pods assigned to its own nodeName.

When the Kubelet sees its new assignment, it acts as the local orchestrator:

It tells the CRI (Container Runtime Interface)—like containerd or CRI-O—to pull the Docker images from the external registry.
It commands the CNI (Container Network Interface) plugin (like Calico or Cilium) to wire up a Virtual Ethernet (veth) pair, connecting the Pod's isolated network namespace to the host's root network, and assigns the Pod a globally unique IP address.
It instructs the CRI to physically start the Linux cgroups and namespaces, bringing the application process to life.

The Kubelet then updates the API server: "Status: Running."

Part 4: The Reconciliation Loop

Crucially, the Kubelet's job does not end when the container starts. It runs continuous Liveness and Readiness Probes.

If your Node physically loses power and burns down, its Kubelet stops sending "Heartbeat" updates to the API Server. The kube-controller-manager notices this silence and marks the Node as "NotReady". After a 5-minute grace period, it brutally evicts all Pods assigned to the dead node.

If those Pods were managed by a Deployment, the ReplicaSet controller detects that the Current State (2 Pods) no longer mathematically equals the Desired State (3 Pods). Without any human intervention, the ReplicaSet controller instantly commands the API Server to create a brand new replacement Pod. The Scheduler sees the new pending Pod, binds it to a healthy Node, the new Kubelet spins it up, and the world is instantly healed. This is the chaotic, beautiful resilience of Kubernetes.

User Submits Manifest

What Happens

Why It Matters

Example

Key Takeaways

Declarative

The Scheduler

Ephemeral

Related Resources

Pods Concept

Node Architecture

ReplicaSet

Scheduler

The Engineering of Kubernetes: The Choreography of State

Part 1: The API Server and etcd

Part 2: The Scheduler's Mathematical Filter

Part 3: The Kubelet and the Container Runtime

Part 4: The Reconciliation Loop

Glossary & Concepts

API Server

etcd

Kube-Scheduler

Kubelet

CRI (Container Runtime Interface)

CNI (Container Network Interface)