How Kubernetes Networking Works

Understanding the "Flat Network", CNI Plugins, and Service Discovery.

Node 1 (192.168.1.10)
10.244.1.2
10.244.1.3
cni0 Bridge
Node 2 (192.168.1.11)
10.244.2.2
CoreDNS
cni0 Bridge
External Web
DHCP/IPAM Request
1 / 2

Pod IP Allocation

IPAM (IP Address Management)

What Happens

When a Pod is scheduled on a Node, the CNI plugin requests an IP address from the IPAM plugin.

Why

Kubernetes requires every Pod to have a unique IP across the entire cluster without NAT.

Technical Detail

IPAM assigns an IP from the Node's allocated Pod CIDR block.

Example Node 1 CIDR: 10.244.1.0/24 -> Pod gets 10.244.1.2

Production Gotchas & Takeaways

The DNS "ndots:5" Issue

Alpine Linux and Node.js often struggle with Kubernetes' default DNS config (`ndots:5`), causing them to make up to 10 sequential DNS queries for a single external API call. This frequently overwhelms CoreDNS in production.

Ingress vs NodePort

While NodePorts punch physical holes in your Node firewalls, Ingress is an L7 router (like NGINX). Use Ingress to route 100 different websites through a single physical IP address using Host header routing.

Beware IP Exhaustion

If your Node's `PodCIDR` is a `/24`, that Node can only run 254 Pods maximum. In AWS EKS, because Pod IPs are often drawn directly from the VPC subnet, running out of VPC IPs halts cluster scaling.

The Engineering of Kubernetes Networking: CNI, Overlays, and IPVS

Kubernetes networking is notoriously complex because it mandates a "Flat Network" topology. The fundamental rule is: Every Pod must be able to communicate with every other Pod across the entire cluster without using Network Address Translation (NAT). Achieving this across dozens of physical machines requires a sophisticated layer of virtual routing, masquerading, and kernel-level manipulation.


Part 1: The Linux Network Namespace

To understand Pod networking, we must first understand Linux security. By default, a Linux machine has one global networking stack (eth0, routing tables, iptables rules). If two processes try to bind to Port 80, they crash.

Containers solve this using Network Namespaces (netns). A netns is an entirely isolated, private network stack created by the kernel. When Kubernetes starts a Pod, it first spawns a hidden "Pause Container". The sole purpose of the Pause Container is to hold open a newly created netns. All other application containers in that Pod join the exact same netns, sharing the same IP address and localhost space.

Part 2: vEth Pairs and The CNI Plugin

A private netns is useless if it cannot communicate with the outside world. Kubernetes delegates the physical wiring to a Container Network Interface (CNI) plugin (like Flannel, Calico, or Cilium).

When a Pod boots, the CNI performs three actions:

  1. IPAM: It requests an available IP address from its pool (e.g., 10.244.1.5) and assigns it to the Pod.
  2. vEth Pair Creation: It creates a Virtual Ethernet (vEth) cable. One end is plugged into the Pod's private netns (acting as eth0). The other end is plugged into the Root namespace of the physical Node.
  3. Bridging: It attaches the Root-side of the vEth cable into a virtual switch (a Linux Bridge, often named cni0) running on the Node.

Part 3: The Overlay Network (VXLAN)

If Pod A (10.244.1.5) on Node 1 wants to talk to Pod B (10.244.2.10) on Node 2, we hit a massive problem. The physical datacenter routers connecting Node 1 and Node 2 only know the IP addresses of the physical Nodes themselves. They have absolutely no idea what a "Pod IP" is. If they see a packet destined for 10.244.2.10, they will drop it immediately.

CNIs solve this using an Overlay Network via Encapsulation (commonly VXLAN or IP-in-IP).

  • Node 1's CNI intercepts the packet leaving Pod A.
  • It takes the entire IP packet (Source: Pod A, Dest: Pod B) and wraps it inside a brand new UDP datagram.
  • The new outer UDP packet has Source: Node 1 IP, Dest: Node 2 IP.
  • The physical datacenter routers see a normal Node-to-Node UDP packet and happily route it across the wire.
  • Upon receiving the packet, Node 2's CNI unwraps the UDP payload, extracts the original Pod-to-Pod packet, and routes it into the cni0 bridge to reach Pod B.

Note: Some advanced CNIs (like Calico in BGP mode) avoid encapsulation entirely by advertising Pod IP routes directly to the physical datacenter routers using the Border Gateway Protocol, achieving near bare-metal performance.

Part 4: The ClusterIP Illusion and kube-proxy

Because Pod IPs are volatile (changing whenever a Pod restarts), we use Services. A Service provides a highly available Virtual IP (ClusterIP).

The shocking truth of the Kubernetes Service is that the ClusterIP (e.g., 10.96.0.100) does not physically exist on any network interface. You cannot ping it. It is a mathematical illusion maintained by a daemon called kube-proxy running on every single Node.

kube-proxy constantly watches the API Server. When a Service is created, it writes complex rules directly into the Node's Linux kernel (using iptables or IPVS).

IPTables vs IPVS

Historically, kube-proxy used iptables. It created a sequential list of routing rules. For massive clusters with 10,000 Services, finding the correct route required O(N) linear scanning, causing severe CPU latency. Modern K8s installations switch kube-proxy to IPVS mode, which uses an underlying in-kernel Hash Table to route VIPs in instant O(1) time.

When a Pod sends traffic to a ClusterIP, the packet hits the Node's kernel, triggers the iptables/IPVS rule, and is instantly rewritten via Destination NAT (DNAT). The destination IP is literally swapped from the ClusterIP to the IP of a randomly selected backend Pod, providing native, distributed load-balancing without a central choke point.

Conclusion: Software-Defined Magic

Kubernetes networking is a masterpiece of software-defined routing. By combining Linux Network Namespaces for isolation, VXLAN overlays for cross-node transit, and kernel-level IPVS hashing for distributed load balancing, K8s creates a seamless, global mesh network that completely abstracts away the physical hardware beneath it.

Glossary & Deep Dives

CNI (Container Network Interface)

A standard that defines how plugins should configure network interfaces in Linux containers. Kubernetes relies entirely on external CNI plugins (like Flannel, Calico, or Cilium) to implement its "flat network" requirement.

IPAM (IP Address Management) is often a sub-component of the CNI. It's responsible for managing the pool of available IPs and doling them out to newly scheduled Pods.

Overlay vs Underlay Networks

The Underlay is your physical data center network (the actual switches and routers connecting your servers). The Overlay is a virtual network built on top.

Because physical switches only know about Node routing, overlay networks tunnel traffic by wrapping a Pod's IP packet inside a standard UDP packet (VXLAN) that the physical switches know how to deliver to the target Node.

The ClusterIP Illusion

A ClusterIP is a Virtual IP (VIP) that provides a stable endpoint for a Service. Crucially, a ClusterIP does not exist on any physical or virtual network interface. It is purely an abstraction maintained by `kube-proxy` mapping rules within the Linux kernel. If you try to `ping` a ClusterIP, it often fails because no actual interface holds that address.

Kube-Proxy (IPTables vs IPVS)

A daemon on every node that watches the K8s API for Service changes and updates local routing rules.

  • IPTables mode: Creates linear rules in the `KUBE-SERVICES` chain. For large clusters (10,000+ Services), routing becomes O(N) slow as traffic checks every rule.
  • IPVS mode: Resolves this by using a kernel hash table for O(1) matching, drastically improving routing speed in massive clusters.

Comparing Major CNI Plugins

Flannel

The simplest option. It sets up a straightforward Layer 3 IPv4 overlay utilizing VXLAN. Excellent for small or simple clusters, but lacks support for advanced Kubernetes Network Policies (firewalling Pods).

Calico

Highly scalable. Calico can operate without an overlay network by hooking into the data center's routers via BGP (Border Gateway Protocol). Fully supports strict Network Policies.

Cilium

The modern standard. It bypasses `kube-proxy` entirely by injecting logic directly into the Linux Kernel using eBPF (Extended Berkeley Packet Filter), resulting in massive performance gains and deep observability.