The Engineering of Kubernetes Networking: CNI, Overlays, and IPVS
Kubernetes networking is notoriously complex because it mandates a "Flat Network" topology. The fundamental rule is: Every Pod must be able to communicate with every other Pod across the entire cluster without using Network Address Translation (NAT). Achieving this across dozens of physical machines requires a sophisticated layer of virtual routing, masquerading, and kernel-level manipulation.
Part 1: The Linux Network Namespace
To understand Pod networking, we must first understand Linux security. By default, a Linux
machine has one global networking stack (eth0, routing tables, iptables
rules). If two processes try to bind to Port 80, they crash.
Containers solve this using Network Namespaces (netns). A netns is an entirely isolated, private network stack created by the kernel. When Kubernetes starts a Pod, it first spawns a hidden "Pause Container". The sole purpose of the Pause Container is to hold open a newly created netns. All other application containers in that Pod join the exact same netns, sharing the same IP address and localhost space.
Part 2: vEth Pairs and The CNI Plugin
A private netns is useless if it cannot communicate with the outside world. Kubernetes delegates the physical wiring to a Container Network Interface (CNI) plugin (like Flannel, Calico, or Cilium).
When a Pod boots, the CNI performs three actions:
- IPAM: It requests an available IP address from its pool (e.g.,
10.244.1.5) and assigns it to the Pod. - vEth Pair Creation: It creates a Virtual Ethernet (vEth) cable. One end
is plugged into the Pod's private netns (acting as
eth0). The other end is plugged into the Root namespace of the physical Node. - Bridging: It attaches the Root-side of the vEth cable into a virtual
switch (a Linux Bridge, often named
cni0) running on the Node.
Part 3: The Overlay Network (VXLAN)
If Pod A (10.244.1.5) on Node 1 wants to talk to Pod B (10.244.2.10) on Node 2, we hit a massive problem. The physical datacenter routers connecting Node 1
and Node 2 only know the IP addresses of the physical Nodes themselves. They have
absolutely no idea what a "Pod IP" is. If they see a packet destined for
10.244.2.10, they will drop it immediately.
CNIs solve this using an Overlay Network via Encapsulation (commonly VXLAN or IP-in-IP).
- Node 1's CNI intercepts the packet leaving Pod A.
- It takes the entire IP packet (Source: Pod A, Dest: Pod B) and wraps it inside a brand new UDP datagram.
- The new outer UDP packet has Source: Node 1 IP, Dest: Node 2 IP.
- The physical datacenter routers see a normal Node-to-Node UDP packet and happily route it across the wire.
- Upon receiving the packet, Node 2's CNI unwraps the UDP payload, extracts the original
Pod-to-Pod packet, and routes it into the
cni0bridge to reach Pod B.
Note: Some advanced CNIs (like Calico in BGP mode) avoid encapsulation entirely by advertising Pod IP routes directly to the physical datacenter routers using the Border Gateway Protocol, achieving near bare-metal performance.
Part 4: The ClusterIP Illusion and kube-proxy
Because Pod IPs are volatile (changing whenever a Pod restarts), we use Services. A Service provides a highly available Virtual IP (ClusterIP).
The shocking truth of the Kubernetes Service is that the ClusterIP (e.g., 10.96.0.100) does not physically exist on any network interface. You cannot ping
it. It is a mathematical illusion maintained by a daemon called kube-proxy running
on every single Node.
kube-proxy constantly watches the API Server. When a Service is created, it
writes complex rules directly into the Node's Linux kernel (using iptables or
IPVS).
IPTables vs IPVS
Historically, kube-proxy used iptables. It created a sequential list of
routing rules. For massive clusters with 10,000 Services, finding the correct route
required O(N) linear scanning, causing severe CPU latency. Modern K8s installations
switch kube-proxy to IPVS mode, which uses an underlying in-kernel Hash Table
to route VIPs in instant O(1) time.
When a Pod sends traffic to a ClusterIP, the packet hits the Node's kernel, triggers the iptables/IPVS rule, and is instantly rewritten via Destination NAT (DNAT). The destination IP is literally swapped from the ClusterIP to the IP of a randomly selected backend Pod, providing native, distributed load-balancing without a central choke point.
Conclusion: Software-Defined Magic
Kubernetes networking is a masterpiece of software-defined routing. By combining Linux Network Namespaces for isolation, VXLAN overlays for cross-node transit, and kernel-level IPVS hashing for distributed load balancing, K8s creates a seamless, global mesh network that completely abstracts away the physical hardware beneath it.