How Kubernetes Networking Works

The Engineering of Kubernetes Networking: CNI, Overlays, and IPVS

Kubernetes networking is notoriously complex because it mandates a "Flat Network" topology. The fundamental rule is: Every Pod must be able to communicate with every other Pod across the entire cluster without using Network Address Translation (NAT). Achieving this across dozens of physical machines requires a sophisticated layer of virtual routing, masquerading, and kernel-level manipulation.

Part 1: The Linux Network Namespace

To understand Pod networking, we must first understand Linux security. By default, a Linux machine has one global networking stack (eth0, routing tables, iptables rules). If two processes try to bind to Port 80, they crash.

Containers solve this using Network Namespaces (netns). A netns is an entirely isolated, private network stack created by the kernel. When Kubernetes starts a Pod, it first spawns a hidden "Pause Container". The sole purpose of the Pause Container is to hold open a newly created netns. All other application containers in that Pod join the exact same netns, sharing the same IP address and localhost space.

Part 2: vEth Pairs and The CNI Plugin

A private netns is useless if it cannot communicate with the outside world. Kubernetes delegates the physical wiring to a Container Network Interface (CNI) plugin (like Flannel, Calico, or Cilium).

When a Pod boots, the CNI performs three actions:

IPAM: It requests an available IP address from its pool (e.g., 10.244.1.5) and assigns it to the Pod.
vEth Pair Creation: It creates a Virtual Ethernet (vEth) cable. One end is plugged into the Pod's private netns (acting as eth0). The other end is plugged into the Root namespace of the physical Node.
Bridging: It attaches the Root-side of the vEth cable into a virtual switch (a Linux Bridge, often named cni0) running on the Node.

Part 3: The Overlay Network (VXLAN)

If Pod A (10.244.1.5) on Node 1 wants to talk to Pod B (10.244.2.10) on Node 2, we hit a massive problem. The physical datacenter routers connecting Node 1 and Node 2 only know the IP addresses of the physical Nodes themselves. They have absolutely no idea what a "Pod IP" is. If they see a packet destined for 10.244.2.10, they will drop it immediately.

CNIs solve this using an Overlay Network via Encapsulation (commonly VXLAN or IP-in-IP).

Node 1's CNI intercepts the packet leaving Pod A.
It takes the entire IP packet (Source: Pod A, Dest: Pod B) and wraps it inside a brand new UDP datagram.
The new outer UDP packet has Source: Node 1 IP, Dest: Node 2 IP.
The physical datacenter routers see a normal Node-to-Node UDP packet and happily route it across the wire.
Upon receiving the packet, Node 2's CNI unwraps the UDP payload, extracts the original Pod-to-Pod packet, and routes it into the cni0 bridge to reach Pod B.

Note: Some advanced CNIs (like Calico in BGP mode) avoid encapsulation entirely by advertising Pod IP routes directly to the physical datacenter routers using the Border Gateway Protocol, achieving near bare-metal performance.

Part 4: The ClusterIP Illusion and kube-proxy

Because Pod IPs are volatile (changing whenever a Pod restarts), we use Services. A Service provides a highly available Virtual IP (ClusterIP).

The shocking truth of the Kubernetes Service is that the ClusterIP (e.g., 10.96.0.100) does not physically exist on any network interface. You cannot ping it. It is a mathematical illusion maintained by a daemon called kube-proxy running on every single Node.

kube-proxy constantly watches the API Server. When a Service is created, it writes complex rules directly into the Node's Linux kernel (using iptables or IPVS).

IPTables vs IPVS

Historically, kube-proxy used iptables. It created a sequential list of routing rules. For massive clusters with 10,000 Services, finding the correct route required O(N) linear scanning, causing severe CPU latency. Modern K8s installations switch kube-proxy to IPVS mode, which uses an underlying in-kernel Hash Table to route VIPs in instant O(1) time.

When a Pod sends traffic to a ClusterIP, the packet hits the Node's kernel, triggers the iptables/IPVS rule, and is instantly rewritten via Destination NAT (DNAT). The destination IP is literally swapped from the ClusterIP to the IP of a randomly selected backend Pod, providing native, distributed load-balancing without a central choke point.

Conclusion: Software-Defined Magic

Kubernetes networking is a masterpiece of software-defined routing. By combining Linux Network Namespaces for isolation, VXLAN overlays for cross-node transit, and kernel-level IPVS hashing for distributed load balancing, K8s creates a seamless, global mesh network that completely abstracts away the physical hardware beneath it.

How Kubernetes Networking Works

Pod IP Allocation

What Happens

Why

Production Gotchas & Takeaways

The DNS "ndots:5" Issue

Ingress vs NodePort

Beware IP Exhaustion

Related Resources

Load Balancer Simulator

CIDR Calculator

The Engineering of Kubernetes Networking: CNI, Overlays, and IPVS

Part 1: The Linux Network Namespace

Part 2: vEth Pairs and The CNI Plugin

Part 3: The Overlay Network (VXLAN)

Part 4: The ClusterIP Illusion and kube-proxy

IPTables vs IPVS

Conclusion: Software-Defined Magic

Glossary & Deep Dives

CNI (Container Network Interface)

Overlay vs Underlay Networks

The ClusterIP Illusion

Kube-Proxy (IPTables vs IPVS)

Comparing Major CNI Plugins

Flannel

Calico

Cilium