How Containers Actually Work

Linux namespaces, cgroups, and OverlayFS. The building blocks behind Docker and Kubernetes.

Host Linux Kernel
Container A
PID 1 (inside)
nginx
Container B
PID 1 (inside)
postgres
Namespaces = Isolation
cgroups = Resource Limits
1 / 6

Containers ≠ VMs

Shared kernel

What Happens

Containers are NOT virtual machines. They share the host Linux kernel. No guest OS.

Why

Faster startup (ms vs minutes), less overhead (MB vs GB), higher density.

Technical Detail

VMs virtualize hardware. Containers virtualize the OS (process isolation).

Example Docker container: ~100MB. Ubuntu VM: ~2GB.

Key Takeaways

Not VMs

Containers share the host kernel. Lighter but weaker isolation.

Linux Primitives

Namespaces + cgroups + OverlayFS = Container runtime.

Security Matters

Kernel exploits can escape. Use seccomp, AppArmor, rootless mode.

The Engineering of Linux Containers: Deconstructing the Magic

The single greatest misconception in modern software engineering is that containers are "lightweight virtual machines." They aren't. In fact, strictly speaking, Linux containers don't even exist. "Container" is merely a user-friendly abstraction—a marketing term representing the precise combination of three deeply entrenched Linux kernel features: Namespaces, Control Groups (cgroups), and Union Filesystems.


Part 1: The Virtual Machine Tax

To understand why the industry abandoned Virtual Machines (VMs) for microservices, you must understand the "hypervisor tax."

When you boot an AWS EC2 instance (a VM), the physical host server runs a Hypervisor (like KVM or Xen). The hypervisor mathematically divides the physical CPU, RAM, and Disk, and then boots an entirely new "Guest" Operating System (like Ubuntu). That Guest OS has its own kernel, its own boot sequence, and its own background daemons.

If you want to run 10 isolated Node.js apps on a physical server, running 10 VMs means you are booting the Linux kernel 10 separate times, sacrificing gigabytes of RAM just to keep the redundant operating systems alive.

Containers share the host's kernel. A container is simply a standard Linux process (just like `htop` or `ssh`) that has been aggressively isolated using kernel features so that it believes it is alone on the machine. Because there is no Guest OS to boot, containers start in milliseconds and consume only the RAM required for the application itself.

Part 2: Namespaces (The Illusion of Solitude)

How do you trick a process into thinking it's alone on a server? You use Namespaces. Namespaces dictate what a process is allowed to see.

  1. PID Namespace: Normally, Linux assigns process IDs linearly. If you run `ps aux` on a host, you see thousands of processes. However, when Docker starts a container, it requests a new PID namespace. The application inside the container is assigned PID 1. It cannot see the host's processes, nor the processes of any other container.
  2. NET Namespace: The container gets its own isolated network stack—its own routing table, firewall (iptables), and virtual ethernet (`veth`) interfaces. It binds to port 80 without conflicting with another container also listening on port 80.
  3. MNT Namespace: Provides an isolated filesystem mount tree. The container cannot see the host's `/etc` or `/var` directories; it only sees the directories explicitly provided to it by the container runtime.
  4. UTS Namespace: Allows the container to have its own isolated hostname and domain name, distinct from the physical server.
  5. USER Namespace: Allows a process to run as `root` (UID 0) inside the container, but maps that user to an unprivileged standard user (UID 1000) on the host system. This is crucial for security.
# You can create a namespace manually without Docker:
$ unshare --pid --utmp --mount-proc --fork /bin/bash

Part 3: cgroups (The Enforcement of Limits)

If Namespaces limit what a process can see, Control Groups (cgroups) limit what a process can use.

Without cgroups, a single runaway memory leak in Container A could consume 100% of the physical server's RAM, triggering the kernel's Out-Of-Memory (OOM) killer, which might indiscriminately murder Container B and Container C to save the system.

When you run `docker run --memory="512m" --cpus="0.5" nginx`, Docker instructs the Linux kernel to create a new cgroup node. The kernel hardware-enforces these physics:

  • If the Nginx process attempts to allocate 513MB of RAM, the kernel intercepts the allocation and instantly kills the containerized process (OOMKilled).
  • If the process tries to consume 100% of the CPU, the kernel's completely fair scheduler (CFS) throttles it rigidly to 50% core utilization.

Part 4: Union Filesystems (The Illusion of Disk)

If you have 50 containers running Ubuntu on a server, and Ubuntu is 200MB, does the server waste 10GB of disk space storing 50 identical copies of the Ubuntu OS directories? No. This is solved by Union Filesystems (most notably OverlayFS).

A Docker image is not a single large file. It is a stack of read-only layers.

  • Layer 1: Base Alpine Linux (5MB)
  • Layer 2: `apk add nodejs` (20MB)
  • Layer 3: `COPY package.json` (1KB)

When Docker starts the container, it takes these read-only layers, stacks them logically, and places a universally thin, empty Read/Write (R/W) layer on top.

If 50 containers boot from that same image, they all share the exact same 25MB of read-only layers on disk. The only disk space consumed is the tiny R/W layer specific to each container. If a container needs to modify a file from a read-only layer, the kernel initiates a Copy-on-Write (CoW) operation: it copies the file up to the R/W layer, modifies it there, and obscures the read-only original beneath it.

Part 5: Security and Kernel Exploits

Because containers share the host kernel, they are inherently less secure than hardware Virtual Machines.

If a hacker finds a vulnerability in the Linux Kernel itself (like a memory corruption bug in the TCP/IP stack), they can execute a "Container Escape." They exploit the bug from within the container to gain raw root access on the underlying host, instantly compromising every other container on the machine.

To mitigate this, modern container runtimes aggressively restrict the syscalls (kernel capabilities) containers are allowed to make using a technology called Seccomp (Secure Computing Mode). By default, Docker completely blocks containers from executing over 40 dangerous Linux syscalls (like `kexec_load` or `reboot`) directly blocking 80% of known kernel exploit vectors before they ever reach the kernel code.

Conclusion: The Modern Symphony

Containers altered the trajectory of cloud computing. By masterfully orchestrating low-level Linux kernel features, they provided the illusion of absolute virtualized isolation with the raw, uncompromising performance of bare-metal processes. They are the atomic unit of the modern internet, forming the foundational bedrock upon which the Kubernetes orchestration utopia was built.

Glossary & Concepts

Linux Namespaces

Kernel feature that isolates process views: PIDs, network, filesystem mounts, hostnames, IPC, user IDs.

cgroups (Control Groups)

Kernel feature that limits, accounts for, and isolates resource usage (CPU, memory, I/O, network).

OverlayFS / Union FS

Union filesystem that stacks read-only layers with a thin read-write layer on top. Copy-on-Write semantics.

veth pair

Virtual Ethernet device pair. One end in the container, one end on the host bridge. Enables container networking.

OCI (Open Container Initiative)

Industry standard for container formats and runtimes. Docker images follow OCI spec. runc is the reference runtime.

seccomp

Secure computing mode. Filters system calls a process can make. Essential for container security.