Containers from Scratch

(For a better experience read this on my own blog - https://mraviral.in/blog/containers-from-scratch )

Docker often feels like magic. You type docker run, and milliseconds later, you have a pristine, isolated environment. For a long time, I settled for the standard explanation: "It’s like a lightweight VM."

While that helps conceptually, it’s technically misleading. A container isn't a physical object inside the Linux kernel; it's a construct made of various kernel features working together. To truly understand them, I decided to stop treating them as black boxes and build one myself.

In this post, we are going to replicate docker run using Go. We will peel back the layers of abstraction to see how Namespaces, File Systems, and Cgroups combine to create the illusion of a container.

The Goal

We will be replicating the docker run command, as it starts a containerized process. Let's observe some things about an actual Docker container first in order to replicate it. If we run a container using docker run --rm -it ubuntu /bin/bash, observe the differences in the output of commands between the container and the actual host.

container output

host output

hostname: The container acts like a different host with a random ID assigned as its hostname.
ps: The container shows processes with very low PIDs (like 1, 2... 10), whereas on the host, the PIDs are much higher (like 239321).

This means Docker containers act like separate hosts with isolated processes.

I will be using Go for this purpose, and we will be replicating the following command:

docker run <image>          <command> <params>
go     run  main.go run     <command> <params>

So, docker run -it ubuntu /bin/bash becomes go run main.go run /bin/bash in our case.

The Beginning

We create a main.go that will run any command given by the user:

package main

import (
    "fmt"
    "os"
    "os/exec"
)

func run() {
    fmt.Printf("[run()] running: %v\n", os.Args[2:])
    cmd := exec.Command(os.Args[2], os.Args[3:]...)
    cmd.Stdin = os.Stdin
    cmd.Stdout = os.Stdout
    cmd.Stderr = os.Stderr

    if err := cmd.Run(); err != nil {
        fmt.Println("[run()] error:", err)
        panic(err)
    }
}

func main() {
    fmt.Println("container demo")
    switch os.Args[1] {
    case "run":
        run()
    default:
        panic("bad command")
    }
}

This is a simple Go program that takes a command as an argument (like run) and executes the subsequent arguments as a new process. It connects the standard input, output, and error streams of the new process to the current terminal, allowing us to interact with it.

Let's try running a simple echo command using this:

It works without any problem. Now, let's try to launch a bash shell inside this as a process:

go run main.go run /bin/bash

It ran without any problems. That should mean that we are inside the bash shell of our process, right? But how do we prove it? If I try to exit from this shell, I should return to my original shell session. Let's try it!

exit from container shell

Yes! This proves we were indeed inside our containerized bash shell. This is starting to feel a bit like containers, and it only gets better from here.

But we want our container to have its own hostname. How do we do this? The answer is Namespaces.

Namespaces

Namespaces define what a process can see. They are created with syscalls to restrict the container's view of the host machine. This includes things like:

UNIX Time-Sharing System (UTS)
Process IDs (PID)
File System (Mount points)
Users
IPC
Networking

It provides process isolation based on the supplied namespace (e.g., UTS gives an isolated hostname).

Giving Our Container its Hostname

We can do this in Go by adding SysProcAttr:

cmd.SysProcAttr = &syscall.SysProcAttr{
        Cloneflags: syscall.CLONE_NEWUTS,
    }

The CLONE_NEWUTS flag creates a new UTS (UNIX Time-Sharing) namespace. This namespace isolates the hostname and the NIS domain name. By using this flag, changes made to the hostname inside the new process will not affect the host system or other processes, effectively giving the container its own identity.

Let's now launch a new bash shell using this and check the hostname:

Here we see that it has inherited the hostname of the host by default. Let's try changing it to container:

The change was successful! It now shows container as its new hostname. If we check the hostname of our host for confirmation, it is still the old one.

Now that we have achieved different hostnames for our container and host, let's take it a step further. I want the hostname of the container to be set automatically on launch so that the bash shell picks up the prompt, giving us better clarity.

If we explore a bit, this can be achieved by using syscall.Sethostname() in Go, but there is a problem. If I do this:

...
if err := cmd.Run(); err != nil {
        fmt.Println("[run()] error:", err)
        panic(err)
    }
syscall.Sethostname([]byte("container"))
...

The hostname will be set after the process has finished running, so we won't be able to use it inside the container. And we can't place it before cmd.Run() either, because that would change the hostname of the host machine (since the new namespace hasn't been created yet). We need the hostname change to happen inside the new namespace, but before the user's command executes.

Hence, we need to do a split.

The Split

We split the code into run() and child() functions as follows:

package main

import (
    "fmt"
    "os"
    "os/exec"
    "syscall"
)

func run() {
    fmt.Printf("[run()] running: %v\n", os.Args[2:])
    cmd := exec.Command("/proc/self/exe", append([]string{"child"}, os.Args[2:]...)...)
    cmd.Stdin = os.Stdin
    cmd.Stdout = os.Stdout
    cmd.Stderr = os.Stderr

    cmd.SysProcAttr = &syscall.SysProcAttr{
        Cloneflags: syscall.CLONE_NEWUTS,
    }

    if err := cmd.Run(); err != nil {
        fmt.Println("[run()] error:", err)
        panic(err)
    }
}

func child() {
    fmt.Printf("[child()] running %v\n", os.Args[2:])

    syscall.Sethostname([]byte("container"))

    cmd := exec.Command(os.Args[2], os.Args[3:]...)
    cmd.Stdin = os.Stdin
    cmd.Stdout = os.Stdout
    cmd.Stderr = os.Stderr

    if err := cmd.Run(); err != nil {
        fmt.Println("[child()] error:", err)
        panic(err)
    }
}

func main() {
    fmt.Println("container demo")
    switch os.Args[1] {
    case "run":
        run()
    case "child":
        child()
    default:
        panic("bad command")
    }
}

Here, run() is responsible for setting up the namespaces and starting the container process. However, instead of running the user's command directly, it runs the same program again (using /proc/self/exe) but with the command child.

The child() function is executed inside the new namespaces (because run() created it with Cloneflags). Inside child(), we can safely set the hostname and then execute the user's command.

Note: /proc/self/exe is a special symbolic link in the Linux kernel that points to the executable file of the currently running process. This allows our program to re-execute itself without needing to know its own path.

Now, run the command and see that the bash prompt automatically picks up the custom hostname of the container.

bash prompt automatically picks up the custom hostname of the container

Process Isolation

If we check the ps command inside our container, we get:

output of ps command initially in container

Observe the high PID numbers; this means that the container is still able to see the host processes. We want it to show only the container processes when ps is used inside it. For this, we need to give it a namespace for processes: syscall.CLONE_NEWPID.

So let's add it:

...
cmd.SysProcAttr = &syscall.SysProcAttr{
        Cloneflags: syscall.CLONE_NEWUTS | syscall.CLONE_NEWPID,
    }
...

If we now run our container and list the processes, we should see new low PID numbers, right? Let's check.

ps output of root@container still showing high pid numbers

We can see that the container is still seeing the host processes. Why? For this, we need to understand in depth how the ps command and processes work in Unix-based systems.

Processes in Linux

ps looks into the /proc/ directory, which contains all the information related to the system's processes. Let's explore this directory on our host.

output of ps command on host

The output shows directories for all the processes on the system. If we go inside any of these, we can see the information related to that process. Let's find a running process and see what's inside.

ps command output

Let's go inside the process of our currently active bash shell.

inside proc/ of bash shell process

We can also do some more experiments here. For example, if we see what ls -l /proc/self gives:

output of ls -l /proc/self

Observe how each time it gives a new PID as output. This is because the ls command gets a new PID each time it runs, and /proc/self points to that process. Further, if we check what ls -l /proc/self/exe gives:

output of ls -l /proc/self/exe

Hence, we can see that /proc/self/exe points to the current process. As we are executing the ls command, it points to the location of the ls binary.

Now that we understand that for isolated process management our container needs its own /proc, and for that it needs its own filesystem. Let's see how this can be done.

Giving the Container its Filesystem

To generate an elementary root file system, we can use Docker to export the filesystem that the Ubuntu image uses. This can be done using the following commands:

docker create --name temp_export_container ubuntu
docker export temp_export_container > filesystem.tar
docker rm temp_export_container
tar -xf filesystem.tar -C container-fs

Check that we now have a basic file system ready:

output of ls and contents of container-fs

Now we need to tell our container to use this filesystem as its root. This can be done by using chroot:

    syscall.Chroot("container-fs")
    syscall.Chdir("/")

To verify that the container indeed uses its own filesystem, I created a test file CONTAINER-ROOT-TEST-FILE. Now let's run the container and see the output of ls:

confirm container uses its own file system

We can see that container-fs is acting as the filesystem for our container.

Let's do one more experiment here. Launch a long-running process on our container (e.g., sleep 200):

launch sleep 200 in container

We can find this process's PID using ps -C sleep. If we check the /proc of this process:

prove that host can see the root of container as the container-fs

This proves that the root of the container is the filesystem that we just mounted.

Docker works the same way: an image packs the filesystem, and when we run a container, it unpacks the filesystem somewhere and chroots into it, allowing the running container to use it as /.

Getting Processes to Work

ps still does not work inside the container filesystem. That is because /proc is a pseudo-filesystem, and we have to mount it to use it.

ps does not work in the container

Modify our child() function as follows:

func child() {
    fmt.Printf("[child()] running %v\n", os.Args[2:])

    syscall.Sethostname([]byte("container"))
    syscall.Chroot("container-fs")
    syscall.Chdir("/")

    syscall.Mount("proc", "proc", "proc", 0, "")

    cmd := exec.Command(os.Args[2], os.Args[3:]...)
    cmd.Stdin = os.Stdin
    cmd.Stdout = os.Stdout
    cmd.Stderr = os.Stderr

    if err := cmd.Run(); err != nil {
        fmt.Println("[child()] error:", err)
        panic(err)
    }

    defer syscall.Unmount("/proc", 0)
}

We mount the proc filesystem to /proc so that tools like ps can inspect processes. Since we are in a new PID namespace, mounting a fresh proc filesystem will show only the processes in this namespace. We use defer syscall.Unmount to ensure cleanup when the container exits.

Now let's check if ps works in our container.

ps command working now inside the container

Hurray!

Diving Deeper into Mounts and Processes

See the outputs of mount commands on our container and host:

output of mount in container

output of mount in host

We can see the container mounted /proc from the host here. This happens because mounts are shared back to the host recursively and automatically. For true isolation, we should use a namespace for mounts: syscall.CLONE_NEWNS. To prevent it from sharing back to the host, we add unshare flags.

cmd.SysProcAttr = &syscall.SysProcAttr{
        Cloneflags:   syscall.CLONE_NEWUTS | syscall.CLONE_NEWPID | syscall.CLONE_NEWNS,
        Unshareflags: syscall.CLONE_NEWNS,
    }

Now, the same mount command on the host does not show the container's /proc mount.

host does not show the container mount now

We can still access the container processes, though. Run sleep 200 on the container, and on the host, run ps -C sleep to get its PID. We can then cat /proc/<pid>/mounts to see the mount info:

ran sleep 200 on container

see mount info of process on host

Resource Control and Limits

We need to have control over our containers and their resources for better isolation. In Linux, this is done using cgroups.

Cgroups (Control Groups) also use a pseudo-filesystem to manage the configuration and properties that we want the kernel to enforce. We can control the resources a process can use, including CPU, memory, disk I/O, network, and device permissions.

First, we will explore the cgroups of our host. Move to /sys/fs/cgroup and run ls:

output of ls in cgroups

You can see different types of cgroups that we can setup for control.

The image shows the cgroup hierarchy in /sys/fs/cgroup. Each directory represents a control group where we can set limits.

Let's see a demo of how cgroups work. Run a Docker container in another shell using docker run --rm -it ubuntu /bin/bash. You will get the container ID. If we check /sys/fs/cgroup/system.slice/docker-<container-id>.scope/ on the host, we can see the newly created control group for that container:

docker container's cgroups

Let's check the current container's memory limits using cat memory.max. It will be unrestricted (max) by default.

docker memory limit max

If we instead run a restricted container using docker run --rm -it --memory=10M ubuntu /bin/bash and check this new container's limits, we can see the set values:

restricted container max memory by cgroup

Limiting Processes in Our Container

For our container, we will use cgroups to limit the number of processes. We define a cg() (control group) function as:

func cg() {
    cgroupName := "mytestcgroup"
    cgroups := "/sys/fs/cgroup/"
    myCgroupPath := filepath.Join(cgroups, cgroupName)

    err := os.Mkdir(myCgroupPath, 0755)
    if err != nil && !os.IsExist(err) {
        panic(err)
    }

    must(os.WriteFile(filepath.Join(myCgroupPath, "pids.max"), []byte("10"), 0700))
    must(os.WriteFile(filepath.Join(myCgroupPath, "cgroup.procs"), []byte(strconv.Itoa(os.Getpid())), 0700))
}

func must(err error) {
    if err != nil {
        panic(err)
    }
}

When we create the directory mytestcgroup, the kernel automatically detects this mkdir and creates the cgroup structure populated with files.

The cg() function creates a new cgroup directory. We then write a limit of 10 to pids.max to restrict the number of processes. Finally, we write os.Getpid() to cgroup.procs to move the current process (the container) into this cgroup, subjecting it to the same limits.

Run the container, go into the cgroup directory, and check the pids.max file. It will show the limit we set.

container ran

pids.max limit is set

Moment of truth! Now let's run a command like sleep 100 on the container. Then do ps -C sleep to find its ID on the host, and cat mytestcgroup/cgroup.procs to see that the container PID is actually controlled by the cgroup we just created.

control group controls that pid

Let's check this process limit by launching 25 sleep processes at once using for i in {1..25}; do sleep 60 & done.

alt text

We can even try a fork bomb (:() { : | : & }; :) confidently in our container.

fork bomb

We can see it is limited to a maximum of 10 processes using pids.current and ps fax.

pids.current

Conclusion

We started with a simple Go process and, step by step, transformed it into an isolated environment. We used

Namespaces to hide the host's details (Hostname, PIDs, Mounts).
Chroot to swap the reality of the filesystem.
Cgroups to enforce strict resource limits.

This exercise proves that a "container" doesn't exist in the way a virtual machine does. There is no hypervisor. What we call a container is simply a standard Linux process with a very specific configuration of isolation rules.

Of course, production runtimes like Docker or runC handle much more—like networking bridges, layered filesystems (OverlayFS), and security profiles (seccomp/AppArmor). But the core logic? You just built it.

Containers from Scratch

The Goal

The Beginning

Namespaces

Giving Our Container its Hostname

The Split

Process Isolation

Processes in Linux

Giving the Container its Filesystem

Getting Processes to Work

Diving Deeper into Mounts and Processes

Resource Control and Limits

Limiting Processes in Our Container

Conclusion

PS - Get the source code from here

Comments

More from this blog

Fixing Over-Engineering using just Postgres

Computing Everything: The Chaos of Concurrency

The Magic of the Tab Key: A Deep Dive into Shell Auto-Completion

Installing Latest Go in Ubuntu 22.04

Command Palette

The Goal

The Beginning

Namespaces

Giving Our Container its Hostname

The Split

Process Isolation

Processes in Linux

Giving the Container its Filesystem

Getting Processes to Work

Diving Deeper into Mounts and Processes

Resource Control and Limits

Limiting Processes in Our Container

Conclusion

PS - Get the source code from here

Comments

More from this blog