Deep dive into linux containers in Docker

## Deep dive into # Linux containers ## in Docker

## Outline * What is Linux container? * The building blocks - Namepsaces - Cgroups - Layered filesystem - User-space tools * Summary

## What is Linux container?

### High level overview: * It's lightweight VM

### Low level overview: * It's `chroot` on steroids - Uses host kernel - Processes are visible on the host machine * Limitations: - Can't boot a different OS - Can't have it's own kernel modules

Container is just a bunch of processes configured to run in isolation.

## The building blocks

### Namespaces

### Namespaces * Provide processes with their own view of the system * There are multiple namespaces: - pid - net - mnt * Each process is in one namespace of each type Note: Other namespaces: uts, ipc, user

### PID Namespace * Process can only see processes in the same namespace * Each namespace has own numbering starting at 1 * When PID 1 dies, the whole namespace is killed * Namespaces can be nested * Process has one PID per namespace in which is nested Note: ``` // T1: docker run -it epd:os ps -ax vim // T2: Show process tree ps a -o pidns,pid,cmd --forest kill -9 [PID] nsenter -p -t [PID] [CMD] ```

### NET Namespace * Each network namespace has own network stack: * network interfaces (including lo) * routing tables * iptables rules * sockets Note: Network interface can be only in one namespace at a time

### NET Namespace * Typical use-case: * disable networking (docker run --network=none) * veth pairs - two virtual interfaces acting as crossover cable * all the vethXXX bridged together (docker0) * reuse network stack of other container Note: ``` ip link add type veth ip netns add epd0 ip netns add epd1 ip link set veth0 netns epd0 ip link set veth1 netns epd1 nsenter --net=/run/netns/epd0 /bin/bash ip link set veth0 up ip addr add 10.1.1.1/24 dev veth0 nc -l 4444 ip link delete veth0 nsenter --net=/run/netns/epd1 /bin/bash ip link set veth1 up ip addr add 10.1.1.2/24 dev veth1 nc 10.1.1.1 4444 ip link delete veth1 ip netns delete epd0 ip netns delete epd1 ```

### MNT Namespace Set of filesystem mounts that are visible to a process: * Change filesystem root - "/" * Private mounts * scoped /tmp * masking of /proc, /sys * Shared mounts * Hard to pass mount from a namespace to another Note: ``` unshare -f -m /bin/bash mount -o bind private_tmp /tmp exit ```

### Namepsace manipulation * create ns by clone() with extra flags - exposed by unshare * "enter" ns with setns() - exposed by nsenter * materialized by pseudo-files in /proc/< pid >/ns * destroyed when last porcess of ns exits * preserve ns by bind-mounting pseudo fs

### Layered filesystem * Instant creation of new container - Without copying it's image filesystem * Easy creation of new images - It's basically a snapshot of container FS at some time * Image is immutable, while container can modify data

### Layered filesystem * Available options: - OverlayFS (file level) - Device mapper (block level) - btrfs, zfs (FS level) Note: ``` cd /tmp mkdir lower upper workdir overlay mount -t overlay -o \ lowerdir=/tmp/lower,\ upperdir=/tmp/upper,\ workdir=/tmp/workdir \ none /tmp/overlay ```

### Control groups

### Control groups * Resource metering and limiting - memory - CPU - block I/O - network

### Control groups * Each subsystem has a hierarchy tree * Hierarchies are independent * Each proces belongs to exaclty 1 node in each hierarhcy * PID 1 is placed at the root of each hierarchy * When process is created, it is palced in the same groups as its parent * Groups are materialized in pseudo-fs (/sys/fs/cgroup) Note: `systemd-cgls`

### memory cgroup * Keeps track of pages used by each group * Each group can have hard and soft limits * Hard limits will trigger a per-group customizable OOM killer * Limits can be set for physical, kernel and total memory * Pages can be shared across multiple groups * When pages are shared, the groups "split the bill" Note: OOM killer: - freeze all processes in group - notify user space - kill processes - etc... Couter overhead - enabled/disabled on boot time Sharing when multiple processes reads from the same file Cost sharing can cause OOM when process leavs a group

### cpu cgroup * Keeps track of user/system CPU time * Allows to set weights * Can't set CPU limits Note: 1. You give N% to group 1. CPU throttles to lower clock speed? 1. Same with time slot 1. instructions? their speed varies wildly

### cpuset cgroup * Pin groups to specific CPU(s) * Reserve CPU for specific apps * Avoid processes bouncing between CPUs

### blkio cgroup * Keeps track of I/O for each group * Set throttle (limits) for each group * Set relative weights for each group

### User-space

### User-space * libcontainer/runc * Ecosystem - CLI tools, - Registers, - Orchestration - etc. * Lot of educational materials elsewhere

### Summary * Container is just a bunch of processes configured to run in isolation * Even when you don't run containers... you are in a container * Each and every process is executed in namespaces and cgroups * Namespaces = isolation * Cgroups = resource metering and limiting

### Summary * Docker = user space tools + ecosystem * Docker did not invented containers - see LXC and others... * Docker made containers easy to use - for both developers and admins

### Sources: * [Jérôme Petazzoni: Anatomy of a Container](https://fr.slideshare.net/jpetazzo/anatomy-of-a-container-namespaces-cgroups-some-filesystem-magic-linuxcon) * [Kyle Olivo: How Linux containers work](http://kyleolivo.com/dev/2016/08/15/containers-how-do-they-work/) * [Eric Chiang: Containers from Scratch](https://ericchiang.github.io/post/containers-from-scratch/) * Docker & Linux documentation * ...many other blog posts