## Deep dive into
# Linux containers
## in Docker
## Outline
* What is Linux container?
* The building blocks
- Namepsaces
- Cgroups
- Layered filesystem
- User-space tools
* Summary
## What is Linux container?
### High level overview:
* It's lightweight VM
### Low level overview:
* It's `chroot` on steroids
- Uses host kernel
- Processes are visible on the host machine
* Limitations:
- Can't boot a different OS
- Can't have it's own kernel modules
Container is just a bunch of processes configured to run in isolation.
### Namespaces
* Provide processes with their own view of the system
* There are multiple namespaces:
- pid
- net
- mnt
* Each process is in one namespace of each type
Note:
Other namespaces: uts, ipc, user
### PID Namespace
* Process can only see processes in the same namespace
* Each namespace has own numbering starting at 1
* When PID 1 dies, the whole namespace is killed
* Namespaces can be nested
* Process has one PID per namespace in which is nested
Note:
```
// T1:
docker run -it epd:os
ps -ax
vim
// T2: Show process tree
ps a -o pidns,pid,cmd --forest
kill -9 [PID]
nsenter -p -t [PID] [CMD]
```
### NET Namespace
* Each network namespace has own network stack:
* network interfaces (including lo)
* routing tables
* iptables rules
* sockets
Note:
Network interface can be only in one namespace at a time
### NET Namespace
* Typical use-case:
* disable networking (docker run --network=none)
* veth pairs - two virtual interfaces acting as crossover cable
* all the vethXXX bridged together (docker0)
* reuse network stack of other container
Note:
```
ip link add type veth
ip netns add epd0
ip netns add epd1
ip link set veth0 netns epd0
ip link set veth1 netns epd1
nsenter --net=/run/netns/epd0 /bin/bash
ip link set veth0 up
ip addr add 10.1.1.1/24 dev veth0
nc -l 4444
ip link delete veth0
nsenter --net=/run/netns/epd1 /bin/bash
ip link set veth1 up
ip addr add 10.1.1.2/24 dev veth1
nc 10.1.1.1 4444
ip link delete veth1
ip netns delete epd0
ip netns delete epd1
```
### MNT Namespace
Set of filesystem mounts that are visible to a process:
* Change filesystem root - "/"
* Private mounts
* scoped /tmp
* masking of /proc, /sys
* Shared mounts
* Hard to pass mount from a namespace to another
Note:
```
unshare -f -m /bin/bash
mount -o bind private_tmp /tmp
exit
```
### Namepsace manipulation
* create ns by clone() with extra flags
- exposed by unshare
* "enter" ns with setns()
- exposed by nsenter
* materialized by pseudo-files in /proc/< pid >/ns
* destroyed when last porcess of ns exits
* preserve ns by bind-mounting pseudo fs
### Layered filesystem
* Instant creation of new container
- Without copying it's image filesystem
* Easy creation of new images
- It's basically a snapshot of container FS at some time
* Image is immutable, while container can modify data
### Layered filesystem
* Available options:
- OverlayFS (file level)
- Device mapper (block level)
- btrfs, zfs (FS level)
Note:
```
cd /tmp
mkdir lower upper workdir overlay
mount -t overlay -o \
lowerdir=/tmp/lower,\
upperdir=/tmp/upper,\
workdir=/tmp/workdir \
none /tmp/overlay
```
### Control groups
* Resource metering and limiting
- memory
- CPU
- block I/O
- network
### Control groups
* Each subsystem has a hierarchy tree
* Hierarchies are independent
* Each proces belongs to exaclty 1 node in each hierarhcy
* PID 1 is placed at the root of each hierarchy
* When process is created, it is palced in the same groups as its parent
* Groups are materialized in pseudo-fs (/sys/fs/cgroup)
Note:
`systemd-cgls`
### memory cgroup
* Keeps track of pages used by each group
* Each group can have hard and soft limits
* Hard limits will trigger a per-group customizable OOM killer
* Limits can be set for physical, kernel and total memory
* Pages can be shared across multiple groups
* When pages are shared, the groups "split the bill"
Note:
OOM killer:
- freeze all processes in group
- notify user space
- kill processes
- etc...
Couter overhead - enabled/disabled on boot time
Sharing when multiple processes reads from the same file
Cost sharing can cause OOM when process leavs a group
### cpu cgroup
* Keeps track of user/system CPU time
* Allows to set weights
* Can't set CPU limits
Note:
1. You give N% to group
1. CPU throttles to lower clock speed?
1. Same with time slot
1. instructions? their speed varies wildly
### cpuset cgroup
* Pin groups to specific CPU(s)
* Reserve CPU for specific apps
* Avoid processes bouncing between CPUs
### blkio cgroup
* Keeps track of I/O for each group
* Set throttle (limits) for each group
* Set relative weights for each group
### User-space
* libcontainer/runc
* Ecosystem
- CLI tools,
- Registers,
- Orchestration
- etc.
* Lot of educational materials elsewhere
### Summary
* Container is just a bunch of processes configured to run in isolation
* Even when you don't run containers... you are in a container
* Each and every process is executed in namespaces and cgroups
* Namespaces = isolation
* Cgroups = resource metering and limiting
### Summary
* Docker = user space tools + ecosystem
* Docker did not invented containers
- see LXC and others...
* Docker made containers easy to use
- for both developers and admins
### Sources:
* [Jérôme Petazzoni: Anatomy of a Container](https://fr.slideshare.net/jpetazzo/anatomy-of-a-container-namespaces-cgroups-some-filesystem-magic-linuxcon)
* [Kyle Olivo: How Linux containers work](http://kyleolivo.com/dev/2016/08/15/containers-how-do-they-work/)
* [Eric Chiang: Containers from Scratch](https://ericchiang.github.io/post/containers-from-scratch/)
* Docker & Linux documentation
* ...many other blog posts