Nytro Posted January 16, 2017 Report Posted January 16, 2017 Containers from Scratch Posted on January 7, 2017 This is write up for talk I gave at CAT BarCamp, an awesome unconference at Portland State University. The talk started with the self-imposed challenge “give an intro to containers without Docker or rkt.” Often thought of as cheap VMs, containers are just isolated groups of processes running on a single host. That isolation leverages several underlying technologies built into the Linux kernel: namespaces, cgroups, chroots and lots of terms you’ve probably heard before. So, let’s have a little fun and use those underlying technologies to build our own containers. On today’s agenda: setting up a file system chroot unshare nsenter bind mounts cgroups capabilities Container file systems Container images, the thing you download from the internet, are literally just tarballs (or tarballs in tarballs if you’re fancy). The least magic part of a container are the files you interact with. For this post I’ve build a simple tarball by stripping down a Docker image. The tarball holds something that looks like a Debian file system and will be our playground for isolating processes. $ wget https://github.com/ericchiang/containers-from-scratch/releases/download/v0.1.0/rootfs.tar.gz $ sha256sum rootfs.tar.gz c79bfb46b9cf842055761a49161831aee8f4e667ad9e84ab57ab324a49bc828c rootfs.tar.gz First, explode the tarball and poke around. $ # tar needs sudo to create /dev files and setup file ownership $ sudo tar -zxf rootfs.tar.gz $ ls rootfs bin dev home lib64 mnt proc run srv tmp var boot etc lib media opt root sbin sys usr $ ls -al rootfs/bin/ls -rwxr-xr-x. 1 root root 118280 Mar 14 2015 rootfs/bin/ls The resulting directory looks an awful lot like a Linux system. There’s a bin directory with executables, an etc with system configuration, a lib with shared libraries, and so on. Actually building this tarball is an interesting topic, but one we’ll be glossing over here. For an overview, I’d strongly recommend the excellent talk “Minimal Containers” by my coworker Brian Redbeard. chroot The first tool we’ll be working with is chroot. A thin wrapper around the similarly named syscall, it allows us to restrict a process’ view of the file system. In this case, we’ll restrict our process to the “rootfs” directory then exec a shell. Once we’re in there we can poke around, run commands, and do typical shell things. $ sudo chroot rootfs /bin/bash root@localhost:/# ls / bin dev home lib64 mnt proc run srv tmp var boot etc lib media opt root sbin sys usr root@localhost:/# which python /usr/bin/python root@localhost:/# /usr/bin/python -c 'print "Hello, container world!"' Hello, container world! root@localhost:/# It’s worth noting that this works because of all the things baked into the tarball. When we execute the Python interpreter, we’re executing rootfs/usr/bin/python, not the host’s Python. That interpreter depends on shared libraries and device files that have been intentionally included in the archive. Speaking of applications, instead of shell we can run one in our chroot. $ sudo chroot rootfs python -m SimpleHTTPServer Serving HTTP on 0.0.0.0 port 8000 ... If you’re following along at home, you’ll be able to view everything the file server can see at http://localhost:8000/. Creating namespaces with unshare How isolated is this chrooted process? Let’s run a command on the host in another terminal. $ # outside of the chroot $ top Sure enough, we can see the top invocation from inside the chroot. $ sudo chroot rootfs /bin/bash root@localhost:/# mount -t proc proc /proc root@localhost:/# ps aux | grep top 1000 24753 0.1 0.0 156636 4404 ? S+ 22:28 0:00 top root 24764 0.0 0.0 11132 948 ? S+ 22:29 0:00 grep top Better yet, our chrooted shell is running as root, so it has no problem killing the topprocess. root@localhost:/# pkill top So much for containment. This is where we get to talk about namespaces. Namespaces allow us to create restricted views of systems like the process tree, network interfaces, and mounts. Creating namespace is super easy, just a single syscall with one argument, unshare. The unshare command line tool gives us a nice wrapper around this syscall and lets us setup namespaces manually. In this case, we’ll create a PID namespace for the shell, then execute the chroot like the last example. $ sudo unshare -p -f --mount-proc=$PWD/rootfs/proc \ chroot rootfs /bin/bash root@localhost:/# ps aux USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 1 0.0 0.0 20268 3240 ? S 22:34 0:00 /bin/bash root 2 0.0 0.0 17504 2096 ? R+ 22:34 0:00 ps aux root@localhost:/# Having created a new process namespace, poking around our chroot we’ll notice something a bit funny. Our shell thinks its PID is 1?! What’s more, we can’t see the host’s process tree anymore. Entering namespaces with nsenter A powerful aspect of namespaces is their composability; processes may choose to separate some namespaces but share others. For instance it may be useful for two programs to have isolated PID namespaces, but share a network namespace (e.g. Kubernetes pods). This brings us to the setns syscall and the nsentercommand line tool. Let’s find the shell running in a chroot from our last example. $ # From the host, not the chroot. $ ps aux | grep /bin/bash | grep root ... root 29840 0.0 0.0 20272 3064 pts/5 S+ 17:25 0:00 /bin/bash The kernel exposes namespaces under /proc/(PID)/ns as files. In this case, /proc/29840/ns/pid is the process namespace we’re hoping to join. $ sudo ls -l /proc/29840/ns total 0 lrwxrwxrwx. 1 root root 0 Oct 15 17:31 ipc -> 'ipc:[4026531839]' lrwxrwxrwx. 1 root root 0 Oct 15 17:31 mnt -> 'mnt:[4026532434]' lrwxrwxrwx. 1 root root 0 Oct 15 17:31 net -> 'net:[4026531969]' lrwxrwxrwx. 1 root root 0 Oct 15 17:31 pid -> 'pid:[4026532446]' lrwxrwxrwx. 1 root root 0 Oct 15 17:31 user -> 'user:[4026531837]' lrwxrwxrwx. 1 root root 0 Oct 15 17:31 uts -> 'uts:[4026531838]' The nsenter command provides a wrapper around setns to enter a namespace. We’ll provide the namespace file, then run the unshare to remount /proc and chroot to setup a chroot. This time, instead of creating a new namespace, our shell will join the existing one. $ sudo nsenter --pid=/proc/29840/ns/pid \ unshare -f --mount-proc=$PWD/rootfs/proc \ chroot rootfs /bin/bash root@localhost:/# ps aux USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 1 0.0 0.0 20272 3064 ? S+ 00:25 0:00 /bin/bash root 5 0.0 0.0 20276 3248 ? S 00:29 0:00 /bin/bash root 6 0.0 0.0 17504 1984 ? R+ 00:30 0:00 ps aux Having entered the namespace successfully, when we run ps in the second shell (PID 5) we see the first shell (PID 1). Getting around chroot with mounts When deploying an “immutable” container it often becomes important to inject files or directories into the chroot, either for storage or configuration. For this example, we’ll create some files on the host, then expose them read-only to the chrooted shell using mount. First, let’s make a new directory to mount into the chroot and create a file there. $ sudo mkdir readonlyfiles $ echo "hello" > readonlyfiles/hi.txt Next, we’ll create a target directory in our container and bind mount the directory providing the -o ro argument to make it read-only. If you’ve never seen a bind mount before, think of this like a symlink on steroids. $ sudo mkdir -p rootfs/var/readonlyfiles $ sudo mount --bind -o ro $PWD/readonlyfiles $PWD/rootfs/var/readonlyfiles The chrooted process can now see the mounted files. $ sudo chroot rootfs /bin/bash root@localhost:/# cat /var/readonlyfiles/hi.txt hello However, it can’t write them. root@localhost:/# echo "bye" > /var/readonlyfiles/hi.txt bash: /var/readonlyfiles/hi.txt: Read-only file system Though a pretty basic example, it can actually be expanded quickly for things like NFS, or in-memory file systems by switching the arguments to mount. Use umount to remove the bind mount (rm won’t work). $ sudo umount $PWD/rootfs/var/readonlyfiles cgroups cgroups, short for control groups, allow kernel imposed isolation on resources like memory and CPU. After all, what’s the point of isolating processes they can still kill neighbors by hogging RAM? The kernel exposes cgroups through the /sys/fs/cgroup directory. If your machine doesn’t have one you may have to mount the memory cgroup to follow along. $ ls /sys/fs/cgroup/ blkio cpuacct cpuset freezer memory net_cls,net_prio perf_event systemd cpu cpu,cpuacct devices hugetlb net_cls net_prio pids For this example we’ll create a cgroup to restrict the memory of a process. Creating a cgroup is easy, just create a directory. In this case we’ll create a memory group called “demo”. Once created, the kernel fills the directory with files that can be used to configure the cgroup. $ sudo su # mkdir /sys/fs/cgroup/memory/demo # ls /sys/fs/cgroup/memory/demo/ cgroup.clone_children memory.memsw.failcnt cgroup.event_control memory.memsw.limit_in_bytes cgroup.procs memory.memsw.max_usage_in_bytes memory.failcnt memory.memsw.usage_in_bytes memory.force_empty memory.move_charge_at_immigrate memory.kmem.failcnt memory.numa_stat memory.kmem.limit_in_bytes memory.oom_control memory.kmem.max_usage_in_bytes memory.pressure_level memory.kmem.slabinfo memory.soft_limit_in_bytes memory.kmem.tcp.failcnt memory.stat memory.kmem.tcp.limit_in_bytes memory.swappiness memory.kmem.tcp.max_usage_in_bytes memory.usage_in_bytes memory.kmem.tcp.usage_in_bytes memory.use_hierarchy memory.kmem.usage_in_bytes notify_on_release memory.limit_in_bytes tasks memory.max_usage_in_bytes To adjust a value we just have to write to the corresponding file. Let’s limit the cgroup to 100MB of memory and turn off swap. # echo "100000000" > /sys/fs/cgroup/memory/demo/memory.limit_in_bytes # echo "0" > /sys/fs/cgroup/memory/demo/memory.swappiness The tasks file is special, it contains the list of processes which are assigned to the cgroup. To join the cgroup we can write our own PID. # echo $$ > /sys/fs/cgroup/memory/demo/tasks Finally we need a memory hungry application. f = open("/dev/urandom", "r") data = "" i=0 while True: data += f.read(10000000) # 10mb i += 1 print "%dmb" % (i*10,) If you’ve setup the cgroup correctly, this program won’t crash your computer. # python hungry.py 10mb 20mb 30mb 40mb 50mb 60mb 70mb 80mb Killed If that didn’t crash your computer, congratulations! cgroups can’t be removed until every processes in the tasks file has exited or been reassigned to another group. Exit the shell and remove the directory with rmdir (don’t use rm -r). # exit exit $ sudo rmdir /sys/fs/cgroup/memory/demo Container security and capabilities Containers are extremely effective ways of running arbitrary code from the internet as root, and this is where the low overhead of containers hurts us. Containers are significantly easier to break out of than a VM. As a result many technologies used to improve the security of containers, such as SELinux, seccomp, and capabilities involve limiting the power of processes already running as root. In this section we’ll be exploring Linux capabilities. Consider the following Go program which attempts to listen on port 80. package main import ( "fmt" "net" "os" ) func main() { if _, err := net.Listen("tcp", ":80"); err != nil { fmt.Fprintln(os.Stdout, err) os.Exit(2) } fmt.Println("success") } What happens when we compile and run this? $ go build -o listen listen.go $ ./listen listen tcp :80: bind: permission denied Predictably this program fails; listing on port 80 requires permissions we don’t have. Of course we can just use sudo, but we’d like to give the binary just the one permission to listen on lower ports. Capabilities are a set of discrete powers that together make up everything root can do. This ranges from things like setting the system clock, to kill arbitrary processes. In this case, CAP_NET_BIND_SERVICE allows executables to listen on lower ports. We can grant the executable CAP_NET_BIND_SERVICE using the setcap command. $ sudo setcap cap_net_bind_service=+ep listen $ getcap listen listen = cap_net_bind_service+ep $ ./listen success For things already running as root, like most containerized apps, we’re more interested in taking capabilities away than granting them. First let’s see all powers our root shell has: $ sudo su # capsh --print Current: = cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,cap_wake_alarm,cap_block_suspend,37+ep Bounding set =cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,cap_wake_alarm,cap_block_suspend,37 Securebits: 00/0x0/1'b0 secure-noroot: no (unlocked) secure-no-suid-fixup: no (unlocked) secure-keep-caps: no (unlocked) uid=0(root) gid=0(root) groups=0(root) Yeah, that’s a lot of capabilities. As an example, we’ll use capsh to drop a few capabilities including CAP_CHOWN. If things work as expected, our shell shouldn’t be able to modify file ownership despite being root. $ sudo capsh --drop=cap_chown,cap_setpcap,cap_setfcap,cap_sys_admin --chroot=$PWD/rootfs -- root@localhost:/# whoami root root@localhost:/# chown nobody /bin/ls chown: changing ownership of '/bin/ls': Operation not permitted Conventional wisdom still states that VMs isolation is mandatory when running untrusted code. But security features like capabilities are important to protect against hacked applications running in containers. Beyond more elaborate tools like seccomp, SELinux, and capabilities, applications running in containers generally benefit from the same kind of best practices as applications running outside of one. Know what your linking against, don’t run as root in your container, update for known security issues in a timely fashion. Conclusion Containers aren’t magic. Anyone with a Linux machine can play around with them and tools like Docker and rkt are just wrappers around things built into every modern kernel. No, you probably shouldn’t go and implement your own container runtime. But having a better understanding of these lower level technologies will help you work with these higher level tools (especially when debugging). There’s a ton of topics I wasn’t able to cover today, networking and copy-on-write file systems probably being the biggest two. However, I hope this acts as a good starting point for anyone wanting to get their hands dirty. Happy hacking! Sursa: https://ericchiang.github.io/post/containers-from-scratch// 1 Quote