Axeman.IN

A Weblog of Programming, Philosophy and Random Thoughts.

Contain It Yourself: Build Your Own Linux Container(LXC)

Creating linux containers is all (well mostly) about setting up boundaries for the guest system. In this article we’ll see how to put boundaries on Disk, CPU and Network.

Disk Isolation

When you start a VM, say in virtual box, it has its own disk. It cannot access any files on the host. Atleast not with conventional disk access methods. We want to do similar thing in our container. There is a tool (also a kernel call) which can do this for us. It is called chroot. Here is a one liner definition from Wikipedia:

A chroot on Unix operating systems is an operation that changes the apparent root directory for the current running process and its children.

But, if you are going to change ‘apparent’ root of a process, you are going to have all the files required for the process to run inside that directory. Let’s start a basic shell into it using BusyBox. BusyBox combines tiny versions of many common UNIX utilities into a single small executable.

Install busybox:

host # apt-get install busybox

We’ll need to create a directory which is going to be root of our container and put busybox in it.

host # mkdir rootfs
host # cp --parents /bin/busybox rootfs

We can start our chroot container using this command:

host # chroot rootfs /bin/busybox sh

We can try some commands to break out of it:

guest # ls
bin
guest # pwd
/
guest # cd ..
guest # pwd
/
guest # ls
bin

Clearly, this shell cannot access any file which is outside the rootfs directory. But at the same time this is pretty useless. Ideally you’d want to install an OS in your VM. Guess what, that’s not so tough. If you want to get an ubuntu or debian instance, you just need to use debootstrap and you have a basic ubuntu/debian installation ready to be chrooted. Other distros also have similar tools, but I have not explored them much as of yet.

To install Debian sid in the rootfs directory:

host # debootstrap sid rootfs

It’ll download and install Debian sid. There are multiple ways by which you can make your chroot container “useful”, but that’s content for another blog ;)

CPU Control

So, now you have isolated the disk and installed an OS (well almost). But this container can and will use all the CPU power our host has. But if it is supposed to behave like a VM, we should be able to control it. We can do that with Control Groups or cgroups.

Another line from Wikipedia: cgroups is a Linux kernel feature to limit, account, and isolate resource usage (CPU, memory, disk I/O, etc.) of process groups.

We’ll need a test by which will verify if we are actually controlling cpu of our chroot container. For that we’ll need a process which guarantees CPU usage. ‘stress’ is a utility which can help us here.

But first we’ll need to install it.

host # chroot rootfs

guest # apt-get install stress

guest # stress -c 2

Now doing htop on host will show you there are two stress processes taking 2 of your CPUs 100%

host # htop

Cgroups can retrict this! To manage this in cgroups, you do these things

  1. Create a cgroup
  2. Assign it resources
  3. Add processes to the group.

If you add a process in a cgroup, all its child processes go into the same group. So, you don’t need to worry about any process trying to escape out of our cgroups container.

You might find it a little amusing that you do filesystem operations to configure cgroups, as it is implemented as a virtual filesystem.

To configure cgroups all you need to do is mount it:

host # mount -t cgroup none /sys/fs/cgroup

CGroups have multiple subsystems, an ls into /sys/fs/cgroup will show you the list of these. CPU can be controlled using cpu and cpusets subsystems. The cpu subsystem is used to control how multiple processes would share CPU time and the cpusets subsystem is used to assign individual CPU and memory nodes to groups.

So, we can configure cpusets to assign 1 out of 2 cpus on host to stress. As listed above we’ll do the 3 steps:

The first step was to create a group. Which is as exactly as tough as creating a directory:

host # mkdir /sys/fs/cgroup/cpuset/myFirstCgroup
host # ls /sys/fs/cgroup/cpuset/myFirstCgroup
cgroup.clone_children  cpuset.cpu_exclusive  cpuset.mem_hardwall     cpuset.memory_spread_page  cpuset.sched_load_balance        tasks
cgroup.event_control   cpuset.cpus           cpuset.memory_migrate   cpuset.memory_spread_slab  cpuset.sched_relax_domain_level
cgroup.procs           cpuset.mem_exclusive  cpuset.memory_pressure  cpuset.mems                notify_on_release

Generally you wouldn’t expect a freshly created directory to have files, but myFirstCgroup directory has a few files! This is because cgroups is a filesystem and it can do all this magic. And yes, that’s all you need to do for creating a cgroup.

All these files in our newly created directory (cgroup) are either reporting files or config files. Reporting files report statistics about the cgroup. Config files are the ones where you specify the limits or often ways in which kernel handles processes in the group. Which brings us to our second step was to assign resources to the group.

Let’s assign CPU ‘0’ and memory node ‘0’ to the group, to do that we’ll write to files cpuset.cpus and cpuset.mems respectively

host # echo 0 > /sys/fs/cgroup/cpuset/myFirstCgroup/cpuset.cpus
host # echo 0 > /sys/fs/cgroup/cpuset/myFirstCgroup/cpuset.mems

While we are done assigning resources to the cgroup, the stress command is still taking 2 cpus! Why? Because we never added it to the cgroup, this group is just empty. The ‘tasks’ file in a cgroup contains the list of process ids which are in the group. All the future children of these processes are automatically added to the cgroup. To add a process to a cgroup, you just need to write its pid to the tasks file. As we noticed earlier, there were 2 processes, so we need to add both of them. So Let’s assume they are 1551 and 1552 and add them:

host # echo 1551 > /sys/fs/cgroup/cpuset/myFirstCgroup/tasks
host # echo 1552 > /sys/fs/cgroup/cpuset/myFirstCgroup/tasks

So, if we fire up htop again, we’d see only 1 of our CPUs reving like a superbike engine while the other one does minor things in the system.

host # htop

So, both the stress process got ‘Contained into CPU0’.
If we write 1 to cpuset.cpus and see htop:

host # echo 1 > /sys/fs/cgroup/cpuset/myFirstCgroup/cpuset.cpus
host # htop

This proves we can control which process gets which CPU.

But if you want cpu control AND disk isolation, you totally can have that because chroot is also a process and its child of the shell which started it. So, you can attach pid of the shell to a cgroup and use the same shell to start a chroot container.

Networking Foo

Whenever we open a port in the guest, it really is created on the host. Here is a test:

guest # nc -l -p 4444

host # netstat -nptel
Proto Recv-Q Send-Q Local Address           Foreign Address         State       User       Inode       PID/Program name
tcp        0      0 0.0.0.0:4444            0.0.0.0:*               LISTEN      0          14031       2195/nc

host # nc localhost 4444

The last nc command connects. So, the guest is not really a guest, it is opening ports on the host!

Why does this happen? Because we never specified any network configuration for our container. So the host and the guest can’t be listening on same port. Even two guests can’t listen on same port. Atleast not just with chroot and cgroups.

Network namespaces can allow us to do the same (and much more). So as earlier Let me copy a definition from the interweb (as I couldn’t find one on wikipedia):

A network namespace is logically another copy of the network stack, with its own routes, firewall rules, and network devices.

So, just like the previous section, let’s list down the steps:

  1. Create a network namespace
  2. Give it a device
  3. Attach your process to it.

Too easy?

Can’t be, its networking! It is always ‘little’ more complicated than it looks. But rather than solving all complexities now, Let’s solve them when we need to.

Creating a network namespace (netns) is easy:

host # ip netns add myFirstNetns

A few experiments/observations before we proceed.

host # ip link list
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN 
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 08:00:27:88:0c:a6 brd ff:ff:ff:ff:ff:ff

host # ip netns exec myFirstNetns bash
guest # ip link list
3: lo: <LOOPBACK> mtu 16436 qdisc noop state DOWN 
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

As we can see in default namespace we have the loopback interface and the ethernet interface. While our new netns has only loopback interface. It needs an interface to talk to outside world. But how do we give it one? Do we go to a computer hardware store, buy an NIC and attach it to our computer and assign it to netns? Well, I don’t see it scaling that good. So, for the first time in this article, we’ll use virtualization. We would create virtual ethernet (veth) devices and wire them (software wires!) to get what we want. Before we create veth devices, we need to know and understand that veth devices come in pairs. The real ethernet device, as you might have seen (eth0 generally), comes solo, because it assumes that the other end of the wire which is connected to it is connected to some other ethernet device. So, it also comes in pairs. (I know this doesn’t fit completely. But you get the point, right?)

Let’s create our veth pair and name the interfaces veth0 and veth1:

host # ip link add veth0 type veth peer name veth1
host # ip link list
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN 
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 08:00:27:88:0c:a6 brd ff:ff:ff:ff:ff:ff
4: veth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
    link/ether 26:d0:a1:e0:78:0c brd ff:ff:ff:ff:ff:ff
5: veth0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
    link/ether 2e:c2:85:d6:63:27 brd ff:ff:ff:ff:ff:ff

This command just created two network interfaces and connected them with a wire. But as of now both of them live in same netns. Its like buying two NICs, attaching them same computer and connecting them with a crossover cable. Making both of them completely useless. So, to make them useful, we can put one of them in our newly created netns so they would be connecting network namespaces.

host # ip link set veth0 netns myFirstNetns
host # ip link list
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN 
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 08:00:27:88:0c:a6 brd ff:ff:ff:ff:ff:ff
4: veth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
    link/ether 26:d0:a1:e0:78:0c brd ff:ff:ff:ff:ff:ff

host # ip netns exec myFirstNetns bash

guest # ip link list
3: lo: <LOOPBACK> mtu 16436 qdisc noop state DOWN 
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
5: veth0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
    link/ether 2e:c2:85:d6:63:27 brd ff:ff:ff:ff:ff:ff

Now we’ve put the interfaces where they are needed to be, but they are both down and have no IP address. This is how we can set them up:

host # ip link set veth1 up
host # ip addr add 192.168.99.2/24 dev veth1
host # ip link list
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN 
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 08:00:27:88:0c:a6 brd ff:ff:ff:ff:ff:ff
4: veth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 26:d0:a1:e0:78:0c brd ff:ff:ff:ff:ff:ff
host # ip addr list
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN 
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 08:00:27:88:0c:a6 brd ff:ff:ff:ff:ff:ff
    inet 10.0.2.15/24 brd 10.0.2.255 scope global eth0
    inet6 fe80::a00:27ff:fe88:ca6/64 scope link 
       valid_lft forever preferred_lft forever
4: veth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 26:d0:a1:e0:78:0c brd ff:ff:ff:ff:ff:ff
    inet 192.168.99.2/24 scope global veth1
    inet6 fe80::24d0:a1ff:fee0:780c/64 scope link 
       valid_lft forever preferred_lft forever

host # ip netns exec myFirstNetns bash

guest # ip link set veth0 up
guest # ip addr add 192.168.99.1/24 dev veth0
guest # ip link list
3: lo: <LOOPBACK> mtu 16436 qdisc noop state DOWN 
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
5: veth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 2e:c2:85:d6:63:27 brd ff:ff:ff:ff:ff:ff
guest # ip addr list
3: lo: <LOOPBACK> mtu 16436 qdisc noop state DOWN 
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
5: veth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 2e:c2:85:d6:63:27 brd ff:ff:ff:ff:ff:ff
    inet 192.168.99.1/24 scope global veth0
    inet6 fe80::2cc2:85ff:fed6:6327/64 scope link 
       valid_lft forever preferred_lft forever

So, now if we start listening on any port in guest, it should not be opening it on host.

guest # nc -l -p 4444
(Don't close this shell)

host # netstat -nptel

Doesn’t show us anything on port 4444

host # nc localhost 4444

Doesn’t even connect.

host # nc 192.168.99.1 4444

Connects :)

Congratulations, you have achieved network isolation. You surely can go ahead and be all network ninja on it to configure internet, firewall, routing table etc. on it.

How do we make our “cgroup chroot container” into “netns cgroup chroot container”?

Easy:

  1. Create and configure a rootfs
  2. Create and configure a cgroup
  3. Create and configure a netns
  4. Execute a shell in a netns
  5. Put this shell’s pid in tasks file of the cgroup
  6. Start chroot

Well, that’s it for this article. But there are many more types of boundaries which you can put a process into. LWN has a nice article on Linux Namespaces here. I am going to experiment with these for next few days and may be write another article about them.

Until then, Happy Hacking!

Comments