January 25, 2016
In our last blog post we gave you a short introduction to Linux namespaces. Part 2 will go deeper into user namespaces and current problems that Linux containers face today. Among them, resource accounting and container privileges are top culprits.
Currently, processes on the host may still share some resource accounting within processes inside containers. The question of how many processes the same user and owner of containers must have is one of the many examples. Even with all Linux namespaces isolation, in this situation containers may DoS each other. To fix this, containers may take advantage of user namespaces in a certain way, but even so, user namespaces still have some problems at the filesystem level.
Containers and user namespaces
User namespaces used in the right way can improve containers by allowing to always have the same UID mapping 0-65535 inside, but outside each container or group of containers will have its own mapping 0x20000-0x2FFFF range A, 0x30000-0x3FFFF range B, and so on.
This scheme allows containers to run with different users, by having a unique UID range outside which maps always to 0-65535 inside, container processes will not share the same users any more with the host or with other containers. The technical reason for this is that inside the kernel, resource accounting is done within the struct
_userstruct : At the kernel level some resource accounting is done against the real user ID of the calling process. In other words, this struct is used to count how many processes a user has. Without user namespaces your host and all containers will share the same process accounting and you may have problems related to RLIMIT_NPROC. Same applies to:
- How many pending signals a user can send.
- How many inotify watches a user has.
- How many inotify devs exist.
- How many bytes can be allocated to POSIX message queues.
- Maximum bytes in shared memory segments that can be mlocked.
- The UID keyring is shared.
- UID capabilities and privileges sharing. Now, most containers run with the capability CAP_SYS_ADMIN which is basically like a real root. It would be better if containers require the capability CAP_SYS_ADMIN in their own user namespace and not on the host namespace. And so the scope of CAP_SYS_ADMIN would be reduced and cannot affect namespaces of the host.
Those are some important points why containers may start to use user namespaces in order to separate users. The plan is to map whole ranges of 0x20000-0x2FFFF, 0x30000-0x3FFFF, etc outside of containers into 0-65535 inside. The container manager will be responsible for the setup, and containers will always end up with the same mapping inside (0-65535) but with different users outside. Doing that, containers will no longer share the same resource accountings, capabilities and privileges. Containers may even start to use the same base filesystem for all containers      .
This last point of using the same base filesystem for all containers or even the same filesystem for both the host and containers is really important, as one does not need to construct a root image for each use case. However, currently user namespaces in their form do not really support this, they work well within some virtual objects, but not for files. Inside user namespaces, the identity of files read from disk may contain wrong UIDs and GIDs   . To fix that, containers usually have to do recursive chown() calls on all files of the root filesystem. That’s how we have implemented it in rkt container  , and others use their own solution to construct root filesystems and shift UIDs   since all containers or users expect to own their proper files. A better solution would be to do this inside the kernel at the virtual layer or using overlay filesystem by shifting UIDs and GIDs, which will allow to get rid of the recursive chown overhead and to complete user namespace support   . Having the same base filesystem for all containers allows them to have their own private copy. This can result in the copy being packed for backup, to be re-used later by another container and/or be redistributed for other use cases, even for virtual machines. In the end, the container manager should just set the UID mapping and boot on any root filesystem.
We have experimented with user namespaces and UID shifting on overlay filesystem. If you are interested in our results check out the links below:
- Add overlayfs uidshift support to the kernel: https://github.com/endocode/linux/tree/tixxdz/overlayfs-uidshift-v2
- Testing with systemd-nspawn: https://github.com/systemd/systemd/issues/2404
-  https://lwn.net/Articles/637431/
-  http://lxr.free-electrons.com/ident?i=user_struct
-  http://lists.freedesktop.org/archives/systemd-devel/2015-February/027867.html
-  https://lkml.org/lkml/2014/6/3/608
-  https://lists.linuxcontainers.org/pipermail/lxc-devel/2014-June/009416.html
-  http://www.freedesktop.org/wiki/Software/systemd/ContainerInterface
-  http://lists.freedesktop.org/archives/systemd-devel/2015-February/027884.html
-  https://github.com/coreos/rkt/issues/986
-  https://github.com/coreos/rkt/pull/1250
-  https://www.stgraber.org/2014/01/17/lxc-1-0-unprivileged-containers/
-  https://github.com/docker/docker/pull/12648