<html><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; ">Hi,<div><br></div><div>If you need a primer on container-based virtualization and why it's interesting see these two (a bit dated) papers:</div><div><br></div><div><a href="http://nsg.cs.princeton.edu/publication/vserver_eurosys_07.pdf">http://nsg.cs.princeton.edu/publication/vserver_eurosys_07.pdf</a></div><div><a href="http://www.hpl.hp.com/techreports/2007/HPL-2007-59R1.pdf">http://www.hpl.hp.com/techreports/2007/HPL-2007-59R1.pdf</a></div><div><br></div><div><br></div><div>Coming from a Debian background there are three things to consider when talking LXC.</div><div><br></div><div>*) The kernel support. Implementation is quite good these days with features being worked on for close to 5 years. [2] </div><div><br></div><div>*) The userland tools [3], which are clearly in the early stages of the development cycle, not having seen too many (large-scale) production environments, compared to OpenVZ's vzctl</div><div><br></div><div>*) The integration work done by Debian & Ubuntu</div><div><br></div><div><div><div><br></div><div>LXC is _almost_ feature complete to OpenVZ these days but not quite ready for production use (Ubuntu) or even dangerous to use (Debian).</div></div></div><div><br></div><div><br></div><div>What works:</div><div><br></div><div>Getting LXC & guests up and running is on par with OpenVZ from a time & complexity PoV, which is a good thing. Limiting CPU & memory usage for guests works, introspection and configuration aren't too different from OpenVZ.</div><div><br></div><div><br></div><div>What's missing:</div><div><br></div><div>Hiding dangerous parts:</div><div><br></div><div>sysfs, proc pidfs (/proc/$PID/...), proc sysctlfs (/proc/sys) and the VFS itself are namespaced. What has been disregarded so far are boring things like /proc/sysrq-trigger (allows any guest to reboot the host, among other things) and /proc/kcore (The systems memory as seen by the kernel).</div><div><br></div><div>The current stop-gap measure would be to use AppArmor, but this really should be integrated into the kernel.</div><div><br></div><div><br></div><div><div>AppArmor Integration:</div></div><div><br></div><div>AppArmor [4] is a security module for linux, extending the default UNIX/POSIX defined discretionary access control [5]. Compared to SELinux it doesn't use extended attributes on files to define permissions but just uses VFS paths, which makes it considerably more easy to maintain.</div><div><br></div><div>In a nutshell, you define to which paths & operations a given process is allowed to have which access.</div><div><br></div><div>For LXC guests the bare minimum would be to lock down /proc/sysrq-trigger and /proc/kcore. Ubuntu has integrated native AppArmor support into it's lxc package and ships nice default profiles. [6] This is completely missing from LXC upstream as well as Debian's LXC package.</div><div><br></div><div><br></div><div>Quota support:</div><div><br></div><div>Linux offers (AFAIK!) no general-purpose (V)FS quota interface - and thus there is no quota support in LXC when run in the default configuration (guest root = /var/lib/lxc/$guestname/rootfs). OpenVZ had it's own simfs which wrapped the hosts VFS and bolted quota support ontop of it.</div><div><br></div><div>You can bandaid this by creating separate filesystems on LVM volumes for each guest but this comes at a much higher IOPS cost since reads & writes aren't as local anymore and there is more housekeeping to be done.</div><div><br></div><div><br></div><div>Kernel log:</div><div><br></div><div>To lock down access to the kernel log ring buffer ("dmesg") you actually have to disable a syscall, which is aptly named syslog(2) [7], not to be confused by the Unix logging standard. There's talk about using seccomp [8] for this, but this is probably a few months if not years out.</div><div><br></div><div><br></div><div>Migration of live guests:</div><div><br></div><div>Migration of running containers is usually done in a three-step process.</div><div><br></div><div>During the first pass the filesystem and a snapshot of the running processes (memory, SYSV IPC resources, fds, sockets, etc.) is copied. After this is completed all processes get frozen on the host, a second copy pass is done over the filesystem and process state picking all the changes that happened in the meanwhile. Then the processes get destroyed on the source host and unfrozen/thawed on the destination host, resuming operation unfazed.</div><div><br></div><div>Freezing & Thawing is already supported in LXC, what's been missing for a long time was to recreate TCP sockets on the destination host, but TCP connection repair has been merged in 3.5 [9]. I don't know if something else is missing or if we can expect that live migration of LXC containers is soon on the horizon.</div><div><br></div><div><br></div><div>In the meanwhile, if you're serious about LXC I'd suggest to look at Ubuntu since the Debian packages (AppArmor support) and container templates (Created guest doesn't boot without some polishing) aren't too nice at the moment and probably won't be fixed in time for Wheezy.</div><div><br></div><div>You can find our collection of information including step-by-step notes on how to get LXC running on Debian at <a href="http://titanpad.com/ep/pad/view/ro.PHwVPcirW2K/rev.3326">http://titanpad.com/ep/pad/view/ro.PHwVPcirW2K/rev.3326</a></div><div><br></div><div><br></div><div>Thanks to Bernhard Miklautz, Stefan Schlesinger and Christian Hofstδdtler who helped to compile the information so far. Stefan Schlesinger also has a blog post in the works focusing a bit more on the practical side of things.</div><div><br></div><div>All the best,</div><div>Michael</div><div><br></div><div><br></div><div>[1] <a href="http://www.netmeister.org/blog/writing-tools.html">http://www.netmeister.org/blog/writing-tools.html</a></div><div><br></div><div>[2] Overview of the LXC development process compiled by Stefan Schlesinger:</div><div><br></div><div>HISTORY<br><br><div><span class="Apple-tab-span" style="white-space:pre"> </span> 2.6.24 -- Cgroups: Task control groups <br></div><div><span class="Apple-tab-span" style="white-space:pre"> </span> 2.6.25 -- Cgroups: Memory Resource Controller, Sysfs: Initial version of Network Namespaces<br></div><div><span class="Apple-tab-span" style="white-space:pre"> </span> 2.6.26 -- Cgroups: Device Whitelists<br></div><div><span class="Apple-tab-span" style="white-space:pre"> </span> 2.6.27 -- UID namespaces: First appearence of User Namespaces (still incomplete)<br></div><div><span class="Apple-tab-span" style="white-space:pre"> </span> 2.6.28 -- Cgroups: Container Freezer <a href="http://lwn.net/Articles/287435/">http://lwn.net/Articles/287435/</a><br></div><div><span class="Apple-tab-span" style="white-space:pre"> </span> 2.6.29 -- Cgroups: Swap Management Feature for Memory Resource Controler, Devpts: multiple instances support<br></div><div><span class="Apple-tab-span" style="white-space:pre"> </span> 2.6.30 -- Cgroups: Per-cgroup utime/stime statistics, struct mem_cgroup memory improvements<br></div><div><span class="Apple-tab-span" style="white-space:pre"> </span> 2.6.32 -- Cgroups: Add support for named cgroups.<br></div><div><span class="Apple-tab-span" style="white-space:pre"> </span> 2.6.34 -- Cgroups: Implement Memory Thresholds + Eventfd API for notification<br></div><div><span class="Apple-tab-span" style="white-space:pre"> </span> 2.6.35 -- Sysfs: Tagged Directories/Network Namespaces<br></div><div><span class="Apple-tab-span" style="white-space:pre"> </span> 2.6.37 -- Cgroups: I/O Throttling support (blkio, doesn't seem to be supported by lxc configuration yet)<br></div><div><span class="Apple-tab-span" style="white-space:pre"> </span> 2.6.38 -- Cgroups: performance improvements on smp systems for cpu-cgroups<br></div><div><span class="Apple-tab-span" style="white-space:pre"> </span> 3.0 -- Cgroups: ??? <a href="http://kernelnewbies.org/Linux_3.0">http://kernelnewbies.org/Linux_3.0</a><br></div><div><span class="Apple-tab-span" style="white-space:pre"> </span> 3.1 -- Tomoyo Policy namespace support (MAC Framework)<br></div><div><span class="Apple-tab-span" style="white-space:pre"> </span> 3.2 -- Sysfs: Tagged Files<br></div><div><span class="Apple-tab-span" style="white-space:pre"> </span> 3.3 -- Cgroups: Per group TCP buffer limits <a href="https://lwn.net/Articles/470656/">https://lwn.net/Articles/470656/</a><br></div><div><span class="Apple-tab-span" style="white-space:pre"> </span> 3.3 -- Network priority control group<br></div><div><span class="Apple-tab-span" style="white-space:pre"> </span> 3.5 -- TCP Connection Repair <a href="http://lwn.net/Articles/495304/">http://lwn.net/Articles/495304/</a><br></div><br></div><div>[3] <a href="http://lxc.sourceforge.net/">http://lxc.sourceforge.net/</a></div><div><br></div><div>[4] <a href="http://en.wikipedia.org/wiki/AppArmor">http://en.wikipedia.org/wiki/AppArmor</a></div><div><br></div><div>[5] <a href="http://en.wikipedia.org/wiki/Discretionary_access_control">http://en.wikipedia.org/wiki/Discretionary_access_control</a></div><div><br></div><div>[6] <a href="http://nopaste.narf.at/show/1108/">http://nopaste.narf.at/show/1108/</a></div><div><br></div><div>[7] <a href="http://linux.die.net/man/2/syslog">http://linux.die.net/man/2/syslog</a></div><div><br></div><div>[8] <a href="https://blueprints.launchpad.net/ubuntu/+spec/servercloud-p-lxc-sandboxing">https://blueprints.launchpad.net/ubuntu/+spec/servercloud-p-lxc-sandboxing</a></div><div><br></div><div>[9] <a href="http://lwn.net/Articles/495304/">http://lwn.net/Articles/495304/</a></div></body></html>