Using Kexec and LVM to Quickly Run OpenStack Integration Tests

On the OpenStack project, we want to test that all of the components install and operate correctly, and we want to do that for every commit of every project.  OpenStack can be deployed using several mechanisms (we know that puppet, chef, and juju at least are being used at the time of writing).  It can also be operated in many different configurations (hypervisor selection, authentication mechanism, etc.). The number of testing variations is large, so we want to operate our test machines as efficiently as possible.

Our first pass at running integration tests on bare metal involved PXE booting a cluster of machines each time the test was run, installing an operating system from scratch, and then installing OpenStack and running the tests.  It didn't take long to notice that we were spending as much time testing whether the machines could boot and install an OS (oddly enough, not 100% of the time as it turns out) as we were testing whether OpenStack worked.

To align the test methodology with what we are actually interested in testing, all we really need is a set of machines as they would be just after completing an operating system installation, ready to install OpenStack.  We can use LVM snapshots to restore the disk to that state, and use kexec to quickly reset the state of the running system.

Kexec is a facility popular with Linux kernel developers that allows the running kernel to replace itself with a new kernel in the same way that the exec system call allows a process to replace itself with a new program.  With one command, a new kernel is loaded in memory, and with a second, the system starts booting into the new kernel immediately (losing any state such as running programs).  Using kexec, we can boot into a new system as fast as Linux itself takes to boot (perhaps 3-6 seconds on a server).

LVM allows us to take a snapshot of the filesystem just after operating system installation is finished.  We can use kexec to boot into a snapshot of that filesystem, install OpenStack, run tests, and when complete, simply kexec boot into a new snapshot of the pristine filesystem and start all over again, with nearly no elapsed time between tests.

To set this system up, when the operating system installation is finished, we run the following script as root:

apt-get install kexec-tools
sed -i /etc/default/kexec -e s/LOAD_KEXEC=false/LOAD_KEXEC=true/
lvrename /dev/main/root orig_root
lvcreate -L20G -s -n root /dev/main/orig_root

That installs the kexec-tools and enables kexec (note that now any time you reboot your system, the init scripts will use kexec to perform the reboot).  We then rename the logical volume holding the root filesystem to "orig_root", and create a copy-on-write snapshot of it called "root".  The bootloader and fstab are still configured to mount "root" so that's what will be used on the next boot.

Then any time we want to reset the system for a test, we run this script:

lvremove -f /dev/main/last_root
lvrename /dev/main/root last_root
lvcreate -L20G -s -n root /dev/main/orig_root
APPEND="`cat /proc/cmdline`"
kexec -l /vmlinuz --initrd=/initrd.img --append="$APPEND"
nohup bash -c "sleep 2; kexec -e" </dev/null >/dev/null 2>&1 &

That removes the previous snapshot (if there is one) and creates a new snapshot of the pristine filesytem.  The last three lines load and immediately invoke kexec without performing any shutdown tasks.  In our environment, that's okay because once a test is complete, we're going to completely discard the system (including even the current filesystem).  A more gentle approach would be to simply replace the last three lines with "reboot", which will still use kexec, but only after performing a proper shutdown.  The last line in particular is constructed so that it can be run over an SSH connection from the machine that drives the integration tests.  It gives it a chance to log out and tear down the SSH connection before performing kexec.

This approach allows us to quickly execute a battery of integration tests on bare metal without incurring any of the penalties normally associated with booting, installing, and configuring real hardware. This technique may be generally useful to anyone involved in continuous integration testing that needs unmitigated access to the CPU.  On the other hand, if you can test your system on cloud servers, stay tuned because we're working on that too.


James E. Blair

I love hacking Free Software and have been fortunate to do so professionally with some wonderful people and organizations throughout my career. This is my blog.