Cinder: a Success Story in Automating Project Infrastructure

OpenStack projects have gated trunks -- that is, every change to an OpenStack project must pass unit and integration tests.  Each one requires a number of Jenkins jobs to accomplish this, and some support within the project in the form of configuration files and test interfaces.  Until recently, this was managed in an ad-hoc manner, but as we add projects and tests, it won't scale.  We currently have 235 Jenkins jobs, and that's way too many to manage manually.

Enter the standardized Project Testing Interface.  It lays out all the processes for testing and distribution that a project needs to support to work with the OpenStack Jenkins system.  By standardizing this, we can start to manage Jenkins jobs collectively instead of individually.  This means that not only is it easier to add new projects, but we can be sure that existing projects benefit from improvements in the system and avoid bit-rot.

The OpenStack Common project helps ensure that the code in each project that handles project setup, dependencies, versions, etc, is kept in sync and standardized.  The Project Testing Interface depends on openstack-common for the project-side of its implementation.

Finally, the OpenStack CI team (Andrew Hutchings in particular) has been developing a system to manage our Jenkins configuration within puppet.  That's how we plan on managing groups of Jenkins jobs, and it also means that changes to the Jenkins configuration can go through code review, just like any other change to the project.  Anyone can submit changes to the running Jenkins configuration without any special administrative privileges.

All of these efforts came together this morning when we bootstrapped the Cinder project, the breakout of the volumes component from Nova.  Adding "cinder" to the list of standard python jobs in puppet caused all of our standard packaging and gating jobs to be created in Jenkins.  OpenStack Common generated the skeleton code for the project that conforms to the Project Testing Interface.  And when that code was submitted for review, it passed the automatically-created gate jobs.  From an infrastructure standpoint, Cinder went from an empty repository to a fully integrated OpenStack project in just a few minutes.


Passing the Devstack Gate

I've recently made some big changes to OpenStack's devstack gate scripts. As a developer, here's what you need to know about how they work, and what tools are available to help you diagnose a problem.

All changes to core OpenStack projects are "gated" on a set of tests so that it will not be merged into the main repository unless it passes all of the configured tests. Most projects require unit tests in python2.6 and python2.7, and pep8. Those tests are all run only on the project in question. The devstack gate test, however, is an integration test and ensures that a proposed change still enables several of the projects to work together. Currently, any proposed change to the following projects must pass the devstack gate test:

  • nova
  • glance
  • keystone
  • horizon
  • python-novaclient
  • python-keystoneclient
  • devstack
  • devstack-gate

Obviously we test nova, glance, keystone, horizon and their clients because they all work closely together to form an OpenStack system. Changes to devstack itself are also required to pass this test so that we can be assured that devstack is always able to produce a system capable of testing the next change to nova. The devstack gate scripts themselves are included for the same reason.

A Tour of the Devstack Gate

The devstack test starts with an essentially bare virtual machine, installs devstack on it, and runs some simple tests of the resulting OpenStack installation. In order to ensure that each test run is independent, the virtual machine is discarded at the end of the run, and a new machine is used for the next run. In order to keep the actual test run as short and reliable as possible, the virtual machines are prepared ahead of time and kept in a pool ready for immediate use. The process of preparing the machines ahead of time reduces network traffic and external dependencies during the run.

The mandate of the devstack-gate project is to prepare those virtual machines, ensure that enough of them are always ready to run, bootstrap the test process itself, and clean up when it's done. The devstack gate scripts should be able to be configured to provision machines based on several images (eg, natty, oneiric, precise), and each of those from several providers. Using multiple providers makes the entire system somewhat highly-available since only one provider needs to function in order for us to run tests. Supporting multiple images will help with the transition of testing from oneiric to precise, and will allow us to continue running tests for stable branches on older operating systems.

To accomplish all of that, the devstack-gate repository holds several scripts that are run by Jenkins.

Once per day, for every image type (and provider) configured, the script checks out the latest copy of devstack, and then runs the script. It boots a new VM from the provider's base image, installs some basic packages (build-essential, python-dev, etc), runs puppet to set up the basic system configuration for the openstack-ci project, and then caches all of the debian and pip packages and test images specified in the devstack repository, and clones the OpenStack project repositories. It then takes a snapshot image of that machine to use when booting the actual test machines. When they boot, they will already be configured and have all, or nearly all, of the network accessible data they need. Then the template machine is deleted. The Jenkins job that does this is devstack-update-vm-image. It is a matrix job that runs for all configured providers, and if any of them fail, it's not a problem since the previously generated image will still be available.

Even though launching a machine from a saved image is usually fast, depending on the provider's load it can sometimes take a while, and it's possible that the resulting machine may end up in an error state, or have some malfunction (such as a misconfigured network). Due to these uncertainties, we provision the test machines ahead of time and keep them in a pool. Every ten minutes, a job runs to spin up new VMs for testing and add them to the pool, using the script. Each image type has a parameter specifying how many machine of that type should be kept ready, and each provider has a parameter specifying the maximum number of machines allowed to be running on that provider. Within those bounds, the job attempts to keep the requested number of machines up and ready to go at all times. The Jenkins job that does this is devstack-launch-vms. It is also a matrix job that runs for all configured providers.

When a proposed change is approved by the core reviewers, Jenkins triggers the devstack gate test itself. This job runs the script which checks out code from all of the involved repositories, merges the proposed change, fetches the next available VM from the pool that matches the image type that should be tested (eg, oneiric) using the script, rsyncs the Jenkins workspace (including all the source code repositories) to the VM, installs a devstack configuration file, and invokes devstack. Once devstack is finished, it runs which performs some basic integration testing. After everything is done, the script copies all of the log files back to the Jenkins workspace and archives them along with the console output of the run. If testing was successful, it deletes the node. The Jenkins job that does this is the somewhat awkwardly named gate-integration-tests-devstack-vm.

If testing fails, the machine is not immediately deleted. It's kept around for 24 hours in case it contains information critical to understanding what's wrong. In the future, we hope to be able to install developer SSH keys on VMs from failed test runs, but for the moment the policies of the providers who are donating test resources do not permit that. However, most problems can be diagnosed from the log data that are copied back to Jenkins. There is a script that cleans up old images and VMs that runs once per hour. It's and is invoked by the Jenkins job devstack-reap-vms.

How to Debug a Devstack Gate Failure

When Jenkins runs gate tests for a change, it leaves comments on the change in Gerrit with links to the test run. If a change fails the devstack gate test, you can follow it to the test run in Jenkins to find out what went wrong. The first thing you should do is look at the console output (click on the link labeled "[raw]" to the right of "Console Output" on the left side of the screen). You'll want to look at the raw output because Jenkins will truncate the large amount of output that devstack produces. Skip to the end to find out why the test failed (keep in mind that the last few commands it runs deal with copying log files and deleting the test VM -- errors that show up there won't affect the test results). You'll see a summary of the devstack tests near the bottom. Scroll up to look for errors related to failed tests.

You might need some information about the specific run of the test. At the top of the console output, you can see all the git commands used to set up the repositories, and they will output the (short) sha1 and commit subjects of the head of each repository.

It's possible that a failure could be a false negative related to a specific provider, especially if there is a pattern of failures from tests that run on nodes from that provider. In order to find out which provider supplied the node the test ran on, search for "NODE_PROVIDER=" near the top of the console output.

Below that, you'll find the output from devstack as it installs all of the debian and pip packages required for the test, and then configures and runs the services. Most of what it needs should already be cached on the test host, but if the change to be tested includes a dependency change, or there has been such a change since the snapshot image was created, the updated dependency will be downloaded from the Internet, which could cause a false negative if that fails.

Assuming that there are no visible failures in the console log, you may need to examine the log output from the OpenStack services. Back on the Jenkins page for the build, you should see a list of "Build Artifacts" in the center of the screen. All of the OpenStack services are configured to syslog, so you may find helpful log messages by clicking on "syslog.txt". Some error messages are so basic they don't make it to syslog, such as if a service fails to start. Devstack starts all of the services in screen, and you can see the output captured by screen in files named "screen-*.txt". You may find a traceback there that isn't in syslog.

After examining the output from the test, if you believe the result was a false negative, you can retrigger the test by clicking on the "Retrigger" link on the left side of the screen. If a test failure is a result of a race condition in the OpenStack code, please take the opportunity to try to identify it, and file a bug report or fix the problem. If it seems to be related to a specific devstack gate node provider, we'd love it if you could help identify what the variable might be (whether in the devstack-gate scripts, devstack itself, OpenStack, or even the provider's service).

Contributions Welcome

All of the OpenStack developer infrastructure is freely available and managed in source code repositories just like the code of OpenStack itself. If you'd like to contribute, just clone and propose a patch to the relevant repository:

You can file bugs on the openstack-ci project:

And you can chat with us on Freenode in #openstack-dev or #openstack-infra

The next thing planned for the devstack-gate scripts is to start running Tempest, the OpenStack integration test suite, as part of the process. This will provide more thorough testing of the system that devstack sets up, and of course will help Tempest to evolve in step with the rest of the system.


Using Kexec and LVM to Quickly Run OpenStack Integration Tests

On the OpenStack project, we want to test that all of the components install and operate correctly, and we want to do that for every commit of every project.  OpenStack can be deployed using several mechanisms (we know that puppet, chef, and juju at least are being used at the time of writing).  It can also be operated in many different configurations (hypervisor selection, authentication mechanism, etc.). The number of testing variations is large, so we want to operate our test machines as efficiently as possible.

Our first pass at running integration tests on bare metal involved PXE booting a cluster of machines each time the test was run, installing an operating system from scratch, and then installing OpenStack and running the tests.  It didn't take long to notice that we were spending as much time testing whether the machines could boot and install an OS (oddly enough, not 100% of the time as it turns out) as we were testing whether OpenStack worked.

To align the test methodology with what we are actually interested in testing, all we really need is a set of machines as they would be just after completing an operating system installation, ready to install OpenStack.  We can use LVM snapshots to restore the disk to that state, and use kexec to quickly reset the state of the running system.

Kexec is a facility popular with Linux kernel developers that allows the running kernel to replace itself with a new kernel in the same way that the exec system call allows a process to replace itself with a new program.  With one command, a new kernel is loaded in memory, and with a second, the system starts booting into the new kernel immediately (losing any state such as running programs).  Using kexec, we can boot into a new system as fast as Linux itself takes to boot (perhaps 3-6 seconds on a server).

LVM allows us to take a snapshot of the filesystem just after operating system installation is finished.  We can use kexec to boot into a snapshot of that filesystem, install OpenStack, run tests, and when complete, simply kexec boot into a new snapshot of the pristine filesystem and start all over again, with nearly no elapsed time between tests.

To set this system up, when the operating system installation is finished, we run the following script as root:

apt-get install kexec-tools
sed -i /etc/default/kexec -e s/LOAD_KEXEC=false/LOAD_KEXEC=true/
lvrename /dev/main/root orig_root
lvcreate -L20G -s -n root /dev/main/orig_root

That installs the kexec-tools and enables kexec (note that now any time you reboot your system, the init scripts will use kexec to perform the reboot).  We then rename the logical volume holding the root filesystem to "orig_root", and create a copy-on-write snapshot of it called "root".  The bootloader and fstab are still configured to mount "root" so that's what will be used on the next boot.

Then any time we want to reset the system for a test, we run this script:

lvremove -f /dev/main/last_root
lvrename /dev/main/root last_root
lvcreate -L20G -s -n root /dev/main/orig_root
APPEND="`cat /proc/cmdline`"
kexec -l /vmlinuz --initrd=/initrd.img --append="$APPEND"
nohup bash -c "sleep 2; kexec -e" </dev/null >/dev/null 2>&1 &

That removes the previous snapshot (if there is one) and creates a new snapshot of the pristine filesytem.  The last three lines load and immediately invoke kexec without performing any shutdown tasks.  In our environment, that's okay because once a test is complete, we're going to completely discard the system (including even the current filesystem).  A more gentle approach would be to simply replace the last three lines with "reboot", which will still use kexec, but only after performing a proper shutdown.  The last line in particular is constructed so that it can be run over an SSH connection from the machine that drives the integration tests.  It gives it a chance to log out and tear down the SSH connection before performing kexec.

This approach allows us to quickly execute a battery of integration tests on bare metal without incurring any of the penalties normally associated with booting, installing, and configuring real hardware. This technique may be generally useful to anyone involved in continuous integration testing that needs unmitigated access to the CPU.  On the other hand, if you can test your system on cloud servers, stay tuned because we're working on that too.

Tags: openstack code

Using Zone Key Tool to Manage DNSSEC Signed Domains with NSD

Why Use NSD?

NSD is an original authoritative DNS server from NLnet. Because it focuses solely on serving authoritative DNS data, rather than also acting as recursive DNS server (ie, one used directly by client applications), it has a simpler code base and configuration file syntax. It does have a companion program, Unbound, which is a full featured DNS resolver (which in turn, does not serve authoritative data).

NSD is an efficient and simple alternative to BIND. Having major portions of Internet infrastructure rely on one piece of software can be risky, so many large DNS providers now use both authoritative servers in production (splitting them among their infrastructure). NSD's use of BIND format zone files makes this easy, and makes it compatible with many existing tools.

Why Use ZKT?

If you are new to DNSSEC, I recommend reading this DNSSEC HOWTO, a tutorial in disguise. It covers the different methods of key rollover. Zone Key Tool, or ZKT, is used to automate the management of the various DNSSEC signing keys for your domains. It implements automatic Zone Signing Key (ZSK) rollover using the pre-publish method, meaning that the most frequent key changes are handled automatically. Key Signing Keys (KSK, which are used to sign the ZSKs) must be provided to the maintainer of the parent zone (or a DLV registry) and so their rollover cannot be handled completely automatically, but ZSK automates as much of the process as is feasible using the double signature method.

ZKT uses the standard BIND command-line tools to do the work of actually signing the zone, and stores the keys and any metadata it needs in a simple hierarchical filesystem structure. This makes it easy to use on any system, and indeed, any DNS server, as well as providing for simple maintenance and the ability to easily incorporate it into other processes.

Setting up NSD

The following procedure was used on an Ubuntu 10.10 system, though it should apply generally to any current GNU/Linux system.

Install NSD and the BIND utilities (which include the DNSSEC tools that ZKT will use). Create a directory to hold the zones, copy the sample config and start customizing it:

root@zkt1:~# apt-get install nsd3 bind9utils
root@zkt1:~# cd /etc/nsd3/
root@zkt1:/etc/nsd3# mkdir zones
root@zkt1:/etc/nsd3# cp nsd.conf.sample nsd.conf
Edit nsd.conf and make the following changes:

This directory will have a subdirectory for each zone which will contain the zone file for that zone as well as its keys:

         # The directory for zonefile: files.
zonesdir: "/etc/nsd3/zones"

If you will be performing zone transfers, you'll need to set the same TSIG key on the master and slave. Here's a quick way to generate a key:

root@zkt1:~# dd if=/dev/random bs=16 count=1 2>/dev/null | openssl base64 

Add the key to the config file, with a unique name.

name: examplekeyname
algorithm: hmac-md5
secret: "AWGPLrhHD6oLOXI7vZBToQ=="

Add more key stanzas if you have more slave servers then remove the remaining example key and zone stanzas. Finally, start the server:

root@zkt1:~# /etc/init.d/nsd3 start 

Setting Up a Slave with NSD

Install NSD and copy the sample config as before, and copy the appropriate key: section of nsd.conf from the master. Then start the server:

root@zkt1:~# /etc/init.d/nsd3 start 

Setting up ZKT

ZKT is not (yet) packaged in Debian or Ubuntu, so it will need to be downloaded, built, and installed:

root@zkt1:~# apt-get install build-essential
root@zkt1:~# wget
root@zkt1:~# tar xvfz zkt-1.0.tar.gz
root@zkt1:~# cd zkt-1.0
root@zkt1:~/zkt-1.0# ./configure --enable-configpath=/etc/nsd3
root@zkt1:~/zkt-1.0# make
root@zkt1:~/zkt-1.0# make install
root@zkt1:~/zkt-1.0# make install-man

Create a config file for ZKT. Normally the location defaults to /var/named, but since you may not have BIND installed on your system, the configure line supplied above changes the location to /etc/nsd3/dnssec.conf:

# @(#) dnssec.conf vT0.99d (c) Feb 2005 - Aug 2009 Holger Zuleger

# dnssec-zkt options
Zonedir: "/etc/nsd3/zones"
Recursive: True
PrintTime: True
PrintAge: True
LeftJustify: False

# zone specific values
ResignInterval: 1d # (604800 seconds)
Sigvalidity: 10d # (864000 seconds)
Max_TTL: 1h # (28800 seconds) maximum ttl actually in zone
Propagation: 5m # (300 seconds) expected slave propagation time
KEY_TTL: 1h # (14400 seconds)
Serialformat: incremental

# signing key parameters
Key_algo: NSEC3RSASHA1 # (Algorithm ID 7)
KSK_lifetime: 1y # (31536000 seconds)
KSK_bits: 2048
KSK_randfile: "/dev/urandom"
ZSK_lifetime: 4w # (7257600 seconds)
ZSK_bits: 1024
ZSK_randfile: "/dev/urandom"
SaltBits: 24

# dnssec-signer options
LogFile: ""
LogLevel: ERROR
SyslogFacility: NONE
SyslogLevel: NOTICE
VerboseLog: 0
Keyfile: "dnskey.db"
Zonefile: "zone.db"
DLV_Domain: ""
Sig_Pseudorand: False
Sig_GenerateDS: True
Sig_Parameter: ""

You may be interested in editing that file to tune the output format of the tools, resigning interval, and the key algorithm.

To automate the signing of zones with ZKT and updating NSD, create the following script at /usr/local/sbin/update-dns and make it executable:


echo "Current DNSSEC signing keys:"
/usr/local/bin/zkt-ls -f

echo "Resigning zones:"
/usr/local/bin/zkt-signer -v || exit

echo "Update NSD:"
/usr/sbin/nsdc rebuild && /usr/sbin/nsdc reload && /usr/sbin/nsdc notify

echo "Done."

The initial installation and configuration is complete, the next sections detail how to add a new zone with instructions that can be repeated each time.

Adding a New Zone

Add the following to nsd.conf on the master:

notify: SLAVEIP examplekeyname
provide-xfr: SLAVEIP examplekeyname
notify-retry: 5

Or zone.db rather than zone.db.signed if you are not planning on signing this zone. Add this to any slaves:

allow-notify: MASTERIP examplekeyname
request-xfr: AXFR MASTERIP examplekeyname

Where SLAVEIP is the IP address of your slave server, and MASTERIP that of the master. On the master, create a directory to hold the zone file and keys. This directory must have the same name as the zone (ie, the $ORIGIN line):

root@zkt1:~# mkdir /etc/nsd3/zones/ 

I recommend starting with the following stub zone file for each new zone you create. Copy this to /etc/nsd3/zones/

; -*- mode: zone -*-                                                            
$TTL 1h
@ IN SOA (
1296347822 ; serial number unixtime
1h ; refresh (secondary checks for updates)
10m ; retry (secondary retries failed axfr)
10d ; expire (secondary ends serving old data)
1h ) ; min ttl (cache time for failed lookups)
$INCLUDE dnskey.db

If you edit that file within Emacs, the serial number will automatically be updated. Note that the dnskey.db file which is to be included has not been created yet. If you are not going to sign the zone, remove this line:

$INCLUDE dnskey.db 

Run nsdc restart on the master and any slaves after editing the configuration file. If you are not going to sign the zone, run:

root@zkt1:~# update-dns 

to recompile the zone database and send notifications to the slaves. It will also re-sign any zones that need it. If you are going to sign the zone, continue to the next section.

Signing a Zone for the First Time

This procedure is only required to initially sign a zone; subsequent updates are handled automatically. ZKT's global configuration is stored in /etc/nsd3/dnssec.conf. Local, per-zone settings may be added to a dnssec.conf in the zone directory if needed. To sign a zone, simply create an empty signed zone file, and then run the signing program:

root@zkt1:~# touch /etc/nsd3/zones/
root@zkt1:~# zkt-signer -v -v

This will sign any zones that need updating, and also create the keys for your new zone and sign it. If everything went well, run:

root@zkt1:~# update-dns 

which will perform another pass over the zones (extraneous but harmless), then update NSD on the master and slaves.

Adding the Zone to DLV

Since the root is not yet signed, you probably want to use DNSSEC Lookaside Validation (DLV). ISC, the maintainers of BIND, provide an easy to use and well regarded DLV registry at Create an account there and log into it. Click Manage Zones, and then (add a zone). Enter the name, and then it will prompt you to add a record or upload a file. Look in your zone directory for the dsset file:

root@zkt1:~# /etc/nsd3/zones/ 

Copy the last line out of that file. It should look like:  IN DS 38448 7 2 EFF15AE6AA31A1D552DDE68FE874253959A2DA8B28DE39D42FE025B5 541B5CBE 

Paste the entire line into the Add Record prompt on the DLV website. You will then be prompted to add a TXT record to the domain. Follow the instructions in this document for modifying a zone to add the TXT record. Very shortly afterwords (about 5 minutes), the DLV website will have checked for the TXT record. If everything worked, you can then remove the record, and you're done.

Updating a Zone

Edit the zone file at /etc/nsd3/zones/ Don't edit anything else in the zone directory, it's all managed by ZKT. When finished, run

root@zkt1:~# update-dns 

To re-sign any zones that need it, recompile the NSD database, reload it, and notify the slaves.

ZSK rollover

Zone Signing Keys are configured by default to roll over using the pre-publish procedure every 30 days. To make sure this happens, set up the following cron job in /etc/cron.d/dnssec:

# Re-sign dnssec domains
17 3 * * * root /usr/local/sbin/update-dns

KSK rollover

Key Signing Keys, which are the DNSSEC Secure Entry Point (SEP) for the zone and used to sign the ZSKs, are configured for a one year lifetime. They are NOT rolled over automatically, since updating the upstream DS (or DLV) records is not automatic. Once a year, when they near expiration, use this procedure:

  1. Generate a new key and allow it to propogate through DNS:
    root@zkt1:~# zkt-keyman -1
    root@zkt1:~# update-dns
  2. After a while (zkt will calculate it for you), publish the new DS (or DLV) record:

    root@zkt1:~# zkt-keyman -2
    root@zkt1:~# update-dns

    Follow the instructions above for sending the DS record to the DLV registry.

  3. After that has propagated, you can remove the old key:

    root@zkt1:~# zkt-keyman -3
    root@zkt1:~# update-dns

At any time during the process, you can see what ZKT thinks the status is by running:

root@zkt1:~# zkt-keyman --ksk-status 


Tags: security code

SSH Agent Forwarding in Ubuntu's Gnome

It's been over two years since the bug was opened and the SSH agent built into gnome-keyring still does not support constrained identities, particularly the confirmation constraint.

If you are forwarding your SSH agent connection through an intermediate (or bastion) host and the intermediate host is compromised (or has an untrustworthy admin), your forwarded agent connection could be hijacked and your key could be used to access other hosts without your knowledge.  Therefore, when forwarding an SSH agent, it's important that your agent asks for confirmation before the key is used.  That way you will be alerted if your agent is used by someone else to access your key.

Because the SSH agent component in gnome-keyring does not support confirmation dialogs, it should be disabled if you want to use SSH keys in this way.  In order to do that, you must use gconf:

$ gconftool-2 --set -t bool /apps/gnome-keyring/daemon-components/ssh false

If that were the only bug in GNOME, the ssh-agent from openssh would take over on your next login and everything would be fine.  However, if you have seahorse-plugins installed (you probably do), you'll run into this bug.  The Xsession script provided by seahorse-plugins abuses a variable that is supposed to be available to all Xsession scripts, and in doing so, prevents ssh-agent from running.  You could edit the file to fix it, but it's perhaps better to just add another file that undoes the damage.  As root:

# cat > /etc/X11/Xsession.d/60seahorse-plugins-fix <<EOF
# This file is sourced by Xsession(5), not executed.
OPTIONS=$(cat "$OPTIONFILE") || true

Once that is done, you can add "/usr/bin/ssh-add -c" to your gnome startup items.

Tags: code

James E. Blair

I love hacking Free Software and have been fortunate to do so professionally with some wonderful people and organizations throughout my career. This is my blog.