There is no Jenkins, only Zuul

Since its inception, the OpenStack project has used Jenkins to perform its testing and artifact building.  When OpenStack was two git repos, we had one Jenkins master, a few slaves, and we configured all of our jobs manually in the web interface.  It was easy for a new project like OpenStack to set up and maintain.  Over the years, we have grown significantly, with over 1,200 git repos and 8,000 jobs spread across 8 Jenkins masters and 800 dynamic slave nodes.  Long before we got to this point, we could not manage all of those jobs by hand, so we wrote Jenkins Job Builder, one of our more widely used projects, so that we could automatically generate those 8,000 jobs from templated YAML.

We also wrote Zuul.

Zuul is a system to drive project automation.  It directs our testing, running tens of thousands of jobs each day, responding to events from our code review system and stacking potential changes to be tested together.

We are working on a new version of Zuul (version 3) with some major changes: we want to make it easier to run jobs in multi-node environments, easier to manage large numbers of jobs and job variations, support in-tree job configuration, and the ability to define jobs using Ansible.

With Zuul in charge of deciding which jobs to run, and when and where to run them, we use very few advanced features of Jenkins at this point.  While we are still working on Zuul v3, we are at a point where we can start to use some of the work we have done already to switch to running our jobs entirely with Zuul.

As of today, we have turned off our last Jenkins master and all of our automation is being run by Zuul.  It's been a great ride, and OpenStack wouldn't be where it is today without Jenkins.  Now we're looking forward to focusing on Zuul v3 and exploring the full potential of project automation.




Introducing the NNFI scheduler for Zuul

We recently made a change to Zuul's scheduling algorithm (how it determines which changes to combine together and run tests).  Now when a change fails tests (or has a merge conflict), Zuul will move it out of the series of changes that it is stacking together to be tested, but it will still keep that change's position in the queue.  Jobs for changes behind it will be restarted without the failed change in their proposed repo states.  And if something later fails ahead of it, Zuul will once again put it back into the stream of changes it's testing and give it another chance.

To visualize this, we've updated the status screen to include a tree view:

In Zuul, this is called the Nearest Non-Failing Item (NNFI) algorithm because in short, each item in a queue is at all times being tested based on the nearest non-failing item ahead of it in the queue.

On the infrastructure side, this is going to drive our use of cloud resources even more, as Zuul will now try to run as many jobs as it can, continuously.  Every time a change fails, all of the jobs for changes behind it will be aborted and restarted with a new proposed future state.

For developers, this means that changes should land faster, and more throughput overall, as Zuul won't be waiting as long to re-test changes after a job has failed.  And that's what this is ultimately about -- virtual machines are cheap compared to developer time, so the more velocity our automated tests can sustain, the more velocity our project can achieve.


Scaling the OpenStack Test Environment

A year ago I introduced Zuul, a program I developed to drive the OpenStack project's gating system. In short, each change to an OpenStack project must pass unit and integration tests before it is merged. For more details on the system, see Zuul: a Pipelining Trunk Gating System.

Over the past year, the OpenStack project has grown tremendously, with 62 git repositories related to OpenStack, 30 for the project infrastructure, and an additional 75 unofficial projects that share the same testing infrastructure. In all, the development infrastructure currently serves 167 repositories. We run up to 720 test jobs per hour, and our dynamic provisioning system has pushed our test node count up to 328 nodes online and running tests simultaneously.

Over the past year, we've made a large number of changes to prepare for this load (we saw the graphs of the OpenStack project growth just like everyone else). Here are some of the key innovations that help us test at scale.

Gearman-Plugin and Multi-Master Jenkins

It was becoming apparent that as we kept adding more nodes to Jenkins that the Jenkins master was becoming a bottleneck for scaling, as well as a single point of failure. We decided to solve this by creating a system where we can have completely independent Jenkins masters.

We decided to use Gearman as a way to distribute jobs from Zuul to any number of systems that can run tests for it (Jenkins or otherwise). Instead of talking to Jenkins directly, Zuul now submits requests to run jobs to a Gearman server, which then distributes them to any worker that registers with it indicating that it can run a particular job.

So that we can continue to run our Jenkins tests, Khai Do and I wrote the gearman-plugin for Jenkins. It connects to a Gearman server and registers every job defined in Jenkins as something it can run. We currently have three Jenkins masters which register their jobs (1129 of them) with Gearman, which distributes build requests from Zuul to them as they have nodes available.

This system gives us quite a bit of flexibility -- we can have any program (not just Zuul) trigger jobs, as well as have any system (not just Jenkins) run those jobs. We also now have a highly-available Jenkins system, with redundant Jenkins masters across which we can do rolling upgrades of Jenkins with no downtime.


While implementing the Gearman interface for Zuul, we found that the existing Python Gearman libraries didn't facilitate the kind of asynchronous concurrency we wanted to use in Zuul. So I wrote gear which is a very simple and lightweight interface that tries to expose all of the flexibility of the Gearman protocol. Using Gear, it's very simple to write a Gearman worker or Client that can handle having thousands of jobs in-flight at a time. In the OpenStack project infrastructure it is used by Zuul, as well as the log processors in our Logstash system.


With multiple Jenkins masters in place, we now have quite a bit of capacity to add more test nodes. To fill that capacity, I wrote Nodepool.

Some of our most important as well as complex tests involve taking over an entire (and perhaps in the future, more than one) virtual machine and installing and configuring software as root. This is clearly not an ideal environment for a long-running Jenkins node. Instead, we run these tests on single-use nodes.

Once a day, nodepool spins up a new node (using novaclient, of course), and caches data that tests can later make use of locally on the host. It then creates a snapshot of the host. It spins up a number of machines from that image and adds them to Jenkins. Jenkins registers their availability with Gearman, and they wait to be assigned a job.

Clark Boylan wrote the ZeroMQ event publisher plugin for Jenkins. With that installed, Jenkins publishes start and end events for every build.

Nodepool subscribes to ZeroMQ events published by each of our Jenkins masters and notes when a build starts on a node that it manages, marks it as being in-use in its internal database, and immediately starts spinning up a replacement node. When the job completes, Nodepool removes the node from Jenkins and deletes it.

Nodepool is very fast and responsive. In our current configuration, developers and tests never have to wait for a nodepool-managed node to become available, unless we hit our test node quota (of several hundred machines, of which nodepool is capable of exhausting in a few minutes). We like it so much that we're looking into having it manage all of our Jenkins nodes.


Zuul: a Pipelining Trunk Gating System

Before each change to the OpenStack projects is merged into the main tree, unit and integration tests are run on the change, and only if they pass, is the change merged.  We call this "gating".  We use Jenkins to run the tests, along with the Gerrit Trigger Plugin to kick them off and manage the resulting approval or rejections.

Currently, this process is (mostly) serialized due to the fact that Jenkins is configured to only run one build of each job at a time.  This serialized aspect of trunk gating is desirable -- it means that each change is tested exactly as it will be eventually merged into the repository.  For example, change A will be tested against HEAD, then merged, then change B will be tested against HEAD, which now includes change A.  If we allowed those jobs to run in parallel, it may be that change A introduces a condition that causes change B to fail, but without testing B against A, we would not detect it until after the change is merged.  Strict serialization of testing and merging changes is therefore useful.

However, a problem arises as the tests become longer or the rate of changes increases.  If a given test takes, say, one hour (which is entirely reasonable for some kinds of tests), then the entire project can only merge, at most, 24 changes each day.  That is the very definition of un-scalable, and quite inconvenient for developers too, who may have to wait a very long time for the tree to change.

When processor designers hit the wall for how fast a processor could execute instructions, they branched out, so to speak.  Taking a page from processor design, I have written a program that performs speculative execution of tests.  By constructing a virtual queue of changes based on the order of their approval, it runs jobs in parallel assuming they will all be successful.  If any of them fail, then any jobs that were run based on the assumption they succeeded are re-run without the problematic changes included.  This means that in the best case, as many changes can be tested and merged in parallel as computing resources will allow for testing.  And of course, with cloud computing, that isn't much of a hurdle.

Most changes to OpenStack do pass tests the first time, so planning for the best case is very useful.  Other changes we are making, such as executing tests as soon as they are uploaded to Gerrit for review will help to provide early feedback to developers so that reviewers (and Jenkins) don't waste time trying to merge changes that we know ahead of time will fail.

The program that now drives our execution of tests is called Zuul.  It is quite generalized and not at all specific to the OpenStack workflow.  In fact, it's so configurable, it doesn't even have the idea of gating programmed into it.  With only some YAML configuration, it can be made to run all of the kinds of jobs we've developed during the course of OpenStack development:

Check jobs: tests that run immediately on submission of a patch.  No speculative execution is done, all tests can simply run in parallel and provide early feedback to developers.

Gate jobs: changes are tested in parallel but in a virtually serialized manner so that each change is tested exactly as it will be merged.  Changes with failed tests don't merge.

Post jobs: jobs that run after a change is committed (eg, generating a tarball, or documentation).

Silent jobs: jobs that should not provide feedback (perhaps the jobs are not ready for production use).

Zuul can be found here:

It should be easy to use with any project that uses Gerrit and Jenkins.  The internal interfaces should be clean enough that if you don't use Jenkins, you can easily plug in another kind of job system (patches welcome!).  With a little more trouble, you could probably factor Gerrit out as well.

Development is done just like the rest of the OpenStack project. Clone the git repo, commit your change, and "git review".  Visit us in #openstack-infra on freenode if you want to chat about it.

Cinder: a Success Story in Automating Project Infrastructure

OpenStack projects have gated trunks -- that is, every change to an OpenStack project must pass unit and integration tests.  Each one requires a number of Jenkins jobs to accomplish this, and some support within the project in the form of configuration files and test interfaces.  Until recently, this was managed in an ad-hoc manner, but as we add projects and tests, it won't scale.  We currently have 235 Jenkins jobs, and that's way too many to manage manually.

Enter the standardized Project Testing Interface.  It lays out all the processes for testing and distribution that a project needs to support to work with the OpenStack Jenkins system.  By standardizing this, we can start to manage Jenkins jobs collectively instead of individually.  This means that not only is it easier to add new projects, but we can be sure that existing projects benefit from improvements in the system and avoid bit-rot.

The OpenStack Common project helps ensure that the code in each project that handles project setup, dependencies, versions, etc, is kept in sync and standardized.  The Project Testing Interface depends on openstack-common for the project-side of its implementation.

Finally, the OpenStack CI team (Andrew Hutchings in particular) has been developing a system to manage our Jenkins configuration within puppet.  That's how we plan on managing groups of Jenkins jobs, and it also means that changes to the Jenkins configuration can go through code review, just like any other change to the project.  Anyone can submit changes to the running Jenkins configuration without any special administrative privileges.

All of these efforts came together this morning when we bootstrapped the Cinder project, the breakout of the volumes component from Nova.  Adding "cinder" to the list of standard python jobs in puppet caused all of our standard packaging and gating jobs to be created in Jenkins.  OpenStack Common generated the skeleton code for the project that conforms to the Project Testing Interface.  And when that code was submitted for review, it passed the automatically-created gate jobs.  From an infrastructure standpoint, Cinder went from an empty repository to a fully integrated OpenStack project in just a few minutes.


James E. Blair

I love hacking Free Software and have been fortunate to do so professionally with some wonderful people and organizations throughout my career. This is my blog.