Introducing the NNFI scheduler for Zuul

We recently made a change to Zuul's scheduling algorithm (how it determines which changes to combine together and run tests).  Now when a change fails tests (or has a merge conflict), Zuul will move it out of the series of changes that it is stacking together to be tested, but it will still keep that change's position in the queue.  Jobs for changes behind it will be restarted without the failed change in their proposed repo states.  And if something later fails ahead of it, Zuul will once again put it back into the stream of changes it's testing and give it another chance.

To visualize this, we've updated the status screen to include a tree view:

In Zuul, this is called the Nearest Non-Failing Item (NNFI) algorithm because in short, each item in a queue is at all times being tested based on the nearest non-failing item ahead of it in the queue.

On the infrastructure side, this is going to drive our use of cloud resources even more, as Zuul will now try to run as many jobs as it can, continuously.  Every time a change fails, all of the jobs for changes behind it will be aborted and restarted with a new proposed future state.

For developers, this means that changes should land faster, and more throughput overall, as Zuul won't be waiting as long to re-test changes after a job has failed.  And that's what this is ultimately about -- virtual machines are cheap compared to developer time, so the more velocity our automated tests can sustain, the more velocity our project can achieve.

0 Comments

James E. Blair

I love hacking Free Software and have been fortunate to do so professionally with some wonderful people and organizations throughout my career. This is my blog.

Archive