When multiple events for the same repository hit Drone in quick succession, is there a way to force Drone to queue builds for said repository? Our goal is to avoid the overlapped (simultaneous) execution of any pipeline step across revisions.
In more concrete terms, if a (GitHub) repository sends events push 1 and push 2 in the space of a couple of minutes, we would like all pipelines for event push 1 to complete execution before the first pipeline for event push 2 is scheduled for execution. It would be ideal if this queuing behaviour could be specified at the repository layer so that other repositories – not subject to unusual mutual exclusion constraints – are free to maximise concurrency and make full use of available resources on execution nodes.
We use Drone to execute some of our Terraform pipelines. I’ll try to summarise this problem in a general way, so that it will be understandable without any prior exposure to this specific tooling. Terraform is a doohickey that transforms plain text into live IaaS resources in some remote cloud. By necessity, the tool must interact with remote resources; these remote resources are obviously not contained within the ephemeral build context, and so, our builds are not perfectly hermetic. As a safety precaution, the tool relies upon a remote mutex to ensure that only one instance of itself is interacting with remote resources at any given time. You can probably see where this is going.
Our Drone builds work fine if there is only one build executing at any given time. Terraform is able to lock its mutex, do its work, then unlock its mutex in preparation for the next build.
Unfortunately, our Drone builds fail – with false negatives – if the Drone scheduler attempts to execute for multiple revisions at the same time. All concurrent builds subsequent to the first will be unable to lock that mutex. These builds would have succeeded had they been queued.
Our builds are wide and short. .drone.yml contains a number of pipelines (cardinality increases over time) that we endeavour to execute concurrently for a single revision. Each pipeline contains only a couple of steps. (And each pipeline is associated with a unique terraform mutex.) We have only one (docker) runner at the moment configured with RUNNER_CAPACITY=20 and RUNNER_MAX_PROCS=10.
I could perhaps work around this problem with a step that blocks on the mutex lock, but that would tie up an executor slot for no good reason. Other repositories may not be able to build in the meantime.
In my original post, I had specified the following (ostensibly for the sake of simplicity):
The concurrency knob appears to apply to individual pipelines, which is a softer constraint, but would still work for us if it behaves as documented. Our pipelines are independent, so it is no problem if pipeline foo for revision X executes simultaneously with pipeline bar for revision Y.
I had been perusing the documentation on docs.drone.io. These search terms:
do not seem to expose the existence of that knob. (There is some mention of a concurrency attribute for the Kubernetes Pipeline, though we do not use Kubernetes.) A revision to that documentation may help users in similar circumstances.
I shall bear this in mind. With only a single runner node, I wonder whether we are less likely to experience problems. (In our simple case, either the runner is up and established, or no Drone builds are able to be scheduled.) We have some other distributed systems we can use for this purpose, but I’ll first try to go without explicit locking for the sake of simplicity.
Please note the internal scheduler makes a reasonable effort to enforce pipeline concurrency, however, the system does not make formal guarantees. If you require formal guarantees you should consider integrating a locking system into your pipeline (such as redislock or zookeeper).
This concurrency warning in our documentation exists because we have not formally proved our locking mechanism. We could (and many systems do) claim distributed locking without modeling a formal proof like TLA+ and testing for correctness, however, I think it would be irresponsible for us to do so. So this warning is us erring on the side of caution and letting our users know that we cannot make strong mathematical claims with regard to correctness because we have not gone through this exercise. But with that being said, I am not aware of any real world issues being reported with the latest stable version of our locking algorithm.