Drone "running" builds forever

I am running drone 1.1 and I am finding that under certain conditions, drone can be under the impression that builds are running when they are not. For example, it seems that when the drone server is unreachable temporarily, drone agents are unable to report when they are done during that time and I believe what happens is drone server never learns that a build is finished (although there may be some other reason for this that I haven’t thought of).

So I see a few builds that are more than a week old that are still ‘running’ even though they are configured via drone to time out after only 1 hour. I click ‘cancel’ via the UI on these old builds and everything seems to return to normal. Shouldn’t drone-server do this for me periodically when there are builds running that are older than the ‘timeout’ interval?

For example, it seems that when the drone server is unreachable temporarily, drone agents are unable to report when they are done during that time and I believe what happens is drone server never learns that a build is finished (although there may be some other reason for this that I haven’t thought of).

Not sure about this. Drone has logic in place to retry when the server is unavailable, and will continue to retry until the server comes back online. https://github.com/drone/drone/blob/master/operator/manager/rpc/client.go#L299

The only issue I am aware of that matches what you described is when you kill an agent while it is running a build. Drone does not gracefully handle this situation (it was not designed to). So in this case, the expected result is that someone needs to perform manual cleanup. Could this be the case?

Otherwise I am happy to assist with debugging but would need to see detailed system data (logs, etc) or steps and conditions to reproduce.

The only issue I am aware of that matches what you described is when you kill an agent while it is running a build.

I don’t think my agents are dying while running a build, but I may have something close going on. My agents are part of an autoscaling system that I put together and the ec2 instances that are running agents will terminate themselves if they aren’t running any builds for more than 5 minutes. If the drone-server is unreachable before a build is complete and remains unreachable until after the agent’s instance is terminated, will I see the same behavior? If it solely the drone-agent’s responsibility to update drone-server, then that would make sense as an explanation.

Drone does not gracefully handle this situation (it was not designed to).

Is this part of the design open to change? My thought is it would be nice if drone-server would periodically check to see if any builds are ‘running’ and also past the configured ‘timeout’ and then react by cancelling them. Come to think of it, this is something I could probably implement outside of drone core as a cron job that consumes the drone api.

Are you sure? We have an experimental reaper built into the agent to handle this exact situation. It needs to be enabled with a feature flag. We also have an open issue around non-graceful shutdowns (here). I am happy to talk through these issues, but would ask that we avoid speculating what is or is not the problem without more data.

Is this part of the design open to change? My thought is it would be nice if drone-server would periodically check to see if any builds are ‘running’ and also past the configured ‘timeout’ and then react by cancelling them. Come to think of it, this is something I could probably implement outside of drone core as a cron job that consumes the drone api.

I prefer not to answer this question because we are talking about solutions before we have identified a root cause. I think we need to take a step back and gather more data before we start making assumptions about root causes and required design changes.

Which data do you think would be most useful here? Will collecting drone-server logs help in this situation? At the current (non-debug) log level, there was nothing useful in the logs but I can easily change the log level if those would be helpful.

Collecting the agent logs would present a challenge because I don’t currently have any way to save them as they die along with the host instances when they are idle. But if you think the agent logs are essential to debugging this issue then I can try to set something up.

You should be able to look at when the build was running and on which instance. When was that instance terminated? Was it terminated at the during the time the build was running? This information should be available in the autoscaler logs.

I understand that this is difficult to debug, but unfortunately I have very little time I can allocate to this. So you will probably have to drive this effort and figure out how to gather more information to help identify root cause.

Ok. I’m going to try to gather more info. I have not been supplying a DRONE_RUNNER_NAME to my agents so I may have to make that change first so I can establish a direct link between historical build stages and the instances they ran on.

the autoscaler sets this automatically, to match the unique identifier of the instance. So when the autoscaler terminates an instance you should see its id in the logs, which you can match to the machine id for the running build stage.

I actually didn’t know about the autoscaler. I have my own autoscaling system unfortunately :slight_smile:

I’m still at a loss as to what’s happening. I did find one condition where my agent instances were killed, but I’m seeing more puzzling issues that may not be drone-related. Nevertheless, I need some way to work around this problem as it prevents my agent instances from discovering that they are out of work and scaling down whenever this happens.

So here is my glorious hack that (when run repeatedly, ie cron job) will discover any builds that have exceeded their time allotment and cancel them:

#! /bin/bash

curl -s -H "Authorization: Bearer ${drone_token}" https://${drone_host}/api/builds/incomplete |
  jq -c 'select(any(((now - .build.started) / 60) > .timeout))[0] as $b | "/repos/" + $b.slug + "/builds/" + ($b.counter | tostring)' |
  xargs -r -i curl -X DELETE -s -H "Authorization: Bearer ${drone_token}" https://${drone_host}/api{}
1 Like