Runners must be gracefully terminated; they must not be force-terminated while pipelines are running, otherwise they are stuck in a running state.
The server does not keep track of runner connectivity for a number of reasons (for example, connections are not persistent and the runners use long polling and frequently connect and disconnect to avoid tcp timeouts, which are common in many corporate networks). If you stop or restart the server while builds are running, or the runner loses connectivity with the server, it is able to keep running pipelines and upload the results using a backoff once it is able to re-establish a connection. This decentralized design makes the system more resilient to outages and flaky networks, but the tradeoff is that you must not shut down a runner while it is running a pipeline.
The servers does scan for stuck jobs every 24 hours and terminates them. If you want to reduce the interval and scan more frequently, you can adjust the cleanup intervals and deadlines by passing the following environment variables to your Drone server:
Just for clarification: If a (docker) runner is getting a signal to gracefully stop, it is 1. not accepting (reads: pulling for) any new jobs to execute and 2. blocks until it’s currently executing jobs have finished before exiting itself?
Settings the runner’s container grace period longer than the maximum build timeout (default 60min) should do the job, if the OS itself is not reaping processes at some point.
Would it be hard to implement a different, configurable strategy for runners, that is automatically cancelling executing jobs (and notifying the server thereof) before exiting?