[drone-runner-kube] Cleanup failures

It looks like there’s room for improvement in kube-runner’s ability to clean up hanging jobs. I have noted that it’s the runner’s job to terminate zombie builds, and that it wouldn’t be able to do so if it’s not terminated gracefully. However, even with runner pods that have not been interrupted, we are seeing orphaned jobs in our clusters. We are running the server at v2.4.0 with the default cleanup values. The kube-runner’s at v1.0.0-rc2.

Are you aware of a bug relating to this? Would it be feasible for the kube-runner to keep track of the jobs it has started in a persistent store? It could then periodically validate the status of those jobs and clean up any hanging ones.

We would like to avoid implementing our own housekeeping solution so your views around this would be appreciated.

edit: I’ve emailed some trace logs to Harness, mentioning this post.

1 Like

I noticed same behaviour, mainly with the failed pipelines. The pods remains stuck in Error state for weeks.

We run v1.0.0-rc2 version of the kube-runner.