When running jobs via the kube runner there’s a few common issues we tend to hit:
- Job takes a while to complete or fails early due to running out of resources (e.g. OOMKilled)
- Job fails for other unknown error and was terminated
Generally you can work through these by tweaking pipeline resource requests or checking events etc in Kubernetes to see what went wrong. There’s problems with both of these approaches though:
- Tweaking resource requests is a bit of a guessing game. Unless the users of the system have access to where the jobs are run and/or some monitoring in place, they’d just have to keep testing random resource requests until they get some clean runs. Often this results in people over-provisioning resources just to get a job running.
- We’re running with ephemeral namespaces, and so job names, namespace names etc are randomly generated and resources all torn down quickly after job completion. When you have loads of users executing jobs in parallel, it becomes very difficult for someone to figure out which namespace or workload corresponds to a job, because that information is not in the runner logs or the UI.
I’d like to request some features to help with the above:
- Log out the namespace + job name correlating to the Drone Job that’s been kicked off (should be fairly trivial to do)
- Expose that information on the UI
- Periodically fetch Pod resource utilisation during job execution and report the highest observed values on the UI for a Job, so this information can be used to tweak resource requests and set more sensible values