[drone-runner-kube] Expose job information for debugging failures

When running jobs via the kube runner there’s a few common issues we tend to hit:

  • Job takes a while to complete or fails early due to running out of resources (e.g. OOMKilled)
  • Job fails for other unknown error and was terminated

Generally you can work through these by tweaking pipeline resource requests or checking events etc in Kubernetes to see what went wrong. There’s problems with both of these approaches though:

  1. Tweaking resource requests is a bit of a guessing game. Unless the users of the system have access to where the jobs are run and/or some monitoring in place, they’d just have to keep testing random resource requests until they get some clean runs. Often this results in people over-provisioning resources just to get a job running.
  2. We’re running with ephemeral namespaces, and so job names, namespace names etc are randomly generated and resources all torn down quickly after job completion. When you have loads of users executing jobs in parallel, it becomes very difficult for someone to figure out which namespace or workload corresponds to a job, because that information is not in the runner logs or the UI.

I’d like to request some features to help with the above:

  1. Log out the namespace + job name correlating to the Drone Job that’s been kicked off (should be fairly trivial to do)
  2. Expose that information on the UI
  3. Periodically fetch Pod resource utilisation during job execution and report the highest observed values on the UI for a Job, so this information can be used to tweak resource requests and set more sensible values
1 Like

@Kash thanks, these are great suggestions, especially tracking resource usage and exposing in the user interface. It would be interesting to show a graph of resource usage over the lifetime of the build. Once we have the data we could do some interesting thinks like trend resource usage of builds over time. I am going to forward this link to our product manager so that we can do some more research and figure out what it would take to get this added to the product.

1 Like

Many thanks Brad. Graphs and trends over time would be an amazing addition and really valuable information to have.