[solved] [drone-runner-kube] Not correctly setting resource requests for Steps

I’ve noticed some unusual behaviour when setting Resource Requests for Pipeline Steps. In our runner we provide the following two env vars:

DRONE_RESOURCE_REQUEST_CPU: 200m
DRONE_RESOURCE_REQUEST_MEMORY: 100MiB

I assumed (perhaps wrongly) that this is the default CPU & Memory Request applied to each individual Step, in the absence of Resource Requests being defined in the Pipeline. Instead what it seems, is that this is the Maximum Requests which are applied to the entire Pipeline (which is not overridable). So every Step in the Pipeline ends up with:

    Requests:
      cpu:     1m
      memory:  4Mi

But the first Step (usually the git clone) will get whatever is remaining from the total after the above values were subtracted. So if you have 4 more Steps after the git clone, your first clone Step would reserve:

    Requests:
      cpu:     196m
      memory:  84Mi

If this is intentional behaviour then the docs should probably be updated with more of an explanation that setting these vars on the runner means it’s the total reserved resources for the entire Pod that is scheduled (overriding any specification for individual Steps in a users Pipeline), split across each Container.

I’ll test removing both env variables to make sure each step can define their own Requests. It would still be nice to set a minimum applied for all Steps (at the runner level), so that they are given a buffer of reserved resources and reduce the likelihood of over-allocation of Jobs on a single Node (another problem we hit).

It’s related to this logic which was merged in December but not officially tagged: drone-runner-kube/compiler.go at master · drone-runners/drone-runner-kube · GitHub

I’ve also tested the following two env vars (in isolation) introduced in that commit:

  • DRONE_RESOURCE_MIN_REQUEST_CPU
  • DRONE_RESOURCE_MIN_REQUEST_MEMORY

Their behaviour overrides the Resource Requests for each Step, regardless of what the step has defined. I would expect that it would set the higher value of either the env vars above, or what has been defined in the Step. For example if the Env Var has 200MiB for Memory and the Step has 250MiB, it should set the latter in the Spec.

This thread explains the changes we are making to how the kubernetes runner works:

Please keep in mind the Kubernetes runner is in Beta and we do not have a stable specification. We are still making breaking changes which may result in inconsistencies in the docs (despite our best efforts to keep them up to date). If you need something more stable, we recommend the Docker runner.

Thanks for pointing us to the post that sets out the current resource strategy. According to our tests with the current master (5abe9d7), there doesn’t seem to be any way of overriding DRONE_RESOURCE_MIN_REQUEST_CPU or MEMORY values in pipeline manifests.

Each container gets the resources specified by the MIN_REQUEST env vars regardless of the requests specified in the pipeline manifest. The following table shows the four scenarios we have tested and the actual resource allocations reported by K8s API:

  • no env vars are set in runner deployment
  • only minimum request values are set
  • only stage request values are set
  • both minimum and stage values are set.

The pipeline manifests used for testing contained four steps (in addition to git clone):

  • first container does not specify any requests
  • second requests resources that are lower than the MIN_REQUEST values set in the runner deployment
  • third requests resources that are higher than MIN_REQUEST but lower than stage values set in the runner deployment
  • fourth requests resources that are higher than stage values set in the runner deployment.

Looking at the post you linked, we are not clear if this is the intended behaviour. Should we be able to set requests for individual pipeline steps? or shall we be setting only limits per step instead?

We are also observing that the stage approach is not having the desired effect of reserving enough resources for each given step. A single step can go above its reserved amount (4MB or MIN_REQUEST) and get OOMKilled even when stage env vars reserve plenty of resource for the entire pod.

Do you have any suggestions?

@bradrydzewski would you be able to advise on the above investigation please?

1 Like

Yes @bradrydzewski, you might be busy with other stuff since Drone got acquired, but please don’t forget about the old user base. We still rely on Drone, and wrong resource allocation is what makes the runner unstable now.

Recap:

One bug outstanding based on the documentation provided: the resources.requests prop is NOT taken into account now.

So how can we make sure we are allocating enough resources for a step, or for the entire pod if per-container is not feasible?

Ok, maybe it’s just a case of documentation lacking. If the docs would clearly state the resource requests can ONLY be configured globally in the pipeline, and not per step, that would make a big difference. The way it is currently presented allows for a different interpretation. The step limits are seemingly used per container, so we’d expect to set the requests per container as well. But that simply is not possible, and the docs fail to say so…

The decision to divide the resources.requests evenly over containers is the problematic part. Why that is forced puzzles me. If we are not allowed to set the resources.requests per step we run OOM on steps that now get assigned too little, as it is blindly assumed that it is ok to distribute evenly from a global value. What is going on here? Why stray from the k8s way? It seems trivial to let us assign resources per step.

Hey @Morriz ,

My original issue was addressed as part of this release and improvements to the resource setting logic: GitHub - drone-runners/drone-runner-kube at v1.0.0-beta.9

The docs (Resources | Drone) mention resource requests tells how much a pipeline needs and is in the example yaml, but I agree it could definitely be made clearer as this caught me out as well.

The Pod created by Drone only runs one container (Step) at a time. If you defined requests at a per Step level, the accumulation of all those resources will get reserved on the Node for the entire Pod, which is wasteful as you’re overprovisioning. I believe this is why they went for defining resources at the Pipeline level. It shouldn’t matter what is divided to each Pod because the entire amount for the Pipeline is reserved. You cannot patch a Pod to change resource requests, so that’s why it wouldn’t work to define it at a Step (container) level and patch between moving to each Step (to avoid overallocating resources).

Just to add: I raised a feature request ([drone-runner-kube] Expose job information for debugging failures) which, if prioritised and addressed, should help a lot with understanding what is being allocated and consumed during execution to support setting more sensible values.

Resource requests are defined at the pipeline level and resource limits are defined at the step level. This decision was made based on the discussion in this thread and was modeled after Tekton.

kind: pipeline
type: kubernetes

# resource requests are defined on the pipeline level
resources:
  requests:
    cpu: 2000
    memory: 2000MiB

steps:
  - name: en
    image: alpine
    commands:
      - echo hello
    resources:
      # resource limits are defined for each step
      limits:
        cpu: 1000
        memory: 1000MiB

  - name: es
    image: alpine
    commands:
      - echo hola

  - name: fr
    image: alpine
    commands:
      - echo bonjour

We do have documentation for how to set resource requests both globally and in the yaml (you can find them here). I will pass along feedback to our documentation that it may be confusing.