We’ve been having an issue where builds on our Kubernetes pipelines seem to be getting on various steps. A 1- or 2-minute step can take over 30 minutes before it recovers. The issue can occur in any step including the clone step and doesn’t consistently occur. There’s aren’t any errors in the logs for the runners or the build pods when stuck and the drone server logs show this error once during the build:
{"build.id":52808,"build.number":29,"error":"Cannot transition status via :enqueue from :pending (Reason(s): Status cannot transition via \"enqueue\")","level":"warning","msg":"manager: cannot publish status","repo.id":2207438,"stage.id":56814,"time":"2021-03-08T13:28:14Z"}
I’ve also checked the nodes that the server and runners are on, and they aren’t running out of CPU or memory. However the build pods do have several restarts
Server version: 1.10.1
Runner built off commit: 5abe9d7
Here is the drone.yml file of one of the pipeline that have been affected. There are many other pipelines of varying sizes and tasks, which are all having the same issue
--- kind: pipeline name: default type: kubernetes
platform: os: linux arch: amd64
steps: # Build Docker image on every commit except Master - name: build_image pull: if-not-exists image: <dind_image> environment: DOCKER_HOST: tcp://docker:2375 commands: # wait for docker service to be up before running docker build - n=0; while [ "$n" -lt 60 ] && [ ! docker stats --no-stream ]; do n=$(( n + 1 )); sleep 1; done - docker build -t <repo>:${DRONE_COMMIT_SHA} . when: branch: exclude: - master event: - push
{“build.id”:55686,“build.number”:72,“error”:“Cannot transition status via :enqueue from :pending (Reason(s): Status cannot transition via “enqueue”)”,“level”:“warning”,“msg”:“manager: cannot publish status”,“repo.id”:2200026,“stage.id”:59917,“time”:“2021-03-11T14:47:11Z”}
{“build.id”:55686,“build.number”:72,“level”:“debug”,“msg”:“manager: build is finished, teardown”,“repo.id”:2200026,“stage.id”:59917,“time”:“2021-03-11T15:47:11Z”}
sounds like it appeared stuck because Kubernetes was unable to schedule and start the Pod, meaning the Pod sits in a pending state in Kubernetes until it can be assigned to a node and started. Perhaps you need to increase capacity to avoid prolonged scheduling delays?
We have cycled the nodes on the cluster and that resolved the issue for a couple of days, but we are now getting reports of the issue occurring again. This issue isn’t just happening at the start of a build. It can also happen in the middle of a build at the start of a new step (which is when we mainly see this issue).
We provide a guide for helping triage issues and submitting patches for the Kubernetes runner (below). The Kubernetes runner is currently in Beta and may not be suitable for production use. You may want to consider using the Docker runner instead.
The guide you have linked me to says that I should create topics for support, and issues on this discourse forum, so I will continue this topic. I am currently gathering any information that may help with identifying the underlying cause, whether that is an issue with our cluster or a bug in the runner.
In the meantime, I was asked by colleagues who are responsible for license renewals what we can expect from support as paying customers. Is there any information that i can send them?
The guide you have linked me to says that I should create topics for support, and issues on this discourse forum, so I will continue this topic. I am currently gathering any information that may help with identifying the underlying cause, whether that is an issue with our cluster or a bug in the runner.
That works too. Any information you can provide is helpful.
In the meantime, I was asked by colleagues who are responsible for license renewals what we can expect from support as paying customers. Is there any information that i can send them?
Beta features are not subject to the same service levels (more details in section 2.3) given they are under active development and may be unstable or may change significantly prior to stable release. However, once we provide a stable release of the Kubernetes runner it would be subject to our standard service levels.