Pipelines randomly getting stuck at various stages

Hello,

We’ve been having an issue where builds on our Kubernetes pipelines seem to be getting on various steps. A 1- or 2-minute step can take over 30 minutes before it recovers. The issue can occur in any step including the clone step and doesn’t consistently occur. There’s aren’t any errors in the logs for the runners or the build pods when stuck and the drone server logs show this error once during the build:

{"build.id":52808,"build.number":29,"error":"Cannot transition status via :enqueue from :pending (Reason(s): Status cannot transition via \"enqueue\")","level":"warning","msg":"manager: cannot publish status","repo.id":2207438,"stage.id":56814,"time":"2021-03-08T13:28:14Z"}

I’ve also checked the nodes that the server and runners are on, and they aren’t running out of CPU or memory. However the build pods do have several restarts

Server version: 1.10.1
Runner built off commit: 5abe9d7

Hello @DJamesHO,

Could you please provide below details to understand the issue.

  1. Yaml file used
  2. Trace level runner logs

Regards,
Harness Support

Here is the drone.yml file of one of the pipeline that have been affected. There are many other pipelines of varying sizes and tasks, which are all having the same issue

---
kind: pipeline
name: default
type: kubernetes

platform:
os: linux
arch: amd64

steps:
# Build Docker image on every commit except Master
- name: build_image
pull: if-not-exists
image: <dind_image>
environment:
DOCKER_HOST: tcp://docker:2375
commands:
# wait for docker service to be up before running docker build
- n=0; while [ "$n" -lt 60 ] && [ ! docker stats --no-stream ]; do n=$(( n + 1 )); sleep 1; done
- docker build -t <repo>:${DRONE_COMMIT_SHA} .
when:
branch:
exclude:
- master
event:
- push

# Scan image for vulnerabilities
- name: scan-image
image: <scanning_image>
pull: always
environment:
IMAGE_NAME: :${DRONE_COMMIT_SHA}
SHOW_ALL_VULNERABILITIES: true
TOLERATE: medium
FAIL_ON_DETECTION: false
when:
event:
- push

# Build Docker image when git tag issued
- name: build_tag_ecr
pull: if-not-exists
image: plugins/ecr
environment:
AWS_REGION: eu-west-2
settings:
access_key:
from_secret: AWS_ACCESS_KEY_ID
secret_key:
from_secret: AWS_SECRET_ACCESS_KEY
repo: <repo>
registry: <registry>
tags:
- ${DRONE_TAG}
when:
event:
- tag

- name: build_push_ecr
pull: if-not-exists
image: plugins/ecr
environment:
AWS_REGION: eu-west-2
settings:
access_key:
from_secret: AWS_ACCESS_KEY_ID
secret_key:
from_secret: AWS_SECRET_ACCESS_KEY
repo: <repo>
registry: <registry>
tags:
- latest
- ${DRONE_COMMIT_SHA}
when:
branch:
master
- feature/*
event:
- push

# Deploys image to <namespace1>when promote event triggered by dev
- name: deploy-to-kube
pull: if-not-exists
image: <kube_image>
commands:
- bin/deploy.sh
environment:
IMAGE_URL: <image>:${DRONE_COMMIT_SHA}
KUBE_TOKEN_ACP_NOTPROD:
from_secret: kube_token_acp_notprod
when:
event:
- promote

# Builds image and pushes to ECR when feature MR merged to Master
- name: build_push_ecr_master
pull: if-not-exists
image: plugins/ecr
environment:
AWS_REGION: eu-west-2
settings:
access_key:
from_secret: AWS_ACCESS_KEY_ID
secret_key:
from_secret: AWS_SECRET_ACCESS_KEY
repo: <repo>
registry: <registry>
tags:
- latest
- ${DRONE_COMMIT_SHA}
when:
branch:
- master
event:
- push

# Deploys image to <namespace2> when PR is merged with Master
- name: deploy-to-<namespace2>
pull: if-not-exists
image: <kube_image>
commands:
- bin/deploy-sys-test.sh
environment:
IMAGE_URL: <image>:${DRONE_COMMIT_SHA}
KUBE_TOKEN_ACP_NOTPROD_SYSTEST:
from_secret: KUBE_TOKEN_ACP_NOTPROD_SYSTEST
when:
branch:
- master
event:
- push

- name: clone-repo
image: alpine/git
commands:
- git clone <git_repo>
- echo "The current branch listed below"
- git branch -a
- echo "Change directory to api automation"
- cd api-automation
- echo "checkout feature branch"
- git checkout '<branch>'
- echo "List content in branch <branch>"
- ls -l
- echo "Modifying Namespace value"
- sed -i "s/# KUBE_NAMESPACE=<namespace1>/KUBE_NAMESPACE=<namespace2>" ./bin/deploy.sh
- echo "check contents"
- cat ./bin/deploy.sh

services:
- name: docker
image: <dind_image>
environment:
DOCKER_TLS_CERTDIR: ""

- name: anchore-submission-server
image: <scanning_image>
pull: always
commands:
- /run.sh server

Here are the runner logs relating to this build:

time="2021-03-11T14:47:11Z" level=debug msg="stage details fetched" build.id=55686 build.number=72 repo.id=2200026 repo.name=<repo> repo.namespace=<project>stage.id=59917 stage.name=default stage.number=1 thread=12

time="2021-03-11T14:47:11Z" level=debug msg="updated stage to running" build.id=55686 build.number=72 repo.id=2200026 repo.name=<repo> repo.namespace=<project>stage.id=59917 stage.name=default stage.number=1 thread=12

time="2021-03-11T15:24:44Z" level=debug msg="received exit code 0" build.id=55686 build.number=72 repo.id=2200026 repo.name=<repo> repo.namespace=<project>stage.id=59917 stage.name=default stage.number=1 step.name=clone thread=12

time="2021-03-11T15:31:35Z" level=debug msg="received exit code 0" build.id=55686 build.number=72 repo.id=2200026 repo.name=<repo> repo.namespace=<project>stage.id=59917 stage.name=default stage.number=1 step.name=build_image thread=12

time="2021-03-11T15:34:58Z" level=debug msg="received exit code 0" build.id=55686 build.number=72 repo.id=2200026 repo.name=<repo> repo.namespace=<project>stage.id=59917 stage.name=default stage.number=1 step.name=scan-image thread=12

time="2021-03-11T15:47:11Z" level=debug msg="destroying the pipeline environment" build.id=55686 build.number=72 repo.id=2200026 repo.name=<repo> repo.namespace=<project>stage.id=59917 stage.name=default stage.number=1 thread=12

time="2021-03-11T15:47:17Z" level=debug msg="successfully destroyed the pipeline environment" build.id=55686 build.number=72 repo.id=2200026 repo.name=<repo> repo.namespace=<project>stage.id=59917 stage.name=default stage.number=1 thread=12

time="2021-03-11T15:47:17Z" level=debug msg="updated stage to complete" build.id=55686 build.number=72 duration=3600 repo.id=2200026 repo.name=<repo> repo.namespace=<project> stage.id=59917 stage.name=default stage.number=1 thread=12

time="2021-03-11T15:47:17Z" level=debug msg="done listening for cancellations" build.id=55686 build.number=72 repo.id=2200026 repo.name=<repo> repo.namespace=<project> stage.id=59917 stage.name=default stage.number=1 thread=12

Here are the server logs:

{“build.id”:55686,“build.number”:72,“error”:“Cannot transition status via :enqueue from :pending (Reason(s): Status cannot transition via “enqueue”)”,“level”:“warning”,“msg”:“manager: cannot publish status”,“repo.id”:2200026,“stage.id”:59917,“time”:“2021-03-11T14:47:11Z”}

{“build.id”:55686,“build.number”:72,“level”:“debug”,“msg”:“manager: build is finished, teardown”,“repo.id”:2200026,“stage.id”:59917,“time”:“2021-03-11T15:47:11Z”}

This is the pod 19 minutes into the build. It was still stuck on the clone step at this point:

NAME READY STATUS RESTARTS AGE
drone-zii7z20zlffo89vr1351 0/11 ContainerCreating 0 19m

Here are the events from around that time:

Events:

Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 19m default-scheduler Successfully assigned drone-ftvu6xcu9z2idm3h07zp/drone-zii7z20zlffo89vr1351 to <node>
Normal Pulled 18m kubelet Container image "drone/placeholder:1" already present on machine
Warning Failed 16m kubelet Error: context deadline exceeded
Normal Pulled 16m kubelet Container image "drone/placeholder:1" already present on machine
Normal Created 15m kubelet Created container drone-9ov6ur02tt9jbmhvvuac
Normal Started 15m kubelet Started container drone-9ov6ur02tt9jbmhvvuac
Normal Pulling 15m kubelet Pulling image "drone/placeholder:1"
Normal Pulled 15m kubelet Successfully pulled image "drone/placeholder:1"
Normal Created 13m kubelet Created container drone-57of4y5bqrah1rs50qml
Normal Started 13m kubelet Started container drone-57of4y5bqrah1rs50qml
Normal Pulled 13m kubelet Container image "drone/placeholder:1" already present on machine
Warning Failed 11m kubelet Error: context deadline exceeded
Normal Pulling 11m kubelet Pulling image "drone/placeholder:1"
Normal Pulled 11m kubelet Successfully pulled image "drone/placeholder:1"
Warning Failed 9m40s kubelet Error: context deadline exceeded
Normal Pulled 9m40s kubelet Container image "drone/placeholder:1" already present on machine
Normal Created 8m16s kubelet Created container drone-6ex8eezik118aemf7507
Normal Started 8m15s kubelet Started container drone-6ex8eezik118aemf7507
Normal Pulled 8m15s kubelet Container image "drone/placeholder:1" already present on machine
Warning Failed 6m15s kubelet Error: context deadline exceeded
Normal Pulled 6m15s kubelet Container image "drone/placeholder:1" already present on machine
Warning Failed 4m15s kubelet Error: context deadline exceeded
Normal Pulled 4m15s kubelet Container image "drone/placeholder:1" already present on machine
Warning Failed 2m15s kubelet Error: context deadline exceeded
Normal Pulled 2m15s kubelet Container image "drone/placeholder:1" already present on machine
Warning Failed 15s kubelet Error: context deadline exceeded
Normal Pulled 15s kubelet Container image "drone/placeholder:1" already present on machine

This is the pod 11 minutes after the build was terminated due to a timeout:

NAME READY STATUS RESTARTS AGE
drone-zii7z20zlffo89vr1351 0/11 Terminating 13 71m

sounds like it appeared stuck because Kubernetes was unable to schedule and start the Pod, meaning the Pod sits in a pending state in Kubernetes until it can be assigned to a node and started. Perhaps you need to increase capacity to avoid prolonged scheduling delays?

I don’t think it’s that as you can see from the events that the pod has already been scheduled onto a node

Normal Scheduled 19m default-scheduler Successfully assigned drone-ftvu6xcu9z2idm3h07zp/drone-zii7z20zlffo89vr1351 to <node>

We have cycled the nodes on the cluster and that resolved the issue for a couple of days, but we are now getting reports of the issue occurring again. This issue isn’t just happening at the start of a build. It can also happen in the middle of a build at the start of a new step (which is when we mainly see this issue).

We provide a guide for helping triage issues and submitting patches for the Kubernetes runner (below). The Kubernetes runner is currently in Beta and may not be suitable for production use. You may want to consider using the Docker runner instead.

Hi Brad,

The guide you have linked me to says that I should create topics for support, and issues on this discourse forum, so I will continue this topic. I am currently gathering any information that may help with identifying the underlying cause, whether that is an issue with our cluster or a bug in the runner.

In the meantime, I was asked by colleagues who are responsible for license renewals what we can expect from support as paying customers. Is there any information that i can send them?

The guide you have linked me to says that I should create topics for support, and issues on this discourse forum, so I will continue this topic. I am currently gathering any information that may help with identifying the underlying cause, whether that is an issue with our cluster or a bug in the runner.

That works too. Any information you can provide is helpful.

In the meantime, I was asked by colleagues who are responsible for license renewals what we can expect from support as paying customers. Is there any information that i can send them?

Beta features are not subject to the same service levels (more details in section 2.3) given they are under active development and may be unstable or may change significantly prior to stable release. However, once we provide a stable release of the Kubernetes runner it would be subject to our standard service levels.

moving the discussion to this thread: