[kubernetes runner] Builds are hanging after upgrade

We have had hanging build issues raised by multiple tenants across our platform.

Our runner is on version: 1.0.0-rc.1 and our server is on version: 2.4.0

The timeout has been set to 1 hour across the platform but as can be seen builds are running over this time until they have to be cancelled.



I have attached a gist with the corresponding drone files here: Hanging Build Drone Files · GitHub

Are there any updates for this issue? It is affecting multiple tenants on our platform.

@jim5252,

Could you please enable trace logging and share sever and runner log during that time for our review.

I have uploaded server logs for that day of the 12th and runner logs for that day from the time of 09:20-09:25 here: Drone Server and Runner for 12/10/21 · GitHub

Please let me know if you need anything else.

Hi, is there any updates on this issue? @csgit

@jim5252,

I assume you are using k8 runner and looks like stuck because Kubernetes was unable to schedule and start the Pod, meaning the Pod sits in a pending state in Kubernetes until it can be assigned to a node and started
So i am just checking if possible could you please take latest runner GitHub - drone-runners/drone-runner-kube at v1.0.0-beta.12 and test the behaviour and let us know the update.

As stated in the original issue, we are currently on the runner version above the one you have suggested: 1.0.0-rc.1, this was to fix the issue of fails with exit code 2 at Release v1.0.0-rc.1 · drone-runners/drone-runner-kube · GitHub

@bradrydzewski @marko-gacesa is there any further input on this? This is still posing an issue to our tenants.

Can you please try with the v1.0.0-rc.2 version. Thanks.

I have updated the runner version and will observe if this has made any improvements with tenant builds tomorrow.

An error I have been reported today from a tenant is:

I have checked the runner logs and can see:

$ k -n drone-ci logs drone-gl-runner-595f67bdc4-thlg2 -c runner | grep timely
time="2021-10-26T17:43:53Z" level=error msg="Engine: Container start timeout" build.id=217531 build.number=4635 container=drone-834ajvn39jo1s3wtl6q2 error="kubernetes has failed: container failed to start in timely manner: id=drone-834ajvn39jo1s3wtl6q2 image=docker.io/plugins/slack:1.0" image="docker.io/plugins/slack:1.0" namespace=drone-2ftet7qdrh6h3irniji8 placeholder="drone/placeholder:1" pod=drone-247ckl5lmpzg4nsz8ce7 repo.id=177690 repo.name=entity-search-frontend repo.namespace=hodac stage.id=260633 stage.name=default stage.number=1 step=notify_slack_deploy_notprod step.name=notify_slack_deploy_notprod thread=93
time="2021-10-26T17:43:53Z" level=error msg="Engine: Container start timeout" build.id=217531 build.number=4635 container=drone-2y5qk1g22cr8wo5narfb error="kubernetes has failed: container failed to start in timely manner: id=drone-2y5qk1g22cr8wo5narfb image=quay.io/ukhomeofficedigital/hashicorp-vault:1.6.0" image="quay.io/ukhomeofficedigital/hashicorp-vault:1.6.0" namespace=drone-2ftet7qdrh6h3irniji8 placeholder="drone/placeholder:1" pod=drone-247ckl5lmpzg4nsz8ce7 repo.id=177690 repo.name=entity-search-frontend repo.namespace=hodac stage.id=260633 stage.name=default stage.number=1 step=retrieve-preprod-secrets step.name=retrieve-preprod-secrets thread=93
time="2021-10-26T17:51:54Z" level=error msg="Engine: Container start timeout" build.id=217531 build.number=4635 container=drone-24admemipy5x6mx0o9n4 error="kubernetes has failed: container failed to start in timely manner: id=drone-24admemipy5x6mx0o9n4 image=docker.io/plugins/slack:1.0" image="docker.io/plugins/slack:1.0" namespace=drone-2ftet7qdrh6h3irniji8 placeholder="drone/placeholder:1" pod=drone-247ckl5lmpzg4nsz8ce7 repo.id=177690 repo.name=entity-search-frontend repo.namespace=hodac stage.id=260633 stage.name=default stage.number=1 step=notify_slack_deploy_preprod step.name=notify_slack_deploy_preprod thread=93
time="2021-10-26T21:27:56Z" level=error msg="Engine: Container start timeout" build.id=217581 build.number=4636 container=drone-jfbwbm9c4fvpc7hius36 error="kubernetes has failed: container failed to start in timely manner: id=drone-jfbwbm9c4fvpc7hius36 image=docker.io/plugins/slack:1.0" image="docker.io/plugins/slack:1.0" namespace=drone-ck79oywgjvyju7l6fso4 placeholder="drone/placeholder:1" pod=drone-kz0isgff5j8sh8zt3n6v repo.id=177690 repo.name=entity-search-frontend repo.namespace=hodac stage.id=260690 stage.name=default stage.number=1 step=notify_slack_deploy_preprod step.name=notify_slack_deploy_preprod thread=94

Can you advise on this error?

In all container logs for steps that exhibit this error it returns:

unable to retrieve container logs for docker://f2b0c833fc27b4b5761eea1300fbe389a3bd4718398ccfc4aca9efc5487f7b83

unable to retrieve container logs for docker://f6023ab57fcd0f468abb33dc3cd6176c832fc413f8b80c42feb8041f57444e47

etc

This error, the “Container start timeout” message, is shown after a container in a pod fails to start after 8 minutes. The runner replaces a step image (from a placeholder image to the actual image) and waits for a Kubernetes event that tells that the container has started. If it does not arrive after 8 minutes, the error is returned.

There is an environment variable DRONE_ENGINE_CONTAINER_START_TIMEOUT which you can use to modify the timeout. The value is the number of seconds, the default value is “480”, and that would be 8 minutes. The default value should give enough time for Kubernetes to download a large image and to start it, so probably something else is the reason.

Does this happen every time, or just occasionally (and how often)?

This appears to be happening regularly. One team is reporting very frequent

kubernetes has failed: container failed to start

and

container failed to start in timely mannner

errors

The above issues appear to have been resolved by configuring higher cpu and memory via requests and limits.