jim5252
(Jim Hamill)
October 14, 2021, 11:38am
1
We have had hanging build issues raised by multiple tenants across our platform.
Our runner is on version: 1.0.0-rc.1 and our server is on version: 2.4.0
The timeout has been set to 1 hour across the platform but as can be seen builds are running over this time until they have to be cancelled.
I have attached a gist with the corresponding drone files here: Hanging Build Drone Files · GitHub
jim5252
(Jim Hamill)
October 18, 2021, 8:57am
2
Are there any updates for this issue? It is affecting multiple tenants on our platform.
@jim5252 ,
Could you please enable trace logging and share sever and runner log during that time for our review.
jim5252
(Jim Hamill)
October 21, 2021, 1:17pm
4
I have uploaded server logs for that day of the 12th and runner logs for that day from the time of 09:20-09:25 here: Drone Server and Runner for 12/10/21 · GitHub
Please let me know if you need anything else.
jim5252
(Jim Hamill)
October 26, 2021, 9:09am
5
Hi, is there any updates on this issue? @csgit
@jim5252 ,
I assume you are using k8 runner and looks like stuck because Kubernetes was unable to schedule and start the Pod, meaning the Pod sits in a pending state in Kubernetes until it can be assigned to a node and started
So i am just checking if possible could you please take latest runner GitHub - drone-runners/drone-runner-kube at v1.0.0-beta.12 and test the behaviour and let us know the update.
jim5252
(Jim Hamill)
October 26, 2021, 2:19pm
7
As stated in the original issue, we are currently on the runner version above the one you have suggested: 1.0.0-rc.1, this was to fix the issue of fails with exit code 2 at Release v1.0.0-rc.1 · drone-runners/drone-runner-kube · GitHub
jim5252
(Jim Hamill)
October 26, 2021, 2:22pm
8
@bradrydzewski @marko-gacesa is there any further input on this? This is still posing an issue to our tenants.
Can you please try with the v1.0.0-rc.2 version. Thanks.
jim5252
(Jim Hamill)
October 26, 2021, 4:19pm
10
I have updated the runner version and will observe if this has made any improvements with tenant builds tomorrow.
jim5252
(Jim Hamill)
October 27, 2021, 9:09am
11
An error I have been reported today from a tenant is:
I have checked the runner logs and can see:
$ k -n drone-ci logs drone-gl-runner-595f67bdc4-thlg2 -c runner | grep timely
time="2021-10-26T17:43:53Z" level=error msg="Engine: Container start timeout" build.id=217531 build.number=4635 container=drone-834ajvn39jo1s3wtl6q2 error="kubernetes has failed: container failed to start in timely manner: id=drone-834ajvn39jo1s3wtl6q2 image=docker.io/plugins/slack:1.0" image="docker.io/plugins/slack:1.0" namespace=drone-2ftet7qdrh6h3irniji8 placeholder="drone/placeholder:1" pod=drone-247ckl5lmpzg4nsz8ce7 repo.id=177690 repo.name=entity-search-frontend repo.namespace=hodac stage.id=260633 stage.name=default stage.number=1 step=notify_slack_deploy_notprod step.name=notify_slack_deploy_notprod thread=93
time="2021-10-26T17:43:53Z" level=error msg="Engine: Container start timeout" build.id=217531 build.number=4635 container=drone-2y5qk1g22cr8wo5narfb error="kubernetes has failed: container failed to start in timely manner: id=drone-2y5qk1g22cr8wo5narfb image=quay.io/ukhomeofficedigital/hashicorp-vault:1.6.0" image="quay.io/ukhomeofficedigital/hashicorp-vault:1.6.0" namespace=drone-2ftet7qdrh6h3irniji8 placeholder="drone/placeholder:1" pod=drone-247ckl5lmpzg4nsz8ce7 repo.id=177690 repo.name=entity-search-frontend repo.namespace=hodac stage.id=260633 stage.name=default stage.number=1 step=retrieve-preprod-secrets step.name=retrieve-preprod-secrets thread=93
time="2021-10-26T17:51:54Z" level=error msg="Engine: Container start timeout" build.id=217531 build.number=4635 container=drone-24admemipy5x6mx0o9n4 error="kubernetes has failed: container failed to start in timely manner: id=drone-24admemipy5x6mx0o9n4 image=docker.io/plugins/slack:1.0" image="docker.io/plugins/slack:1.0" namespace=drone-2ftet7qdrh6h3irniji8 placeholder="drone/placeholder:1" pod=drone-247ckl5lmpzg4nsz8ce7 repo.id=177690 repo.name=entity-search-frontend repo.namespace=hodac stage.id=260633 stage.name=default stage.number=1 step=notify_slack_deploy_preprod step.name=notify_slack_deploy_preprod thread=93
time="2021-10-26T21:27:56Z" level=error msg="Engine: Container start timeout" build.id=217581 build.number=4636 container=drone-jfbwbm9c4fvpc7hius36 error="kubernetes has failed: container failed to start in timely manner: id=drone-jfbwbm9c4fvpc7hius36 image=docker.io/plugins/slack:1.0" image="docker.io/plugins/slack:1.0" namespace=drone-ck79oywgjvyju7l6fso4 placeholder="drone/placeholder:1" pod=drone-kz0isgff5j8sh8zt3n6v repo.id=177690 repo.name=entity-search-frontend repo.namespace=hodac stage.id=260690 stage.name=default stage.number=1 step=notify_slack_deploy_preprod step.name=notify_slack_deploy_preprod thread=94
Can you advise on this error?
jim5252
(Jim Hamill)
October 27, 2021, 11:15am
12
In all container logs for steps that exhibit this error it returns:
unable to retrieve container logs for docker://f2b0c833fc27b4b5761eea1300fbe389a3bd4718398ccfc4aca9efc5487f7b83
unable to retrieve container logs for docker://f6023ab57fcd0f468abb33dc3cd6176c832fc413f8b80c42feb8041f57444e47
etc
This error, the “Container start timeout” message, is shown after a container in a pod fails to start after 8 minutes. The runner replaces a step image (from a placeholder image to the actual image) and waits for a Kubernetes event that tells that the container has started. If it does not arrive after 8 minutes, the error is returned.
There is an environment variable DRONE_ENGINE_CONTAINER_START_TIMEOUT which you can use to modify the timeout. The value is the number of seconds, the default value is “480”, and that would be 8 minutes. The default value should give enough time for Kubernetes to download a large image and to start it, so probably something else is the reason.
Does this happen every time, or just occasionally (and how often)?
jim5252
(Jim Hamill)
October 28, 2021, 1:46pm
14
This appears to be happening regularly. One team is reporting very frequent
kubernetes has failed: container failed to start
and
container failed to start in timely mannner
errors
jim5252
(Jim Hamill)
October 29, 2021, 10:45am
15
The above issues appear to have been resolved by configuring higher cpu and memory via requests and limits.