Would anyone know how to prevent Exit Code 137 from happening?
We have about 50 builds in a day and getting Exit Code 137 2-3 times a day on some of them. Unable to determine root cause
Each our pipeline runs on a m6a.2xlarge nodes with 8cpu, 32GB ram (request is 5cpu, 6gb ram per pipeline). Containers show memory utilization well below 4GB so we have a huge headroom
build is cancelled due to timeout being exceeded (timeout can be increased in the repository settings in the user interface)
build is cancelled due to oom by the host os
also note that the kubernetes runner is not recommended for production use due to it being a community contribution, having a beta status, and overall instability at this stage. If you are looking to setup a product installation, we would strongly recommend installing the docker runner on kubernetes (for which an official helm chart exists).
The Docker runner is stable with optional commercial support. The kubernetes runner is experimental and may require your team to get hands on with the code to triage issues. Just something to consider as you ramp up your usage.
There may be some exceptions to the rule, but if the Drone server goes offline, the runners are designed to continue executing pipelines that are in-progress. The runner uses an exponential backoff to reconnect to the server and publish pipeline results once it comes back online.
Because Drone uses an embedded queue you cannot have multiple replicas of the Drone server at this time. But since runners are generally resilient to server outages, you should just need to make sure the server automatically restarts if it goes offline.
We managed to reduce the number of Exit 137 errors by increasing the replica count of the runner + making our auto-scaler less agressive.
However, when they do happen, we notice the drone server pod usually churns down & restarts (about 20s - 1m) right before the OOM kill on the test-worker container. Any way to prevent this from happening?