Exit 137 from Drone Kubernetes Runner

Hello!

X-Posting here from slack to share -

Would anyone know how to prevent Exit Code 137 from happening?
We have about 50 builds in a day and getting Exit Code 137 2-3 times a day on some of them. Unable to determine root cause

Each our pipeline runs on a m6a.2xlarge nodes with 8cpu, 32GB ram (request is 5cpu, 6gb ram per pipeline). Containers show memory utilization well below 4GB so we have a huge headroom

See logs attached


there are 3 reasons you can have a 137 exit code

  1. build is cancelled by an end user
  2. build is cancelled due to timeout being exceeded (timeout can be increased in the repository settings in the user interface)
  3. build is cancelled due to oom by the host os

also note that the kubernetes runner is not recommended for production use due to it being a community contribution, having a beta status, and overall instability at this stage. If you are looking to setup a product installation, we would strongly recommend installing the docker runner on kubernetes (for which an official helm chart exists).

Hi @brad
Thanks for replying!

We verified the below:

  1. Builds are not cancelled by end user
  2. Timeout is set to 60m. The pipelines usually get 137 code in 10-20 mins. We’re using a cypress docker image
  3. We suspect this could be happening? However, the logs don’t show much on why it could be doing that.

Here’s a snippet of my .drone.yml. There are “4” workers and they all run in parallel via depends_on step (pipelines are copy and paste)

---
kind: pipeline
type: kubernetes
name: test-worker-1 #Worker 1 

metadata:
  namespace: drone 
  annotations:
    cluster-autoscaler.kubernetes.io/safe-to-evict: false  

resources: #Pipeline Request 
    requests:
        cpu: 5000
        memory: 6000MiB

trigger:
    event:        
        - custom  #Trigger pipeline via Drone API Only 

default-cypress: &default-cypress
    image: public.ecr.aws/cypress-io/cypress/browsers:node16.16.0-chrome105-ff104-edge
    failure: always 
    privileged: true
    volumes:
        - name: cypress-cache
          path: /root/.cache/Cypress
    environment:
        CYPRESS_RECORD_KEY:
            from_secret: CYPRESS_RECORD_KEY
        CYPRESS_ENV_JSON:
            from_secret: CYPRESS_ENV_JSON
        CYPRESS_COMMAND: cy:run:ci --env=dev --suite=infra --test_browser=chrome --retry_amount=0 #Default if not specified 
  

steps:
    - name: install
      <<: *default-cypress
      commands:
          - /bin/bash drone/install.sh 
    - name: test
      <<: *default-cypress
      commands:
          - /bin/bash drone/test.sh default         

volumes:
    - name: cypress-cache
      temp: {}

---
...
...
---
kind: pipeline
type: kubernetes
name: after 

metadata:
  namespace: drone 
  annotations:
    cluster-autoscaler.kubernetes.io/safe-to-evict: false


trigger: 
    event:        
      - custom 
    status:
      - failure 
    branch: 
      - master #Will be done for master builds only 
depends_on: #Induce Parallelism on 4 workers 
    - test-worker-1
    - test-worker-2
    - test-worker-3
    - test-worker-4

....

When using our docker runner, if there is an oom kill, you would see an entry in our logs:

You can install the docker runner on Helm using this chart:
https://github.com/drone/charts/tree/master/charts/drone-runner-docker

The Docker runner is stable with optional commercial support. The kubernetes runner is experimental and may require your team to get hands on with the code to triage issues. Just something to consider as you ramp up your usage.

1 Like

Thank for the links!

We’ve also determined that right before the 137 errors happen, the drone server pods appear to churn.

We’ve set our drone-runner replicas to “2” to help but unable to do so for the server

Is it possible to run more than one below?

Thanks!

There may be some exceptions to the rule, but if the Drone server goes offline, the runners are designed to continue executing pipelines that are in-progress. The runner uses an exponential backoff to reconnect to the server and publish pipeline results once it comes back online.

Because Drone uses an embedded queue you cannot have multiple replicas of the Drone server at this time. But since runners are generally resilient to server outages, you should just need to make sure the server automatically restarts if it goes offline.

Thanks for the helpful explanation!

We managed to reduce the number of Exit 137 errors by increasing the replica count of the runner + making our auto-scaler less agressive.

However, when they do happen, we notice the drone server pod usually churns down & restarts (about 20s - 1m) right before the OOM kill on the test-worker container. Any way to prevent this from happening?

Ex: