Service limit in kubernetes runtime

I’m giving the new kubernetes runtime a try on EKS and I’m running into a problem (screenshot attached).

.

The steps labeled mongo, memcached, mysql, solr and redis are all services that need to start in order for this pipeline to succeed. But after the clone step runs, only the first two services start and at this point it hangs indefinitely. Inspecting logs for drone-server shows no activity and no errors during this time. I do see the new namespace when I inspect with kubectl and inside that namespace I see nodes for the completed clone step and the two running steps, but no others:

kubectl get all --namespace 5df4qu2u0efmrrgsk0h2oua7atj0yv02
NAME                                   READY   STATUS      RESTARTS   AGE
pod/bmztcu9hgmopsgvb3t6vys44jtgpucy9   0/1     Completed   0          9m8s
pod/hozuoms4ag2alixz76uwj9qgz262zjgs   1/1     Running     0          8m37s
pod/v9z8sipfya0d06619eyq112gw7xf5fd9   1/1     Running     0          8m26s

I’ve enabled debug logging on drone server and collected the output in case it helps:

DEBU[0038] api: root access granted                      name=vdr-app namespace=stevecrozz request-id=1JSLUk6T4yZXiRDp5OW0fQhO8iW user.admin=true user.login=stevecrozz
DEBU[0038] api: root access granted                      name=vdr-app namespace=stevecrozz request-id=1JSLUk6T4yZXiRDp5OW0fQhO8iW user.admin=true user.login=stevecrozz
DEBU[0038] trigger: received                             commit=b89e0219939a8b338ddbd6c2b6f22cbc8204700e event=push ref=refs/heads/brenshanny/drone_ci repo=stevecrozz/vdr-app
DEBU[0038] kubernetes: creating job                      build-id=51 repo-id=24 stage-id=50 stage-name=default stage-number=1
DEBU[0038] kubernetes: successfully created job          build-id=51 repo-id=24 stage-id=50 stage-name=default stage-number=1
DEBU[0038]                                               fields.time="2019-04-05T16:59:29Z" latency=509.81041ms method=POST remote="10.32.1.46:46920" request=/api/repos/stevecrozz/vdr-app/builds/50 request-id=1JSLUk6T4yZXiRDp5OW0fQhO8iW
DEBU[0038] api: root access granted                      name=vdr-app namespace=stevecrozz request-id=1JSLUjSzlIAzAXNH4q4OZ8oVIWb user.admin=true user.login=stevecrozz
DEBU[0038]                                               fields.time="2019-04-05T16:59:29Z" latency=2.826318ms method=GET remote="10.32.1.46:46920" request=/api/repos/stevecrozz/vdr-app request-id=1JSLUjSzlIAzAXNH4q4OZ8oVIWb
DEBU[0038] api: root access granted                      name=vdr-app namespace=stevecrozz request-id=1JSLUgInCkcZ5X7Aj8dxjmSfZBF user.admin=true user.login=stevecrozz
DEBU[0038]                                               fields.time="2019-04-05T16:59:29Z" latency=5.965099ms method=GET remote="10.32.1.46:46876" request=/api/repos/stevecrozz/vdr-app/builds/51 request-id=1JSLUgInCkcZ5X7Aj8dxjmSfZBF
DEBU[0041] manager: fetching stage details               step-id=50
DEBU[0041] manager: updating step status                 step.id=251 step.name=clone step.status=running
DEBU[0041] api: root access granted                      name=vdr-app namespace=stevecrozz request-id=1JSLV5uULOMBX6L870meyfOqk9J user.admin=true user.login=stevecrozz
DEBU[0048]                                               fields.time="2019-04-05T16:59:39Z" latency="891.361µs" method=GET remote="10.32.1.46:46920" request=/manual/de/rewrite/intro.html request-id=1JSLW0terzuT2vcwA6dxxtomYjn
DEBU[0051]                                               fields.time="2019-04-05T16:59:42Z" latency="993.386µs" method=GET remote="10.32.1.46:46920" request=/ request-id=1JSLWLagIUU245qZ8bpJfRTyCRF
DEBU[0052]                                               fields.time="2019-04-05T16:59:43Z" latency="835.431µs" method=GET remote="10.32.1.46:46920" request=/ request-id=1JSLWQSkIg5i16jHsuTOwMEOxns
DEBU[0068] manager: updating step status                 step.id=251 step.name=clone step.status=success
DEBU[0068]                                               fields.time="2019-04-05T16:59:59Z" latency=26.357983386s method=GET remote="10.32.1.46:46876" request=/api/stream/stevecrozz/vdr-app/51/1/1 request-id=1JSLV5uULOMBX6L870meyfOqk9J
DEBU[0068] manager: updating step status                 step.id=252 step.name=mongo step.status=running
DEBU[0069] manager: updating step status                 step.id=253 step.name=memcached step.status=running
DEBU[0081]                                               fields.time="2019-04-05T17:00:12Z" latency="898.578µs" method=GET remote="10.32.1.46:46876" request=/ request-id=1JSLa7jLFm19uLu7lfDRxG1dUWx
DEBU[0082]                                               fields.time="2019-04-05T17:00:13Z" latency="766.503µs" method=GET remote="10.32.1.46:46876" request=/ request-id=1JSLaBkmM01E9gcuTEwaW2aBss5
DEBU[0111]                                               fields.time="2019-04-05T17:00:42Z" latency="811.701µs" method=GET remote="10.32.1.46:46876" request=/ request-id=1JSLds6BL1FfZwACwnkwXpG4R3l
DEBU[0112]                                               fields.time="2019-04-05T17:00:43Z" latency="757.592µs" method=GET remote="10.32.1.46:46876" request=/ request-id=1JSLe48U3WHbt4fEHkxHHOqoHLs

I believe I figured out the problem. Drone will not carry on until there is something on STDOUT after having run a service. In this case, the image I’m using is the official memcached image from:
https://hub.docker.com/_/memcached (memcached/memcached:1.5-alpine).

I can work around the issue by forcing the container to emit something on STDOUT by adding a command section to .drone.yml inside the memcached service definition:

  - name: memcached
    image: memcached:1.5-alpine
    ports:
      - 11211
    command:
      - sh
      - -c
      - echo "Starting memcached on port 11211" &&
        /entrypoint.sh memcached

Is this the intended behavior? or is this a bug?

I am not aware of any code that enforces such login in Drone. The Kubernetes log stream can be found here: https://github.com/drone/drone-runtime/blob/master/engine/kube/kube.go#L238

Notice the code returns an io.ReadCloser from Kubernetes and then Drone performs an io.Copy that exits when the Kubernetes client closes the log stream. Would this imply Kubernetes does not close the io.ReadCloser until some output is written? If so, that would sound like a Kubernetes client bug?

I’m not sure yet what the mechanism is behind this issue. I think the most helpful thing I can do to start with is to share some simple steps to reproduce. Here’s a repo I just created to illustrate what I’m seeing:

Hi Steve, thanks for the link to reproduce. Unfortunately I do not have any bandwidth right now to deep dive on the issue (I have more immediate priorities), however I will certainly provide as much support as possible if you are interested in debugging further and possibly authoring a patch.

Note that in the meantime you can always fallback to Drone agents.