Having trouble with healthchecks of drone agent 0.8.0-rc.5 in swarm

Ever since the rc.5, the agent keeps restarting every once a minute (killed by docker)
Creating the service starts the agent, but is stuck in status Up XX seconds (health: starting)
Then every minute the container is killed and another one is created and it goes on and on. The agent is able to accept jobs and processes them until killed

Status of a killed container seems to be

"Status": {
            "Timestamp": "2017-09-15T03:22:32.08665588Z",
            "State": "failed",
            "Message": "starting",
            "Err": "task: non-zero exit (2): dockerexec: unhealthy container",
            "ContainerStatus": {
                "ContainerID": "09bac37e7634b77d527ba9dfb9eebf5c2009554aa8867557d4a6733e3aebdc22",
                "ExitCode": 2
            },
            "PortStatus": {}
        },

logs do not output anything interesting, just request next execution on each container start:

drone-agent.1.w3oeydwtzq2e@proxy-1    | {"time":"2017-09-15T03:26:02Z","level":"debug","message":"request next execution"}
drone-agent.1.v6e4dwrtawdn@proxy-1    | {"time":"2017-09-15T03:31:09Z","level":"debug","message":"request next execution"}
drone-agent.1.6tobp13atx6x@proxy-1    | {"time":"2017-09-15T03:27:44Z","level":"debug","message":"request next execution"}
drone-agent.1.u7mpqtdj0yqn@proxy-1    | {"time":"2017-09-15T03:29:26Z","level":"debug","message":"request next execution"}
drone-agent.1.hzvnyg5bsxgb@proxy-1    | {"time":"2017-09-15T03:32:51Z","level":"debug","message":"request next execution"}

The spec is as follows (I hope the debug envs are correct):

"ContainerSpec": {
                "Image": "drone/agent:0.8.0-rc.5@sha256:183bd066396e236cb531a789319b894fe2cd5eefb000c4e8af1466323236ccb6",
                "Env": [
                    "DRONE_SECRET=redacted",
                    "DRONE_SERVER=drone:9000",
                    "DRONE_DEBUG=true",
                    "DRONE_BROKER_DEBUG=true"
                ],
                "Mounts": [
                    {
                        "Type": "bind",
                        "Source": "/var/run/docker.sock",
                        "Target": "/var/run/docker.sock"
                    }
                ],
                "DNSConfig": {}
            },

This also leaves some of the builds in status “Running” forever (clone step failed because of authentication issue, the following steps remain as pending
Both server and agent are at version 0.8.0-rc.5

I am not seeing this behavior at beta.drone.io (which runs the latest build) or my private drone instance.

I recommend looking at the healthcheck code and sending a patch. The relevant code is here:

This is really weird, running the agent on another swarm with the exact configuration runs ok and the health status passes fine after about 35 seconds

the difference is only the host machine kernel and the debian version
Debian 8 with backported kernel 4.9.18-1~bpo8+1 - runs fine
Debian 9 kernel 4.9.30-2+deb9u3 - won’t pass healthcheck, keeps restarting.
both have docker version 17.06.2-ce, build cec0b72

I think there is actually something really wrong with my networking, because if I expose the port 3000 to the host machine, I still cannot access it… However I can connect to it from another container if I connect the agent and the container to an overlay network. I think I will resort to purging the whole docker installation as this is not the first time it happened

Sorry to have raised this issue, you can close it