Unhealthy Agents can not be "killed"

So I’m having some issues with grpc and my agents.

I think an agent is working fine, has some grpc issues that it doesn’t “recover” from and then gets a build that “doesn’t finish”. Really the build is no longer running on the agent (no containers aside from agent container running). But when I hit /varz the agent still shows the build as running and having exceeded it’s timeout. So then the agent becomes “unhealthy”.

But gRPC then starts working and since I have DRONE_MAX_PROCS set to 3 it continues to take builds again. So the build that isn’t running but timed out is taking 1 slot, so 2 are free to continue to take and run builds. So now I have an agent that is working but in a degraded state.

So I go and try to clean it up. I run drone kill -s SIGINT agent. I see in the log ctrl+c received, terminating process. It doesn’t take anymore jobs. But hitting the /varz endpoint I still see that “hung” job that isn’t really running still “running”. So the agent never exits. At this point I force remove the agent container and start it again.

So a couple questions/thoughts I had after all this:

  • Why does the agent continue to track a build that is timed out? Why doesn’t it cancel/clean the build since it’s reached it’s timeout? I think if it knows the build is timed out it should automatically try to stop the build’s containers and remove it from “running”.
  • Would it be considered a bug that drone kill -s SIGINT agent doesn’t work properly on an unhealthy agent? If so I can open an issue. However, if you agree with my first statement I think that would also take care of this issue then.

Why does the agent continue to track a build that is timed out? Why doesn’t it cancel/clean the build since it’s reached it’s timeout? I think if it knows the build is timed out it should automatically try to stop the build’s containers and remove it from “running”.

I think based on our slack discussion, what is happening is that the docker logs command is hanging, preventing drone from uploading the logs and completing the pipeline. The issue here is that a context is not passed to the docker logs command. The context would have an associated timeout, and would kill any blocking commands that hang for too long. BUT this is fixed in our new runtime which is planned for integration in 0.9 https://github.com/drone/drone-runtime/blob/master/engine/docker/docker.go#L147

In general, docker commands should not be hanging. So while this is a bug and drone should handle this failure scenario (and will in 0.9) we should also get to the underlying root cause of why docker logs are hanging on your instance. If we fix the latter, the issue should hopefully go away without having to wait for 0.9 :slight_smile:

I think there are two improvements we can make to prevent docker logs from hanging the system. The following should work in theory:

  1. disable the half-backed re-queue logic for now:
    https://github.com/cncd/queue/blob/master/fifo.go#L178:L184
  2. pass a context with timeout to the docker logs command:
    1. make sure this context has a timeout equal to the build timeout https://github.com/drone/drone/blob/master/cmd/drone-agent/agent.go#L409
    2. pass the context to the tail command https://github.com/cncd/pipeline/blob/master/pipeline/pipeline.go#L132
    3. use the context in the docker logs command https://github.com/cncd/pipeline/blob/master/pipeline/backend/docker/docker.go#L155
1 Like