[autoscaler] [gce] Drone agents require restart after idling

Hi there! I’m testing out / using the Drone Autoscaler in GCE, and I’m having an issue where after idling for ~10 minutes, the agents don’t pick up new builds from Drone.

Restarting the drone agent container on the agent server instances fixes the problem, though.

    DRONE_POOL_MIN=2
    DRONE_POOL_MAX=10
    DRONE_SERVER_PROTO=https
    DRONE_SERVER_HOST=drone.example.io
    DRONE_SERVER_TOKEN=omitted
    DRONE_AGENT_HOST=drone.example.io:9000
    DRONE_AGENT_TOKEN=omitted
    DRONE_GOOGLE_PROJECT=example-1234
    DRONE_GOOGLE_ZONE=us-east1-b

It does pick up builds, and if builds continue to pile on, those will also be picked up, but after ~10 or so minutes of no builds being processed, then they stay idle until the agent is restarted.

I’m not sure if this is a bug or if it’s a misconfiguration. :thinking:

This sounds like you have a load balancer, proxy or network device sitting between the agent and the server that is breaking the network connection without properly signaling a close. The agents have very useful logs (especially if you enable debug logging). Have you taken a look at the agent logs to gather more details?

The autoscaler is not really relevant in terms of troubleshooting, since it is just doing the same docker run that a human would be doing. Although the agent does have optional keepalive parameters, and it could be helpful for the autoscaler to provide a means to set these. So the autoscaler is probably not the problem, but could be part of the solution :slight_smile:

Also are you planning on purchase an enterprise license or support? You might find this useful to your organization.

Unfortunately the agent logs aren’t particularly useful here.

The last thing that it reports is that the last pipeline it ran completed.

{"time":"2018-03-20T16:49:09Z","level":"debug","repo":"rune/rune","build":"94","id":"614","error":"","exit_code":1,"message":"pipeline complete"}
{"time":"2018-03-20T16:49:09Z","level":"debug","repo":"rune/rune","build":"94","id":"614","message":"uploading logs"}
{"time":"2018-03-20T16:49:09Z","level":"debug","repo":"rune/rune","build":"94","id":"614","message":"uploading logs complete"}
{"time":"2018-03-20T16:49:09Z","level":"debug","repo":"rune/rune","build":"94","id":"614","error":"","exit_code":1,"message":"updating pipeline status"}
{"time":"2018-03-20T16:49:09Z","level":"debug","repo":"rune/rune","build":"94","id":"614","message":"stop listening for cancel signal"}
{"time":"2018-03-20T16:49:10Z","level":"debug","repo":"rune/rune","build":"94","id":"614","message":"updating pipeline status complete"}
{"time":"2018-03-20T16:49:10Z","level":"debug","message":"request next execution"}
{"time":"2018-03-20T16:49:10Z","level":"debug","repo":"rune/rune","build":"94","id":"614","message":"pipeline done"}

This sounds like you have a load balancer, proxy or network device sitting between the agent and the server that is breaking the network connection without properly signaling a close

So this could be the case. And I imagine in most use cases on GCE, this will be the case.

In my setup specifically, I have Drone Server and Drone Autoscaler on Kubernetes, and I have Drone Server’s gRPC exposed via. serviceType: NodePort, so the only thing in between the agent and the server is whatever Kubernetes uses to map nodePorts to pods, so it’s probably way less than having an nginx proxy or something.

Also are you planning on purchase an enterprise license or support?

We definitely will; our organization pretty small. To be honest, I’m mostly trying to get a lot of this on Kubernetes just for sport. I’d like to update your Drone helm chart at some point to include the Autoscaler when I get it totally figured out. :slight_smile:

Drone is a fantastic use case for Kubernetes.

If the TCP connection broke I would expect to see the following logs:

INFO: 2018/03/18 21:17:16 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: Error while dialing dial tcp [::1]:9000: getsockopt: connection refused"; Reconnecting to {localhost:9000 <nil>}
INFO: 2018/03/18 21:17:17 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: Error while dialing dial tcp [::1]:9000: getsockopt: connection refused"; Reconnecting to {localhost:9000 <nil>}

The absence of this log data would tell me that the TCP connection was not closed without signaling, which means the agent doesn’t know there is a connection problem. It is likely that the TCP would eventually timeout after 15 minutes (that is the Go default) although I’m not certain.

You might be able to pass a keepalive time to the agent. The autoscaler is not yet able to set this parameter, but if you verify it works, we could add it. Also note that it must be > 5 minutes.

DRONE_KEEPALIVE_TIME=6m

But overall GRPC has not bee a great choice for Drone. When a team controls the code and the deployment environment, and you need high performance and low-latency, I think GRPC is great. But Drone has to support diverse network requirements, load balancers, corporate proxies, etc. So that is my long-winded way of saying that I plan to eventually provide a REST alternative to the GRPC transport [1].

[1] Roadmap for the 1.0 release

We’ve had similar problems with gRPC, and we’ve found that Twirp has been a great way to provide HTTP endpoints using the same protobuf, but also have the native Go client.

Migrating from gRPC to Twirp was also super easy. Maybe it’s something to investigate. :slightly_smiling_face:

https://github.com/twitchtv/twirp

I’ll give that a shot and let you know, I definitely haven’t seen any gRPC related errors yet.

Yep, twirp is something I’m considering. The main benefit of REST or Twirp is that each call is a single http request, which means we no longer have to worry about keeping a persistent http2 connection alive, which seems to be the root of all of my support woes these days.

I’m guessing you haven’t seen any grpc errors because it doesn’t know its connection is broken yet :slight_smile: But just a guess …

This was the only gRPC error I got:

2018/03/20 19:31:50 grpc error: extend(): code: Unknown: rpc error: code = Unknown desc = queue: task not found

I waited about 40 minutes after that and it didn’t pick up any new builds.

I’ve now re-created the docker container with -e DRONE_KEEPALIVE=6m and we’ll see how it goes.

docker version on the agent machine?

Client:
 Version:	17.12.1-ce
 API version:	1.35
 Go version:	go1.9.4
 Git commit:	7390fc6
 Built:	Tue Feb 27 22:17:40 2018
 OS/Arch:	linux/amd64

Server:
 Engine:
  Version:	17.12.1-ce
  API version:	1.35 (minimum version 1.12)
  Go version:	go1.9.4
  Git commit:	7390fc6
  Built:	Tue Feb 27 22:16:13 2018
  OS/Arch:	linux/amd64
  Experimental:	false

I let that one sit idly for ~1 hour, it’s logs are:

{"time":"2018-03-20T20:54:09Z","level":"debug","message":"request next execution"}
{"time":"2018-03-20T20:54:09Z","level":"debug","message":"request next execution"}

I don’t know if debugging is enabled, but this is what I used to run it:

docker run -e DRONE_KEEPALIVE=6m -e DRONE_SERVER=drone.example.io:9000 -e DRONE_SECRET=<omitted> -e DRONE_HOSTNAME=agent-xyz -e DRONE_MAX_PROCS=2 -e GODEBUG=netdns=go -v /var/run/docker.sock:/var/run/docker.sock -d drone/agent:0.8

and it’s not picking up any builds.

this would indicate the agent is connected to the server and is blocking waiting to receive a build from the queue. Is it possible the load balancer killed the connection between itself and your server, but the server still has an open connection since a signal was not sent?

The challenge here is that Drone does not really support load balancers and proxies, so I recommend avoiding such a configuration and directly connecting the agent and server. I just want to make sure we don’t spend too much time triaging something that has known issues.

Alternatively if you have some GRPC experience and can recommend the appropriate configuration parameters, we can try to issue a patch.

Gotya. As far as I’m aware my service isn’t using any sort of proxy or loadbalancer, other than some port mappings (basically docker run -p 31970:9000).

Setting the keepalive to 6m did not help; I also set the keepalive to 120s (and the min_keepalive on the server to 120s) just for the hell of it and that also didn’t help.

I think for now I’ll skip on the autoscaler until the agent / server communicate via. HTTP :grimacing:

just as an fyi, grpc will blacklist and block the client if you set a keepalive under 5 minutes. But yes, probably makes sense to hold off until http protocol is in place if you are hitting issues.

ah, well TIL. I’ll probably keep messing around with it.

If you need any help writing the HTTP version I’d be happy to contribute. :slight_smile:

Experiencing same thing on GCE.

Got no middleware between agents and server/autoscaler.

Server/autoscaler are (now) running in host mode, special security group granting ‘protocol = “all”’ agent->server.

Server is specified as ip (were using dns record before) in autoscaler configuration.

But still, idle servers are not accepting new builds.

agent:

{"time":"2018-06-04T11:32:28Z","level":"debug","repo":"someteam/somerepo","build":"3708","id":"47874","error":"","exit_code":0,"message":"updating pipeline status"}
{"time":"2018-06-04T11:32:28Z","level":"debug","repo":"someteam/somerepo","build":"3708","id":"47874","message":"stop listening for cancel signal"}
{"time":"2018-06-04T11:32:28Z","level":"debug","repo":"someteam/somerepo","build":"3708","id":"47874","message":"updating pipeline status complete"}
{"time":"2018-06-04T11:32:28Z","level":"debug","message":"request next execution"}
{"time":"2018-06-04T11:32:28Z","level":"debug","repo":"someteam/somerepo","build":"3708","id":"47874","message":"pipeline done"}

server


Hope it helps somehow

This PR fixes it for us :wink:

Many thanks for genisd