I don’t know if you’re using Amazon ECS (although the term “stack” says that to me), but load balancers on AWS have given me a fair amount of trouble with Drone. If you’re using ECS, I think only the Application Load Balancer will work with Drone - the Classic Load Balancer causes the same error message to appear.
Drone uses http2 which may not play nicely with load balancers. You should check your load balancer documentation to see if they support http2 and/or grpc. You might also want to check with AWS support to see if they have a recommended configuration. I do not use AWS, so this is not something I can confirm.
Did you implement Drone v0.8 under AWS ALB, There are two group of ports 443/80 and 9000, https/http traffic is managed by the external load balancer, but how about port 9000? That’s communication between drone agents to servers and it is internal load balancer
I currently set ELB on both external and internal load balancers.
I agree that I can switch external load balancer to ALB which supports https/http well. But how about the internal load balancer? will tcp port 9000 be supported by ALB, I don’t think so.
This error I got are mostly from the connection between agents to servers, so it is from 9000, how can I fix it?
According to the amazon documentation, the load balancers convert http/2 requests to http/1.1, which means they would not be compatible with drone [1]
Application Load Balancers provide native support for HTTP/2 with HTTPS listeners. You can send up to 128 requests in parallel using one HTTP/2 connection. The load balancer converts these to individual HTTP/1.1 requests and distributes them across the healthy targets
Port 9000 exposes a grpc endpoint which requires http/2
I tried a lot of things, but in the end I just loaded 0.5, which worked perfectly over the classic load balancer. I guess you could try customising it with HAProxy or something like that, but I found that using the earlier version solved all of my problems, as it happily communicated with the server via TCP. I might post a blog at some point about my experiences configuring Drone behind classic AWS ELB as it’s useful, but I didn’t get it going via the ALB.
I have the same issue but with docker swarm without any LB between containers.
2018-02-04T18:37:57.680823144Zinfra_ci.1.s3zfafzg5au5@dc-m INFO: 2018/02/04 18:37:57 grpc: Server.processUnaryRPC failed to write status connection error: desc = "transport is closing"
➜ infra git:(master) ✗ docker exec -ti c8 sh
/ # ping ci
PING ci (10.0.0.110): 56 data bytes
64 bytes from 10.0.0.110: seq=0 ttl=64 time=0.177 ms
64 bytes from 10.0.0.110: seq=1 ttl=64 time=0.093 ms
^C
--- ci ping statistics ---
2 packets transmitted, 2 packets received, 0% packet loss
round-trip min/avg/max = 0.093/0.135/0.177 ms
/ #
The error grpc: Server.processUnaryRPC failed to write status connection error: desc = "transport is closing is a grpc warning. And after this event a drone-server cant use drone-agent, that is why jobs stuck.
I believe I’m using the same stack as you: a single drone-server running on AWS ECS behind a Classical ELB for both external and internal access.
I did some tests a while ago and IIRC, increasing the ELB’s “Idle timeout” would increase the time delta between these error messages; so it seems that the ELB is closing the connection periodically. However, increasing it too much (so it would close the connection less often, e.g only once every hour) would counter-intuitively cause the Agents to permannently lose connection and stay hanging.
So, like you, I’ve decided to just live with the error messages (Idle timeout is set to 30 sec ATM).
I also gave ALB/NLB a try, but I couldn’t get them to support multiple origin ports
AWS’s load balancers do not have full support of end-to-end HTTP/2 yet as of Feb 2018.
The best way to avoid those error messages is to have the agent communicate with the server directly via the server’s IP (public or private), and not behind a load balancer or reverse proxy of any kind.
This communication works in Kubernetes because there is no load balancing “product” at the Service layer, it is just some VIP mappings.