Hey guys,
Been getting a strange SSL timeout issue that started a few days ago. It seems to happen in the default ‘clone’ step, and it looks something like this:
Initialized empty Git repository in /drone/src/.git/
+ git fetch origin +refs/heads/bugfix/UA-1656-admin-app---generated-data-text:
fatal: unable to access 'https://bitbucket.org/xxxxxx/admin-app.git/': OpenSSL SSL_connect: Connection reset by peer in connection to bitbucket.org:443
Our drone setup is roughly as follows. We have one drone vm instance always running, and we use drone-autoscaler to fire up build agents as needed. Our agents are set with a minimum of 0 to a maximum of 4, and i believe our concurrency per agent is set at 1.
The drone vm and the agents are all running on a private subnet on a dedicated VPC with no publicly exposed endpoints. I have an GCP https load balancer which is the entry point for the drone GUI, and it proxies the requests back to the drone vm. We also use a Cloud NAT to allow the agents to access the internet.
After running a few tests, etc, the issue seems to manifest itself due to some sort of timeout occurring from a long running clone step, which is when an agent is dynamically provisioned. I noticed that if I already have 1 or 2 agents running for other builds, then restarting a failed build usually succeeds.
If i set the minimum number of agents to 1 (so that there is always 1 agent running) the issue seems to happen much less frequently, but it still happens from time to time.
I am also seeing some 502 errors on the load balancer which look like this:
httpRequest: {
requestMethod: "POST"
requestUrl: "https://build.xxxxxx.dev/rpc/v2/build/1637/watch"
requestSize: "8"
status: 502
responseSize: "392"
userAgent: "Go-http-client/2.0"
remoteIp: "34.74.145.102"
serverIp: "10.142.0.2"
latency: "30.028465s"
}
The remoteIP is our nat router on the VPC.
I’ve tried setting longer timeouts on the load balancer (up to 120s) but that has had no effect. I have not yet tried to increase timeouts on the NAT router. Which is where i’m starting to suspect the issue is.
Just wondering if anyone else here has had similar issues on a similar setup?
Thanks,
John