Bitbucket ssl timeout issues, drone-autoscaller on GCP

Hey guys,

Been getting a strange SSL timeout issue that started a few days ago. It seems to happen in the default ‘clone’ step, and it looks something like this:

Initialized empty Git repository in /drone/src/.git/
+ git fetch origin +refs/heads/bugfix/UA-1656-admin-app---generated-data-text:
fatal: unable to access 'https://bitbucket.org/xxxxxx/admin-app.git/': OpenSSL SSL_connect: Connection reset by peer in connection to bitbucket.org:443 

Our drone setup is roughly as follows. We have one drone vm instance always running, and we use drone-autoscaler to fire up build agents as needed. Our agents are set with a minimum of 0 to a maximum of 4, and i believe our concurrency per agent is set at 1.

The drone vm and the agents are all running on a private subnet on a dedicated VPC with no publicly exposed endpoints. I have an GCP https load balancer which is the entry point for the drone GUI, and it proxies the requests back to the drone vm. We also use a Cloud NAT to allow the agents to access the internet.

After running a few tests, etc, the issue seems to manifest itself due to some sort of timeout occurring from a long running clone step, which is when an agent is dynamically provisioned. I noticed that if I already have 1 or 2 agents running for other builds, then restarting a failed build usually succeeds.

If i set the minimum number of agents to 1 (so that there is always 1 agent running) the issue seems to happen much less frequently, but it still happens from time to time.

I am also seeing some 502 errors on the load balancer which look like this:

httpRequest: {
 requestMethod: "POST"
 requestUrl: "https://build.xxxxxx.dev/rpc/v2/build/1637/watch"
 requestSize: "8"
 status: 502
 responseSize: "392"
 userAgent: "Go-http-client/2.0"
 remoteIp: "34.74.145.102"
 serverIp: "10.142.0.2"
 latency: "30.028465s"
}

The remoteIP is our nat router on the VPC.

I’ve tried setting longer timeouts on the load balancer (up to 120s) but that has had no effect. I have not yet tried to increase timeouts on the NAT router. Which is where i’m starting to suspect the issue is.

Just wondering if anyone else here has had similar issues on a similar setup?

Thanks,
John

The SSL clone timeout and 502 http errors would be unrelated.

The git clone is a direct connection between git and bitbucket, where the git clone runs inside a docker container. Drone does not route the clone traffic through the drone server, or interfere with the clone execution in any way. It basically just runs the clone command in a container, waits for the container to exit, and then collects the logs. Like this:

# start your container
docker run -d <image> /bin/sh -c "git clone ..."

# wait for your container to finish
docker wait <container>

# collect the container logs
docker logs <container>

The runner, on the other hand, performs long polling where it connects to the drone server for 30 seconds and then disconnects. It is possible that the client or server force-closing an http request could manifest in errors on a load balancer, however, this would not necessarily be considered a problem. Nor would this have any impact on cloning.

An SSL clone timeout is usually an issue with your host network configuration (or the host docker daemon network configuration) that prevents the git client from establishing the direct connection with Bitbucket.

@bradrydzewski thank you for the insight!

I suspect that the NAT router/gateway then might be the problem as that is effectively the networking path from the drone host to bitbucket. Or maybe something to do with the docker host itself. Will look at both.

Any specifics that you think I should be looking at?

Thanks!
John

Just to follow up on this. I doubled the GCP NAT router timeouts, and reduced the agents back to min=0 and so far so good. The issue seems to have been mitigated. Will keep an eye on it.

Spoke too soon! Nothing has changed in the environment and the issue is now propping up again more frequently.

I was able to ssh into one of the agent vm’s and did notice that it is indeed happening in the git container. Networking seems fine, but i’ll take another look…