Drone-runner-kube network issue on EKS

A bug known by AWS refers to that can affect Drone-runner-kube in build, generating network error and need to rerun the job. Here below is an AWS explanation of this known behavior.

From AWS support:

In order to provide your better assistance I digged more deeper and searched internally about related issue, I found that there is a known issue with accessing S3 objects over an S3 VPC gateway endpoint in us-east-1. Since ECR stores the image layers in S3, the ‘docker build’ command was failing intermittently. I can see that you are using S3 gateway end point “vpce-0c95b134a65f122d9” .
To give a brief about the issue: when a network interface (ENI) has been created within the last 90 seconds there is a small chance that accessing an S3 bucket through a S3 Gateway Endpoint in us-east-1 will timeout; this occurred in the case of your failed builds. When the network interface has been up for a while the issue no longer occurs. Most common client applications (such as ECS Agent) that access S3 have built-in retries that mitigate this issue, but Docker does not by default.
The internal team has acknowledged this an issue and have been working actively to get it fixed. The team has provided workaround and the recommended workaround right now is to either add a “sleep 90” command at the start of the build (delaying the execution of the docker command) or to remove the S3 Gateway endpoint from the VPC running the builds.

1 Like