Drone version: 2.12
Autoscaler version: 1.8.2 running on AWS
We have a step, called “subscriptions”, that intermittently silently fails.
The step is marked as passing in the UI, despite there being no output.
A later step communicates with a web app in the “subscriptions” container later on, and fails, apparently unable to resolve the container like it normally can.
Error Message: System.AggregateException : One or more errors occurred. (Name or service not known (subscriptions:80))
The step isn’t particularly complicated, as shown below. We pull the container image in an earlier step from our internal ECR. The ECR pull is fine, and other images we pull from our internal ECR run fine.
It normally runs successfully, like so
When it doesn’t work, there’s no output. I had a poke through our drone database, and there’s a null entry in the logs for that step.
There’s no error output or anything for that step, and everything else is working fine.
I was hoping when we updated from the outdated AMIs mentioned here Drone autoscaler AMIs too old for .net 6 application - #4 by Shruthikini it would have resolved this, but alas.
I can verify that the earlier step that pulls the image from ECR works, and that it has pulled the subscription image onto the agent, so that image is present.
The step that it depends on, which spins up a database, is also fine.
I don’t think there’s anything wrong with the subscription image, as it changes infrequently, I suspect something’s going wrong in drone itself. Or it’s a combination of the two.
We do not have this issue with other images that we’ve pulled from our internal ECR.
I can’t replicate this at will, it only happens occasionally, and rerunning the build will normally work just fine.
Is there a convenient way to link agents to a given build? I’d like to be able to go from the failed build to the agent to see if I can run the container manually.