Step intermittently fails with no output but indicates success

Hello,

Drone version: 2.12

Autoscaler version: 1.8.2 running on AWS

We have a step, called “subscriptions”, that intermittently silently fails.

The step is marked as passing in the UI, despite there being no output.

A later step communicates with a web app in the “subscriptions” container later on, and fails, apparently unable to resolve the container like it normally can.

  Error Message:
   System.AggregateException : One or more errors occurred. (Name or service not known (subscriptions:80))

The step isn’t particularly complicated, as shown below. We pull the container image in an earlier step from our internal ECR. The ECR pull is fine, and other images we pull from our internal ECR run fine.

It normally runs successfully, like so

When it doesn’t work, there’s no output. I had a poke through our drone database, and there’s a null entry in the logs for that step.

There’s no error output or anything for that step, and everything else is working fine.

I was hoping when we updated from the outdated AMIs mentioned here Drone autoscaler AMIs too old for .net 6 application - #4 by Shruthikini it would have resolved this, but alas.

I can verify that the earlier step that pulls the image from ECR works, and that it has pulled the subscription image onto the agent, so that image is present.

The step that it depends on, which spins up a database, is also fine.

I don’t think there’s anything wrong with the subscription image, as it changes infrequently, I suspect something’s going wrong in drone itself. Or it’s a combination of the two.

We do not have this issue with other images that we’ve pulled from our internal ECR.

I can’t replicate this at will, it only happens occasionally, and rerunning the build will normally work just fine.

Is there a convenient way to link agents to a given build? I’d like to be able to go from the failed build to the agent to see if I can run the container manually.

I’m not entirely sure, because there’s no easy way to link a build to the agent/server it ran on, but I think I checked all our agents and couldn’t see anything in the logs related to this error. It’s possible it got autoscaled out in between me spotting the failed build and checked though.

If you choose the “graph view” and select any step, you can see information about that step, such as the name of the runner where it was executed.

Here is a step where the runner doesn’t have a name, so the IP address is provided. Would this be helpful?

Huh, I had no idea that was added in the graph, that is helpful!

To be honest, I don’t look at the graph view, as drone defaults to the log view. It’d be nice if that were in the log view as well.

I think the subscription step is also running as detached, which doesn’t explain why it doesn’t output logs, but does explain why it continues when it’s in a bad state. We can’t run it as a service, because we need to auth against AWS ECR to pull it.

I’ll see if running that same container, undetached, with some junk commands causes something to break.