Error with autoscaled drone agents

We have an issue with agents spinning up and reporting an error.

Via the autoscaler CLI commands I can see

‘error during connect: Post “https://{IPADDRESS}:2376/v1.40/images/create?fromImage=drone%2Fdrone-runner-docker&tag=1”: EOF’

I can ssh into the agents, and can’t really see anything wrong with them, although I’m not sure what I’m meant to look for. Docker does seem to be running on them.

It’s been a while since I touched the setup, but we host our agents in AWS, and they have a custom cloudinit that loads in docker hub credentials so they don’t die to rate limiting. Pretty sure it’s the template from the docs with a config file jammed in.

It’d be nice to fix this because those agents wait around for a bit, preventing us from scaling to our maximum, until the autoscaler kills them (we’ve enabled the autoscaler reaper feature).

We’re still having this issue, and it occurs relatively often. Of the last 10 agents to spin up, two were spun up in an error state.

Hey @maxgruebneraeroqual happy to help with this, can you provide more info?

One root cause we observed in the past is the following …

You choose a base AMI where linux distro / package manager is configured to auto-upgrade packages on startup. As a result, when the autoscaler provisions a vm and it boots, the docker daemon starts and then the package manager immediately stops docker for upgrade, making the daemon unreachable for a period of time … when you manually login you see Docker daemon running, because the upgrade has since completed.

I believe aws linux is the biggest offender but other distros may be impacted as well. This can be solved by disabling the auto-upgrade in the base AMI. Here is a link to a past discussion, for reference

We were using the default AMI, which I thought would have been fine, but maybe because we’re using a custom cloudinit to shove the docker hub credentials in, that’s adjusted something.

After we updated to ubuntu 22.04 we’re getting the same error as in that post that you linked Brad, but much less frequently.

Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?

I’ll have a tinker and see where I get to.

Several months later, I’ve finally had the opportunity to look into this some more.

The docker service is indeed rebooting when an agent is created.

We’re using a custom cloud_init.yaml, basically identical to this one, to pass the docker hub credentials in, so that we can pull stuff without going over the docker hub rate limits.

At the bottom of that, there’s a couple of runcmd commands that reload the docker daemon systemd files are restart docker, which is probably what’s causing my issues.

I’ll see if I can come up with a better one


I encountered the same issue for many months. I tried with AMI ubuntu 20.04 & 22.04.

After reading logs in autoscaled instance, I saw that docker were restarted 2 times.

I guess this problem is random because sometimes the server is able to communicate with the agent after the second reboot (no problem case) and sometimes the server is able to communicate with the agent before the second reboot of docker and the software does not try to open a new connection with the agent.

So to fix this issue, I used the following tutorial to create a custom init configuration :

And I edited these lines :

  - [ systemctl, daemon-reload ]
  - [ systemctl, restart, docker ]

To this :

  - [ systemctl, daemon-reload ]
  - [ systemctl, start, docker ]

Since this change, I didn’t see ec2 instance stuck (since last week, and ~50 jobs)

To fix the issue, I think @brad should update the default init configuration.


1 Like

I totally forgot to update.

I did roughly the same Eowin, except I found that I didn’t need to run a start/restart command at all I think, and it still seems to load the docker hub credentials fine.

I’ve seen agents come up broken a couple of times since, but it happens much less often than it used to.

Hi @maxgruebneraeroqual,

I just changed the restart to start to be sure docker is started. This is maybe the reason you still have broken agent (to confirm), after ~90 jobs i didn’t encounter issue.

For docker hub credential, I have created a secret “dockerconfigjson” in the organization with the json code that you find in your ~/.docker/config.json.

In my .drone.yml I added it to log in to my docker registry :

  - dockerconfigjson

I think this solution is more secure than have clear credential in configuration file :slight_smile:

Thanks! This also fixed the issues I experienced with Hetzner.

Seems like a race condition: Docker was already started on the host (default) and Drone pulled the agent image and tries to start the container. But then the restart (cloud-init) of the docker daemon interrupts the startup of the containers and the autoscaler never recovers.

So yes, it should be systemctl start docker by default if we can ensure, that the override.conf of the docker service is already active. Or maybe force docker to not autostart after install, then the agent will simply retry until systemctl restart docker was invoked.

In general it would be nice, if the autoscaler is able to recover from failed image pull/starts.

1 Like