[duplicate] Make autoscaler more robust

from time to time, I’m getting this:

This happens when drone-autoscaler has just scaled up because of additional build jobs.

I can get around this by implementing a custom userData block:

    # use firewall to disable access to docker until it has restarted and has been able to pull an image
      - ufw default allow outgoing
      - ufw default allow incoming
      - ufw deny 2376
      - echo activating firewall
      - ufw enable
      - apt-get install -o Dpkg::Options::="--force-confold" --force-yes -y docker-ce #custom docker config is already in place. These options makes sure installing docker doesnt overwrite them.
      - docker pull drone/drone-runner-docker
      - echo sleeping for 30 secs
      - sleep 30
      - echo opening firewall
      - ufw allow 2376

We inject this using the DRONE_AMAZON_USERDATA_FILE environment variable.
This allows docker to get installed without overwriting the config (daemon.json), start and perform a docker pull before it becomes available to drone.

It would be good if more robustness was built-in to the drone-autoscaler instead, so that we didn’t have to do this. For example, perform a docker pull (with appropirate retry logic) and only when that succeeds mark the runner as “ready for service”.

I think we discussed this in another thread and the maintainer said they’d never experienced issues with drone runners coming online, but we’re seeing it quite frequently - it seems to me as if drone-ausoscaler simply marks the node as healthy too soon. We’d very much like to not have to maintain a custom userdata config.

just to note: We’ve also seen other situations where an ec2 instance becomes unhealthy, but drone has no way of checking/discarding unhealthy nodes. We therefore run a “drone-autoscaler-janitor” in a separate process that uses drone’s api to compare aws ec2 instance status with what’s found in drone. This has been necessary in order to discard provisioning failures etc. It would be awesome if drone-autoscaler had this “unhappy path” logic built-in.

It looks like there is an existing thread for this topic at Autoscaler too brittle?. Let’s move the discussion to this existing thread. I will reply to your messages above in the original thread.