Anyone have a better solution than terminating the agent which showed this error? In our experience, once you see this, it starts happening more frequently until that agent is useless and must be replaced. (our agents are in an ASG, so not a huge deal, but it’s a pain).
This time it took 13 days from boot to first incident of error.
Back of the envelope math from daily build number averages (aka inaccurate) would be after around 200 builds on that agent.
FYI this error originates in the the docker daemon, not in Drone, and could indicate a problem with Docker. We have limited ability to troubleshoot Docker issues, however, these are some things I would research:
check your Docker daemon logs for errors
check the Docker resources on the host. is there a buildup of resources?
do you need to run a prune?
do you need to upgrade to a newer version of Docker?
are you allowing developers to mount the host machine socket and interact with Docker directly? Is it possible they are not cleaning up after themselves?
are you using a tool that tries to periodically cleanup Docker resources? These tools are often too aggressive and remove networks before Drone is done using them.
FWIW this has not been a problem for us at cloud.drone.io.
This handles pruning images, containers and volumes created by Drone. This does not prune objects that your users create out-of-band by mounting the host machine Docker socket in their pipeline, so that is the only caveat.