Builds on our Drone server occasional fail when a step is unable to be started. The log for the step never gets initialized and instead has the following message on a red background:
Error response from daemon: container is marked for removal and cannot be started
Here are a couple of configuration notes/observations:
DRONE_RUNNER_CAPACITY=7
We make extensive use of depends_on, so multiple steps of a job can be running at the same time
Our CICD configuration and dev practices are such that Drone can be running multiple jobs on the same commit, where only the branch name is different
Failures only occur when multiple jobs are running simultaneously
My guess is that there’s some kind of container cache cleanup that is not ref-counting the number of jobs/steps that refer to the container… Or, Docker has race conditions when the same container is being started by multiple clients…
This error comes from the Docker daemon when trying to run docker start. I do not have any immediate reason to conclude Drone is causing this error since Drone would not remove a container before starting. Also, we have not received similar reports from the broader community, which suggests this is isolated.
Drone does not make any attempt to garbage Docker resources (containers, volumes, network) until after the pipeline is complete. It would not try to cleanup while the pipeline is in progress.
Is it possible you have some custom Docker garbage collector routine running on your server that is pre-maturely removing stopped containers before they even start?
This is not out of the question. In the past we have received bug reports that were traced to the Docker daemon. Since this error comes from the deamon, I would recommend monitoring your Docker daemon logs which may provide you with more details.
Is it possible you have some custom Docker garbage collector routine running on your server that is pre-maturely removing stopped containers before they even start?
We do not have any other Docker clients running on this host, only Drone and everything it manages.
I don’t know much about Docker containers and how Docker identifies them. I’m still guessing it’s some kind of race condition involving signatures as keys… The Docker daemon’s error rmessage is not common in Google searches. I will monitor the Docker more closely, and see if I can set up test case (using Drone) that more reliably triggers the error.
Summary: Recommended workaround for now is to have /var/lib/docker on a non-ZFS file system, e.g., ext4. If this is not possible, manual deletion of the ZFS backing store datasets, possibly after a reboot, is the way to go.