Drone GC stopped pruning images

Hi there,

So I’ve started running drone-gc together with my drone agent containers (1 agent/1 gc per machine) and it was working mostly fine for a couple of weeks. Until suddenly one of them stopped pruning images.

It had been running for a fewdays already, when on April 27th it had an error deleting a Node image (it had several similar errors in the past):

{
    "level": "error",
    "error": "Error response from daemon: conflict: unable to delete 406f227b21f5 (cannot be forced) - image has dependent child images",
    "id": "sha256:406f227b21f5842a19835bddced00a11a87cce2a179cecc1b9d02a4ac593f2d7",
    "image": [
        "node:8.9.4-alpine"
    ],
    "time": "2018-04-27T13:46:14Z",
    "message": "cannot remove image"
}

After that, it kept updating the image cache (imaged used, updated cache), but never again logged a prune operation. I noticed it only a week later (on May 1st) when my instance ran out of disk :confused:

Any ideas what could cause this? Anyone had the same issue? I’ll take a look at the code later to see if I can find anything.

Actually, I missed something:

{
    "level": "debug",
    "id": "sha256:4c8c3896c3ef63a297f6d7e77c79455b078b22d011d5bf892a494a2a38e98f22",
    "size": "139.8MB",
    "created": 1524781712,
    "image": [
        "clojure:lein-2.8.1-alpine"
    ],
    "time": "2018-04-27T13:46:14Z",
    "message": "remove image"
}

This was the last log before all I got was “image used, update cache” messages.

Looking at the code it should either log an info or an error message after this, depending if it managed to delete the images or not. But it never did.

Something tells me that the docker remove call got stuck for some reason (that wouldn’t be the first time I see Docker freezing). Maybe some kind of timeout would be in order?

Anyway I only got this once after running several gcs for a couple weeks, will keep monitoring them and see if the issue repeats…

all of our functions use context which means we could definitely integrate a timeout for each cycle of garbage collection. We would probably want to set a reasonably high timeout since the primary goal would be to avoid infinite blocking.

+ var timeout = time.Hour

func (c *collector) Collect(ctx context.Context) error {
+	ctx, cancel := context.WithCancel(ctx, timeout)
+	defer cancel()

	c.collectContainers(ctx)
	c.collectDanglingImages(ctx)
	c.collectImages(ctx)
	c.collectNetworks(ctx)
	c.collectVolumes(ctx)
	return nil
}
1 Like

I just submitted a quick patch https://github.com/drone/drone-gc/commit/86a359e35f4d5d79b5eff7ea651c8119d1bc19b1