I use the drone-docker plugin to build images and it’s been working fine for a while. I had not made commits that prompted builds for a few weeks, but now the process errs due to the docker build process within drone-docker not being able to pull images. I exec’d into the step during its execution and saw something interesting. There is the expected docker0 bridge within the container (172.18.0.1), as well as eth0 (172.18.0.2), and eth1 (172.17.0.2). It is eth1 that bridges to the host’s docker0 bridge. Doing an “ip route show” confirms that the default route for the container is to 172.18.0.1, which is the container’s docker0 bridge. Is this correct behavior? If I manually remove the default route and force it to be via the host’s bridge, then I can start accessing routable IP space and the build starts to proceed further. I cannot verify for previous builds, but I feel like the container is getting created and the interfaces are being enumerated differently now which is causing the routing confusion
The only difference I can tell between a few weeks ago and now is that I am on a slighter newer kernel. I was using 4.4.0-112-generic, now I am on 4.4.0-116-generic on the host.
It looks like docker0 is getting assigned 172.18.0.1 within the container and the host docker network that gets ephemerally created for this stage also has 172.18.0.1 on the host. The container has two routes for 172.18.0.1/24, one via the container’s docker0 and another via eth0.
I was hoping that this was a difference in host docker versions or kernel changes. The last working state was using kernel 4.4.0-112-generic with docker 17.12.0, whereas I had since moved to kernel 4.4.0-116-generic and docker 18.0.2. I downgraded both to no avail.
Very rarely when I restart the same build, there will be times when it can get network access and pull layers for a bit. Then it times out and can no longer access the network. This seems to make sense to me given it has two conflicting routers for the 172.18/16 net?
I found the problem. It looks like the dind image hardcodes the docker0 bridge at 172.18.0.1/16. This means that you must have an already-existing bridge on 172.18.0.0/16 on the host so that when the ephemeral network gets created for the dind pipeline step, it gets a network block that is not 172.18.0.0/16. In my case, I had modified my docker-compose setup to explicitly set my drone network to 172.100.00/16, which meant that 172.18.0.0/16 was still available on the host. This caused both the temporary “step” network and the internal docker bridge to be on 172.18.0.0/16 space, thus causing the routing confusion.
I suppose the proper fix would be for dind not to hardcode the 172.18.0.1/16 address of docker0 and make it something dynamic and non-overlapping.
Not sure if this is a bug in drone-docker or the upstream dind?
Glad to hear you figured it out. I believe this would need to be fixed upstream in the dind image, since by default the plugin starts docker without any flags or customization. You can find the dind github repository (and issue tracker) here https://github.com/docker-library/docker