I just transitioned the agent from one host to another (beefier) after about a year on the previous, and am getting odd network flakiness issues. The repo-under-test has a 2hr timeout and the test stage is running a vm under qemu. Luckily this stage is not very clever and will just hang around waiting to be killed by timeout if the vm doesn’t poweroff, so great for debugging odd network related issues.
Well my issue seems to stem from creating a dummy bridge in the container for qemu’s use and hard code the ip it has. The problem is that the ip I hard code sometimes happens to be on eth0’s subnet and also happens to be the default gw which leads to problems.
This is a nic that seems to be setup by drone. Is this new/on purpose? I can’t recall this being the case before and I’m not seeing much in the docs.
Here’s how things end up when we have problems:
agent host
$ ip addr show docker0
6: docker0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
link/ether 02:42:a6:11:fa:f6 brd ff:ff:ff:ff:ff:ff
inet 172.17.0.1/16 scope global docker0
valid_lft forever preferred_lft forever
inet 10.42.0.1/16 scope global docker0
valid_lft forever preferred_lft forever
inet6 fe80::42:a6ff:fe11:faf6/64 scope link
valid_lft forever preferred_lft forever
stuck container
# ip route; ip addr show eth0; ip addr show eth1; ip addr show hv
default via 172.18.0.1 dev eth0
172.17.0.0/16 dev eth1 scope link src 172.17.0.5
172.18.0.0/24 dev hv scope link src 172.18.0.1
172.18.0.0/16 dev eth0 scope link src 172.18.0.2
345: eth0@if346: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1500 qdisc noqueue state UP
link/ether 02:42:ac:12:00:02 brd ff:ff:ff:ff:ff:ff
inet 172.18.0.2/16 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::42:acff:fe12:2/64 scope link
valid_lft forever preferred_lft forever
343: eth1@if344: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1500 qdisc noqueue state UP
link/ether 02:42:ac:11:00:05 brd ff:ff:ff:ff:ff:ff
inet 172.17.0.5/16 scope global eth1
valid_lft forever preferred_lft forever
inet6 fe80::42:acff:fe11:5/64 scope link
valid_lft forever preferred_lft forever
2: hv: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP qlen 1000
link/ether 36:7c:2a:64:6b:86 brd ff:ff:ff:ff:ff:ff
inet 172.18.0.1/24 scope global hv
valid_lft forever preferred_lft forever
inet6 fe80::dccc:19ff:febc:33d9/64 scope link
valid_lft forever preferred_lft forever
testing hunch
# ping -c1 -W1 8.8.8.8; ip addr del 172.18.0.1/24 dev hv; ping -c1 -W1 8.8.8.8
PING 8.8.8.8 (8.8.8.8): 56 data bytes
--- 8.8.8.8 ping statistics ---
1 packets transmitted, 0 packets received, 100% packet loss
PING 8.8.8.8 (8.8.8.8): 56 data bytes
64 bytes from 8.8.8.8: seq=0 ttl=122 time=993.446 ms
--- 8.8.8.8 ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max = 993.446/993.446/993.446 ms
Now I cad definitely be smarter about dynamically choosing a subnet for hv, but I’m curious to find out if something is misconfigured on this new drone agent (should not be the case) and/or why I never saw this issue on the old agent.