Service hostname is intermittently being resolved to wrong IPs

I have tried running both official drone-cli 0.7 and self-built 0.8. My docker version is Docker version 17.04.0-ce, build 78d1802. I am running openSUSE Tumbleweed.

My .drone.yml:

pipeline:
  ping:
    image: podnov/network-utils
    commands:
      - cat /etc/hosts
      - cat /etc/resolv.conf
      - ping database -c 1
services:
  database:
    image: mongo:3.2
    command: [ --smallfiles ]

I am pasting couple of execs. Please pay attention to the IP and hostname which database resolves to with the ping utility.

This first run resolved database to 104.239.207.44, the next one resolves to a non-answering 198.105.244.130 and the last one shows the correct DNS resolution to the local service container. It always resolves either first, second or the correct container IP. Note: when running in group the same host can be resolved with two different IPs. Also, adding more delay does not seem to help.

» drone exec
[database:L0:0s] 2017-09-15T05:06:51.037+0000 I CONTROL  [initandlisten] MongoDB starting : pid=1 port=27017 dbpath=/data/db 64-bit host=50561b8f1ff5
[database:L1:0s] 2017-09-15T05:06:51.037+0000 I CONTROL  [initandlisten] db version v3.2.16
[database:L2:0s] 2017-09-15T05:06:51.037+0000 I CONTROL  [initandlisten] git version: 056bf45128114e44c5358c7a8776fb582363e094
[database:L3:0s] 2017-09-15T05:06:51.037+0000 I CONTROL  [initandlisten] OpenSSL version: OpenSSL 1.0.1t  3 May 2016
[database:L4:0s] 2017-09-15T05:06:51.037+0000 I CONTROL  [initandlisten] allocator: tcmalloc
[database:L5:0s] 2017-09-15T05:06:51.037+0000 I CONTROL  [initandlisten] modules: none
[database:L6:0s] 2017-09-15T05:06:51.037+0000 I CONTROL  [initandlisten] build environment:
[database:L7:0s] 2017-09-15T05:06:51.037+0000 I CONTROL  [initandlisten]     distmod: debian81
[database:L8:0s] 2017-09-15T05:06:51.037+0000 I CONTROL  [initandlisten]     distarch: x86_64
[database:L9:0s] 2017-09-15T05:06:51.037+0000 I CONTROL  [initandlisten]     target_arch: x86_64
[database:L10:0s] 2017-09-15T05:06:51.037+0000 I CONTROL  [initandlisten] options: { storage: { mmapv1: { smallFiles: true } } }
[database:L11:0s] 2017-09-15T05:06:51.040+0000 I STORAGE  [initandlisten] wiredtiger_open config: create,cache_size=8G,session_max=20000,eviction=(threads_min=4,threads_max=4),config_base=false,statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=snappy),file_manager=(close_idle_time=100000),checkpoint=(wait=60,log_size=2GB),statistics_log=(wait=0),
[database:L12:0s] 2017-09-15T05:06:51.185+0000 W STORAGE  [initandlisten] Detected configuration for non-active storage engine mmapv1 when current storage engine is wiredTiger
[database:L13:0s] 2017-09-15T05:06:51.185+0000 I CONTROL  [initandlisten] 
[database:L14:0s] 2017-09-15T05:06:51.185+0000 I CONTROL  [initandlisten] ** WARNING: /sys/kernel/mm/transparent_hugepage/enabled is 'always'.
[database:L15:0s] 2017-09-15T05:06:51.185+0000 I CONTROL  [initandlisten] **        We suggest setting it to 'never'
[database:L16:0s] 2017-09-15T05:06:51.185+0000 I CONTROL  [initandlisten] 
[database:L17:0s] 2017-09-15T05:06:51.253+0000 I NETWORK  [HostnameCanonicalizationWorker] Starting hostname canonicalization worker
[database:L18:0s] 2017-09-15T05:06:51.253+0000 I FTDC     [initandlisten] Initializing full-time diagnostic data capture with directory '/data/db/diagnostic.data'
[database:L19:0s] 2017-09-15T05:06:51.253+0000 I NETWORK  [initandlisten] waiting for connections on port 27017
[ping:L0:0s] + cat /etc/hosts
[ping:L1:0s] 127.0.0.1  localhost
[ping:L2:0s] ::1        localhost ip6-localhost ip6-loopback
[ping:L3:0s] fe00::0    ip6-localnet
[ping:L4:0s] ff00::0    ip6-mcastprefix
[ping:L5:0s] ff02::1    ip6-allnodes
[ping:L6:0s] ff02::2    ip6-allrouters
[ping:L7:0s] 172.17.0.3 f4aaff0b8cae
[ping:L8:0s] 172.25.0.2 f4aaff0b8cae
[ping:L9:0s] + cat /etc/resolv.conf
[ping:L10:0s] search attlocal.net
[ping:L11:0s] nameserver 127.0.0.11
[ping:L12:0s] options ndots:0
[ping:L13:0s] + ping database -c 1
[ping:L14:0s] PING database (104.239.207.44) 56(84) bytes of data.
[ping:L15:0s] 64 bytes from 104.239.207.44 (104.239.207.44): icmp_seq=1 ttl=46 time=41.0 ms
[ping:L16:0s] 
[ping:L17:0s] --- database ping statistics ---
[ping:L18:0s] 1 packets transmitted, 1 received, 0% packet loss, time 0ms
[ping:L19:0s] rtt min/avg/max/mdev = 41.066/41.066/41.066/0.000 ms
» drone exec
[database:L0:0s] 2017-09-15T05:10:40.831+0000 I CONTROL  [initandlisten] MongoDB starting : pid=1 port=27017 dbpath=/data/db 64-bit host=9f6b65f496df
[database:L1:0s] 2017-09-15T05:10:40.831+0000 I CONTROL  [initandlisten] db version v3.2.16
[database:L2:0s] 2017-09-15T05:10:40.831+0000 I CONTROL  [initandlisten] git version: 056bf45128114e44c5358c7a8776fb582363e094
[database:L3:0s] 2017-09-15T05:10:40.831+0000 I CONTROL  [initandlisten] OpenSSL version: OpenSSL 1.0.1t  3 May 2016
[database:L4:0s] 2017-09-15T05:10:40.831+0000 I CONTROL  [initandlisten] allocator: tcmalloc
[database:L5:0s] 2017-09-15T05:10:40.831+0000 I CONTROL  [initandlisten] modules: none
[database:L6:0s] 2017-09-15T05:10:40.831+0000 I CONTROL  [initandlisten] build environment:
[database:L7:0s] 2017-09-15T05:10:40.831+0000 I CONTROL  [initandlisten]     distmod: debian81
[database:L8:0s] 2017-09-15T05:10:40.831+0000 I CONTROL  [initandlisten]     distarch: x86_64
[database:L9:0s] 2017-09-15T05:10:40.831+0000 I CONTROL  [initandlisten]     target_arch: x86_64
[database:L10:0s] 2017-09-15T05:10:40.831+0000 I CONTROL  [initandlisten] options: { storage: { mmapv1: { smallFiles: true } } }
[database:L11:0s] 2017-09-15T05:10:40.833+0000 I STORAGE  [initandlisten] wiredtiger_open config: create,cache_size=8G,session_max=20000,eviction=(threads_min=4,threads_max=4),config_base=false,statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=snappy),file_manager=(close_idle_time=100000),checkpoint=(wait=60,log_size=2GB),statistics_log=(wait=0),
[database:L12:0s] 2017-09-15T05:10:40.975+0000 W STORAGE  [initandlisten] Detected configuration for non-active storage engine mmapv1 when current storage engine is wiredTiger
[database:L13:0s] 2017-09-15T05:10:40.975+0000 I CONTROL  [initandlisten] 
[database:L14:0s] 2017-09-15T05:10:40.975+0000 I CONTROL  [initandlisten] ** WARNING: /sys/kernel/mm/transparent_hugepage/enabled is 'always'.
[database:L15:0s] 2017-09-15T05:10:40.975+0000 I CONTROL  [initandlisten] **        We suggest setting it to 'never'
[database:L16:0s] 2017-09-15T05:10:40.975+0000 I CONTROL  [initandlisten] 
[database:L17:0s] 2017-09-15T05:10:41.005+0000 I NETWORK  [HostnameCanonicalizationWorker] Starting hostname canonicalization worker
[database:L18:0s] 2017-09-15T05:10:41.005+0000 I FTDC     [initandlisten] Initializing full-time diagnostic data capture with directory '/data/db/diagnostic.data'
[database:L19:0s] 2017-09-15T05:10:41.005+0000 I NETWORK  [initandlisten] waiting for connections on port 27017
[ping:L0:0s] + cat /etc/hosts
[ping:L1:0s] 127.0.0.1  localhost
[ping:L2:0s] ::1        localhost ip6-localhost ip6-loopback
[ping:L3:0s] fe00::0    ip6-localnet
[ping:L4:0s] ff00::0    ip6-mcastprefix
[ping:L5:0s] ff02::1    ip6-allnodes
[ping:L6:0s] ff02::2    ip6-allrouters
[ping:L7:0s] 172.17.0.3 524afb980fe6
[ping:L8:0s] 172.22.0.2 524afb980fe6
[ping:L9:0s] + cat /etc/resolv.conf
[ping:L10:0s] search attlocal.net
[ping:L11:0s] nameserver 127.0.0.11
[ping:L12:0s] options ndots:0
[ping:L13:0s] + ping database -c 1
[ping:L14:10s] PING database (198.105.244.130) 56(84) bytes of data.
[ping:L15:10s] 
[ping:L16:10s] --- database ping statistics ---
[ping:L17:10s] 1 packets transmitted, 0 received, 100% packet loss, time 0ms
[ping:L18:10s] 
2017/09/15 00:10:54 drone_step_0 : exit code 1
» drone exec
[database:L0:0s] 2017-09-15T05:12:37.819+0000 I CONTROL  [initandlisten] MongoDB starting : pid=1 port=27017 dbpath=/data/db 64-bit host=764697466d6f
[database:L1:0s] 2017-09-15T05:12:37.831+0000 I CONTROL  [initandlisten] db version v3.2.16
[database:L2:0s] 2017-09-15T05:12:37.831+0000 I CONTROL  [initandlisten] git version: 056bf45128114e44c5358c7a8776fb582363e094
[database:L3:0s] 2017-09-15T05:12:37.831+0000 I CONTROL  [initandlisten] OpenSSL version: OpenSSL 1.0.1t  3 May 2016
[database:L4:0s] 2017-09-15T05:12:37.831+0000 I CONTROL  [initandlisten] allocator: tcmalloc
[database:L5:0s] 2017-09-15T05:12:37.831+0000 I CONTROL  [initandlisten] modules: none
[database:L6:0s] 2017-09-15T05:12:37.831+0000 I CONTROL  [initandlisten] build environment:
[database:L7:0s] 2017-09-15T05:12:37.831+0000 I CONTROL  [initandlisten]     distmod: debian81
[database:L8:0s] 2017-09-15T05:12:37.831+0000 I CONTROL  [initandlisten]     distarch: x86_64
[database:L9:0s] 2017-09-15T05:12:37.831+0000 I CONTROL  [initandlisten]     target_arch: x86_64
[database:L10:0s] 2017-09-15T05:12:37.831+0000 I CONTROL  [initandlisten] options: { storage: { mmapv1: { smallFiles: true } } }
[database:L11:0s] 2017-09-15T05:12:37.834+0000 I STORAGE  [initandlisten] wiredtiger_open config: create,cache_size=8G,session_max=20000,eviction=(threads_min=4,threads_max=4),config_base=false,statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=snappy),file_manager=(close_idle_time=100000),checkpoint=(wait=60,log_size=2GB),statistics_log=(wait=0),
[database:L12:0s] 2017-09-15T05:12:37.989+0000 W STORAGE  [initandlisten] Detected configuration for non-active storage engine mmapv1 when current storage engine is wiredTiger
[database:L13:0s] 2017-09-15T05:12:37.989+0000 I CONTROL  [initandlisten] 
[database:L14:0s] 2017-09-15T05:12:37.989+0000 I CONTROL  [initandlisten] ** WARNING: /sys/kernel/mm/transparent_hugepage/enabled is 'always'.
[database:L15:0s] 2017-09-15T05:12:37.989+0000 I CONTROL  [initandlisten] **        We suggest setting it to 'never'
[database:L16:0s] 2017-09-15T05:12:37.989+0000 I CONTROL  [initandlisten] 
[database:L17:0s] 2017-09-15T05:12:38.015+0000 I NETWORK  [HostnameCanonicalizationWorker] Starting hostname canonicalization worker
[database:L18:0s] 2017-09-15T05:12:38.015+0000 I FTDC     [initandlisten] Initializing full-time diagnostic data capture with directory '/data/db/diagnostic.data'
[database:L19:0s] 2017-09-15T05:12:38.015+0000 I NETWORK  [initandlisten] waiting for connections on port 27017
[ping:L0:0s] + cat /etc/hosts
[ping:L1:0s] 127.0.0.1  localhost
[ping:L2:0s] ::1        localhost ip6-localhost ip6-loopback
[ping:L3:0s] fe00::0    ip6-localnet
[ping:L4:0s] ff00::0    ip6-mcastprefix
[ping:L5:0s] ff02::1    ip6-allnodes
[ping:L6:0s] ff02::2    ip6-allrouters
[ping:L7:0s] 172.17.0.3 2dcd8390c9f1
[ping:L8:0s] 172.26.0.3 2dcd8390c9f1
[ping:L9:0s] + cat /etc/resolv.conf
[ping:L10:0s] search attlocal.net
[ping:L11:0s] nameserver 127.0.0.11
[ping:L12:0s] options ndots:0
[ping:L13:0s] + ping database -c 1
[ping:L14:0s] PING database (172.26.0.2) 56(84) bytes of data.
[ping:L15:0s] 64 bytes from drone_services_0.drone_default (172.26.0.2): icmp_seq=1 ttl=64 time=0.063 ms
[ping:L16:0s] 
[ping:L17:0s] --- database ping statistics ---
[ping:L18:0s] 1 packets transmitted, 1 received, 0% packet loss, time 0ms
[ping:L19:0s] rtt min/avg/max/mdev = 0.063/0.063/0.063/0.000 ms

Also I tried running example:

pipeline:
  ping:
    image: mongo:3.0
    group: ping
    commands:
      - sleep 15
      - 'mongo --host mongo --eval "{ ping: 5 }"'

  ping2:
    image: mongo:3.0
    group: ping
    commands:
      - sleep 15
      - 'mongo --host mongo --eval "{ ping: 5 }"'

services:
  mongo:
    image: mongo:3.0
    command: [ --smallfiles ]

This is the result I got:

» drone exec
[mongo:L0:0s] 2017-09-15T05:22:47.261+0000 I CONTROL  [initandlisten] MongoDB starting : pid=1 port=27017 dbpath=/data/db 64-bit host=c3f2e1fa654d
[mongo:L1:0s] 2017-09-15T05:22:47.261+0000 I CONTROL  [initandlisten] db version v3.0.15
[mongo:L2:0s] 2017-09-15T05:22:47.261+0000 I CONTROL  [initandlisten] git version: b8ff507269c382bc100fc52f75f48d54cd42ec3b
[mongo:L3:0s] 2017-09-15T05:22:47.261+0000 I CONTROL  [initandlisten] build info: Linux ip-10-166-66-3 3.2.0-4-amd64 #1 SMP Debian 3.2.46-1 x86_64 BOOST_LIB_VERSION=1_49
[mongo:L4:0s] 2017-09-15T05:22:47.261+0000 I CONTROL  [initandlisten] allocator: tcmalloc
[mongo:L5:0s] 2017-09-15T05:22:47.261+0000 I CONTROL  [initandlisten] options: { storage: { mmapv1: { smallFiles: true } } }
[mongo:L6:0s] 2017-09-15T05:22:47.270+0000 I STORAGE  [initandlisten] 
[mongo:L7:0s] 2017-09-15T05:22:47.270+0000 I STORAGE  [initandlisten] ** WARNING: Readahead for /data/db is set to 512KB
[mongo:L8:0s] 2017-09-15T05:22:47.270+0000 I STORAGE  [initandlisten] **          We suggest setting it to 256KB (512 sectors) or less
[mongo:L9:0s] 2017-09-15T05:22:47.270+0000 I STORAGE  [initandlisten] **          http://dochub.mongodb.org/core/readahead
[mongo:L10:0s] 2017-09-15T05:22:47.270+0000 I JOURNAL  [initandlisten] journal dir=/data/db/journal
[mongo:L11:0s] 2017-09-15T05:22:47.270+0000 I JOURNAL  [initandlisten] recover : no journal files present, no recovery needed
[mongo:L12:0s] 2017-09-15T05:22:47.806+0000 I JOURNAL  [initandlisten] preallocateIsFaster=true 3.6
[mongo:L13:1s] 2017-09-15T05:22:48.333+0000 I JOURNAL  [initandlisten] preallocateIsFaster=true 4.72
[ping2:L0:0s] + sleep 15
[ping:L0:0s] + sleep 15
[mongo:L14:2s] 2017-09-15T05:22:49.619+0000 I JOURNAL  [initandlisten] preallocateIsFaster=true 2.22
[mongo:L15:2s] 2017-09-15T05:22:49.619+0000 I JOURNAL  [initandlisten] preallocating a journal file /data/db/journal/prealloc.0
[mongo:L16:2s] 2017-09-15T05:22:49.921+0000 I JOURNAL  [initandlisten] preallocating a journal file /data/db/journal/prealloc.1
[mongo:L17:2s] 2017-09-15T05:22:50.224+0000 I JOURNAL  [initandlisten] preallocating a journal file /data/db/journal/prealloc.2
[mongo:L18:3s] 2017-09-15T05:22:50.604+0000 I JOURNAL  [durability] Durability thread started
[mongo:L19:3s] 2017-09-15T05:22:50.604+0000 I JOURNAL  [journal writer] Journal writer thread started
[mongo:L20:3s] 2017-09-15T05:22:50.610+0000 I CONTROL  [initandlisten] 
[mongo:L21:3s] 2017-09-15T05:22:50.610+0000 I CONTROL  [initandlisten] ** WARNING: /sys/kernel/mm/transparent_hugepage/enabled is 'always'.
[mongo:L22:3s] 2017-09-15T05:22:50.610+0000 I CONTROL  [initandlisten] **        We suggest setting it to 'never'
[mongo:L23:3s] 2017-09-15T05:22:50.610+0000 I CONTROL  [initandlisten] 
[mongo:L24:3s] 2017-09-15T05:22:50.611+0000 I INDEX    [initandlisten] allocating new ns file /data/db/local.ns, filling with zeroes...
[mongo:L25:3s] 2017-09-15T05:22:50.657+0000 I STORAGE  [FileAllocator] allocating new datafile /data/db/local.0, filling with zeroes...
[mongo:L26:3s] 2017-09-15T05:22:50.657+0000 I STORAGE  [FileAllocator] creating directory /data/db/_tmp
[mongo:L27:3s] 2017-09-15T05:22:50.667+0000 I STORAGE  [FileAllocator] done allocating datafile /data/db/local.0, size: 16MB,  took 0.006 secs
[mongo:L28:3s] 2017-09-15T05:22:50.670+0000 I NETWORK  [initandlisten] waiting for connections on port 27017
[ping2:L1:14s] + mongo --host mongo --eval "{ ping: 5 }"
[ping2:L2:15s] MongoDB shell version: 3.0.15
[ping2:L3:15s] connecting to: mongo:27017/test
[ping:L1:14s] + mongo --host mongo --eval "{ ping: 5 }"
[ping:L2:15s] MongoDB shell version: 3.0.15
[ping:L3:15s] connecting to: mongo:27017/test
[ping:L4:15s] 2017-09-15T05:23:04.013+0000 W NETWORK  Failed to connect to 104.239.207.44:27017, in(checking socket for error after poll), reason: errno:113 No route to host
[ping:L5:15s] 2017-09-15T05:23:04.014+0000 E QUERY    Error: couldn't connect to server mongo:27017 (104.239.207.44), connection attempt failed
[ping:L6:15s]     at connect (src/mongo/shell/mongo.js:181:14)
[ping:L7:15s]     at (connect):1:6 at src/mongo/shell/mongo.js:181
[ping:L8:15s] exception: connect failed
[ping2:L4:20s] 2017-09-15T05:23:08.633+0000 W NETWORK  Failed to connect to 198.105.244.130:27017 after 5000 milliseconds, giving up.
[ping2:L5:20s] 2017-09-15T05:23:08.636+0000 E QUERY    Error: couldn't connect to server mongo:27017 (198.105.244.130), connection attempt failed
[ping2:L6:20s]     at connect (src/mongo/shell/mongo.js:181:14)
[ping2:L7:20s]     at (connect):1:6 at src/mongo/shell/mongo.js:181
[ping2:L8:20s] exception: connect failed
2017/09/15 00:23:12 drone_step_0 : exit code 1

In one run we have a correct and two incorrect resolutions.

I am currently unable to reproduce any issues with regards to networking. Some quick notes:

Because I cannot reproduce these errors, I am afraid there is not much more help I can offer here. If you continue to experience issues I recommend triaging the code and sending a patch.

Thanks for mentioning the multiple drone_default networks issue. It seems to be what’s causing the problem!

This is what I got on my dev machine:

» docker network ls
NETWORK ID          NAME                DRIVER              SCOPE
68dbf40ffc55        bridge              bridge              local
b73b07455a0b        drone_default       bridge              local
1c81b2218a23        drone_default       bridge              local
37dcac44213c        drone_default       bridge              local
6945cd295b52        drone_default       bridge              local
ae56fbe36dd3        drone_default       bridge              local
dd56d92dbc44        drone_default       bridge              local
0920ed42678b        drone_default       bridge              local
3ab972a3bd20        drone_default       bridge              local
96deb0e84760        drone_default       bridge              local
13b713760e0a        drone_default       bridge              local
a23a30973969        drone_default       bridge              local
09e382fbafd1        drone_default       bridge              local
94d53f49a784        drone_default       bridge              local
641e77e7dfdb        drone_default       bridge              local
a71cd05922ea        drone_default       bridge              local
084d669350b2        drone_default       bridge              local
290ee38f5bea        drone_default       bridge              local
9f2325bb70c3        droneio_default     bridge              local
3da130fb4523        host                host                local
698f8253b8ed        none                null                local

Running docker network prune helped!