Drone Docker Runner services often fail on windows

We have a pipeline that starts two services (mssql and postgres) and normal steps. Very often the service does not start. Sleeping is not a solution, as it does not start at all in these circumstances.

I noticed that when this happens, this error is recorded in the windows application log:

Resolver Setup/Start failed for container drone-qyuUtNPN5z0UlKGo8fss, “error in opening name server socket listen udp 172.22.48.1:53: bind: Only one usage of each socket address (protocol/network address/port) is normally permitted.”

I have the impression things are better if I restart docker, but it reappears quite fast.

Is this a known problem and is there a workaround?

I am using windows 2019 server.

@johanvdw,

Could you please provide below details so that we can suggest accordingly:

  1. Docker version you are using?
  2. Docker logs
  3. Entry of event viewer logs before/after this event
  4. Could you please confirm if you are passing windows version correctly as per doc
    Platform | Drone

Regards,
Harness Support

Docker version:

PS C:\Windows\system32> docker version
Client: Mirantis Container Runtime
 Version:           20.10.4
 API version:       1.40
 Go version:        go1.13.15
 Git commit:        110e091
 Built:             04/12/2021 15:53:12
 OS/Arch:           windows/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Enterprise
 Engine:
  Version:          19.03.12
  API version:      1.40 (minimum version 1.24)
  Go version:       go1.13.13
  Git commit:       f295753ffd
  Built:            08/05/2020 19:26:41
  OS/Arch:          windows/amd64
  Experimental:     false

Docker logs for drone-runner (report nothing at the time of the failure):

time="2021-06-02T12:26:00+02:00" level=info msg="starting the server" addr=":3000"
time="2021-06-02T12:26:00+02:00" level=info msg="successfully pinged the remote server"
time="2021-06-02T12:26:00+02:00" level=info msg="polling the remote server" arch=amd64 capacity=44 endpoint="http://ijzer:8000" kind=pipeline os=windows type=docker
PS C:\Windows\system32>

Event log items:
before:

sending event [module=libcontainerd namespace=moby container=71f69a91e471c376c4cc2fbff7ca6c107392f735241e859a95755d11fe019429 event=exit event-info={71f69a91e471c376c4cc2fbff7ca6c107392f735241e859a95755d11fe019429 init 22500 3221225473 2021-06-03 14:22:49.2993386 +0200 CEST m=+93410.089022101 false <nil>}]

error:

Resolver Setup/Start failed for container drone-VBtSn062E24X6b8P9k8T, "error in opening name server socket listen udp 172.21.32.1:53: bind: Only one usage of each socket address (protocol/network address/port) is normally permitted."

first event after:

sending event [event=create module=libcontainerd namespace=moby container=9793571181fb9a22aded0caab0d869436867455d50ea83e54fd5da3ca6238b6e]

No other containers are running on this host except the drone runner. We are correctly using the windows version: about 2/3 of the time the pipeline runs successfully, but it fails in 1/3 times.

I have the impression the issue is mostly there when using ms sql

Here is a simplified docker file:

kind: pipeline
type: docker
name: windows-tests
platform:
  os: windows
  version: "1809"
environment:
  POSTGRES_PASSWORD: easy_to_guess_password
concurrency:
   limit: 1

steps:
  - name: setup ensol testdb
    image: chrml/mssql-server-windows-express:1809
    commands:
      - "$success=$false; $count=0; do { Write-Output 'Waiting for mssql DB'; $Result = Test-NetConnection -ComputerName ensol -Port 1433; $success = $Result.TcpTestSucceeded; Start-Sleep -Seconds 2; $count++; } until($count -eq 30 -or $success); if(-not($success)){exit 1;}"
      - "sqlcmd -S mssql -U SA -P $env:SA_PASSWORD -i testfiles/mssql/db.sql"
    environment:
      ACCEPT_EULA: "Y"
      SA_PASSWORD: yourStrong_Password

services:
 - name: pgsql
   image: our_repo/postgres-windows
   environment:
     POSTGRES_PASSWORD: easy_to_guess_password
     GDAL_DATA: c:\pgsql\gdal-data
     PROJ_LIB: c:\pgsql\share\contrib\postgis-3.1\proj
 - name: mssql
   hostname: mssql
   image: chrml/mssql-server-windows-express:1809
   environment:
     ACCEPT_EULA: "Y"
     SA_PASSWORD: yourStrong_Password

When MSSQL start succesfully: we see in the mssql service


VERBOSE: Starting SQL Server
VERBOSE: Changing SA login credentials
VERBOSE: Started SQL Server.

TimeGenerated         EntryType Message                                        
-------------         --------- -------                                        
6/3/2021 3:30:52 PM Information Parallel redo is shutdown for database 'CTMS...
6/3/2021 3:30:52 PM Information Parallel redo is started for database 'CTMSd...
6/3/2021 3:30:52 PM Information Starting up database 'CTMSdatabase'.           
6/3/2021 3:30:52 PM Information Parallel redo is shutdown for database 'CTMS...
6/3/2021 3:30:52 PM Information Parallel redo is started for database 'CTMSd...
6/3/2021 3:30:52 PM Information Starting up database 'CTMSdatabase'.           
6/3/2021 3:32:42 PM Information Parallel redo is shutdown for database 'CTMS...
6/3/2021 3:32:42 PM Information Parallel redo is started for database 'CTMSd...
6/3/2021 3:32:42 PM Information Starting up database 'CTMSdatabase'.    

However if it fails (and the error appears in the system log) we only see:

VERBOSE: Starting SQL Server
VERBOSE: Changing SA login credentials
VERBOSE: Started SQL Server.

In that case, the main script in the pipeline will not able to reach the service, and the job will stop after 30 tries.

From the stuff I’ve briefly looked at, it looks like it’s due to IP/port exhaustion.

Maybe try checking the available ports when the error starts showing up? Would a firewall be blocking any ports either?

I’m only really experienced in Linux though, so your guess is probably going to be as good as mine.