Drone server not starting, stuck after "starting the zombie build reaper"

I had a working drone installation, running in my kubernetes cluster.
After the upgrade of the cluster to 1.24 and restart of the nodes, the drone server won’t start any more.
The logging simply stops after “starting the zombie build reaper”, but the server is unreachable.
Port 80 does not respond.

{"build.limit":0,"expires":"0001-01-01T00:00:00Z","kind":"trial","level":"debug","msg":"main: license loaded","repo.limit":0,"time":"2022-09-07T07:29:59Z","user.limit":0}
{"admin":true,"level":"debug","login":"...","machine":false,"msg":"bootstrap: create account","time":"2022-09-07T07:29:59Z","token":""}
{"admin":true,"level":"debug","login":"...","machine":false,"msg":"bootstrap: updating account","time":"2022-09-07T07:29:59Z","token":""}
{"admin":true,"level":"debug","login":"...","machine":false,"msg":"bootstrap: account already up-to-date","time":"2022-09-07T07:29:59Z","token":""}
{"acme":false,"host":"drone.fritz.box","level":"info","msg":"starting the http server","port":":80","proto":"https","time":"2022-09-07T07:29:59Z","url":"https://drone.fritz.box"}
{"interval":"30m0s","level":"info","msg":"starting the cron scheduler","time":"2022-09-07T07:29:59Z"}
{"interval":"24h0m0s","level":"info","msg":"starting the zombie build reaper","time":"2022-09-07T07:29:59Z"}

The configured runner is as well unable to reach the server:

time="2022-09-07T07:26:43Z" level=error msg="cannot ping the remote server" error="Post \"http://drone.drone.svc.cluster.local/rpc/v2/ping\": dial tcp 10.107.227.4:80: connect: connection refused"

Cluster networking in general is working fine, there are lots of other pods/services running.

Enabling log trace did not reveal any useful hints. Is there a way to find additional hints to why the server is unreachable / stuck?

Hi @justsomebody42 you can check these links if it helps:

  1. Drone and Runner not communicating properly behind reverse proxy! - #2 by csgitharness
  2. Pipeline fails with “clone: skipped” (Docker stack) - Drone

Feel free to drop in if the issue still persists.

Hi @Shruthikini !

Thanks a lot for the hints.
Unfortunately, both seem unrelated to my issue.
The runner itself seems to start just fine, but is unable to reach the server.

time="2022-09-07T09:25:55Z" level=info msg="starting the server" addr=":3000"
time="2022-09-07T09:25:55Z" level=error msg="cannot ping the remote server" error="Post \"http://drone.drone.svc.cluster.local/rpc/v2/ping\": dial tcp 10.107.227.4:80: connect: connection refused"

And as the server UI itself is as well unreachable as well, I rather assume that the server has problems to start, but as the logging simply stops, I have no idea what to look for…

Thanks!

@justsomebody42 right now arm builds are stuck in pending state. Are you using arm32 servers?

I’m using amd64.
Tried with both drone/drone:2.12.1 and latest.

@justsomebody42 got it! let me check for more details on this and get back to you.

Thanks!

@justsomebody42 just wanted to check. can you tell me, what version were they previously? if it was 0.8.x, there are breaking changes.

Thanks!

I’ve been running the latest tag, so I’m not 100% sure of the exact version, but given the fact that it has been running recently and that I have not started using Drone more than 2 years ago, I would doubt that I have ever used 0.8.x

happend to me too when trying to upgrade, currently using chart version 0.2.5.
I guessed there are breaking changes, but couldn’t find any release notes

@justsomebody42 @natali let me check with the team on this internally and get back to you.

Thanks!

@justsomebody42 @natali are you using the helm charts? there was an issue recently that would cause the behavior they are seeing Breaking environment · Issue #84 · drone/charts · GitHub

I’m using my own config files which basically resemble the helm charts.
The image I’m using is 2.12.1
My server was still using port 80 as can be seen in the logs ("port":":80"):

{"acme":false,"host":"drone.fritz.box","level":"info","msg":"starting the http server","port":":80","proto":"https","time":"2022-09-07T07:29:59Z","url":"https://drone.fritz.box"}

I changed the configuration and added DRONE_SERVER_PORT: "8080" to adapt to the new default value and updated the corresponding URL configuration for the runner.
Unfortunately, this only changed the ports the server is running on as expected, the runner is still not able to connect, the UI is still not reachable and the logging of the server simply stops after starting the zombie build reaper

Server:

{"acme":false,"host":"drone.fritz.box","level":"info","msg":"starting the http server","port":"8080","proto":"https","time":"2022-09-08T12:09:03Z","url":"https://drone.fritz.box"}
{"interval":"30m0s","level":"info","msg":"starting the cron scheduler","time":"2022-09-08T12:09:03Z"}
{"interval":"24h0m0s","level":"info","msg":"starting the zombie build reaper","time":"2022-09-08T12:09:03Z"}

Runner:

time="2022-09-08T12:11:45Z" level=error msg="cannot ping the remote server" error="Post \"http://drone.drone.svc.cluster.local:8080/rpc/v2/ping\": dial tcp 10.107.227.4:8080: i/o timeout"```

Yes, I am using helm.
Tried this solution mentioned in the issue above and it helped.

service:
type: ClusterIP
port: 80

Thanks @Shruthikini !

@natali Glad to hear it helped :raised_hands: Feel free to reach out for any further queries. I am more than happy to help you out :smiley:

Is there anything else I can check?
I opened a shell in the container and checked for listening ports. None are opened, besides the log states, that the http server should been started:

/ # netstat -tulpn
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
/ #

The cron jobs are running though:

{"admin":true,"level":"debug","login":"...","machine":false,"msg":"bootstrap: updating account","time":"2022-09-08T16:27:18Z","token":""}
{"admin":true,"level":"debug","login":"...","machine":false,"msg":"bootstrap: account already up-to-date","time":"2022-09-08T16:27:18Z","token":""}
{"acme":false,"host":"drone...","level":"info","msg":"starting the http server","port":"8080","proto":"https","time":"2022-09-08T16:27:18Z","url":"https://drone..."}
{"interval":"24h0m0s","level":"info","msg":"starting the zombie build reaper","time":"2022-09-08T16:27:18Z"}
{"interval":"30m0s","level":"info","msg":"starting the cron scheduler","time":"2022-09-08T16:27:18Z"}
{"level":"debug","msg":"cron: begin process pending jobs","time":"2022-09-08T16:57:18Z"}
{"level":"debug","msg":"cron: found 0 pending jobs","time":"2022-09-08T16:57:18Z"}
{"level":"debug","msg":"cron: finished processing jobs","time":"2022-09-08T16:57:18Z"}
{"level":"debug","msg":"cron: begin process pending jobs","time":"2022-09-08T17:27:18Z"}

@justsomebody42 one of my team mate will get back to you on this shortly.

Thanks!

I would like to help, but there are two accounts responding in this thread, so I don’t know if there are two different users with different problems or not.

Since the runner can’t reach drone.drone.svc.cluster.local:8080, my guess is your drone server service is listening on port 80 as you mentioned.

Can you see if drone.drone.svc.cluster.local:80 is reachable within the cluster?

Hi @jimsheldon, thanks for joining the conversation!

I tried tried to run the server on both port 80 and 8080.
As you can see in the current logs, the server says that it’s starting on port 8080

{"acme":false,"host":"drone...","level":"info","msg":"starting the http server","port":"8080","proto":"https","time":"2022-09-08T16:27:18Z","url":"https://drone..."}

As well, I already tried to check the open ports in the container itself with netstat and didn’t have any luck.
So, I’m currently assuming that somethings preventing the server to fully start. But as the logging of the server stops, I cannot understand what I may need to debug/change.

1 Like

I’m completely confused.
I tried running the image 2.13.0 with the helm chart and the server started just fine.
When I shell into the container I can see the server running:

/ # netstat -tulpn
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp        0      0 :::80                   :::*                    LISTEN      1/drone-server

Logs:

{"build.limit":0,"expires":"0001-01-01T00:00:00Z","kind":"trial","level":"debug","msg":"main: license loaded","repo.limit":0,"time":"2022-09-23T08:57:59Z","user.limit":0}
{"admin":true,"level":"debug","login":"...","machine":false,"msg":"bootstrap: create account","time":"2022-09-23T08:57:59Z","token":""}
{"admin":true,"level":"info","login":"...","machine":false,"msg":"bootstrap: account created","time":"2022-09-23T08:57:59Z","token":"..."}
{"acme":false,"host":"drone...","level":"info","msg":"starting the http server","port":":80","proto":"https","time":"2022-09-23T08:57:59Z","url":"https://drone..."}
{"interval":"24h0m0s","level":"info","msg":"starting the zombie build reaper","time":"2022-09-23T08:57:59Z"}
{"interval":"30m0s","level":"info","msg":"starting the cron scheduler","time":"2022-09-23T08:57:59Z"}
{"arch":"","kernel":"","kind":"pipeline","level":"debug","msg":"manager: request queue item","os":"","time":"2022-09-23T08:58:26Z","type":"kubernetes","variant":""}

The default still seems to be port 80, but anyway.

When I run the image from the Deployment.yaml I created, the logging is exactly the same, but the server seems to not start listening:

/ # netstat -tupln
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
/ #

Logs:

{"build.limit":0,"expires":"0001-01-01T00:00:00Z","kind":"trial","level":"debug","msg":"main: license loaded","repo.limit":0,"time":"2022-09-23T08:55:54Z","user.limit":0}
{"admin":true,"level":"debug","login":"...","machine":false,"msg":"bootstrap: create account","time":"2022-09-23T08:55:54Z","token":""}
{"admin":true,"level":"debug","login":"...","machine":false,"msg":"bootstrap: updating account","time":"2022-09-23T08:55:54Z","token":""}
{"admin":true,"level":"debug","login":"...","machine":false,"msg":"bootstrap: account already up-to-date","time":"2022-09-23T08:55:54Z","token":""}
{"interval":"30m0s","level":"info","msg":"starting the cron scheduler","time":"2022-09-23T08:55:54Z"}
{"interval":"24h0m0s","level":"info","msg":"starting the zombie build reaper","time":"2022-09-23T08:55:54Z"}
{"acme":false,"host":"drone...","level":"info","msg":"starting the http server","port":"8080","proto":"http","time":"2022-09-23T08:55:54Z","url":"http://drone..."}

The loglevel is already set to trace.
Is there anything I can do to find out, why the server won’t start listening?
Nothing has changed in my configuration since the last time it was working and I have no idea where to look next…

I gave up.
The latest error message I was able to get from the server was

{"error":"listen tcp: address 80: missing port in address","level":"fatal","msg":"program terminated","time":"2022-09-26T07:17:22Z"}

I tried various combinations of environment variables found in different posts without success.
The exact same configuration is working, when used in the helm values file, but the server simply doesn’t start to listen, when deployed with manual config files.

I rewrote my complete config and use the helm deployment now, which is working.
The issue may be closed, but I’m still curious, why the server wouldn’t start with the exact same configuration…
If anybody encounters a similar situation or has an idea, what else could be tested, I’d be happy to continue the investigation :slight_smile: