This section will help triage why builds are stuck in a pending state.
Whenever we encounter this issue it is always related to configuration. To triage this problem we, therefore, need to see configuration details and logs. Please take the following actions and provide the following data:
- Provide your server configuration
- Provide your agent configuration
- Enable
DRONE_LOGS_TRACE=true
on the server - Enable
DRONE_LOGS_TRACE=true
on the agent - Provide the agent logs with trace enabled
- Provide the server logs with trace enabled
- Provide the Yaml configuration file
- Provide the build details for your pending builds via this API endpoint.
Verify Runner Installation
Please make sure you have installed a runner that can process your pipelines. If you have only installed the server, and have not installed any runners to execute your pipelines, they will sit in a pending state.
See https://docs.drone.io/server/overview/
Did you use Helm?
If you used Helm to install Drone please make sure you used our official charts at drone/charts. The charts in Helm stable are not official charts and will not produce a working installation.
Check Server Settings
If you set DRONE_AGENTS_DISABLED
or DRONE_KUBERNETES_ENABLED
you should remove these settings. They are legacy settings and will prevent the server from assigning workloads to the runner.
What does a successful connection look like?
Before we discuss troubleshooting connectivity issues, we should first examine what a successful connection looks like. When debug mode is enabled on the server, and when agents successfully connect, you will see an entry in your server logs that looks like this:
{
"arch": "amd64",
"kernel": "",
"level": "debug",
"msg": "manager: request queue item",
"os": "linux",
"time": "2019-04-28T16:00:47-07:00",
"variant": ""
}
The manager: request queue item
entry is proof that the agent is successfully connecting to the server. If you do not see these corresponding log entries, you can be certain that the agents are failing to connect with the server.
Networking Problems
The most common root cause is network connectivity issues. The best way to triage connectivity issues is to pass DRONE_LOGS_TRACE=true
to the agent. This will provide detailed logs for http attempts made to the server.
If the agent cannot establish a connection to the server you will see the below agent logs. Please note that this indicates a problem with either your Agent configuration, your Server configuration or your Network configuration (DNS, etc). This does not indicate a bug with Drone.
2019/04/28 16:05:57 [ERR] POST http://localhost:8080/rpc/v1/request request failed: Post http://localhost:8080/rpc/v1/request: dial tcp [::1]:8080: connect: connection refused
2019/04/28 16:05:57 [DEBUG] POST http://localhost:8080/rpc/v1/request: retrying in 2s (29 left)
2019/04/28 16:05:57 [ERR] POST http://localhost:8080/rpc/v1/request request failed: Post http://localhost:8080/rpc/v1/request: dial tcp [::1]:8080: connect: connection refused
2019/04/28 16:05:57 [DEBUG] POST http://localhost:8080/rpc/v1/request: retrying in 2s (29 left)
2019/04/28 16:05:57 [ERR] POST http://localhost:8080/rpc/v1/request request failed: Post http://localhost:8080/rpc/v1/request: dial tcp [::1]:8080: connect: connection refused
2019/04/28 16:05:57 [DEBUG] POST http://localhost:8080/rpc/v1/request: retrying in 2s (29 left)
Invalid Endpoint, Proxy Problems
Another common root cause is when you specify and invalid endpoint or when a reverse proxy is incorrectly routing the request. This will manifest as an error that includes html in the error message, for example:
2019/04/28 16:12:03 [DEBUG] POST https://drone.company.com/rpc/v1/request
{
"arch": "amd64",
"error": "\u003c!DOCTYPE html\u003e\n\u003chtml
You should also check to ensure you provide the correct server address, including the scheme (http vs https). If you are using the http address, and your reverse proxy automatically redirects to https, it can result in connectivity issues.
Protected Mode
Did you enable protected mode? If so you need to make sure you sign your yaml configuration file. Otherwise, your pipeline will sit in a pending state until it has been manually approved. If you change your yaml file, also be sure to re-generate the signature.
Incorrect Secret
Unfortunately a shared secret mismatch between the agent and server is the most difficult error to debug because it does not produce a useful error message. You should take care to ensure you are passing the correct secret to both the server and agent. Make sure the characters match exactly. If you are using cat
to read the secret from a file, be careful, since this has caused problems (with newlines, etc) that can be difficult to troubleshoot.
Undefined Platform when using Arm or Arm64
Drone assumes all pipelines are amd64 unless otherwise specified. If you are using Drone with arm or arm64 agents please be sure to specify the architecture to ensure builds are routed to the correct agent.
kind: pipeline
name: default
+platform:
+ os: linux
+ arch: arm64
steps: ...
Possible architecture values are arm
, arm64
, and amd64
.
Undefined Kernel when using Windows Docker Runner
If you are using Docker runner with on Windows please be sure to specify the kernel version to ensure builds are routed to the correct agent. This section only applies to the Docker runner. Do not specify a kernel version when using other runner types (kubernetes, exec, etc).
kind: pipeline
name: default
+platform:
+ os: windows
+ version: 1903
steps: ...
If you are unsure which kernel version your runner is using, you can check your server logs with debug mode enabled. When the runner pings the server it includes the kernel version in the payload.
{
"arch": "amd64",
"kernel": "1903",
"level": "debug",
"msg": "manager: request queue item",
"os": "windows",
"time": "2019-04-28T16:00:47-07:00",
"variant": ""
}
Invalid kind or type
Another common root cause is when your kind
or type
are invalid (due to a simple spelling mistake, etc). When you configure an invalid kind or type, the pipeline will sit in queue waiting for a runner with a matching kind and type to come online (which of course will never happen).
-kind: pipline
+kind: pipeline
type: docker
name: default
Incorrect type
The kind
and type
determine which runner can execute your pipeline. For example, the docker runner can only execute pipelines where type
is docker
; the kubernetes runner can only execute pipelines where type
is kubernetes, etc. Please ensure the type
in your yaml matches the value expected by your installed runners.
kind: pipeline
type: docker
name: default
steps: ...
Mismatched Labels and Nodes
When you set the runner labels the runner will only process pipelines where the node parameters are an exact match. All labels must match all node parameters and vice versa. If your pipeline’s node parameters are incorrect and do not exactly match any runners, the pipeline will sit in a pending state waiting for an available runner that is an exact match.
We recommend you double check your pipeline to ensure the node section is an exact match with the desired runner. Below is a sample yaml and a sample runner request to the server. You will note the node
section is not an exact match to the runner labels and therefore would not be picked up by the runner.
{
"arch": "amd64",
"kernel": "",
"level": "debug",
"msg": "manager: request queue item",
"os": "linux",
"time": "2019-04-28T16:00:47-07:00",
"labels": {
"foo": "bar",
"baz": "qux",
}
}
node:
foo: bar
# missing baz: qux
Please note that labels and node selection is an advanced feature. Typical installations will not need to use this feature.
Beware of False Positives
When you enable trace logs it is easy to misinterpret the results. The Agent uses long polling to request builds from the server queue. The agent connects to the server for up to 30 seconds. If the agent does not receive a build from the queue after 30 seconds, it terminates the connection and then reconnects. The connection is terminated after 30 seconds to prevent timeouts (from reverse proxies, load balancers, etc). It is therefore completely normal to see 524 status codes and context deadline exceeded
errors in the trace logs.
The following trace log entries are therefore completely normal:
{
"arch": "amd64",
"level": "debug",
"machine": "bradleys-mbp.lan",
"msg": "runner: polling queue",
"os": "linux",
"time": "2019-04-28T16:22:16-07:00"
}
2019/04/28 16:22:16 [DEBUG] http: no content returned: re-connect and re-try
2019/04/28 16:22:46 [DEBUG] http: no content returned: re-connect and re-try
Check the runner architecture
We have seen issues where Docker for mac (for whatever reason) pulls the arm image instead of the amd64 image. As a result the runner polls the server for arm pipelines. If this happens you should use the architecture-specific docker image.
Still having issues?
We are happy to help you troubleshoot issues, however, before we can help you will need to gather and provide the below information. We will not provide assistance until all requested information is provided.
- Provide your server configuration
- Provide your runner configuration
- Provide your full server logs with trace logging enabled
- Provide your full runner logs with trace logging enabled
a. Enable runner variableDRONE_RPC_DUMP_HTTP=true
b. Enable runner variableDRONE_RPC_DUMP_HTTP_BODY=true
- Provide your yaml configuration file
- Confirm you have checked all common issues described in this thread and quickly summarize how you ruled each item out.
- Provide the build details for your pending builds via this API endpoint.
We request this information up-front to streamline the support process. The alternative would be a prolonged back and forth of questions and answers, after which you end up providing us everything in the above list anyway.
We also request that you create a new Discourse thread when providing the above information. Please do not post this information to Gitter or our chatrooms.