No jobs taken from the queue anymore

I had a working drone server running in docker on NixOS. The configuration of the server is:

  systemd.services.pubdrone = {                                              
          wantedBy = [ "multi-user.target" ];                                
          after = [ "network.target" ];                                      
          description = "Drone.io CD-Server for public projects";            
          serviceConfig = {                                                  
                  Type = "oneshot";                                          
                  RemainAfterExit = "yes";                                   
                  ExecStart = ''${pkgs.docker}/bin/docker run \              
                --volume=/var/run/docker.sock:/var/run/docker.sock \         
                --volume=/var/lib/drone:/data \                              
                --env=DRONE_GIT_ALWAYS_AUTH=false \                          
                --env=DRONE_GITLAB_SERVER=https://gitlab.com \               
                --env=DRONE_GITLAB_CLIENT_ID=<hidden> \
                --env=DRONE_GITLAB_CLIENT_SECRET=<hidden> \
                --env=DRONE_SERVER_HOST=pub.drone.amessage.eu \              
                --env=DRONE_SERVER_PROTO=https \                             
                --env=DRONE_TLS_AUTOCERT=true \                              
                --env=DRONE_USER_CREATE=username:<hidden>,machine:false,admin:true \
                --env=DRONE_DATABASE_SECRET=<hidden> \
                --env=DRONE_RPC_SECRET=<hidden> \                    
                --publish=8081:80 \                                          
                --restart=always \                                           
                --detach=true \                                              
                --name=pubdrone \                                            
                drone/drone:1'';                                             
                  ExecStop = ''${pkgs.docker}/bin/docker stop pubdrone'';    
                  ExecStopPost = ''${pkgs.docker}/bin/docker rm -f pubdrone'';
          };                                                                 
  };

The runner (on a different host is):

  systemd.services.pubdronerun = {
          wantedBy = [ "multi-user.target" ];
          after = [ "network.target" ];
          description = "Drone.io Runner";
          serviceConfig = {
                  Type = "oneshot";
                  RemainAfterExit = "yes";
                  ExecStart = ''${pkgs.docker}/bin/docker run \
                --volume=/var/run/docker.sock:/var/run/docker.sock \
                --volume=/var/lib/drone:/data \
                --env=DRONE_RPC_HOST=pub.drone.amessage.eu \
                --env=DRONE_RPC_PROTO=https \
                --env=DRONE_RPC_SECRET=<hidden> \
                --env=DRONE_RUNNER_CAPACITY=2 \
                --env=DRONE_RUNNER_NAME=nahe.amessage.eu \
                --restart=always \
                --detach=true \
                --name=pubdronerun \
                drone/drone-runner-docker:1'';
                  ExecStop = ''${pkgs.docker}/bin/docker stop pubdronerun'';
                  ExecStopPost = ''${pkgs.docker}/bin/docker rm -f pubdronerun'';
          };
  };

In the log file of the runner I get the following:

time=“2020-03-26T11:29:31Z” level=info msg=“starting the server” addr=":3000"
time=“2020-03-26T11:29:31Z” level=info msg=“successfully pinged the remote server”
time=“2020-03-26T11:29:31Z” level=info msg=“polling the remote server” arch=amd64 capacity=2 endpoint=“https://pub.drone.amessage.eu” kind=pipeline os=linux type=docker

On the server I get nothing. Only the following if I restart the container:

{“level”:“info”,“msg”:“main: internal scheduler enabled”,“time”:“2020-03-26T15:48:27Z”}
{“acme”:true,“host”:“pub.drone.amessage.eu”,“level”:“info”,“msg”:“starting the http server”,“port”:":443",“proto”:“https”,“time”:“2020-03-26T15:48:27Z”,“url”:“https://pub.drone.amessage.eu”}
{“interval”:“30m0s”,“level”:“info”,“msg”:“starting the cron scheduler”,“time”:“2020-03-26T15:48:27Z”}

Any idea on how I could further debug this problem? Any obvious configuration error?

In the job queue there are 4 builds to be done and they don’t get executed for 14 days now.

Okay, this is the request from the runner to the server:

POST /rpc/v2/stage HTTP/1.0
Host: pub.drone.amessage.eu
X-Real-IP: 178.27.181.13
X-Forwarded-For: 178.27.181.13
X-Forwarded-Proto: https
X-Forwarded-Host: pub.drone.amessage.eu
X-Forwarded-Server: pub.drone.amessage.eu
Connection: close
Content-Length: 89
x-drone-token: <hidden>
user-agent: Go-http-client/2.0

{"kind":"pipeline","type":"docker","os":"linux","arch":"amd64","variant":"","kernel":""}

The server responds with:

HTTP/1.0 204 No Content
Cache-Control: no-cache, no-store, must-revalidate, private, max-age=0
Expires: Thu, 01 Jan 1970 00:00:00 UTC
Pragma: no-cache
X-Accel-Expires: 0
Date: Thu, 26 Mar 2020 16:22:47 GMT

Why isn’t the server returning the pending builds?

Update: After canceling the 4 pending builds and restarting them afterwards 3 of them got built while one keeps staying in the pending state. Three actually corresponds the the sum of the configured capacity of the runners. So it seems after running one job, the runners don’t get a new one.

Thanks in advance
Matthias

see the following thread which discusses all known reasons a build can be stuck in pending:

Thanks for the pointer. That looks promising and at least provides me with new thinks I can check.

Thank you!

I didn’t have the problem in the last weeks and now it seems back. I had created a new repository and imported it to Drone. The first build failed due to outside reasons. After (hopefully) fixing them, I tried to rerun the build by git commit --amend --reset-author ; git push --force.

Server Log:

{"arch":"amd64","kernel":"","kind":"pipeline","level":"debug","msg":"manager: context canceled","os":"linux","time":"2020-07-13T16:00:03Z","type":"docker","variant":""}
{"arch":"amd64","kernel":"","kind":"pipeline","level":"debug","msg":"manager: request queue item","os":"linux","time":"2020-07-13T16:00:13Z","type":"docker","variant":""}
{"arch":"amd64","kernel":"","kind":"pipeline","level":"debug","msg":"manager: context canceled","os":"linux","time":"2020-07-13T16:00:22Z","type":"docker","variant":""}
{"arch":"amd64","kernel":"","kind":"pipeline","level":"debug","msg":"manager: context canceled","os":"linux","time":"2020-07-13T16:00:22Z","type":"docker","variant":""}
{"arch":"amd64","kernel":"","kind":"pipeline","level":"debug","msg":"manager: request queue item","os":"linux","time":"2020-07-13T16:00:32Z","type":"docker","variant":""}
{"arch":"amd64","kernel":"","kind":"pipeline","level":"debug","msg":"manager: request queue item","os":"linux","time":"2020-07-13T16:00:33Z","type":"docker","variant":""}
{"arch":"amd64","kernel":"","kind":"pipeline","level":"debug","msg":"manager: context canceled","os":"linux","time":"2020-07-13T16:00:43Z","type":"docker","variant":""}
{"arch":"amd64","kernel":"","kind":"pipeline","level":"debug","msg":"manager: request queue item","os":"linux","time":"2020-07-13T16:00:53Z","type":"docker","variant":""}
{"arch":"amd64","kernel":"","kind":"pipeline","level":"debug","msg":"manager: context canceled","os":"linux","time":"2020-07-13T16:01:02Z","type":"docker","variant":""}
{"arch":"amd64","kernel":"","kind":"pipeline","level":"debug","msg":"manager: context canceled","os":"linux","time":"2020-07-13T16:01:03Z","type":"docker","variant":""}
{"arch":"amd64","kernel":"","kind":"pipeline","level":"debug","msg":"manager: request queue item","os":"linux","time":"2020-07-13T16:01:12Z","type":"docker","variant":""}
{"arch":"amd64","kernel":"","kind":"pipeline","level":"debug","msg":"manager: request queue item","os":"linux","time":"2020-07-13T16:01:13Z","type":"docker","variant":""}
{"arch":"amd64","kernel":"","kind":"pipeline","level":"debug","msg":"manager: context canceled","os":"linux","time":"2020-07-13T16:01:23Z","type":"docker","variant":""}
{"arch":"amd64","kernel":"","kind":"pipeline","level":"debug","msg":"manager: request queue item","os":"linux","time":"2020-07-13T16:01:33Z","type":"docker","variant":""}
{"arch":"amd64","kernel":"","kind":"pipeline","level":"debug","msg":"manager: context canceled","os":"linux","time":"2020-07-13T16:01:42Z","type":"docker","variant":""}
{"arch":"amd64","kernel":"","kind":"pipeline","level":"debug","msg":"manager: context canceled","os":"linux","time":"2020-07-13T16:01:43Z","type":"docker","variant":""}
{"arch":"amd64","kernel":"","kind":"pipeline","level":"debug","msg":"manager: request queue item","os":"linux","time":"2020-07-13T16:01:52Z","type":"docker","variant":""}
{"arch":"amd64","kernel":"","kind":"pipeline","level":"debug","msg":"manager: request queue item","os":"linux","time":"2020-07-13T16:01:53Z","type":"docker","variant":""}
{"arch":"amd64","kernel":"","kind":"pipeline","level":"debug","msg":"manager: context canceled","os":"linux","time":"2020-07-13T16:02:03Z","type":"docker","variant":""}
{"arch":"amd64","kernel":"","kind":"pipeline","level":"debug","msg":"manager: request queue item","os":"linux","time":"2020-07-13T16:02:13Z","type":"docker","variant":""}
{"arch":"amd64","kernel":"","kind":"pipeline","level":"debug","msg":"manager: context canceled","os":"linux","time":"2020-07-13T16:02:22Z","type":"docker","variant":""}
{"arch":"amd64","kernel":"","kind":"pipeline","level":"debug","msg":"manager: context canceled","os":"linux","time":"2020-07-13T16:02:23Z","type":"docker","variant":""}
{"arch":"amd64","kernel":"","kind":"pipeline","level":"debug","msg":"manager: request queue item","os":"linux","time":"2020-07-13T16:02:32Z","type":"docker","variant":""}
{"arch":"amd64","kernel":"","kind":"pipeline","level":"debug","msg":"manager: request queue item","os":"linux","time":"2020-07-13T16:02:33Z","type":"docker","variant":""}
{"arch":"amd64","kernel":"","kind":"pipeline","level":"debug","msg":"manager: context canceled","os":"linux","time":"2020-07-13T16:02:43Z","type":"docker","variant":""}
{"arch":"amd64","kernel":"","kind":"pipeline","level":"debug","msg":"manager: request queue item","os":"linux","time":"2020-07-13T16:02:53Z","type":"docker","variant":""}
{"arch":"amd64","kernel":"","kind":"pipeline","level":"debug","msg":"manager: context canceled","os":"linux","time":"2020-07-13T16:03:02Z","type":"docker","variant":""}
{"arch":"amd64","kernel":"","kind":"pipeline","level":"debug","msg":"manager: context canceled","os":"linux","time":"2020-07-13T16:03:03Z","type":"docker","variant":""}
{"arch":"amd64","kernel":"","kind":"pipeline","level":"debug","msg":"manager: request queue item","os":"linux","time":"2020-07-13T16:03:12Z","type":"docker","variant":""}
{"arch":"amd64","kernel":"","kind":"pipeline","level":"debug","msg":"manager: request queue item","os":"linux","time":"2020-07-13T16:03:13Z","type":"docker","variant":""}
{"arch":"amd64","kernel":"","kind":"pipeline","level":"debug","msg":"manager: context canceled","os":"linux","time":"2020-07-13T16:03:23Z","type":"docker","variant":""}
{"arch":"amd64","kernel":"","kind":"pipeline","level":"debug","msg":"manager: request queue item","os":"linux","time":"2020-07-13T16:03:33Z","type":"docker","variant":""}

Log of one of the two runners:

time="2020-07-13T16:00:22Z" level=trace msg="http: no content returned: re-connect and re-try"
time="2020-07-13T16:01:02Z" level=trace msg="http: no content returned: re-connect and re-try"
time="2020-07-13T16:01:42Z" level=trace msg="http: no content returned: re-connect and re-try"
time="2020-07-13T16:02:22Z" level=trace msg="http: no content returned: re-connect and re-try"
time="2020-07-13T16:03:02Z" level=trace msg="http: no content returned: re-connect and re-try"

What’s the meaning of “manager: context canceled” in the server’s log?

And that’s the result of the REST endpoint:

{
    "id": 303,
    "repo_id": 62,
    "trigger": "mawis",
    "number": 2,
    "status": "pending",
    "event": "push",
    "action": "",
    "link": "https://git.amessage.eu/mawis/web-forms/-/commit/328d9606416796aa7fe9acaf3621e1081fe7c628",
    "timestamp": 0,
    "message": "Try to build with drone.\n",
    "before": "",
    "after": "328d9606416796aa7fe9acaf3621e1081fe7c628",
    "ref": "refs/heads/drone-build",
    "source_repo": "",
    "source": "drone-build",
    "target": "drone-build",
    "author_login": "mawis",
    "author_name": "Matthias Wimmer",
    "author_email": "m@tthias.eu",
    "author_avatar": "https://secure.gravatar.com/avatar/b1f2ad378f4eccf0df09d80d2f305e96?s=80&d=identicon",
    "sender": "mawis",
    "started": 0,
    "finished": 0,
    "created": 1594655182,
    "updated": 1594655182,
    "version": 1,
    "stages": [
        {
            "id": 303,
            "repo_id": 62,
            "build_id": 303,
            "number": 1,
            "name": "default",
            "kind": "pipeline",
            "type": "docker",
            "status": "pending",
            "errignore": false,
            "exit_code": 0,
            "machine": "nahe.amessage.eu",
            "os": "linux",
            "arch": "amd64",
            "started": 0,
            "stopped": 0,
            "created": 1594655182,
            "updated": 1594655183,
            "version": 2,
            "on_success": true,
            "on_failure": false
        }
    ]
}

The .drone.yml is:

kind: pipeline
name: default

steps:
- name: test
  image: rust:1.44.1
  commands:
  - cargo test
- name: lint
  image: rust:1.44.1
  commands:
  - rustup default stable
  - rustup component add clippy-preview
  - rustup component add rustfmt
  - cargo fmt -- --check
  - cargo clippy
- name: build
  image: rust:1.44.1
  commands:
  - cargo build --release

Configuration of the server is:

      systemd.services.drone = {
              wantedBy = [ "multi-user.target" ];
              after = [ "network.target" ];
              description = "Drone.io CD-Server";
              serviceConfig = {
                      Type = "oneshot";
                      RemainAfterExit = "yes";
                      ExecStart = ''${pkgs.docker}/bin/docker run \
                    --volume=/var/run/docker.sock:/var/run/docker.sock \
                    --volume=/var/lib/drone:/data \
                    --env=DRONE_GIT_ALWAYS_AUTH=false \
                    --env=DRONE_GITLAB_SERVER=https://git.amessage.eu \
                    --env=DRONE_GITLAB_CLIENT_ID=xxxxxxxx \
                    --env=DRONE_GITLAB_CLIENT_SECRET=xxxxxxxx \
                    --env=DRONE_SERVER_HOST=drone.amessage.eu \
                    --env=DRONE_SERVER_PROTO=https \
                    --env=DRONE_USER_CREATE=username:mawis,machine:false,admin:true \
                    --env=DRONE_DATABASE_SECRET=xxxxxxxx \
                    --env=DRONE_RPC_SECRET=xxxxxxxx \
                    --env=DRONE_LOGS_TRACE=true \
                    --publish=8080:80 \
                    --restart=always \
                    --detach=true \
                    --name=drone \
                    drone/drone:1'';
                      ExecStop = ''${pkgs.docker}/bin/docker stop drone'';
                      ExecStopPost = ''${pkgs.docker}/bin/docker rm -f drone'';
              };
      };

Configuration of the runner is:

      systemd.services.dronerun = {
              wantedBy = [ "multi-user.target" ];
              after = [ "network.target" ];
              description = "Drone.io Runner";
              serviceConfig = {
                      Type = "oneshot";
                      RemainAfterExit = "yes";
                      ExecStart = ''${pkgs.docker}/bin/docker run \
                    --volume=/var/run/docker.sock:/var/run/docker.sock \
                    --volume=/var/lib/drone:/data \
                    --env=DRONE_RPC_HOST=drone.amessage.eu \
                    --env=DRONE_RPC_PROTO=https \
                    --env=DRONE_RPC_SECRET=xxxxxxxx \
                    --env=DRONE_RUNNER_CAPACITY=1 \
                    --env=DRONE_RUNNER_NAME=nahe.amessage.eu \
                    --env=DRONE_LOGS_TRACE=true \
                    --restart=always \
                    --detach=true \
                    --name=dronerun \
                    drone/drone-runner-docker:1'';
                      ExecStop = ''${pkgs.docker}/bin/docker stop dronerun'';
                      ExecStopPost = ''${pkgs.docker}/bin/docker rm -f dronerun'';
              };
      };

Problem is that the job is stuck in the queue in pending state and doesn’t get taken by the runner.

I canceled the build, that got stuck in PENDING, made another change to the repository and committed it. This commit was built by drone immediately. – While this solved my problem for the moment, I still feel there is something wrong, and the fact that for one commit it hangs while for the next it doesn’t makes me even more unconfident of the setup.

If builds are stuck in a pending state it generally indicates a configuration issue. I believe all known root causes are discussed in this thread:

What’s the meaning of “manager: context canceled” in the server’s log?

the runner performs long polling with a 30 second timeout per request. These debug logs allow us to see that the runner is connecting to the server, and then the request is canceled after 30 seconds. These debug logs are expected.

Thanks for the link, but I (think I) already checked all reasons that are given there and everything seemed to be fine.

  • I didn’t use helm
  • Neither DRONE_AGENTS_DISABLED nor DRONE_KUBERNETES_ENABLED is set
  • It looks like a successful connection (there are two additional keys in the map, else it’s the same)
  • Networking problems don’t seem to exist, as the request arrives at the server
  • I checked the request/response and there is no HTML in it. The response is just HTTP code 204 with empty body.
  • Protected mode is disabled
  • I checked the secret for equality several times
  • Platform isn’t defined, but I don’t use arm – everything is amd64
  • kind is pipeline, no type defined
  • neither node nor label parameters are used
  • everything runs on a Linux system (NixOS to be more specific)

can you post the server logs and the logs for both runners for the time period when the build was first created and became stuck? I see the build was assigned to a machine (based on the API results you posted) which would in fact imply the build was picked up by the runner (the machine variable can only be set by the runner that receives the build). There would be numerous debug entries in the the runner logs for this (for example, here and here). So it seems that the relevant runner logs still need to be posted to this thread.

Second runner wasn’t up at that time (it wasn’t up for days now).

Logs of the runner are gone, as I restarted the docker container with the runner and removed the old one.

I’ll post them when it happens the next time. – For the time being, I think we cannot do more then and just have to wait for it to happen the next time.

Thanks for your support!

@ashwilliams1 It happend again already.

The sections of the log at the time it happened are below. For reference I also put the complete logs since I restarted the containers earlier today on the web:
This is the server’s log: https://nym.at/drone.log – and this is the (only) runner’s log: https://nym.at/dronerun.log

I did push around 19:24 UTC is the relevant entries on the runner only show the successfully run job #9:

time="2020-07-13T19:24:31Z" level=trace msg="http: no content returned: re-connect and re-try"
time="2020-07-13T19:24:41Z" level=debug msg="stage received" stage.id=310 stage.name=default stage.number=1 thread=1
time="2020-07-13T19:24:41Z" level=debug msg="stage accepted" stage.id=310 stage.name=default stage.number=1 thread=1
time="2020-07-13T19:24:41Z" level=debug msg="stage details fetched" build.id=310 build.number=9 repo.id=62 repo.name=web-forms repo.namespace=mawis stage.id=310 stage.name=default stage.number=1 thread=1
time="2020-07-13T19:24:41Z" level=trace msg="secret: database: found matching secret" kind=secret name=DOCKER_AUTH thread=1
time="2020-07-13T19:24:42Z" level=debug msg="updated stage to running" build.id=310 build.number=9 repo.id=62 repo.name=web-forms repo.namespace=mawis stage.id=310 stage.name=default stage.number=1 thread=1
time="2020-07-13T19:25:11Z" level=trace msg="http: no content returned: re-connect and re-try"
time="2020-07-13T19:25:51Z" level=trace msg="http: no content returned: re-connect and re-try"
time="2020-07-13T19:26:31Z" level=trace msg="http: no content returned: re-connect and re-try"
time="2020-07-13T19:27:11Z" level=trace msg="http: no content returned: re-connect and re-try"
time="2020-07-13T19:27:51Z" level=trace msg="http: no content returned: re-connect and re-try"
time="2020-07-13T19:28:31Z" level=trace msg="http: no content returned: re-connect and re-try"

On the server at that time the log entries are:

{"arch":"amd64","kernel":"","kind":"pipeline","level":"debug","msg":"manager: request queue item","os":"linux","time":"2020-07-13T19:24:01Z","type":"docker","variant":""}
{"arch":"amd64","kernel":"","kind":"pipeline","level":"debug","msg":"manager: context canceled","os":"linux","time":"2020-07-13T19:24:31Z","type":"docker","variant":""}
{"commit":"e9cce5a8fade15e6a37062dfbf9ec649e6369121","event":"push","level":"debug","msg":"webhook parsed","name":"web-forms","namespace":"mawis","time":"2020-07-13T19:24:34Z"}
{"commit":"e9cce5a8fade15e6a37062dfbf9ec649e6369121","event":"push","level":"debug","msg":"trigger: received","ref":"refs/heads/docker-image","repo":"mawis/web-forms","time":"2020-07-13T19:24:34Z"}
{"fields.time":"2020-07-13T19:24:34Z","latency":452590093,"level":"debug","method":"POST","msg":"","remote":"172.17.0.1:41342","request":"/hook","request-id":"1ee3Y2g6MTRk0iuZC7neEg3j32o","time":"2020-07-13T19:24:34Z"}
{"arch":"amd64","kernel":"","kind":"pipeline","level":"debug","msg":"manager: request queue item","os":"linux","time":"2020-07-13T19:24:41Z","type":"docker","variant":""}
{"level":"debug","machine":"nahe.amessage.eu","msg":"manager: accept stage","stage-id":310,"time":"2020-07-13T19:24:41Z"}
{"level":"debug","machine":"nahe.amessage.eu","msg":"manager: stage accepted","stage-id":310,"time":"2020-07-13T19:24:41Z"}
{"level":"debug","msg":"manager: fetching stage details","step-id":310,"time":"2020-07-13T19:24:41Z"}
{"build.id":310,"build.number":9,"error":"Cannot transition status via :enqueue from :pending (Reason(s): Status cannot transition via \"enqueue\")","level":"warning","msg":"manager: cannot publish status","repo.id":62,"stage.id":310,
"time":"2020-07-13T19:24:42Z"}
{"level":"debug","msg":"manager: updating step status","step.id":860,"step.name":"clone","step.status":"running","time":"2020-07-13T19:24:42Z"}
{"level":"debug","msg":"manager: updating step status","step.id":860,"step.name":"clone","step.status":"success","time":"2020-07-13T19:24:43Z"}
{"level":"debug","msg":"manager: updating step status","step.id":861,"step.name":"test","step.status":"running","time":"2020-07-13T19:24:43Z"}
{"commit":"863ddaf22744fd5193400e326a0f8d93bc9dc1fb","event":"push","level":"debug","msg":"webhook parsed","name":"web-forms","namespace":"mawis","time":"2020-07-13T19:24:50Z"}
{"commit":"863ddaf22744fd5193400e326a0f8d93bc9dc1fb","event":"push","level":"debug","msg":"trigger: received","ref":"refs/heads/master","repo":"mawis/web-forms","time":"2020-07-13T19:24:50Z"}
{"fields.time":"2020-07-13T19:24:50Z","latency":407087643,"level":"debug","method":"POST","msg":"","remote":"172.17.0.1:41406","request":"/hook","request-id":"1ee3ZxvxJ2I7daDB8fXLykQ72qp","time":"2020-07-13T19:24:50Z"}
{"level":"debug","msg":"api: root access granted","name":"web-forms","namespace":"mawis","request-id":"1ee3a6EJ4d98B6AjwsUlZafG3LM","time":"2020-07-13T19:24:51Z","user.admin":true,"user.login":"mawis"}
{"fields.time":"2020-07-13T19:24:51Z","latency":4237269,"level":"debug","method":"GET","msg":"","remote":"172.17.0.1:41410","request":"/api/repos/mawis/web-forms","request-id":"1ee3a6EJ4d98B6AjwsUlZafG3LM","time":"2020-07-13T19:24:51Z
"}
{"level":"debug","msg":"api: root access granted","name":"web-forms","namespace":"mawis","request-id":"1ee3aAlv8y0FjTeOC6PtbUo2Wtr","time":"2020-07-13T19:24:51Z","user.admin":true,"user.login":"mawis"}
{"fields.time":"2020-07-13T19:24:51Z","latency":2449309,"level":"debug","method":"GET","msg":"","remote":"172.17.0.1:41414","request":"/api/repos/mawis/web-forms/builds?page=1","request-id":"1ee3aAlv8y0FjTeOC6PtbUo2Wtr","time":"2020-0
7-13T19:24:51Z"}
{"level":"debug","msg":"api: root access granted","name":"web-forms","namespace":"mawis","request-id":"1ee3p5DeLAeIXsfAmBmGxdGzo1d","time":"2020-07-13T19:26:50Z","user.admin":true,"user.login":"mawis"}
{"level":"debug","msg":"api: root access granted","name":"web-forms","namespace":"mawis","request-id":"1ee3p4dBFWHPLyAMYhO5BkLXggd","time":"2020-07-13T19:26:50Z","user.admin":true,"user.login":"mawis"}
{"fields.time":"2020-07-13T19:26:50Z","latency":12120358,"level":"debug","method":"GET","msg":"","remote":"172.17.0.1:41606","request":"/api/repos/mawis/web-forms/builds/9","request-id":"1ee3p4dBFWHPLyAMYhO5BkLXggd","time":"2020-07-13
T19:26:50Z"}
{"fields.time":"2020-07-13T19:26:50Z","latency":13257361,"level":"debug","method":"GET","msg":"","remote":"172.17.0.1:41608","request":"/api/repos/mawis/web-forms","request-id":"1ee3p5DeLAeIXsfAmBmGxdGzo1d","time":"2020-07-13T19:26:50Z"}
{"level":"debug","msg":"api: root access granted","name":"web-forms","namespace":"mawis","request-id":"1ee3p45XCHThXlGVIvnjzS209Ix","time":"2020-07-13T19:26:50Z","user.admin":true,"user.login":"mawis"}
{"fields.time":"2020-07-13T19:26:50Z","latency":3241578,"level":"debug","method":"GET","msg":"","remote":"172.17.0.1:41612","request":"/api/repos/mawis/web-forms/builds/9/logs/1/1","request-id":"1ee3p45XCHThXlGVIvnjzS209Ix","time":"2020-07-13T19:26:50Z"}
{"level":"debug","msg":"manager: updating step status","step.id":861,"step.name":"test","step.status":"success","time":"2020-07-13T19:27:06Z"}
{"level":"debug","msg":"manager: updating step status","step.id":862,"step.name":"lint","step.status":"running","time":"2020-07-13T19:27:06Z"}
{"level":"debug","msg":"manager: updating step status","step.id":862,"step.name":"lint","step.status":"success","time":"2020-07-13T19:28:20Z"}
{"level":"debug","msg":"manager: updating step status","step.id":863,"step.name":"build","step.status":"running","time":"2020-07-13T19:28:20Z"}
{"level":"debug","msg":"manager: updating step status","step.id":863,"step.name":"build","step.status":"success","time":"2020-07-13T19:31:46Z"}
{"level":"debug","msg":"manager: updating step status","step.id":864,"step.name":"docker","step.status":"running","time":"2020-07-13T19:31:46Z"}
{"level":"debug","msg":"manager: updating step status","step.id":864,"step.name":"docker","step.status":"success","time":"2020-07-13T19:31:48Z"}
{"level":"debug","msg":"manager: stage is complete. teardown","stage.id":310,"time":"2020-07-13T19:31:48Z"}
{"build.id":310,"build.number":9,"level":"debug","msg":"manager: build is finished, teardown","repo.id":62,"stage.id":310,"time":"2020-07-13T19:31:48Z"}
{"arch":"amd64","kernel":"","kind":"pipeline","level":"debug","msg":"manager: request queue item","os":"linux","time":"2020-07-13T19:31:50Z","type":"docker","variant":""}
{"arch":"amd64","kernel":"","kind":"pipeline","level":"debug","msg":"manager: context canceled","os":"linux","time":"2020-07-13T19:32:20Z","type":"docker","variant":""}
{"arch":"amd64","kernel":"","kind":"pipeline","level":"debug","msg":"manager: request queue item","os":"linux","time":"2020-07-13T19:32:30Z","type":"docker","variant":""}
{"arch":"amd64","kernel":"","kind":"pipeline","level":"debug","msg":"manager: context canceled","os":"linux","time":"2020-07-13T19:33:00Z","type":"docker","variant":""}

I was pushing two commits within a short amount of time. The first one was a commit in a branch that resulted in build #9 which was executed successfully. Build #10 was a merge from this request to master and is hanging in the pending state.

One additional thing I now noticed: in the web interface build #9 shows all the steps in the pipeline:

… while the pending step doesn’t show the steps in the pipeline:

Also I wonder: is it correct, that drone is switching “to a new branch ‘docker-image’ when actually what it is processing is the master branch? Also in #10 it does a checkout on the commit that was build in #9 (compare the commit ID in the white box with the commit ID shown in the dark box).

This is the JSON for the job:

{
    "id": 311,
    "repo_id": 62,
    "trigger": "@hook",
    "number": 10,
    "status": "pending",
    "event": "push",
    "action": "",
    "link": "https://git.amessage.eu/mawis/web-forms/-/commit/863ddaf22744fd5193400e326a0f8d93bc9dc1fb",
    "timestamp": 0,
    "message": "Merge branch 'docker-image'\n",
    "before": "",
    "after": "863ddaf22744fd5193400e326a0f8d93bc9dc1fb",
    "ref": "refs/heads/master",
    "source_repo": "",
    "source": "master",
    "target": "master",
    "author_login": "mawis",
    "author_name": "Matthias Wimmer",
    "author_email": "m@tthias.eu",
    "author_avatar": "https://secure.gravatar.com/avatar/b1f2ad378f4eccf0df09d80d2f305e96?s=80&d=identicon",
    "sender": "mawis",
    "started": 0,
    "finished": 0,
    "created": 1594668290,
    "updated": 1594668290,
    "version": 1,
    "stages": [
        {
            "id": 311,
            "repo_id": 62,
            "build_id": 311,
            "number": 1,
            "name": "default",
            "kind": "pipeline",
            "type": "docker",
            "status": "pending",
            "errignore": false,
            "exit_code": 0,
            "machine": "nahe.amessage.eu",
            "os": "linux",
            "arch": "amd64",
            "started": 0,
            "stopped": 0,
            "created": 1594668290,
            "updated": 1594668312,
            "version": 2,
            "on_success": true,
            "on_failure": false
        }
    ]
}

I can also confirm that nahe.amessage.eu is the correct runner.

The JSON for build 10 indicates the stage was assigned to machine nahe.amessage.eu. This cannot happen without the runner writing stage received and stage accepted in the logs for this particular stage id. The absense of this information in the logs would tell me that we are still missing logs that would be required to further debug.

For example, the runner would write the below log entries (assuming you are not using an outdated version of the runner), which includes the stage id which corresponds to the JSON output you provided above.

time="2020-07-13T19:24:41Z" level=debug msg="stage received" stage.id=311 stage.name=default stage.number=1 thread=1
time="2020-07-13T19:24:41Z" level=debug msg="stage accepted" stage.id=311 stage.name=default stage.number=1 thread=1

I think I finally found the problem why builds got stuck on this machine frequently.

I am running two instances of Drone on the host. One to build projects that have their repository on gitlab.com, the other to build projects that have their repository on my own private gitlab instance.

By accident, I had both instances of Drone sharing the same sqlite database.