Builds stuck in running 0.8-rc.3

I have received reports of builds getting stuck and I wanted to create a dedicated thread to track the issue. This is impacting a subset of Drone users, typically running older versions of Docker. This is not something I’ve been able to personally reproduce.

Please note that this is for builds stuck in running status. If your builds are stuck in pending status (with the clock) please do not use this thread.

The issue seems to be related to logs freezing either while streaming or uploading. We know this because we can follow the agent logs and pinpoint exactly where the agent appears to be blocking. We see the agent blocking after the first log line below:

	logger.Debug().
		Msg("uploading logs")

	uploads.Wait()

	logger.Debug().
		Msg("uploading logs complete")

Docker Bug

There is a known issue with older Docker versions (17.03 and below) that can cause the docker logs command to lock, and therefore causes Drone to lock. So if you are experiencing this issue, upgrading Docker should be your first step.

See the following docker issue https://github.com/moby/moby/issues/30135

See the comment below by @zaa. He has confirmed that upgrading docker resolved this issue for him.

Large Logs

It has been suggested that very large log outputs could also cause the system to lock. This is a theory I plan to test prior to an 0.8-final release. This should be tested using Docker 17.06 or higher to avoid confusion with the Docker issue described above.

How to Help

This has not yet been confirmed to be a Drone issue due to my inability to reproduce. If you upgrade Docker to 17.06 and continue to experience issues we need your help. We need someone that is able to reproduce to step through the code, debug and pinpoint a root cause.


PLEASE NOTE to keep this thread concise and focused we will delete all +1 or me too comments, including comments that describe in detail how you can reproduce this issue but do not provide new or useful information to advance this discussion.

2 Likes

I can consistently reproduce this on the latest stable Docker v17.06.1-ce running in Swarm Mode on a fully updated Ubuntu 16.04.

The configuration is completely stock: no special network or filesystem or anything.

I think I have tried everything, including nuking the DB and the Swarm service.

I am running some other services and I would estimate the total log messages per second to 20, so logging could be related, but I would be surprised.

It looks like this:

  1. I start the server and agent
  2. Trigger a build
  3. It builds successfully
  4. I push another commit in an hour or so
  5. The build appears as bending
  6. Restarting the agent service (by scaling it to 0 and back to 1) mostly results in another successful build. Sometimes even this doesn’t help, then the build remains as stuck forever
  7. An hour later and a new commit: pending again

Unfortunately, I couldn’t find anything interesting in logs. Maybe I need to enable an extra level of debugging? I would be glad for guidance on what I should do to provide all of the useful information.

Drone v0.7.3 works flawlessly.

Right now we need someone to step through the codebase and manually debug this issue.

I would recommend starting with the following section of code, responsible for streaming the logs, since this seems to be where things lock up:

For those unable or unwilling to step through the code and help debug this, please subscribe to the issue. Please avoid +1 and me too comments. The issue is acknowledged and pending a volunteer to debug the code and identify a root cause.

Regarding builds that get stucked in state “running”… (I am running drone 0.8.0-rc.3)

When i try to stop a running build with: "drone build stop [repo_name] [build_nr]“
It will say–> client error 400: Cannot cancel a non-running build”

I know there are ticket #2065 and #2090, but until they are fixed - is there a workaround for this issue, except for re-setting the hooks to vcs etcetera?

(In my case I think the cause for the build getting stucked in the first place is verbose logging)

@ztaf in this case you can use the drone repo kill <repo> <build> command. You will need to download and build the CLI from source until we have a final 0.8 release.

1 Like

@bradrydzewski Thanks so much! I test it right away!!

While I was diagnosing the issue I added a bunch of debug output to the agent, cncd’s docker backend and the vendored docker client library and found out that the issue was indeed connected to a bug in old versions of Docker. Namely, docker daemon was not closing log tailing requests when a docker container was stopped. So a docker container would stop, but client’s request to stream logs was still sitting there without an EOF from the daemon. Since drone waits for the EOF, the build was not being progressed to “done” or “failure” state.

As I mentioned in the original ticket, we resolved the issue by running drone agent and docker-in-docker container (version 17.06) in the same kubernetes pod.

ps. Oh, and I managed to reproduce the issue with stock docker daemon and docker cli version 1.12.6. I started a container, ran “docker logs -f” command and stopped the container. The “docker logs -f” command was stuck in the same fashion: no output, but no EOF either.

1 Like

If the drone process is stuck inside some blocking goroutine sending a SIGABRT to that process will print a stack traces for all goroutines before terminating.

When I have had similar issues I’ve also attached things wich outputs way more things on signal listeners. Like running cpu profiling w pprof for a minue to gather hotspots where the code is doing stuff if the problem isn’t related to a single blocking thing.

Here’s something similar I’ve written for a client applications “debug info export” … It’s just the basics…

SIGABRT is still very useful since it works without changing any code.

I was bitten by that problem I think. So I updated docker first to 17.05-ce.
With my gometalinter shut up (redirecting the log to a file), it seems it’s fixed (it wasn’t working before, and would timeout, with the stream not available).
I will try with the metalinter log going directly to drone, to see if it still works with large amount of data.

I tried by enabling the verbose log of gometalinter, and the drone webapp attempt a:

https://drone.domain.com/api/repos/team/project/logs/5/1/gometalinter

and gets a 404. So there is still some issue with that.
The log in the drone-server is:

time="2017-09-09T02:06:03Z" level=error msg="Error #01: sql: no rows in result set\n" ip=drone.domain.com latency=1.239146ms method=GET path="/api/repos/team/project/logs/5/1/gometalinter" status=404 time="2017-09-09T02:06:03Z" user-agent="Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/60.0.3112.113 Chrome/60.0.3112.113 Safari/537.36"

[EDIT]
I let the task run the whole night. At some point, 1h later, the server did a:

INFO: 2017/09/09 03:02:55 grpc: Server.processUnaryRPC failed to write status stream error: code = DeadlineExceeded desc = "context deadline exceeded"

It repeated that message for 8h, every 10s until:

panic: send on closed channel

goroutine 92258 [running]:
github.com/drone/drone/server.LogStreamSSE.func2.1(0xc421d9a000, 0xa908, 0xc000)
        /home/anarmindon/go/src/github.com/drone/drone/server/stream.go:360 +0xc4
github.com/drone/drone/vendor/github.com/cncd/logging.(*log).Tail(0xc4203e3220, 0x14256c0, 0xc43702a600, 0xc436bb55c8, 0x2, 0xc43703cb40, 0xc4203f2ab0, 0xc4203f2ab0)
        /home/anarmindon/go/src/github.com/drone/drone/vendor/github.com/cncd/logging/log.go:93 +0x2bb
github.com/drone/drone/server.LogStreamSSE.func2(0x14256c0, 0xc43702a600, 0xc4218ce370, 0xc43703e240, 0x14290e0, 0xc4218ce000, 0xc436bb55a0)
        /home/anarmindon/go/src/github.com/drone/drone/server/stream.go:363 +0x13e
created by github.com/drone/drone/server.LogStreamSSE
        /home/anarmindon/go/src/github.com/drone/drone/server/stream.go:368 +0x838

I just merged a pull request fixes an issue where large log output causes the upload to return an error due to exceeding the maximum grpc payload size. The agent will continue to retry the upload indefinitely because the error will always be the same, thus causing the build to get stuck.

Thanks to @tboerger for pinpointing the exact error:

err: rpc error: code = ResourceExhausted desc = grpc: trying to send message larger
than max (7399047 vs. 4194304)

Note that I think there were at least two different root causes for builds getting stuck:

I therefore believe that both upgrading docker and getting the drone/agent:latest image with the patch liming log size will resolve this issue for most, if not all, people.

1 Like

Just checked with the new 0.8.0-rc.5 and docker 17.05. I don’t see the problem anymore. And I can get my 997 gometalinter alerts :sleepy:

Thank you!

Cool, than it really resolved the issue.