[0.8.1] Connection lost to agent just after successful run is causing build to hang on stopping services

I have couple pipelines and two services in my .drone.yml. What is happening is that after all the pipelines finish the two services still show as spinning in the UI and the build appears to be hung. When I tried to cancel the build it said “Successfully cancelled your build” but it didn’t as after F5 it was still showing as spinning The next retry to cancel said “Failed to cancel your build”. What’s curious is that after couple minutes the build finally shows as successfully completed in the original 4m time.

Some facts:

  • all docker pipelines finish normally, it’s the reported status that seems to hang
  • I’m seeing connection failures in the log - so this seems to be the core issue
2017-10-11T18:04:52Z |DEBU| log stream closed build=17 id=219 image=drillster/drone-volume-cache:latest repo=our/repo stage=rebuild-cache
INFO: 2017/10/11 18:05:10 transport: http2Client.notifyError got notified that the client transport was broken EOF.
INFO: 2017/10/11 18:05:10 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: Error while dialing dial tcp 35.190.188.41:9000: getsockopt: connection refused"; Reconnecting to {our.ci.io:9000 <nil>}
INFO: 2017/10/11 18:05:11 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: Error while dialing dial tcp 35.190.188.41:9000: getsockopt: connection refused"; Reconnecting to {our.ci.io:9000 <nil>}
2017-10-11T18:05:13Z |DEBU| stop listening for cancel signal build=17 id=219 repo=our/repo
2017-10-11T18:05:35Z |DEBU| pipeline lease renewed build=17 id=219 repo=our/repo

Could this be caused by excessive logging of some pipelines? Like npm install being to verbose?

I do not think so. This sort of error would indicate network reliability issues between your server and agent. The root cause of this error could be at the os level, hardware level, or data center level. It is difficult to say and may be completely out of the control of the drone codebase.

The grpc team would be more qualified to discuss potential network issues that would result in these error messages.

Thanks. I was able to workaround the issue by doing npm install --silent. I’ll keep you updated if I find anything else.

Actually it’s still happening. Basically what is happening is that the mongo service shows as “running” for about 3 minutes after all pipelines finishing. After 7m of total the build finishes with a 4m reported total time.

If you are able to consistently reproduce I would encourage sending a patch