Drone 2.x UI freeze/crashes tab

We recently upgraded to the 2.x version of Drone with the new UI and have gotten a number of reports of browser tabs crashing when streaming logs for running builds. This only happens when rendering the logs for actively running builds, not for ones already completed (these have no issues).

I was able to come up with a minimal repro which provides some evidence that this happens only when rendering log-heavy builds. A fix would be appreciated since it’s been quite frustrating for our developers.

kind: pipeline
type: docker
name: ui-crash
  - name: crash
    image: ubuntu:bionic
      - for f in $(seq 0 10); do cat /dev/urandom | base64 | head -c 4000; sleep 1; done;
1 Like

I also ran into this issue:

We’re also running into this issue, with multiple users.

Most of our devs are using firefox. I’m not able to debug it because I can’t even get the dev tools to open when it happens, or refresh the page, it’s just locked solid.

I think it’s also worst for long running builds, for some reason. Had a dev do something funky that caused a step to run for an hour, which consistently hangs the page for several minutes.

Pretty replicable though, the steps in this post should work.

Can’t capture a performance recording in firefox, it crashes.

I’ve got a bug open with Harness support for this, opened August 29th.

Apparently they’ve been working on it since September 28th, I’ll add another post if it gets sorted.

This one is tricky

drone server is sending an event stream to the UI - every time an event is received the UI state is updated and the console component rerenders

I started seeing issues at 1500 events in a 30 second period

So I have changed the code to batch these events and update state every quarter of a second - batch streaming log state updates to improve UI performance by d1wilko · Pull Request #428 · drone/drone-ui · GitHub

It is currently live on the latest tag of drone if you want to give it a go!

If there are still issues we can extend the state timeout to half a second :slight_smile:

I’ve updated, I’m waiting for some of our builds to go haywire to see if it helps.

Out of curiosity, I don’t remember this being an issue with the previous version of drone. Is the mechanism that was using different? I vaguely remember drone 1 doing live updates of steps, and I’m fairly sure we had some long running steps when we were using that too.

OK, I’ve got a step that gets stuck running, and has been running for 1 1/2 hours.

It’ll still take just over a minute for drone to load the logs and become responsive again.

There’s 3223 lines in the logs for that step, and the last log message was 260s after the step started executing. We’ve got other, bigger steps that load fine. Are these ones being loaded slower because it’s a currently running step?

What are ‘events’ in this context? Updates from the agent, lines in the console log?

Also, if I cancel the step, the logs for the running step are lost.

It’s kind of an edge case for us - steps shouldn’t be running for an hour - but it’s because something’s deadlocked, and it’s useful to be able to look at the logs to try to figure out what’s going on.

I’ve also got a non long running step, it had been running for 6 minutes, and that took 1 1/2 minutes to load, locking the drone UI.

So the changes I have put in place definitely fix the issue at the head of this thread - I’m wondering if @maxgruebneraeroqual 's issue is something deeper than a UI bug.

Just to confirm - you are definitely using the “latest” tag of drone (Docker) as opposed to the latest tagged release of drone? - the change I made is not in v2.16.0 yet.

Also can I ask which runner you are using? if you can provide the drone yaml it might be helpful to diagnose the issue (feel free to leave out steps that are not an issue and sensitive info)

I’ll try to address your comments in order

I can’t speak to how the logs were handled in drone 1 (I wasn’t with the team back then) - but drone 2 has been a complete rewrite of the backend and frontend - so your issue could be in the UI, server or runner - thats the tricky part to diagnose.

With regards to the larger steps loading fine - we push logs to the UI in 2 ways
For finished steps, the finished logs are saved as a blob either to the DB (default) or to an S3 compatible bucket (needs to be set up by the user) - these are easy to load on the UI and would explain why you see them no problem.
For currently running steps the runner streams to the drone server and in turn the drone server streams an individual “event” for each log to the UI - the fix I put in place should reduce the chance of the browser being overwhelmed by these logs events by batching them instead of updating the UI every event

the events I am referring to are console log lines

The running time of the step shouldn’t actually matter - its more a question of the rate that logs are coming in - so the 6 minute step could well freeze up if there are tons of logs

Hopefully the above helps - I can provide a better response with some more info about your setup :slight_smile:

You’re right Dan, it is something deeper than a UI bug :woman_facepalming: (or an exciting new UI bug?).

We’ve just updated to .net 7, at the same time I updated to the latest drone, and there’s some kind of memory issue with our unit tests that is exhausting the memory on our runners.

They’re stalling out on a unit testing step.

When we come along and click on the stalling step, the page is still locking up, but I assume it’s because the drone UI server is taking a long time to get the logs as the runner is dying (to the point where they can’t be ssh’d into).

Does the drone UI server request the log stream from the runner only when someone clicks on a running step? That would explain what I’m seeing.

While the agent is definitely on fire, I’d still expect the entire UI not to lock up while it waits for logs to come in though - I’m seeing the UI lock up for about a minute and a half, for reference, including all the little timers next to the steps stopping, and not being able to switch steps etc.

It’s like the request for the log stream is blocking or something?

One difference is that instead of the browser being overloaded, it now seems that it’s more just the Drone interface in that tab is locking up, so it’s definitely better

This is using drone:latest, and the drone autoscaler spawning agents on ubuntu 22.04 (we ran into issues with the autoscaler using old ubuntu AMIs)

So if anyone else runs into this, check to see whether your runners are healthy.

Glad to hear you have figured out the root cause - but I appreciate it has been harder to diagnose due to the UI freezing :frowning:

The fact that you can’t SSH into the runner is a bit of a red flag for me - while I accept that the issue ultimately manifests in the UI (and the UI should handle it better) - it feels like the underlying issue is the real concern here.

Your summation of the log stream being requested when someone clicks the running step is basically right - technically it happens when a logs request is made for the step in question - this could be done with curl or a rest client aswell

If the step has been completed the request will return a json with all console log lines (at this point they have been saved as a blob).

If the step is in progress the request will return an event stream which will stream console logs from the runner as you said.

Glad to hear that it is only the tab locking up now - small victories haha - I have a feeling it is the event stream whenever an autoscaler runner freezes up that is the issue here - but am I right in saying you are unblocked in the short term?

I will need to raise a backlog bug around this behaviour and give it the attention it deserves in future.

FWIW; I experience the same thing since upgrading to 2.x today

UI hangs for upwards of 10+ seconds to render output of 1,250 lines which is pretty wild.