We’re using the Kubernetes runner version 1.0.0-rc.2
In one of our repos, the log output is viewable while the step is running, but when the step fails, after it finishes the logs are not viewable anymore, which makes it hard to know what went wrong:
This does not happen on any of our other repos (they use the same Drone runner in the same Kubernetes cluster), it makes me wonder if it could be caused by the rather larger log output in the repo where we are seeing that failure.
Oh, that’s odd!! We seem to have that EXACT SAME problem too!!!
Haha, anyways, some additional context: (I work with Adrien above in the DevOps team.)
Here is a synopsis of my findings research on this issue to date:
During build time, I am witnessing how I can watch the logs of the build run that fails in either WEBUI or Kubernetes cluster. I can also tail the logs from the container while it is running. We could stream this somewhere, so there is no reason it gets lost. These containers are indeed producing records while they are working.
However, something odd occurs to the container (within the Job pod), making the logs missing for the failed stage, leading to the UI image posted above. Those logs for the failed build are nowhere to be seen, not in the WebUI, not in the container, and cannot be found.
Now here is how my Kube runner config is set up:
DRONE_RPC_HOST: <correctly mapped to drone service>
DRONE_RPC_SECRET: <yes, the secrets - and non-failing tasks show logs>
I did see previous conversations on this forum about similar issues, such as:
Some of those links are old and to the old drone.io discourse forum before the Drone’s acquisition by Harness.io. So they are not all current.
Much more recently, I found a thread indicating a particular drone runner image was faulty:
I looked at the drone/drone-Kubernetes-runner we were running and bumped it by 1 version to 1.0.0-rc.2, which was released in October 2021, two months after the supposed updated container fixed the issue.
The issue remained, so if it’s the runner container, both 1.0.0 rc1 and rc2 are affected!
Updating the runner- This did not fix our issue.
The bug report that I believe most closely resembles our issue is this post here: (Logs go missing after fail):
The report is from July 2020. During this conversation, the Drone team modified the runner image for the bug reporter to introduce the following configuration flags: DRONE_FEATURE_FLAG_RETRY_LOGS=true , DRONE_FEATURE_FLAG_DELAYED_DELETE=true, DRONE_FEATURE_FLAG_DISABLE_DELETE=true
Which are described by @bradrydzewski in this post: Kubernetes runner intermittently fails steps - #7 by bradrydzewski - but unfortunately are not referenced in the drone documentation.
Hoping this would fix our problem, I proposed to a senior colleague to try these flags in our dev environment. We discovered that Drone had removed these flags from recent drone-runner images when conducting our due diligence.
(This may explain why they are not documented anywhere, My guess is this patch was intended to be for a short time to help identify the single user’s issue? ) - so, this fix is also not an option.
Other configurations steps have suggested I enable DRONE_DEBUG, RPC configuration, dump HTTP configuration and others, WHich you can see I already have enabled above.
One particular ray of hope I found was a user who reported that this problem resolved itself when they modified their Kubernetes drone pipelines to use a docker pipeline instead:
with no other changes. - this seemed to resolve the issue for that user, but I have not tested it.
What holds me back from suggesting this is that it would result in a docker-in-docker running on Kubernetes build flow and be less resource-efficient as multiple pods would spin up for each build job, starving us of resources.
I am sharing this here in case others are looking for a “Quick & Dirty Fix”
As Enterprise customers of both harness.io & drone.io, I work hard to keep our engineers happy and satisfied with the tools we give them. we introduced drone enterprise to use the drone Kubernetes runners, so if we need to use the docker plugins, that’s open-source. So we might not need to pay for the drone licence?
If that is the case, I can mention that the fix is not to use some enterprise features, but I’m hoping there is a better solution. My superiors told me we licenced drones to use Kubernetes runners, which seemed logical at the time.
However, it is critically essential that devs see when their builds failed and why they failed, right there in the UI. - Perhaps more critical than using the Kubernetes runner, as otherwise, this takes up infra time helping the engineers.
As an alternative, seeing as I can tail logs in real-time during the build, I may put in place some operators to collect these logs and transport them to Loki/Grafana off-cluster. But that would mean dev’s need to use two web UI’s and might prefer to use grafana instead for all logging!
Again, building an alternative logging platform to overcome the lack of failed logs in the WebUI is more engineering time, which we hope to avoid.
Please help us with a better solution. We cannot understand why the build logs are disappearing!