Hi everyone.
We are running drone 1.0.1 run tests in parallel.
They run in ~20 steps, each step running a rspec command in different folder, runining in parallel in the same machine (a r5.xlarge) instance.
The problem is that some steps are failing due to “Failure” Error.
In the agent log the error “Optimistic Lock Error” raises and we don’t know what might be the cause.
This error usually indicates someone canceled the Build in the user interface. When the build is canceled all running stages are canceled (and updated in the database). Because we are dealing with distributed systems, it is possible the agent will try to update the stage before it receives the cancelation event. To prevent this race condition we implement optimistic locking.
So in general Optimistic Locking does not signal an error and can be considered normal. If a build is not canceled and is receiving such an error, we would require ALL server and agent debug logs to help you triage.
The error raises without cancelling the build, and happens about 8 seconds of a step, in this case “test_helpers” start.
The step raises the error “failure” [1] and the others run normally. But the build as considered as failed.
As an attempt we upgraded the drone database (rds instance to a larger one) and seems that the error decreased. But eventually it happens.
I will collect the logs from server and agent and attach here.
After some deep analysis, as @bradrydzewski said, the “Optimistic Error” might be considered normal. We made a little confusion on log related the the real build error we are getting, and I am going to open another discussion about that.