Manager: cannot update step with EOF error

Version: 1.5.1
Database: PostgreSQL 10.7

Hi,
We occasionally (0.05% out of 10,273 builds, more precisely) see the message below in drone-ci server log.

{"error":"EOF","level":"warning","msg":"manager: cannot update step","step.id":2767276,"step.name":"wait-datomic","step.status":"running","time":"2020-04-22T19:45:27Z"

It results in a build failure something like this;

Looks like the error comes from drone/updater.go at 2e876a4a000ac160b8b7f133de84358450f7c010 · harness/drone · GitHub and I presume the implementation of Update is in drone/step.go at 2e876a4a000ac160b8b7f133de84358450f7c010 · harness/drone · GitHub.

Is it reasonable to think the EOF error comes from Postgres? Is there a way to workaround it via configuration, something such as extending session timeout?

Thanks,

do you have a reverse proxy or load balancer sitting between the server and your runners? If so, perhaps sometimes logs exceed the maximum request size. The only time we have ever received similar reports of this error have been related to reverse proxies or load balancers terminating the request, hence the EOF

Hi @bradrydzewski,

No, I deploy docker runners via autoscaler, and the ec2 instances are located in the same VPC subnet with the host where drone docker image runs. I interpreted the message manager: cannot update step as the EOF happened when it tries to update the information about the step in the DB based on the source code search with the error message. I was wondering if the db session has timed out and caused EOF error. For example, I see the following message in psql CLI tool when I keep the connection idle for a while.

drone_ci=> select count(stage_build_id) from stages where stage_error='EOF' group by stage_build_id;
SSL SYSCALL error: EOF detected
The connection to the server was lost. Attempting reset: Succeeded.

Can I work it around if I specify something like tcp_keepalives_idle as the option of the connection to psql?