Diagnosing database issues

Sometimes databases can misbehave and this in turn can affect Drone. Bellow are some common scenarios and their causes.

  • “Optimistic Lock Error” - sometimes this shows up in drone logs, but it is 100% normal, this is a technique used by drone runners, to make sure that a single build only occurs once and on a single runner. It can also occur when cancelling builds through the UI.
  • “postgres or mysql is using lots of connections / 100% CPU” - a thundering herd problem can occur where so much traffic occurs causing the CPU of the DB to hit 100% then the UI becomes un-responsive. This is incredibly rare and you dont need to do anything. However if you want to limit the connections you can use, DRONE_DATABASE_MAX_CONNECTIONS | Drone
  • “other things to check” - check IOPS for your database, check the logs of the db, check the cpu usage and memory usage on your system. finally check the available disk space.

In the drone docs for DRONE_DATABASE_MAX_CONNECTIONS it says “must be configured before the system is first used.” The variable cannot be set later and then restart the drone server? Most other variables work that way. If that’s true you might even include additional text: “Changing the variable and restarting the drone server will not have an effect.”

I just ran into a “thundering herd problem”, too many db connections, and the message “FATAL: remaining connection slots are reserved for non-replication superuser connections”. It seems the default postgres max_connections is 100. If the default DRONE_DATABASE_MAX_CONNECTIONS is unlimited it could overwhelm the database. Jobs were getting permanently hung/frozen.

Not sure all the ramifications, so just an idea: the default drone max_connections should match the default postgres max_connections. If that configuration will more gracefully handle a sudden increase in traffic, and avoid errors.