Yesterday I saw that even though we had 10 agents deployed, the builds were waiting after 2 concurrent builds, which means that the 8 other agents were somehow locked.
It wasnt also clear from the drone UI which among the 10 agents were actually working and which were not.
How can I pinpoint which agents are locked and possibly auto-restart them (using health checks in k8s) ?
I think some of this might be resolved when 0.8 is released (planned for monday, 9/18/17) which has a number of fixes and improvements that you can take advantage of. Preventing agents from freezing is one of the major fixes we have been working on.
There are also plans to add healthchecks, but that probably will not land until 0.9 https://github.com/drone/drone/issues/2087
But I am pretty confident that 0.8 will fix most, if not all, of these issues for you.