Kubernetes "No space left on device"

Since 1-rc4 we have the following occurring at clone stage

error: copy-fd: write returned: No space left on device
fatal: cannot copy '/usr/share/git-core/templates/hooks/pre-push.sample' to '/drone/src/.git/hooks/pre-push.sample': No space left on device
fatal: Not a git repository (or any parent up to mount point /drone)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
+ git fetch origin +refs/heads/master:
fatal: Not a git repository (or any parent up to mount point /drone)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).

error: copy-fd: write returned: No space left on device

I think my first question is if you have checked your node to see if it has run out of disk space? If yes, then I would follow-up and ask if you can run some sort of disk usage analysis to see if Drone is responsible for filling up disk space (and if yes, where those files are stored) and then report back.

If the node has not run out of disk space, then I would be interested in the following:

  • does this happen on every node, or just some
  • does this happen for every repository, or just some. If just some, can you provide a sample yaml
  • if this happens for every build, how were you able to successfully test this fix, and what might have changed since your last successful test?

Disk pressure

There is no disk pressure, there are 4 nodes in the cluster and each has 100Gb of allocatable storage. Currently only 4Gb, 7Gb, 6GB, 2Gb in use. Also KubeletHasNoDiskPressure is reported by each node.

Does this occur on each build?

No not on all builds. As you have noticed we were able to test a fix for the volume mount issues in this issue.

Does this occur on each node?

All builds were running on the same node, to test if it was a node dependant issue we ran kubectl cordon node-id-xxxx then ran the build again so it was on a different node, this solved the issue and everything was fine. No storage issues.

Although this node 100% has physical space. We can use this space through other pods without issue on the same node.

Drone does not yet support persistent volume claims, and as a temporary workaround creates a host machine mount to /tmp (tracking at https://github.com/drone/drone-runtime/issues/19). Perhaps /tmp is mounted to a local block device storage with limited disk space, and the standard kubernetes storage is mounted to another block device with more disk space?

This seems like a serious issue. @bradrydzewski it doesn’t appear the /tmp directory gets cleaned after builds, is it suppose to?

/tmp/ just increases forever until the node breaks. In our case, drone just filled the entire disk entirely and we didn’t even have enough room to transfer an SSH key to fix it.

I’m going through our other nodes and making sure it doesn’t break those too. What’s the correct way to handle this, is it to regularily go onto the machine and rm -rf /tmp/.drone?

When I SSH’d into the GKE node, the /tmp folder is specifically only 2GB in size. The rest of the 30GB is not assigned to /tmp/, but elsewhere.

What exactly is stored in /tmp/? Is it just the git repo, not what runs in containers?

The native Kubernetes implementation was experimental and has since been formally deprecated. We therefore recommend running agents on Kubernetes.

I see some people saying there is a new version of kubernetes runner, is this true or its really deprecated?

The experimental kubernetes runtime (discussed in this thread) was deprecated in April. We launched a second iteration of a kubernetes runner a few weeks ago which you can find at https://docs.drone.io/runner/kubernetes/overview/