Concurrent build issues

I have a rather simple setup with 1 drone server, 4 agents and 1 dind (Docker-in-Docker) service. Basically all of them are running in a Kubernetes cluster as pods (1 pod for server, 1 pod for 4 agents and dind).

In general 4 agents are capable to handle concurrent builds but it’s not the case when at least 2 of them are using ECR plugin(plugins/ecr) to build docker images. In such cases only one agent is building image, others reporting “Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?”.

My assumption is that it’s happening because ultimately dind is building images and he’s not capable to use the same docker socket for few builds at the same time. Could you please help me resolving this issue?

Here is my yaml config for agents and dind:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: drone-agent
  namespace: drone
spec:
  replicas: 1
  template:
    metadata:
      labels:
        app: drone-agent
    spec:
      dnsPolicy: ClusterFirstWithHostNet
      hostNetwork: true
      containers:
      - image: drone/agent:1.2.1
        imagePullPolicy: Always
        name: drone-agent
        volumeMounts:
          - name: docker-socket
            mountPath: /var/run/docker.sock
        resources:
          requests:
            cpu: 50m
            memory: 64Mi
        env:
        - name: DRONE_RPC_SERVER
          value: http://drone.drone.svc.cluster.local.
        - name: DRONE_RPC_SECRET
          valueFrom:
            secretKeyRef:
              name: drone-secrets
              key: DRONE_RPC_SECRET
        - name: DRONE_KEEPALIVE_MIN_TIME
          value: "5s"
        - name: DRONE_MAX_PROCS
          value: "3"
        - name: DRONE_AGENT_CONCURRENCY
          value: "3"  
        - name: DRONE_LOGS_DEBUG
          value: "false"
        - name: DRONE_LOGS_TRACE
          value: "true"
        - name: DRONE_LOGS_PRETTY
          value: "true"
        - name: DOCKER_HOST
          value: tcp://localhost:2375
      - image: drone/agent:1.2.1
        imagePullPolicy: Always
        name: drone-agent-2
        volumeMounts:
          - name: docker-socket
            mountPath: /var/run/docker.sock
        resources:
          requests:
            cpu: 50m
            memory: 64Mi
        env:
        - name: DRONE_RPC_SERVER
          value: http://drone.drone.svc.cluster.local.
        - name: DRONE_RPC_SECRET
          valueFrom:
            secretKeyRef:
              name: drone-secrets
              key: DRONE_RPC_SECRET
        - name: DRONE_KEEPALIVE_MIN_TIME
          value: "5s"
        - name: DRONE_MAX_PROCS
          value: "3"
        - name: DRONE_AGENT_CONCURRENCY
          value: "3"  
        - name: DRONE_LOGS_DEBUG
          value: "false"
        - name: DRONE_LOGS_TRACE
          value: "false"
        - name: DRONE_LOGS_PRETTY
          value: "true"
        - name: DOCKER_HOST
          value: tcp://localhost:2375
      - image: drone/agent:1.2.1
        imagePullPolicy: Always
        name: drone-agent-3
        volumeMounts:
          - name: docker-socket
            mountPath: /var/run/docker.sock
        resources:
          requests:
            cpu: 50m
            memory: 64Mi
        env:
        - name: DRONE_RPC_SERVER
          value: http://drone.drone.svc.cluster.local.
        - name: DRONE_RPC_SECRET
          valueFrom:
            secretKeyRef:
              name: drone-secrets
              key: DRONE_RPC_SECRET
        - name: DRONE_KEEPALIVE_MIN_TIME
          value: "5s"
        - name: DRONE_MAX_PROCS
          value: "3"
        - name: DRONE_AGENT_CONCURRENCY
          value: "3"  
        - name: DRONE_LOGS_DEBUG
          value: "false"
        - name: DRONE_LOGS_TRACE
          value: "false"
        - name: DRONE_LOGS_PRETTY
          value: "true"
        - name: DOCKER_HOST
          value: tcp://localhost:2375
      - image: drone/agent:1.2.1
        imagePullPolicy: Always
        name: drone-agent-4
        volumeMounts:
          - name: docker-socket
            mountPath: /var/run/docker.sock
        resources:
          requests:
            cpu: 50m
            memory: 64Mi
        env:
        - name: DRONE_RPC_SERVER
          value: http://drone.drone.svc.cluster.local.
        - name: DRONE_RPC_SECRET
          valueFrom:
            secretKeyRef:
              name: drone-secrets
              key: DRONE_RPC_SECRET
        - name: DRONE_KEEPALIVE_MIN_TIME
          value: "5s"
        - name: DRONE_MAX_PROCS
          value: "3"
        - name: DRONE_AGENT_CONCURRENCY
          value: "3"  
        - name: DRONE_LOGS_DEBUG
          value: "false"
        - name: DRONE_LOGS_TRACE
          value: "false"
        - name: DRONE_LOGS_PRETTY
          value: "true"
        - name: DOCKER_HOST
          value: tcp://localhost:2375
      - name: dind
        image: "docker.io/library/docker:18.06-dind"
        imagePullPolicy: IfNotPresent
        env:
        - name: DOCKER_DRIVER
          value: overlay2
        securityContext:
          privileged: true
        volumeMounts:
          - name: docker-volumes-cache
            mountPath: /cache
      tolerations:
      - key: "purpose"
        operator: "Equal"
        value: "system"
        effect: "NoSchedule"
      volumes:
        - name: docker-socket
          hostPath:
            path: /var/run/docker.sock
        - name: docker-volumes-cache
          persistentVolumeClaim:
            claimName: drone-volumes-cache

By the way, here I saw that dind is not required anymore but if to remove dind and change DOCKER_HOST to tcp://172.17.0.1:2375 all builds will end up in Pending state because agents not able to connect to this host

I also tried to replace agents + dind by runner with DRONE_RUNNER_CAPACITY=4, getting exactly the same message ““Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?”.”

I see a few issue here. I have to ask … did you use an old, third party installation guide to install Drone? The reason I ask is because Drone has not used GRPC in years which means your configuration is incorrect, which is why the agents are unable to connect and process jobs. Also the drone/agent image was deprecated over a year ago and replaced by the new runner images. Also you should not install multiple runners on the same machine (you should use DRONE_RUNNER_CAPACITY to configure concurrency). Please see docs.drone.io for our official installation guide and an up-to-date list of configuration parameters.

@bradrydzewski thanks for the reply! Basically our drone setup is quite old, I’ve been just trying to modify it without major changes.

As I already mentioned I’ve tried to replace agent by runner and got completely the same result. Here is my config:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: drone-runner
  namespace: drone
  labels:
    app.kubernetes.io/name: drone-runner
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: drone-runner
  template:
    metadata:
      labels:
        app.kubernetes.io/name: drone-runner
    spec:
      containers:
      - name: runner
        image: drone/drone-runner-docker:1
        volumeMounts:
          - name: docker-socket
            mountPath: /var/run/docker.sock
        ports:
        - containerPort: 3000
        env:
        - name: DRONE_RPC_HOST
          value: "drone-ci.idecodevsite.com"
        - name: DRONE_RPC_PROTO
          value: https
        - name: DRONE_RPC_SECRET
          valueFrom:
            secretKeyRef:
              name: drone-secrets
              key: DRONE_RPC_SECRET
        - name: DOCKER_API_VERSION
          value: "1.38"
        - name: DRONE_RUNNER_CAPACITY
          value: "4"
      tolerations:
      - key: "purpose"
        operator: "Equal"
        value: "system"
        effect: "NoSchedule"
      volumes:
        - name: docker-socket
          hostPath:
            path: /var/run/docker.sock
        - name: docker-volumes-cache

It works exactly the same way. Normally no issues for concurrent builds but 2 concurrent ECR plugin steps are failing arguing about docker socket

this is a docker limitation. If you have two separate processes that run docker build and use the same build parameters (name, tag, etc) this can result in race conditions and errors. We provide a docker plugin that uses docker-in-docker to work around the limitations of the Docker daemon and to avoid such race conditions http://plugins.drone.io/drone-plugins/drone-docker/

Cool, last question: is this plugin ECR-compatible?

we have a separate ECR plugins that is a small wrapper around the Docker plugin that you can use. See http://plugins.drone.io/drone-plugins/drone-ecr/

ah, problem is that it’s exactly the same plugin I’m using. Issue happens on the “build” step…

image

@bradrydzewski here is a complete output for failing build:

image

@bradrydzewski just reread your message " f you have two separate processes that run docker build and use the same build parameters (name, tag, etc)". I probably had incorrect test conditions. I was basically starting the same build 2-4 times at the same time. Running together builds with different DRONE_COMMITs didn’t lead to any issues.

Many thanks for quick and helpful reply and have a good day :slight_smile:

@bradrydzewski, seems to still happen quite frequently:

There was 2 different commits leading to ECR plugin usage. 1 build passed, another wasn’t able to connect to docker daemon.

Are you sure that ECR plugin is capable to handle multiple concurrent builds? Seems like he’s sharing the same host socket

I also see that with multiple concurrent builds my EC2 t3.medium instance (completely dedicated to Drone) CPU consumption is hitting 100%. Is it possible that it’s just not enough CPU for docker to build and he’s improperly reports it as docker daemon connectivity issue?

yes, I am sure. The ecr plugin uses docker-in-docker and therefore does not share the same docker socket with other containers. You can audit the code here. The only way the ecr plugin would share the same socket is if you explicitly mounted the socket as a volume in your yaml (which is not something you should do).