Stop Other Steps When Any One Fails?

Having a hard time finding this Google/docs etc. I have parallel steps running and when any of them fail I want the whole thing to full stop, this is what I currently see, is there an attribute/setting that I am not seeing?

type: docker
name: default

steps:
  - name: setup-build-env
    image: <ci-build-runner:debian-master.2c27fba>
    privileged: true
    commands:
      - while ! docker system info > /dev/null 2>&1; do sleep 1; done
      - chown docker:docker /drone/src/ * -R
      - docker-compose logs --tail="10" -f &
      - make parallel-env-prepare
      - make parallel-env-up
    volumes:
      - name: docker_sock_internal
        path: /var/run/
      - name: docker_config
        path: /root/.docker/config.json
  - name: blender-feature-unit-api
    image: <ci-build-runner:debian-master.2c27fba>
    privileged: true
    commands:
      - make parallel-build-blender-base
    volumes:
      - name: docker_sock_internal
        path: /var/run/
      - name: docker_config
        path: /root/.docker/config.json
    depends_on:
      - setup-build-env
  - name: blender-laravel
    image: <ci-build-runner:debian-master.2c27fba>
    privileged: true
    commands:
      - make parallel-build-blender-laravel
    volumes:
      - name: docker_sock_internal
        path: /var/run/
      - name: docker_config
        path: /root/.docker/config.json
    depends_on:
      - setup-build-env
  - name: trousers-nightwatch
    image: <ci-build-runner:debian-master.2c27fba>
    privileged: true
    commands:
      - make parallel-build-trousers
    volumes:
      - name: docker_sock_internal
        path: /var/run/
      - name: docker_config
        path: /root/.docker/config.json
    depends_on:
      - setup-build-env

services:
  - name: docker-in-docker
    image: docker:dind
    privileged: true
    commands:
      - while ! dockerd > /dev/null 2>&1; do echo "Waiting for another build of the same project to complete..."; sleep 10; done
    environment:
      DOCKER_DRIVER: overlay2
    volumes:
      - name: docker_sock_internal
        path: /var/run
      - name: docker_var_lib
        path: /var/lib/docker

volumes:
  - name: docker_sock_host
    host:
      path: /var/run/docker.sock
  - name: docker_var_lib
    host:
      path: /var/lib/dind/blender/
  - name: docker_sock_internal
    temp: {}
  - name: docker_config
    host:
      path: /root/.docker/config.json

trigger:
  branch:
    - feature/drone-build
  event:
    - push
cat docker-compose.yml | grep "image:"
    image: drone/drone:1.6.2
    image: drone/drone-runner-docker:1.0.1
    image: drone/cli:1.2-alpine

I have parallel steps running and when any of them fail I want the whole thing to full stop

this capability is not currently supported

How is this not a thing? I don’t want to wait for Drone to cleanup the runner containers until one step step is straggling for another 10 minutes before it runs another build. No work-arounds?

Honestly not many people have asked for this, and nobody has sent a pull request. I certainly see the value and see no reason why this shouldn’t be supported, but it needs to be prioritized against other requests or we need someone to send us a pull request.

In terms of workarounds, I am not aware of one.

I’m open to taking a look - if you can point me in a direction in terms how your developmental flow looks like when trying modify some of these services. I would imagine the behavior of handling this event would reside in the agent/runner. I haven’t dug into the code hardly at all but I believe most of it is written in Go

awesome. It would be somewhere in this file. It may be as simple as just canceling the context, although I haven’t dug too deep to say for sure.

Sounds good - again is there a way of getting set up to develop/iterate easily enough? Is there a reference guide on dev environment related setup that I could use to get setup quick and start poking at it? Is there logging variables you use for debugging? Otherwise if I end up banging my head for too long it might be too much of a context switch currently

My guess its clone the docker runner repository, I can re-use the drone runner container and mount it where it runs its code currently? go install, go run, etc. Any quick high level details you could shoot over to help get me going then I can try to poke at it when I’m at a lull

I’d love to get this capability going as we are in process of migrating a ton of our pipelines and this will definitely be a pain point

the recommended way to develop the Docker runner is to build the binary file on your host machine and then execute a pipeline from the command line. Note that executing the pipeline has zero interaction with the Drone server, and is completely standalone. Here are the commands:

$ go build
$ ./drone-runner-docker exec /path/to/.drone.yml

I recommend testing with the below yaml file, which disables cloning. You should be able to mockup a yaml that launches steps in parallel and then fails one step while the others are running.

kind: pipeline
type: docker
name: default:

clone:
  disable: true

steps: ...

Solid - that should be a great way to get started

Alright, so I put together a very rough lab setup of reproducing this, happy to clean things up and change things around later but this is just for illustration

The ideal solution would be to add an attribute to a step that has the ability to fail the rest of the pipeline forcefully, in this case I just called it fail_all (boolean); it could make sense as fail_all_on_failure but that can be changed easily enough

Ideally I would put it on each of the parallel steps, but I only put it on the one step that is quick to fail

Desired Result

When several simultaneous parallel steps run, fail all steps right away when one step fails so that we are not wasting resources waiting for a pipeline to fail when it should fail fast and immediately

In our current pipeline we have 5 special Build Runner images with Docker Compose / Docker clients that talk to a Docker in Docker instance and then run test suites that interact with multiple codebases and services that all talk to each other. Each runner stands up an identical instance of the application using all of these services. One test suite could fail very early and since we want to re-use cache of the DinD instance per-project, we make other builds wait until the current pipeline is 100% done, which could be another 10-15 minutes at least and this can be very annoying for developers when tests keep running and the containers do not get killed until the command is done running even after a cancellation

Mock Lab Pipeline Setup

Quickly realized it made more sense to just put a quick mock pipeline together instead of using one of our CI pipelines. Here I just used a few debian containers with some sleeps and echos

kind: pipeline
type: docker
name: default

clone:
  disable: true

steps:
  - name: setup
    image: debian:stable-slim
    commands:
      - sleep 1 && echo "Done with Setup"
  - name: test-1
    image: debian:stable-slim
    commands:
      - sleep 60 && echo "Done 1"
    depends_on:
      - setup
  - name: test-2
    image: debian:stable-slim
    commands:
      - sleep 60 && echo "Done 2"
    depends_on:
      - setup
  - name: test-3
    image: debian:stable-slim
    fail_all: true
    commands:
      - sleep 2 && echo "Attempting early failure" && exit 1
    depends_on:
      - setup
  - name: test-4
    image: debian:stable-slim
    commands:
      - sleep 70 && echo "Done 4"
    depends_on:
      - setup

Pipeline Run Example

./drone-runner-docker exec --debug --dump ./.drone.yml

Executing step ["setup"] [fail_all: false]
Executing step ["test-1"] [fail_all: false]
Executing step ["test-2"] [fail_all: false]
Executing step ["test-3"] [fail_all: true]
Executing step ["test-4"] [fail_all: false]
DEBU[0000] Running Step                                  step.command="[echo \"$DRONE_SCRIPT\" | /bin/sh]" step.image="docker.io/library/debian:stable-slim" step.name=setup
[setup:1] + sleep 1 && echo "Done with Setup"
[setup:2] Done with Setup
step exited
DEBU[0002] Running Step                                  step.command="[echo \"$DRONE_SCRIPT\" | /bin/sh]" step.image="docker.io/library/debian:stable-slim" step.name=test-4
DEBU[0002] Running Step                                  step.command="[echo \"$DRONE_SCRIPT\" | /bin/sh]" step.image="docker.io/library/debian:stable-slim" step.name=test-1
DEBU[0002] Running Step                                  step.command="[echo \"$DRONE_SCRIPT\" | /bin/sh]" step.image="docker.io/library/debian:stable-slim" step.name=test-2
DEBU[0002] Running Step                                  step.command="[echo \"$DRONE_SCRIPT\" | /bin/sh]" step.image="docker.io/library/debian:stable-slim" step.name=test-3
[test-1:3] + sleep 60 && echo "Done 1"
[test-4:4] + sleep 70 && echo "Done 4"
[test-2:5] + sleep 60 && echo "Done 2"
[test-3:6] + sleep 2 && echo "Attempting early failure" && exit 1
[test-3:7] Attempting early failure
step exited
DEBU[0007] Failing all steps in build                    step.name=test-3
step exited
step exited
step exited
{
  "Build": {
    "id": 1,
    "repo_id": 0,
    "trigger": "",
    "number": 1,
    "status": "killed",
    "event": "push",
    "action": "",
    "link": "",
    "timestamp": 0,
    "message": "",
    "before": "",
    "after": "",
    "ref": "",
    "source_repo": "",
    "source": "",
    "target": "",
    "author_login": "",
    "author_name": "",
    "author_email": "",
    "author_avatar": "",
    "sender": "",
    "started": 0,
    "finished": 0,
    "created": 1574231669,
    "updated": 1574231669,
    "version": 0
  },
  "Repo": {
    "id": 1,
    "uid": "",
    "user_id": 0,
    "namespace": "",
    "name": "",
    "slug": "",
    "scm": "",
    "git_http_url": "",
    "git_ssh_url": "",
    "link": "",
    "default_branch": "",
    "private": false,
    "visibility": "",
    "active": false,
    "config_path": "",
    "trusted": false,
    "protected": false,
    "ignore_forks": false,
    "ignore_pull_requests": false,
    "timeout": 60,
    "counter": 0,
    "synced": 0,
    "created": 1574231669,
    "updated": 1574231669,
    "version": 0
  },
  "Stage": {
    "id": 1,
    "build_id": 0,
    "number": 1,
    "name": "default",
    "status": "killed",
    "errignore": false,
    "exit_code": 137,
    "os": "",
    "arch": "",
    "started": 1574231677,
    "stopped": 1574231677,
    "created": 1574231669,
    "updated": 1574231669,
    "version": 0,
    "on_success": false,
    "on_failure": false,
    "steps": [
      {
        "id": 0,
        "step_id": 1,
        "number": 1,
        "name": "setup",
        "status": "success",
        "exit_code": 0,
        "started": 1574231669,
        "stopped": 1574231671,
        "version": 0
      },
      {
        "id": 0,
        "step_id": 1,
        "number": 2,
        "name": "test-1",
        "status": "killed",
        "exit_code": 137,
        "started": 1574231671,
        "stopped": 1574231677,
        "version": 0
      },
      {
        "id": 0,
        "step_id": 1,
        "number": 3,
        "name": "test-2",
        "status": "killed",
        "exit_code": 137,
        "started": 1574231671,
        "stopped": 1574231677,
        "version": 0
      },
      {
        "id": 0,
        "step_id": 1,
        "number": 4,
        "name": "test-3",
        "status": "failure",
        "exit_code": 1,
        "started": 1574231671,
        "stopped": 1574231677,
        "version": 0
      },
      {
        "id": 0,
        "step_id": 1,
        "number": 5,
        "name": "test-4",
        "status": "killed",
        "exit_code": 137,
        "started": 1574231671,
        "stopped": 1574231677,
        "version": 0
      }
    ]
  },
  "System": {}
}

Diff

diff --git a/.gitignore b/.gitignore
index 1a47b29..bd2f260 100644
--- a/.gitignore
+++ b/.gitignore
@@ -4,3 +4,4 @@ release/*
 .docker
 .env
 NOTES*
+.idea
\ No newline at end of file
diff --git a/engine/compiler/step.go b/engine/compiler/step.go
index d0bd093..74a0152 100644
--- a/engine/compiler/step.go
+++ b/engine/compiler/step.go
@@ -5,6 +5,7 @@
 package compiler
 
 import (
+	"fmt"
 	"strings"
 
 	"github.com/drone-runners/drone-runner-docker/engine"
@@ -22,6 +23,7 @@ func createStep(spec *resource.Pipeline, src *resource.Step) *engine.Step {
 		Entrypoint:   src.Entrypoint,
 		Detach:       src.Detach,
 		DependsOn:    src.DependsOn,
+		FailAll:      src.FailAll,
 		DNS:          src.DNS,
 		DNSSearch:    src.DNSSearch,
 		Envs:         convertStaticEnv(src.Environment),
@@ -47,6 +49,8 @@ func createStep(spec *resource.Pipeline, src *resource.Step) *engine.Step {
 		// Resources:    toResources(src), // TODO
 	}
 
+	fmt.Println(fmt.Sprintf("Executing step [%q] [fail_all: %t]", src.Name, src.FailAll))
+
 	// appends the volumes to the container def.
 	for _, vol := range src.Volumes {
 		dst.Volumes = append(dst.Volumes, &engine.VolumeMount{
diff --git a/engine/resource/pipeline.go b/engine/resource/pipeline.go
index c992fff..5e121f5 100644
--- a/engine/resource/pipeline.go
+++ b/engine/resource/pipeline.go
@@ -94,6 +94,7 @@ type (
 		Environment map[string]*manifest.Variable  `json:"environment,omitempty"`
 		ExtraHosts  []string                       `json:"extra_hosts,omitempty" yaml:"extra_hosts"`
 		Failure     string                         `json:"failure,omitempty"`
+		FailAll     bool                           `json:"fail_all,omitempty" yaml:"fail_all"`
 		Image       string                         `json:"image,omitempty"`
 		Network     string                         `json:"network_mode,omitempty" yaml:"network_mode"`
 		Name        string                         `json:"name,omitempty"`
diff --git a/engine/spec.go b/engine/spec.go
index 9b524dc..83d2390 100644
--- a/engine/spec.go
+++ b/engine/spec.go
@@ -40,6 +40,7 @@ type (
 		MemSwapLimit int64             `json:"memswap_limit,omitempty"`
 		MemLimit     int64             `json:"mem_limit,omitempty"`
 		Name         string            `json:"name,omitempty"`
+		FailAll      bool              `json:"fail_all,omitempty"`
 		Network      string            `json:"network,omitempty"`
 		Networks     []string          `json:"networks,omitempty"`
 		Privileged   bool              `json:"privileged,omitempty"`
diff --git a/runtime/execer.go b/runtime/execer.go
index 9071abf..45cfa5c 100644
--- a/runtime/execer.go
+++ b/runtime/execer.go
@@ -8,6 +8,7 @@ package runtime
 
 import (
 	"context"
+	"fmt"
 	"sync"
 
 	"github.com/drone-runners/drone-runner-docker/engine"
@@ -72,6 +73,13 @@ func (e *execer) Exec(ctx context.Context, spec *engine.Spec, state *pipeline.St
 	for _, s := range spec.Steps {
 		step := s
 		d.AddVertex(step.Name, func() error {
+			log := logger.FromContext(ctx).
+				WithField("step.name", step.Name).
+				WithField("step.command", step.Command).
+				WithField("step.image", step.Image)
+
+			log.Debug("Running Step")
+
 			return e.exec(ctx, state, spec, step)
 		})
 	}
@@ -197,6 +205,16 @@ func (e *execer) exec(ctx context.Context, state *pipeline.State, spec *engine.S
 		if err != nil {
 			multierror.Append(result, err)
 		}
+
+		fmt.Println("step exited")
+
+		if step.FailAll {
+			log.Debug("Failing all steps in build")
+			state.Cancel()
+			e.engine.Destroy(noContext, spec)
+			return nil
+		}
+
 		// if the exit code is 78 the system will skip all
 		// subsequent pending steps in the pipeline.
 		if exited.ExitCode == 78 {

Let me know your thoughts

1 Like

@bradrydzewski , let me know your thoughts on this. Happy to change this around or even do it entirely differently, but I think it’d be entirely valuable to have something like this mainline

U completely agree, and thanks for posting the patch. At first glance it looks good, but I would like to apply the patch and run some tests. This time of year gets a little busy, but will do our best to review and get merged as soon as possible.


edit: I wanted to provide some more details as to why this could take a few days, aside from holidays. I am trying to take the entire runtime package and move to the runner-go library. The reason is that this package is virtually identical for all runners, but was difficult to centralize due to lack of generics. I’m taking another stab at this, since otherwise we need to apply this change to 5 different runners :slight_smile:

Definitely wasn’t expecting for it to be merged as is since it was roughed together - like I said let me know I’m happy to clean/change around

I wholly appreciate how responsive you are to the community as it is. So no worries about being busy, I understand. Plus getting your logic decoupled and centralized also makes sense

The only thing I could see needed from this is a cleanup step that is fired from a cancellation, I’m not sure if there is something for this already. In general even when someone hits a cancel from a frontend it’d be nice to have that hook

I like to run explicit cleanup commands to gracefully shutdown docker compose-built resources at the end of a build fail or success - doing a full stop would prevent that.

The system will always cleanup all resources it creates, but will not cleanup resources created out of bounds. This is a negative side effect of mounting the host machine docker socket and creating resources on the host from inside your pipeline, which is something recommend avoiding unless absolutely necessary. Aside from mutating the host, it has networking and volume limitations that can be non-obvious and result in user frustration (see Volume not getting mounted).

We are working on a virtualbox runner which may be more tailored to your use case. This would allow you to execute pipelines inside a vm and launch containers inside a vm, and would ensure they are destroyed on cancel. We’ve also considered what a docker-compose runner might look like https://github.com/drone/drone-runtime/issues/80

The DinD solution that I’m using works almost perfectly, the only issue I’ve run into is that to reduce the possibility of orphaned resources in the DinD instance I docker-compose down -v at the end of a run so that volumes, networks etc. can be dismantled. Sometimes after enough build runs we start seeing weird issues on the network side because things are not getting cleaned up and we start seeing weird symptoms. We can simply blow away the DinD persisted folder when we get to this state and it will rebuild cache - but it is preferred to keep things self cleaned

Because the DinD volume is mounted and persisted - images/caching obviously persists through each run which is hugely beneficial in performance, but you also have networks and volumes that dangle if you do not clean them up. Regardless of a VM runner or however else you would want to encapsulate, you still have the same problem of config resources piling up over time.

And just for notation, each project persists its own docker installation (DinD) to keep things separate and isolated and it works quite nicely

du -sh /var/lib/dind/*
9.2G	/var/lib/dind/project-a
3.3G	/var/lib/dind/project-b
2.5G	/var/lib/dind/project-c-backend
3.8G	/var/lib/dind/project-c-frontend
20G	/var/lib/dind/project-d
432M	/var/lib/dind/marketing-site
7.0G	/var/lib/dind/project-e
6.1G	/var/lib/dind/project-f
1.9G	/var/lib/dind/project-f-fe
13G	/var/lib/dind/project-g
48G	/var/lib/dind/project-h
24G	/var/lib/dind/project-i

I understand that what I’m doing with our pipeline is out of bounds, but we just have way too complex of a setup to use the traditional pipeline steps in this case. I use it in some of my own projects where I could green field it, but in this case we stand up 4-5 copies of the application with 3 different codebases to do full end to end testing and that ends up resulting in roughly 35-50 containers during a parallel test pipeline (single build) for 2 of our bigger projects. The parallelization is necessary in our use case otherwise our builds creep 1 hour and that makes for a painful feedback loop that Drone has otherwise allowed flexibility enough to be able to run this type of setup (thank you)

I think there would be a lot of value in having a single pipeline step be able to fire off on a build cancel / fail step regardless of what runner or solution you use

Worse case scenario I can run a docker-compose down -v at the beginning of each run to cleanup after the build before, so not the end of the world

In this case I think a VM runner (i.e. VirtualBox) would work quite well for your project. The system would execute your entire pipeline inside a VM with zero host machine access or mutation. The VM would have Docker installed so any containers you create would be running inside the VM, and if you cancel the build, the VM is completely destroyed and would not leave any artifacts behind. This is how Travis and Azure Pipelines work. There are pros and cons to Docker-based vs VM-based execution environments. It definitely sounds like your pipelines are better suited for VM-based execution environments, so hopefully that is a solution we can provide in the coming weeks. I had a PoC working that I need to post to GitHub.

Again, I just mentioned the entire pipeline relies on containers and keeping the cache persisted through each run otherwise our build times get quickly out of hand. I’d much prefer having control over that setup using the DinD solution we’re using now where it is persisted and using Overlay2, its incredibly fast and robust. I’ve done benchmarks on non-persisted DinD runs and the results are dramatically slower

Plus, using a VM runner is not exactly my preference because I’m using hardware that is optimized to handle the intense resource load that our pipelines go through to get our build times down as minimal as possible so picking some run of the mill throw away VM template for builds is not ideal because you don’t typically have any control over whether or not those are being over subscribed or not. It sounds like a fantastic option for those who seek it though

Back to the original point of this thread though, to be able to configure full stop failure per-step would be ideal. Let me know if you want me to do anything about this or you just want me to poke you on this one every once in a while since you wanted to consolidate some of the runner logic - I’m happy to do so

I hope you have a great Thanksgiving!

I wanted to provide a status update that the pipeline execution code has been abstracted out of the individual runners into the shared runner-go library (see the runtime package). I am in the process of integrating with the Docker runner.

I also added a new enumeration item for fail. In addition to failure: ignore there is now the option for failure: fast. This is pretty much what you had above, but instead of adding a new field we decided to re-use the existing enumeration.

This sounds great! Great job as usual Brad <3

Fast fail is now implemented in master and can be tested with the drone/drone-runner-docker:latest image. Here is an example pipeline that I used for testing:

kind: pipeline
type: docker
name: test

steps:
- name: a
  image: alpine:3.8
  failure: fast
  commands:
  - echo foo
  - exit 1

- name: b
  image: alpine:3.8
  commands:
  - sleep 15
  - echo bar

- name: c
  image: alpine:3.8
  commands:
  - echo baz
  depends_on: [ a, b ]