I’m looking for feedback/brainstorming on how to solve the following with a drone pipeline.
deploy an application to a test environment
run a series of tests against the newly deployed version of the application
if the above fails: rollback the deployment.
This works pretty easily with the status: failure conditional, but it starts to get more complicated when we consider adding the same three steps for a production deployment. So if all three of the above succeed: do the same for prod. With all 6 steps, the status: failure becomes a trap because a failure in the test environment would then result in a rollback of the production environment…
pipeline:
deploy:
image: alpine
commands:
- echo 'executing a deployment'
test:
image: alpine
commands:
- echo 'running some tests'
- exit 1
rollback:
image: alpine
when:
status: failure
commands:
- echo 'rolling back a deployment'
promotion:
image: alpine
commands:
- echo 'promoting a deployment'
test-promotion:
image: alpine
commands:
- echo 'running some tests'
rollback-promotion:
image: alpine
when:
status: failure
commands:
- echo 'rolling back a promotion'
I’ve come up with some incredibly ugly solutions to this… but wanted to know if anyone else has come up with something that feels a bit cleaner?
have you considered bundling deploy / test / rollback into a single step? you could encapsulate all of your logic in a shell script, or even a custom plugin (which is just a wrapper around a shell script)
or you could use something like jsonnet to generate a more complex yaml configuration file, so that you do not have to manually repeat steps …
Yeah, that’s something I’ve started to consider more closely. I originally wanted to avoid because in my case, people are already using a plugin for the deployment part of it and I like the freedom of using any possible container for the test step (allowing freedom for choice for test harnesses etc.).
That being said… it might still be the cleanest/simplest solution.
yes, the challenge with using the status: failure is that if any stage fails (such as the clone stage or the deploy itself) it will execute a rollback even though the deployment never actually succeeded, meaning it could rollback the wrong version. This is definitely an interesting use case, and something I need to think a bit more about.
But in the meantime, I recommend using a single step to avoid edge cases. And hopefully we can come up with a more clean way of handling this particular workflow for a future release.
Thinking about something like a status condition that can distinguish between statuses of different stages was something that came to mind, but while doing that I kept running into edge cases that sound pretty ugly.
what if people name two stages the same thing?
how would it interact with other status checks?
does this just lead to a stage based dependency graph?
is the complexity of it worth the use case?
I think the use case is compelling, but I think there’s a lot of value in things like smoke tests for cloud based web apps (so… obvious bias). That being said, part of what makes drone so appealing for me is it’s simplicity, so I don’t want to complect things too much either.
One thing you could do is adding a parameter, like “rollback: true/false” to the deployment plugin, and have it as “false” as default, so it does not change anything for everyone who is already using it. I’ve used this strategy a lot on the past, but then the rollback would be based on whether or not the application could start up (so an error on the deploy itself, or on the initial healthchecks). Here’s a plugin I did for DCOS/Marathon that uses this strategy.
However, if you want a “custom” test step, this wouldn’t really work that well.
There’s something I experimented with in the past that might work though: you could develop a plugin that triggers a “deployment” event after the “test” step (you could make a simple plugin that calls the Drone CLI directly). And then have everything below the “promotion” step happen when “event: deployment”. This way, a failure on the first, “test environment” pipeline, would not cause a rollback in production, since the production rollback should be attached to the deployment event. It’s not the most simple solution, but could work.
Or you could change your deployment strategy a bit and run some kind of blue/green deployment: you deploy in production, but keep all requests going to the stable version. Then you run your tests against the new deployment and only if they pass a a new step would route all user requests to the new deployment. It’s a bit tricky and may depend a lot on your deployment environment though (you may also ran into issues in case multiple deployments happen together).