Continuous Deployment: A Lesson in Masochism

Posted by Matt Farmer on December 12, 2012 · 11 mins read

I’ve determined over the past few weeks that the statement “deployment is hard” should be the nominated as the understatement of the century. I’ve spent a large number of hours over the preceding weeks getting our continuous deployment setup at Kevy functional. We actually started out pretty simple by implementing a basic deployment of master as it passed unit tests. Then we decided to get ambitious and implement something similar to the GitHub deployment process using our Hubot, Siri.

For those unfamiliar with the process, the general idea is this: While you’re writing your code, you also write a series of automated tests for your code. When to publish changes to the code, you have some automated system (in our case, Jenkins) that retrieves the latest code, runs the automated tests, and if it determines all the tests have passed it then it pushes that code directly to the live server without any human intervention.

I’ve spent more time doing operations stuff related to our deployment process than I’d like to admit over the last few weeks. This isn’t great because while this is going on it means I’m not being the most contributing member of our team in terms of actual product code and doing code reviews. However, the net net is that we have ended up with a pretty cool deployment process (although I’ve still got quite a bit to debug). The flow for shipping most features goes something like this:

If you’re making something we consider a small change (simple copy changes, a one or two liner, and a few other special cases) then make your change and push it to master. Jenkins gets notified of the change, runs our test suite, and if it passes starts the deploy process. If master is the branch currently on production, your code is deployed. If not, it halts the deploy.

If you’re making any other change, the process looks something like this:

  1. Fork a branch, implement your feature with tests.
  2. Open a pull request.
  3. Jenkins tests your branch, and pushes status updates to the GitHub status API.
  4. Code review happens.
  5. We use our Hubot, Siri, to deploy the pull request to production for us assuming it passes a handful of business rule checks. We whack production with a stick and keep an eye on New Relic for any red flashing lights for a minute or two. At this point production is locked to your pull request. New commits to your branch will be continuously deployed, but new commits to master will not.
  6. Assuming everything is kosher, we merge your pull request. Jenkins does its test juice, then unlocks production and deploys the merged code to production. Continuous deployment from master resumes at this point.

Part of this functionality includes the ability to rollback production to master in the event that your pull request bunks production. Perhaps not ironically, that’s the part that needs the most debugging at the moment. That said, there are a few useful lessons I learned in the process of getting this infrastructure up and running.

Lesson One: Just use Jenkins

We are presently on our third continuous integration system. We tried a hosted service first (Railsonfire), then CC.rb, and finally landed on Jenkins. It’s unfortunate, but nothing at all compares to the power than Jenkins has in configuring builds and plugins. That said, it didn’t come without some strife. I had to implement some modifications to the GitHub Pull Request Builder to make it work for us. However, for the most part it has been a point-and-click experience in configuring our integration testing after the server was up and running.

That said, save yourself the headache I endured: just use Jenkins from the start. You’ll find yourself beating your head against a wall to get anything else working justˆhow you want it where, despite its reputation to the contrary, you can get Jenkins up and running in no time.

Lesson Two: Ruby Environment Configuration Sucks, use a Login Shell

There’s really no other way to title this lesson. Configuring a functional Ruby runtime environment is annoying compared to Java or PHP. One thing that I remember thinking from my interactions with Ruby and Rails back in the 2.3.x days was that the community’s love of moving quickly was its own worst enemy, and I still believe this to be true. For whatever reason it’s guaranteed that paths are going to get screwed up, this gem isn’t going to talk nicely to that gem, occasionally you’ll have to prefix everything you type with “bundle exec” because things got wonky, and various other peculiarities that make me occasionally long for the days when sbt did everything I needed. (Thankfully for some projects it still does.)

This is further aggravated in a continuous deployment scenario where you’re running scripts without a tty attached. Something that took me forever to get to the bottom of was why the paths were all screwy while my deployment scripts were running. Well, we use rvm on production – for a number of different reasons I’d love to write about later – and if you want rvm to load correctly you have to use this thing called a login shell. That’s fun because if you use GNU screen, you may not have a login shell. Most scripts don’t run in one, also. You can google how to make sure your screen is running a login shell. For scripts, use the following shebang line:

#!/bin/bash –login

That’ll get everything rolling along nicely. I assume other shells (sh, zsh, etc) will work the same way.

Lesson Three: Dummy-proof Deployment

You’re a dummy. That’s ok though, because so am I. Everyone at one point or another is a dummy. Even when nobody is really trying to be a dummy, “shit happens,” as they say, and someone ends up being a dummy anyway. So, deployment needs to be dummy-proof. I didn’t follow this rule so well on the first iterations of our deployment code for our Hubot and it bit us. Specific lessons:

  • All deployment operations should be idempotent. This includes rollbacks. So, issuing the rollback command multiple times shouldn’t unwind all of your migrations. (That’s actually what happened to us. Oops.)
  • Implement steps to prevent multiple deployments from running at once. (We also need to do this one.)
  • Occasionally try to break your deployment scripts by giving them crazy input parameters on test systems and seeing what happens. In short, act like a dummy might act if the dummy were very intentional on being a dummy.
  • Be running a watchdog like Monit so that in the event something goes completely sideways there’s some automated process in place to try and get your systems up and running again after your deployment scripts have barfed all over your server. It may not succeed, but at least it can try.

Lesson Four: Learn how to Barrel Roll

Here’s a hard lesson to learn about continuous deployment: it’s going to go sideways sometimes. It just is. Sorry, but it’s true. You’re going to break something. The purpose of the code review and test suites is to try and make sure that when things break you break as little as possible, but something is eventually going to slip through the cracks. For this reason, every member on your team that can deploy code needs to have root access to production servers, and they need to be armed with the knowledge about how to reset things manually when continuous deployment goes completely wrong. If you wouldn’t have the first clue about how to fix it if it broke, you have no business deploying to a server. Doing so regardless will inevitably result in a coworker getting a phone call while they’re on vacation, you making things worse, and probably some combination of the two if Murphy’s Law is to be believed.

Take the time to learn how to do things by hand so you don’t have to learn when production is down, sirens are blaring, customers can’t use the product, and the guy normally in charge of ops is on a mountain outside of cell range.

Moving Forward

There are a handful of things on my short list moving forward in our deployment setup. Of course, I want to fix the bugs that we’ve already got in the mix. Shortly thereafter I want to look into things like:

  • Hitting the New Relic API to signify a deployment to production in their system.
  • Automagically merging in master to a branch before deployment, if that merge is needed.
  • Winding back a deploy that went sideways for some reason in the actual deploy script.
  • Abstracting the entirety of the work I’ve done in this area into a tool that others can use.

This adventure down the continuous deployment road has been a bit crazy so far. I’m certainly not in Kansas anymore. As always, I’d love for you to leave me some comment love about your experience with continuous deployment, deployment in general, or whatever else about this post that strikes your fancy. Until next time, kids.