Ideals

Interesting comment by Werner Vogels

the best way to completely automate operations is to have to developers be responsible for running the software they develop. It is painful at times, but also means considerable creativity gets applied to a very important aspect of the software stack. It also brings developers into direct contact with customers and a very effective feedback loop starts.

If your developers have a survival instinct, this will absolutely be true. If not, your development group will get bogged down in operational issues. That said, I'm not sure it's possible or desirable to not have a separate operations group, Werner's claim about Amazon's policy notwithstanding. I personally prefer to put development effort into creating interfaces that support and operations people (and developers, for that matter) can use to diagnose problems quickly. Steve Loughran's proposal of merging ops and development is ideal, but not always do-able. In addition, for US companies with SOX requirements to satisfy, you may be legally required to restrict access to production machines to a select group - i.e. your operations staff. So realistically, you may be stuck with using contractors for operations; people with different interests than your development group.

When I was at Galileo, we were beginning the development of the first edition of Galileo Web Serivces had the beginnings of a deployment plan - and then Galileo was acquired by Cendant. One of the first acts by new management was to sell the Galileo Data Center to IBM, and then contract with IBM Global Services to run it. I'm not sure of the specifics of who owned what, but the hard part was twofold: the people were the same, but the mandate was completely different. Wheras we used to have a collaborative relationship of colleagues, we now had a contractual relationship with competing interests. IGS' goal was system uptime, and Galileo's goal was to deploy new products. One of the big tenets of system stability is that you want to minimize changes, and deploying new software is one of the greatest sources of instability. So IGS wanted a very tight change control process. Since IGS was responsible for our servers, we (developers) also lost our login access to these boxes. So where we had relied extensively on the Event Viewer and having a debugger installed on the production server, we now had zero visibility into the system. As a consequence, the development team was now in a situation where we were completely reliant on others to do what would have been at least partly our job. This was not an easy transition, but it ended up being a good discipline. What we found was that we needed:

  1. For deployment, detailed instructions for deployment and fallback. We needed instructions at the level where the operator could perform them without any help from a developer - regardless of whether or not one of us was going to be on the conference call. A tip is in order here: if your installation or fallback consists of more than a half page, then you really need to automate. And a half page is a pretty liberal definition, Steve Loughran's ideal of a single command line is much better. My gold standard for this has always been the Perl build script (at least as of several years ago, I haven't used Perl in ages). When I first started with Perl about 10 years ago, I was tremendously impressed how you'd type "make", everything would build, install itself, and then run a series of tests to make sure the installation worked, and when your command prompt returned, you had a working Perl on your system. Every application intended to run on a machine other than the one it was developed on should work this way.
  2. Status of the system, in easily comprehensible format. What we found is that we needed a high level "red/yellow/green" status on the entire system, with a drill down type of status report, so you could break the system down by components. An application "traceroute" may be helpful. In Galileo's case, we had requests that served static files, requests that involved invoking ASP.NET but no other processing, requests that hit the database, and requests that went to our mainframe backend, and requests that went through the mainframe backend to an external vendor's system. For instance, if you get a situation where the web server is serving static files, but is not serving vanilla ASP.NET, or serving it very slowly, that's an important indication of what's wrong.
  3. Meaningful and actionable error messages. If an operator sees "java.net.NoRouteToHostException", they stop reading at "java" and page a developer. Better to say "DNS Lookup Failed on x.x.x.x - verify routing from y.y.y.y to x.x.x.x" - this is something that a competent operator should understand, and there's a chance that they'll check the network before picking up the phone.
  4. Noise reduction. A system that constantly produces alerts is hardly better, and arguably worse, than a silent system. Noise reduction involves using a logging system that has the concept of severity and also channels, so you can filter based on those items. It also means accounting for jitter if you're using inherently noisy indicators. For example, the processor utilization number isn't a good indicator by itself. If you have one sample with utilization at 100%, that's interesting, but much less interesting if the next sample is 40% (assuming your samples are closely spaced). In another example, if you're monitoring the number of requests arriving at your webserver, this number can naturally spike, but sustained high arrival rates across all your servers probably indicates you need to add capacity. Certain metrics are naturally subject to spikes; you have to learn what those metrics are and what sort of smoothing is appropriate before throwing an alert.
  5. Visibility. Writing logfiles on a machine where you can't log in and can't share a drive is no good. You need to publish your status so that you can see it without needing local priveleges. One way is to build a web application and do your best to secure it. It may be helpful to your customers to give an unsecured, or partially priveleged status page. MSN does this for their system status for all users, you might also do a cut-down system status for customers if you have some sort of SLA. Much of the stress for a customer in this situation is that they have no visibility either; they depend on your services but have no clue what's happening with them.

Even if you can't merge your operations and development staffs, you need to at the very least treat ops as a valued customer. I'm talking about web applications here, but the same factors apply to "fat client" applications, particularly the 3rd and 5th points. With an end user, you most certainly don't have local access (absent something like CoPilot). If you create an application log, and give your users an easy way to submit it to your support email, you'll be doing yourself a huge favor.

— Gordon Weakliem at permanent link