Java and the five eights
Brad Wilson, who's wisely disabled comments on his site, links to
.88888 - The Fabled
"Five Eights" of J2EE Availability. I realize that most people won't bother
to read the article, since it's much more fun to toss around unsubstantiated opinions, but I give
it a hearty Amen, and second Brad's sentiment that There are some universal lessons here
.
There are 2 key points in this article:
Problem detection is a serious issue. In 40% of cases first notice of application problems are customer or executive complaints.
...actual problem diagnosis takes too long. It takes more than eight hours on average to get an accurate diagnosis, and more than a full day for 30% of organizations
What this boils down to is 2 items: monitoring and diagnostics. For the last big project I did, we ended up writing an extensive monitoring console that monitored WMI events and performance counters. We also implemented self tests for each web service we deployed, which makes diagnostics easier: for instance, we've had database servers fall over, and the self test tells you that quickly. It's still really difficult to do this stuff. The self test is essentially a test for the environmental parts of the system. Ideally, your unit tests will separate out the environmental variables, through mock objects or whatever, but a self test is the same idea, for the environment. As with unit tests, we tried to add tests as we learned what could go wrong, and as with unit tests, it's a difficult discipline to keep up.
The first point is trickier. For monitoring to be effective, someone has to be watching the monitor. There's all kinds of "it doesn't appear that fancy, expensive tools" out there, but we ended up writing our own, because generalized tools don't have the specific knowlege of your application. It might be nice to know that one of the processors on your SMP server is pegged, but it's better to know what process is pegging it, and better still if you can take a dump of the process so you can do a post mortem and find out why. And there's subtler issues: I've ended up devising an application level traceroute, where one request goes as far as the web server, a second request gets into ASP.NET, a 3rd goes to the database, and so on. This has come in handy in diagnosing slowdowns: if I can see that our web server responds quickly to requests that it can handle itself, but bogs down accessing the application server, that gives me a better idea of where to look.
Rick also points out that the people who wrote the application (hopefully) won't be the ones monitoring it. I first heard this from Steve Loughran, and it's worth repeating: make your error messages meaningful and actionable to the person who'll be reading them. Actionable means that monitors should be quiet when nothing's going on; you want the squeaky wheel to get grease, not the boy who cried wolf. In other words, a noisy monitor gets shut off or ignored. Actionable also means that error messages should come with an action; for example, "if process X is using 100% CPU, restart it". Second, you want a meaningful error message. Steve put it this way: if someone in the NOC sees "java.net.NoRouteToHostException", they quit reading at "java" and page a developer. If they see "Network connection to database server X is down", they page a network admin.
All this will get you only so far, however. As James Robertson points out, we're rewriting a lot of systems these days, and much as I love to make fun of mainframes, IBM figured out a while back that the last few 9's come from things like being able to update the OS without restarting it. Your application relies on a lot of underlying plumbing, so it'll be only as reliable as the OS, the runtime, the database, etc.