January 13, 2012

[If you noticed the site was down for a bit this morning, here's why...]

It’s true, you really can have too much of a good thing sometimes…. about 7am this morning the alarms went off on our web server. CPU load was at 100% and there was zero network traffic.  In other words, the server soiled itself and ran away to sulk.  Looking at our log, it’s clear that the high strain of traffic just beforehand had done the deed.  (Mind you, most of that traffic was search engines which tend to be the worst kind, hitting every page on your entire site all at once.)

So, I did what any good programmer still bleary-eyed from being up most of the night would do… restarted the server. When the server comes back up you then you check the logs to see what caused the problem and fix it.  ONLY…. the server didn’t come back up!  Worse, the start command was grayed out. (Our hosting provider gives us a clever little web UI).  What’s going on? A few quick google searches revealed the problem…

You can never have too much User Experience (UX) design

You see, in my foggy morning brain I didn’t notice the difference between “Terminate” and “Reboot” right next to each other on the same menu.  One of those commands is ordinary business running a server.  The other annihilates it with extreme prejudice and no chance of any recovery. It’s kind of like the difference between starting your car or turning the key and activating a car bomb! In 1 single click, with no confirmation of any kind, our entire webserver with months of work, and 2 weeks worth of upcoming blog posts was gone*. POOF!

[* Of course we have good backups. That we were back up in about 2 hours is not the point.]

Our provider has apparently been made aware of this problem and added protection measures.  There’s now a “Terminate Protection” setting you can turn on somewhere else in the system (It didn’t exist when we created our server) that prevents you from terminating a server. Now, instead of burdening users with extra steps to gain protection wouldn’t it have been simpler to just move the “Terminate” to another place and perhaps rename it?  Like…. I dunno… say maybe, “Delete?” “Destroy?” “Detonate?”

So the lessons here are:

  1. Don’t let programmers near the server racks before lunch.   ;-)
  2. Make sure your backups are current, you never know when you’re going to need them…
  3. Good UX is always important, even in “geek” systems.
  4. You can never spend enough time thinking about what the user needs or what the problems really are.  I’m sure it took our provider many hours to create that whole “protection” system.  It would have taken far less to rename it or even better, move it to another place and add a explanatory confirmation dialog.

Putting our money where my mouth is: We’re actually flying our designer into Atlanta from the Netherlands this weekend and locking ourselves away in a suite to pound through every last little detail of Photosmith’s UI, interactions, how it gets used, and every possible detail we can think of to make sure we get it right.  It was missing a little “detail” like what’s mentioned above that caused us to drop batch-tagging from v1.0.

…we now return you to your regularly scheduled “Meet 2.0″ program…

Posted in: Musings | 4 Comments

4 Responses to “Too much of a good thing”

  1. Kevin says:

    This is why you keep your EC2 data on a EBS volume. You should use this opportunity to correct your EC2 setup so terminating an instance doesn’t require backups to fix and just means you need to relaunch.

    • Chris Morse says:

      Lesson already learned and implemented! :-) We’d never used AWS before setting up the server the first time. Some things there are, um… different.

      • Kevin says:

        Ah nice work. Yeah it’s a totally different process but it’s well worth the time spent learning it, at least in my opinion :)

  2. Robbie says:

    Man this made me laugh – Just woke the wife up when I burst out laughing!! You just can’t imagine how stupid some developers are – Or perhaps it was a sadistic……………