How Not to Handle Downtime

Posted Sun Dec 30 @ 10:11:03 AM PDT 2012

My datacenter had a "short outage" on Friday. From about 8:45am to about 9:00pm, my websites were completely inaccessible. I looked through my apache logs and confirmed not a single request made it to my server during that time.

Needless to say, I was not a happy camper about it. But I only pay $70/month for the privilege of using their network, HVAC and power, so you can't expect the world from them.

What made me really angry was the way they communicated during the outage, or rather, the way they didn't communicate during the outage. When I recognized the outage, I emailed them. I checked on their website for some news, and didn't see anything. Their website was down on and off throughout the day (unlike mine, which didn't see the light of day for 12 hours). When they didn't respond via email, I attempted to call them. Busy signal. Awesome. I called several dozen times during the day (literally), and either got a busy signal, cut off, or stuck in a line that didn't move ("you are caller number 42").

At 7:36pm I finally got a generic email from them with a subject of "WORLDLINK Intermittent Network Issues". Yes, they consider 10+ hours of downtime (up to that point), an "intermittent" issue. Nothing pisses off your customers more than understating the significance of the problem they are facing. In a later email that evening, they say "We were unable to communicate as the networking issue affected both our phone system and email."

At around 8pm, I finally get through on the phone and talk to someone. I ask for an estimate on when service will be restored, and I don't get a firm answer. I ask, "Will network connectivity be restored before tomorrow morning?", and I get a cautious affirmative. Fortunately, an hour after this call, network connectivity to my server is restored.

How a Datacenter Should Handle "Issues"

  1. Immediately post on a third party service (like Twitter) that you know about the issue.
    It tells the customer two important things: you know about the issue and there is nothing wrong with the customer's equipment. Use a third party service so even if your phones and email go down, you can still get a message out.
  2. Give an ETA
    The customer wants to know the scale of the issue, and when it can be fixed. The frustrating thing about this downtime was I had no idea when service would be restored. I have a backup web server and database slave in another datacenter that are ready to go. All I have to do is change my DNS. But I wasn't sure whether it was worth the trouble if service was going to be restored in an hour. Had I known it was going to be an all day thing, I would have switched over to the backup machine immediately.
  3. Post updates
    It's frustrating not knowing what is going on during the outage. It helps to know if the problem has been identified, whether some service has been restored, and if there is an updated ETA.
  4. Notify us when the issue is supposedly resolved
    Consider this scenario: Service has been restored; the datacenter has not notified the customer about the restoration; the customer's system is still inoperable (i.e. the datacenter issue had a side effect on the customer's system). How does the customer know the problem is with their own equipment, and not still with the datacenter?
  5. << Home