Home HomeContents ContentsPrev PrevNext Next

HPMD Bullets

BULLET #57
DISASTER RECOVERY EXAMPLE

The attached story on an outage at America Online Inc.'s Vienna, Virginia data center appeared recently on the Dow Jones News Wire 1. It illustrates a number of points we have made about disaster recovery. Here are some observations to chew on:

1. The most unexpected event can become a disaster. The thunderstorms may be common for Summer in Virginia, but we bet the flooded drain pipe was a surprise.

2. Any item touching the data center is a potential single point of failure. Natural disasters, such as flooding, should be a major red flag. It's why SIAC, the NYSE data center, has built its data center and support systems floors on the top of their MetroTech facility.

3. A four hour outage impacting up to 800,000 subscribers is totally unacceptable. An 18 hour outage for a highly visible application (email) is unfathomable. How would you like to have been the manager in charge of their operations?

4. The duration of the outage begs the question of how current their disaster recovery plan was, and when it was last rehearsed.

5. In the customer's eyes, how an outage is handled is as important as the outage itself. Prompt and frequent communication is essential. Notice the inadequacy of the on-line message.

6. Also notice how uninformed management was. It appears they were in the dark until the system was actually coming back on-line. Some estimate of recovery should be possible, especially for a tried and tested recovery plan.

7. The decision to abandon the switch to the backup system is astonishing. What good is a backup if it takes too long to employ?

8. The question about insufficient backup capacity adds insult to injury. It also begs the question about why a reasonable triage plan for users and applications was not available that could effectively use a smaller backup system.

9. The fact that the company does not plan to change the way it handles similar emergencies, means that they have not gotten the message. If "perception of the intangibles is really everything" (Tom Peters 1), how does America Online quality rate in their customer's eyes?

10. It bears repeating: an ounce of prevention is worth a ton of cure.
© Copyright 1994, HP Management Decisions Ltd., All Rights Reserved.
________________________

1 Dave Pettit, Dow Jones News/Retrieval, June 28, 1994.
2 Tom Peters, Thriving on Chaos, (New York: Alfred A. Knopf, 1987), p. 98.

AMER 6/28 (DJ) Heavy Rain Floods Amer Online Computers; Outage Raises Ire

By Dave Pettit, Dow Jones Staff Reporter

NEW YORK -DJ- Electronic mail is making inroads against the mailman, but it still can't answer to his creed.

Snow isn't a problem, nor is the gloom of night, but rain was a different matter, at least for those who rely on America Online Inc.'s (AMER) e-mail services.

Rain from severe thunderstorms flooded a drain pipe and poured into the Vienna, Va., company's computer facilities on Friday, shutting the entire on-line service for four hours and knocking out e-mail for 18 hours.

The disruptions renewed grousing among some of the service's roughly 800,000 subscribers, who had faced unrelated snags early this year. The company was forced then to ration access to its systems as demand from a wave of new users overwhelmed its capacity.

America Online, which provides news and other information services in addition to e-mail, has added to its computer systems since then, and a spokeswoman said this weekend's difficulties weren't related to capacity constraints.

But that is little comfort to users who quickly turned their e-mail woes into a topic of discussion on the company's electronic bulletin boards Friday night and Saturday morning. They were critical of the outage and of its handling by the company.

Some were peeved America Online didn't go further to explain the problem and offer an estimate on when e-mail services would resume. An on-line message simply said there was an emergency that was being addressed. The same was true for a terse recorded phone message.

''Our full efforts were put toward bringing the system back up,'' said Jean N. Villanueva, a company spokeswoman. ''At no time did we have an accurate estimate of when the system would be brought back up until we were within minutes'' of a resumption, she said.

Villanueva said the company received several hundred inquiries from subscribers about the problems and said there likely were a small number of cancellations. She didn't know the specific number. America Online doesn't plan to change the way it deals with such emergencies, she said.

The faulty drainpipe, meanwhile, has been capped, she said.

Paul Sweeney, an analyst at Wheat First/Butcher & Singer, said he doesn't view the flooding problem as indicative of wider problems with America Online's operations and said he is satisfied with the additional capacity the company has acquired. But he questioned why a backup computer system wasn't available Friday.

Villanueva said America Online opted against switching to its on-site backup because it determined that it would take longer to bring that system online than it would to restore the flooded computers.

None of the computers were damaged by the flood. The rainwater never reached the elevated equipment but was high enough to require a shut down of electric power.

Villanueva couldn't estimate how much time is needed to make the backup system operational. The backup system has less capacity than the primary system, although she wouldn't quantify how much less.
HPMD
© Copyright 1996, 2019, HP Management Decisions Ltd., All Rights Reserved.