BULLET # 49
DISASTER RECOVERY LESSONS
Here is a compilation of observations we have made about disaster recovery operations. Most of these were learned at the school of hard knocks. They are worth pondering.
1. Do not get creative and improvise during recovery. Trying something new usually fails.
2. Run the recovery like a military operation with battle-station positions and defined actions. When under fire, the roles and actions must be automatic.
3. The key roles are the Field General, the Diagnostician, and the Communicator. Choose these people carefully.
4. The Field General who runs the recovery process, and the Diagnostician who troubleshoots, are two separate full-time functions. They need to be separate people.
5. The Recovery operators and the Communicator are two separate functions. Do not mix them.
6. Keep anyone who does not have an assigned battle station out of the data center.
7. Do not let individual customer trouble shooting and recovery interfere with those who are recovering systems. Separate the two recovery functions.
8. Assigned roles must each have a script and they need to be rehearsed regularly. The more practice, the better.
9. Recover top ten customers first and fast.
10. Keep top customers informed frequently.
11. Keep management informed frequently.
12. Restoring the system is as important as recovery. In other words, after switching to backups, restoring the primary system correctly must follow. Failing to do this right prolongs the perception of unreliability.
13. Many people performing many tasks in a crisis usually means many confusions.
14. Recovery has priority over fixing it. Unless the fix can be done fast, execute the recovery first and repair the source of outage later.
15. What constitutes a disaster should be clearly documented. The official disaster declaration that sets the recovery procedures in motion must be automatic.
16. The faster the recovery, the lower the pain.
17. An outage, no matter how small, needs to be quickly escalated.
18. Don't be lulled into non-action by a "small" outage. All outages demand recovery procedures once the defined time threshold is passed.
19. Get the system programmers and technical support desk involved early. Have them work with the diagnostician, not the recovery operators.
20. During recovery, the Communicator should log and track all customer problems in a way that customer support reps can get quick updates and operators can make quick updates, without interrupting recovery actions.
21. All recovery action must be focused, automatic and followed to the letter. Thinking sinks the ship.
22. Supervisors, Managers and Senior Managers should first find out about problems from operators, not customers.
23. When using backup systems, do not try to recover more users than you have extra system capacity to handle; it will only degrade the remaining users' service and risk losing more user connections.
24. Do not use untested, still-under-development recovery systems. It will likely have bugs, and be too difficult to quickly diagnose.
25. It's not recovered until the customer says it's recovered.
26. Whenever and wherever, automate as many of these tasks as possible. In a crisis, it's too easy to screw it up.
27. An ounce of prevention is worth a ton of cure.
© Copyright 1994, HP Management Decisions Ltd., All Rights Reserved.
© Copyright 1996, 2019, HP Management Decisions Ltd., All Rights Reserved.