I'm sorry to hear about the major outage that occurred in two of your top systems. I know what that's like.
When I took over an IT operations role for a Wall Street data provider, we were acutely aware of the pain any outage caused our customers. They often had millions of dollars on the line, and our data was a lifeline for their work. When I first came on board, outages were too frequent. The operations and engineering teams felt that each crash was a disaster, and assigning blame was the first reaction once systems were back on line. Fear ruled. If we were going to improve quality, that had to change.
So I began asking three things whenever a problem occurred:
- Exactly what happened, what were the root causes?
- What were we going to change to avoid it happening again?
- How were we going to be faster about recovery if it did happen again?
I made very clear, that if we could answer these three questions, we didn't have a problem; we had a learning experience. People learned that this was the expected result, and they worked hard to delivered. What happened is that the teams shifted from blame to problem solving, and the quality racheted up each week.
So ask a few questions like these, and allow your team to find the answers. |