Lately it feels like theres no getting away from the terrible news about the oil spill in the Gulf of Mexico. The extent of the damage is mind-boggling: from the devastating impact to the wildlife and shoreline to the loss of livelihoods for locals. The oil spill is a result of two core issues: the failure to put in place adequate controls for prevention of catastrophic events and an incredibly poor to non-existent disaster response and recovery (DR) plan. As most of us watch in helpless horror at the ongoing spill, its easy to scoff at BP for their shoddy practices and inadequate response.
But how many of us in IT have been able to, or even wanted to, engage in highly rigorous prevention and real-world testing of DR? Very often, security controls are skimped on for cost purposes or because the business does not feel the time or overhead related to a control equals the risk to the business. And DR is, unfortunately, an afterthought for many organizations; something that may be written as a skeleton process, but never executed in a testing environment to ensure that the response plan will truly work as expected.
Take something as simple as back-ups. We all know we need to do them, and most of us have automated back-up systems in place. But to make sure theyre working as expected the back-ups should be reviewed on a schedule. Heres where many organizations fall down the back-ups are run, the back-up system returns an all good result, and the admin goes on to the next task. Routine checks of the integrity and completeness of the back-ups may be run during the initial months of operation, but after the system is running successfully for a few months, its easy to skip a day, or week, of validation testing - especially when there are other IT fires to put out.
This approach works well if the back-ups are actually running correctly and/or if recovery from a back-up is never required. Worst case scenario though, is the back-up has malfunctioned, yet the admin is unaware until its actually needed: for example if someone asks for the data as part of eDiscovery during a legal case or if theres a disaster. Have you ever gone to restore a nicely batched and dated back-up only to discover the data isnt there? I have. And its not pretty. Luckily for me, my failed back-up was discovered during an overdue validation test and I was able to fix the problem and complete subsequent back-ups with no catastrophic loss of data.
No such luck for BP and no such luck for many admins when an incident occurs without proper business continuity (BC) and DR processes and procedures in place. So what can this teach us? Here are my top three.
1. Get the Consequence Cost Right
BCP and DR are important. Not in an, its accepted best practice or this regulation requires that I do it kind of way. But in an, if the potential damage and consequences truly matter than the BC/DR plan has to be resilient and well-tested even if its expensive and time consuming kind of way. Many of us dont deal with life and death kinds of consequences, but catastrophic loss to a business could mean loss of that business.
BP seems to have tried to skirt responsibility for not implementing better controls or having better DR in place by downplaying the potential consequences. The financial and environmental impact of the disaster will not be known for months, or possibly even years, but one thing is clear, BP appears to have failed to assess the impact at many steps along the way: initial reports were that the spill was small and could be contained, but these reports had to be revised when it was clear containment and shut-off were not possible at least for the near-term. Some estimates are that BP has already spent $1.25B on the spill. Would the $500,000 for an acoustic trigger remote shut off valve have been justified if theyd accurately assessed the consequence cost?
On the other hand, theres absolutely no sense in spending more to protect less. It wouldnt be a financially sound decision to spend $400/month to insure replacement costs on a car valued at $1,200 for more than three months in just the same way as it wouldnt make sense for an IT department at a 4 million company to invest 5 million in an anti-virus software solution.
2. Go for value, not volume
The BP response plan weighs in at a hefty 583 pages, has a 117-page quick guide, and is chock full of handy flow charts and accountability tables. Having worked on detailed policy and response plans in the past, I respect the attempt at thoroughness shown in this document.
But there are a couple of areas where this falls down. First of all, a flow chart that flows to nowhere isnt very useful. Consider this one that walks a responder through a dispersant use decision tree. It looks good, but the end state for at-risk shoreline habitats is a case-by-case approval. Did we need a flow chart for that?
One of the striking aspects of the response plan is that the spill or slick appears to be treated as something slow moving and containable. The gusher of oil in the Gulf is anything but. Im not a Minerals Management guru, so Im not sure how a spill normally occurs - but I do have extensive experience with network meltdowns and attacks. And when they happen, the damage moves fast and the responders need to be able to keep up.
While theres no denying that a BC/DR plan needs to have documented processes, it also needs to be able to impart key decision points quickly and precisely so responders can act accordingly. Is the corporate policy to shut down the e-mail server if a virus is being re-transmitted or to isolate the offending e-mail source (or sources) and stop them? Or is this an incident that hasnt been anticipated and therefore there are no procedures? Unfortunately, despite best efforts, there will be unprecedented events in IT and some decisions may have to be made on the fly.
Which leads us to the last and final lesson -
3. Practice makes (almost) perfect
The heroes on the supply boat that was attached to the rig at the time acted admirably: they got the boat to a safe distance and then sent crews back to evacuate the survivors who had followed protocol and gotten into lifeboats. The Coast Guard deployed helicopters, cutters, and a plane to take survivors back to shore as quickly as possible. Had these teams practiced evacuating an oil rig that had turned into a sinking fireball? Not likely. In fact one survivor reported that, "It was chaos. . . Nothing went as planned, like it was supposed to." But the crew members had taken part in weekly evacuation drills, and the evacuation effort succeeded in saving the lives 115 of the 126 people on board at the time of the explosion.
What hasnt succeeded is the ongoing response effort as evidenced by an inability to staunch the flow of oil. A variety of oddly named responses (including top hat and top kill) were tried and subsequently failed. A successful cap was installed in early June, but experts believe this has done little to reduce the actual rate of the leak into the Gulf.
Practicing weekly evacuations led to a successful evacuation, failure to practice stemming a massive oil leak led to the ongoing torrent of oil into the Gulf more than six weeks after the initial explosion. This takes us full-circle back to the point about validating data back-ups. The process had two purposes, one was to confirm the data had been backed up correctly, the other was to extract a piece of data from that backup based on a set parameter. The second part was to practice finding data in the backups in case we needed them for business reasons. So when the time came during a crisis that data was needed, we had the confidence and experience to know we could find it.
Catastrophes have a way of making people panic. Crisis causes an adrenaline surge and its hard to think clearly. When we have practiced doing something, its much easier to execute these steps even in crisis mode. The middle of a disaster is the worst time to pick up new skills.
Since theres no way to anticipate everything a top IT response team will need to do in a crisis, at least take time to practice what can be practiced. Some tasks, such as shutting down zones or sub-zones, tracking outbreaks and attack vectors, or transferring operations over to another data center can be practiced in a fire-drill type manner without interrupting core business operations.
When a serious attack on the network occurs the team wont be scrambling to implement basic response activities. Sure, they may need to improvise a little, but lets work on being more like the folks that evacuated the rig even though theyd never practiced evacuating from a fireball rig and less like the engineers that are still struggling to cap the spill effectively.