I present and blog a lot about high availability and disaster recovery solutions, in doing so I get to talk to a lot of folks about different strategies. Depending on your business needs and regulatory requirements these can vary greatly in costs and complexity. However, no matter your DR solution, it is imperative that you have a sound backup strategy and that you test those backups on a regular basis.
I recently took part in a architectural review of several important applications. The reason for the review is that customer teams were asking for a real-time DR solution for systems that were producing multiple terabytes of transaction volume daily. This is possible, but only at great costs, so in order to craft a better solution we started asking for details around current strategy, and to get a better understanding of the business requirements around their data. When doing so, it came out, that multiple application teams had never tested a restore of their backups. Excuse my fonts here…
If you aren’t regularly testing your restores, save the drive space and don’t
back anything up
Ok, rant over—sort of. At a conference I had the pleasure of attending this spring, an Infrastructure Executive from Goldman Sachs was presenting about how they had zero downtime during Hurricane Sandy. Granted it is Goldman Sachs, and their IT budget is some-huge-number billion dollars, but several things she said really stand out. In addition to much risk analysis work that GS had done, they regularly tested out failovers and restores, in all systems (computer, power, cooling and generation). That’s by far the most important thing you can do from a DR perspective. Even, if it is not physically doing a test (you really should), but getting all of the teams (Database, Sys Admin, Network, Application) into one room, and working out all of the moving parts, and what needs to happen in the event of a disaster.
Lastly, I know it’s hard to get resources to do test restores, enterprise storage isn’t cheap. However, there are many options you can leverage if you don’t have space in your primary location for testing:
- Amazon Web Services—Take advantage of the cloud, DR in my opinion is one of the best use cases for the cloud. You can fire up a server, use it for testing, and then blow it away. You have to get your data there and that can be painful for large data sets, but it’s not a bad solution.
- Real Hardware—I know enterprise storage is pricey, but a lab server with some relative slow, but dense storage isn’t. You can build a really effective restore testing environment for less than $10,000, and well under it, if you are willing to buy older unsupported HW on eBay.
- Offsite Recovery Services—There are a number of firms who provide offsite recovery services (for example Sungard). However this option tends to be extremely expensive, as they guarantee hardware for you in the event of a disaster, and as part of that guarantee you are granted testing time.
I can’t express my feelings on this strongly enough—particularly if you are working with large data sets, with lots of opportunities for failure, it is absolutely critical to test those systems. Otherwise, don’t plan on being able to restore when you need.
My last company was lucky enough to have a secondary HQ type branch about 60 miles away. The management decided we could afford up to a 47 hour data loss. We had a SAN unit that would replicate our disk drives to local backup unit, and from there it would be replicated every night to the offsite.
We would “test” mounting the data every week to do our tape backups. We had our recovery time for the servers down in the 5-6 hour range to go back to live.
The big thing everyone missed in the whole planning is the IT department was about one deep in each position. When Ike rolled through — we had people without power for days and family to take care of. After that we all had to cross-train to do the DR steps, and make sure we had well documented our portion that there was a checklist for it.