First Steps for Limiting Your Downtime
March 25, 2014
"The more you sweat in peace, the less you bleed in war" is as true of IT backup as it is of military operations. Downtime is the enemy and forethought, preparation and practice are the keys to minimizing it.
Downtime is a complex subject and so is the process of limiting it. The checklist that follows hits the top layer, so to speak. Each of these items has many sub-items and each of those sub-items could easily generate its own checklist, or group of checklists.
Manage changes and patches effectively
Changes, upgrades and patches are one of the most fertile sources of downtime. This includes both the planned downtime it takes to install the changes and the unplanned downtime when things go wrong.
Change and patch installation isn't an event -- it's a process; and like any process it works best when it is standardized, documented and controlled as much as possible. A good change management process includes when changes and patches should be applied, how they should be installed and how they should be tested. It also includes processes for dealing with the problems that arise. For example, you need to know what to do if the patch produces problems -- do you simply roll back to the pre-patch state or you attempt to keep the patch and fix the problem?
Use instant-restore technologies, such as snapshots and Volume Shadow Copies
The ability to instantly restore files or to roll back a system to a recent last-known-good state is a powerful tool for minimizing downtime. While they can't replace true backups, such techniques can solve many problems, especially the most common ones.
Although 'downtime' has several meanings, the most practical meaning is the amount of time you are out of business because of an IT-related occurrence. Minimizing that kind of downtime is critical. Not everything needs to be restored at the same time or with the same urgency. Prioritize your business-critical applications and restore the most important first.
If you think of downtime as 'the time that part of the computer system is unavailable,' setting priorities may not improve overall downtime, but that's usually less important that business continuity.
Set goals for downtime
Your organization should have clear, measurable downtime-related goals, such as how long it will take to restore each business critical application under various conditions.
Setting these goals isn't just a matter for the IT department. They should be set by, and bought into by, the entire organization. This not only lets everyone know what to expect, it also makes it easier to invest in needed equipment and training to meet those goals.
You can't think of every possible cause of downtime, but you can sure try.
The best way to limit your downtime is to catch problems before you go down. Log files are your friend. Monitor your system's performance constantly and compare current performance in critical areas to a baseline record. Pay special attention to trends. Often you can spot hardware or software problems early and fix them before they shut you down. You should have some form of automatic warning if critical parameters exceed pre-set levels or if an operation needs a large number of re-tries. Needless to say, those levels should be high enough to be significant and low enough to give you warning. Among the things to keep an eye on are performance-critical measures such as storage system throughput.
Where to set the alarm levels depends very much on the application and the nature of your installation. Vendors can usually offer you guidance on their hardware and software.
Test and drill regularly
Planning is wonderful, but it's not execution. The hard fact is that a depressingly large number of emergency restores -- something like two-thirds by some estimates -- suffer significant problems or fail entirely. Even something as simple as a misplaced (or worse, mislabeled) tape can add hours to your down time.
Human ingenuity being what it is, we can usually find workarounds. However, you end up working a lot harder for a lot longer and sweating a lot more than if you'd tested everything out beforehand.
The only way to make sure you can execute your plan is to test it constantly. At the very least, make sure your restoration procedures work by doing test restores and comparing the results with the original files. It's better to test the entire recovery procedure from beginning to end, and best to conduct regular recovery drills to make sure everything works and everyone involved is prepared.
When the system is down you should never have to guess and never have to experiment. Ideally, you should have all the information you need at your fingertips, including all the required procedures to get back up. This should all be filed and cross-indexed, and you should store at least one copy in a separate location other than the original computer. You should also keep a copy of your current documentation offsite.
Among the items you need are the numbers of all the current versions of software and firmware you are using, including patches, complete system configuration information and a duplicate of your tape inventory detailing what is stored on which tapes. It's also a good idea to keep lists of where recovery-related procedures are found in the documentation and current lists of phone numbers for vendor contacts.
While much of minimizing downtime is simply a matter of proper procedures, some of it requires investing in the right hardware and software. Consider your recovery goals and look for bottlenecks caused by your present hardware and software. Then spend the money to eliminate those bottlenecks.
You will often trade money for protection or speed. RAID arrays with hot swapping and dual power supplies are more expensive but they can prevent a lot of downtime.
Sometimes architectural changes can reduce downtime. For example, disk-based backup is more expensive than tape, but a disk-based backup system or a disk-to-disk-to-tape system can enormously reduce downtime. The only way to know if the expense is worth it for your enterprise is to do your own analysis.