Read IT Manager's Handbook: Getting Your New Job Done Online
Authors: Bill Holtsnider,Brian D. Jaffe
Tags: #Business & Economics, #Information Management, #Computers, #Information Technology, #Enterprise Applications, #General, #Databases, #Networking
This documentation will become an indispensable resource. Memories fail, particularly in a crisis. In fact, aside from your backup tapes, your plan may be the only resource available to you during a disaster. As such, it's in your best interest to make it as useful as possible by including as much information as possible in this document. This should include documentation about the existing environment.
All documentation should be reviewed and updated at least once a year to reflect changes to the environment, operations, personnel, procedures, etc. These review opportunities can be a great time to keep key people current on the plans and to educate new employees—don't assume that everyone will read the plans on their own.
Real Estate and IT Facilities
One of the first questions that must be considered in regard to disaster recovery planning is “Where should we go now?” If you're unable to use your organization's facilities, where will everyone go when disaster strikes? If your organization is very small, you might be able get away with operating for a small time out of someone's residence. A slightly larger organization might be able to use a meeting or banquet room at a nearby hotel, assuming that the same disaster hasn't impacted those facilities. Other alternatives might include a nearby branch office of your organization or perhaps the office of a sister, subsidiary, or parent company.
However, if your IT organization is more than a few people in size, you're probably going to need specialized facilities with sufficient space, air conditioning, electricity, telecommunications resources, and so on. This may be the case even if you're only supporting a portion of your normal operation, even for just an interim period. Many companies offer disaster recovery facilities. They can generally tailor their offerings to your needs, perhaps just providing space or, at the other extreme, providing specified computer hardware, telecommunications, and perhaps even some staffing.
Of course, the ultimate in disaster recovery facilities is for an organization to maintain its own standby site with redundant hardware. For the most critical environments, the standby site is always live, connected to the network, with a mirrored copy of the database and applications, and so on.
Disaster Recovery Facilities Considerations
When looking at companies that provide disaster recovery facilities, you have to consider several issues:
•
Proximity to your location:
You generally want a nearby location in order to get to it easily, but not so close that the facility is likely to be hit by the same disaster that affects yours. You may need to consider a facility that is reachable by mass transit if you're in a large metropolitan area where not everyone has their own car.
•
Costs:
The more services and facilities you want to be ready for your needs, the more it will cost. Contracts for disaster recovery usually last at least two years and are billed monthly. However, there are several aspects to disaster recovery fees:
•
Standby fees: The monthly fees you pay to have contracted facilities available for your use
•
Activation fee: A fee you pay when you decide that you have a disaster that warrants use of the facilities
•
Use fee: The rate (weekly, monthly) that you pay while you're using the facilities during a disaster
•
Test fee: A fee that is paid when you want to make use of the facilities to test your disaster recovery plans
•
Number of clients:
You want to be sure that the provider you're working with hasn't contracted with more clients than it can provide services for. If there is a regional disaster, and all the provider's customers suddenly need to use the facilities, will there be enough to go around? Disaster recovery providers can either provide you with dedicated space that will always be there for you, or non-dedicated space that is made available to their subscribers on a first-come–first-served basis. Of course, the dedicated space is far more expensive.
•
Other required services:
Space, hardware, staff, telecommunications, air conditioning, electricity. Don't forget basics such as furniture, phones, and so on.
In the event of a disaster, one of the critical decision points is when to fail-over to the recovery site. In the event of a catastrophe such as an earthquake that destroys your primary facility, the decision is pretty easy. However, in the case of a blackout, it is reasonable to think that the power will be back on soon enough, and the time, cost, and effort to bring up the disaster recovery facility (along with reverting back to the primary facility) does not outweigh the benefits of being down for a short period.
Because of this, many environments don't configure their hardware and software to automatically fail-over to the backup facilities if a problem is detected at the primary site. Often, the fail-over process is something that has to wait for a human decision to be specifically initiated.
Off-Site Storage of Data
Backup Tapes
If you need to activate a disaster recovery plan, make sure that you can get your company's systems and data up and available. Most likely, you'll have to do some sort of restore from your backup tapes. If your regular facility is destroyed or inaccessible, you'll have to retrieve the backup tapes from your off-site storage vendor.
To get those tapes, you'll need several items:
•
Contact information for your off-site location
•
A method of identifying which set of tapes you want retrieved
•
A customer ID, account number, and possibly a password as a way of identifying yourself to the off-site location as someone authorized to request that the tapes be retrieved
•
The address of, and probably directions to, the location of where the tapes should be delivered (you most likely won't want them delivered to your usual facility)
Getting the tapes is the first step. Then you have to begin the restore process. You'll need to have access to compatible hardware and software that can read those tapes, and have procedures for doing the restore. Also, if you normally encrypt your backup tapes, your recovery site has to have the appropriate technology and copies of the encryption keys to ensure that the backup tapes can be unencrypted. Some companies are moving to tapeless backup by copying their files to a remote facility over a network (see the next section,
“Data Replication”
on
page 256
). While this simplifies some of the issues of getting your tapes, you still have to make sure that your recovery site has connectivity to the backup data. For those organizations operating in the cloud (discussed further in
Chapter 5, Software, Operating Systems, and Enterprise Applications
on
page 135
), things are even simpler.
Data Replication
If you have an identified disaster recovery facility, with hardware, there are a number of options, in addition to backup tapes, that you can use for making data readily available in an emergency.
•
A number of storage vendors (e.g., EMC, NetApps, IBM, Hewlett-Packard) have solutions for replicating data between sites. These utilities don't duplicate the entire data set, but merely the changes (usually referred to as the “
deltas
”), which results in the two copies being in synch. Similarly, third-party utilities can do the same thing (Double Take, LinkPro, Neverfail, etc.).
•
Database vendors (e.g., Oracle, IBM, Microsoft) have features and utilities for keeping multiple copies of databases in synch. Similar to the data replication feature just discussed, this is strictly for databases.
•
Transaction logs can be regularly replicated to your secondary site where they can be imported into the copy of the database.
Hardware Availability
If your regular computer hardware is unusable for any reason (e.g., a power outage or the destruction of your facilities), you'll have to quickly get your hands on some computer hardware before you can even begin rebuilding your environment.
Size of Your Environment
The smaller and more generic your environment is, the more options you'll have. For example, if your environment is based on Microsoft Windows and Intel PCs and servers, you may be able to rely on local retailers or your regular reseller. Alternatively, you can contract with your disaster recovery facility to keep a quantity of these units on hand for you.
With larger or more complex environments, it will be more difficult (and more expensive) to make sure the equipment will be available. You may want to purchase some of this equipment yourself to have in an emergency, or your disaster recovery facility provider may do this and pass the cost on to you. Your vendor may also have options and provisions available to allow you to receive emergency delivery of specified equipment in the event of a disaster.
For larger environments you may consider having a dedicated facility, either at your own site or from a disaster recovery provider, that is a running environment with all your necessary hardware.
Duplicating Your Entire Environment
In case of a disaster, you may not need to duplicate your entire environment. You probably just want to plan for bringing up the systems that are the most critical to the continued operation and survival of the organization (as discussed previously in this chapter in the section,
“Application Assessment”
on
page 250
)
.
To make sure that your recovery operation is as smooth as possible, you'll want to ensure that you're using equipment as comparable as possible to your existing environment. The middle of a crisis isn't the time to find out that the emergency tape drive you have isn't compatible with the backup tapes you use, that you don't have the proper drivers for the network interface cards you're using, or that your application software has to be recompiled before it will run on the hardware you have.
Equipment at Home
If your plan includes people working from home, be sure they have what they need:
•
Workstations with appropriate software (either the applications they need and/or software for remote access to reach those applications and data). If it's a laptop you gave them strictly to be used in the event of a disaster, you'll want to check the laptop periodically to make sure it's still functioning, has up-to-date software, etc.
•
Familiarity with procedures for connecting remotely (especially for connecting to a recovery site).
Regular Updating and Testing
A disaster recovery plan needs to be reviewed and updated regularly. Just as important is testing the plan periodically.
Review and Update
At least once a year you should review your plan for the following:
•
Is the emergency contact list current? Check it to verify that it doesn't contain individuals who have left the company or are no longer relevant to the plan, that new employees have been added, and that the contact information on the list is accurate.
•
Are your own internal safety nets still working? You've probably installed a number of redundant resources to use in case of emergency. However, too often, when an emergency strikes, the backup facility fails because it isn't working either. Perhaps it hasn't been used for so long that it's fallen into disrepair or perhaps it hasn't been kept up to date with upgrades. Regular testing of your redundant resources is important. A spare tire is of no use if there's no air in it.
•
Can the backup tapes be read by the equipment at the backup site?
•
Do you have copies of the media and installation instructions for the requisite software (operating systems, applications, backup software, etc.) that may have to be installed before you can begin restoring your data from tape?
•
Are all associated components factored in? If your check-printing application is considered critical, for example, you have to include the availability of a printer during an emergency in addition to bringing up the application to be adequately prepared.
•
Do you have current critical passwords for applications, servers, websites, databases, etc?
Testing
Just like you check the pressure in your spare tire periodically to make sure it would be useful when you need it, you also need to test your disaster recovery plan periodically to make sure it will work for you when needed. This can be an enormous task, requiring a fair amount of planning of its own.
•
You'll need a way to take your primary site off-line (or at least have it seem off-line). You may be able to power down your environment or disconnect its WAN connections. Regardless, your monitoring and management solution will start sending a number of alerts.
•
Develop a test plan and script to run through to make a determination if things are working as expected.
•
Coordinate with all parts of IT and user department representatives to prepare for and participate in the test.
•
To get the most value from the test, you want to have a rigorous postmortem process to evaluate what aspects didn't work, why they didn't work, and what has to be changed so that they'll work the next time.
If there is simply no way to plan downtime to your production environment (perhaps because your organization runs 24/7), you
may
have to consider doing tests in phases, testing just a few systems or components at a time. In very large environments, it can take months to plan out a single test of a disaster recovery plan.