
| Selecting a Diaster Recovery Strategy (Cold, Warm or Hot) |
|
|
| Disaster Recovery Tutorials | ||||
| Written by Gareth Eagar | ||||
|
In the previous tutorials, you should have determined the Recovery Time Objectives (RTO) and Recovery Point Objectives (RPO) for critical business processes and mapped those to various parts of your IT infrastructure. In this chapter we'll review the primary disaster recovery strategies that may be appropriate based on your objectives and budget. It is likely that different systems will have different recovery strategies, based on the individual objectives for each system or group of systems. However you must ensure that you understand dependencies between systems, as systems that transfer real-time data between each other will need to be grouped together in what can be called a 'dependency group'. All the systems in a dependency group should have the same recovery strategy. Remember this as you plan your strategy for the various systems. A common way to look at the primary strategies for disaster recovery is based on definitions of the primary off-site recovery center facilities that these strategies use. The following definitions have been around for decades and do vary depending on who you speak to, but they still provide a good way to group the differing primary recovery / continuity strategies. Note that the lines between the different strategies are not clearly defined and often overlap. [Note that where estimates are given for recovery times, these are really just generalizations. Every recovery is different and recovery time really depends on complexity of the system, amount of data to be restored, method of restoring data, etc] COLD SITE RECOVERY [Days or weeks to recover]A cold site is an off-site recovery facility that contains basic data center facilities, without any actual systems to recover to. Usually a cold site provides a computer room with power and network cabling (and perhaps basic network infrastructure) as well as air-conditioning, but there is no actual hardware (servers, disk, tape, etc) for recovery. With cold site recovery, there is no data mirroring or replication (pretty obviously) and recoveries will generally be done from backup tapes. A cold site is the lowest cost option but does not enable a fast recovery - most often it is also practically impossible to perform system recovery testing at a cold site since there are no servers permanently based at a cold site (so no testing unless hardware can be loaned for use in a DR test). Cold site recovery is not a practical option for most situations and is not recommended since it is difficult to loan equipment for regular disaster recovery testing (and DR testing for all systems, even those with long RTO and RPO objectives, is critical). With a cold-site recovery strategy, generally hardware (servers and disk) needs to be purchased at time of disaster. Depending on the type and number of servers that you require, it could take days or weeks to purchase and install the relevant hardware required for recovery. In summary, a cold site recovery strategy is not recommended and in the modern IT environment very few companies, even small organizations, can safely select this recovery strategy. If this strategy however is selected for all or part of your IT infrastructure, you need to ensure that you can get access to recovery hardware as fast as possible. The most common ways of doing this is to set up an arrangement with your hardware supplier for preferential and urgent delivery of hardware in a disaster situation or to contract with a DR provider for them to provide syndicated recovery hardware at the cold site in the event of a disaster. WARM SITE RECOVERY [Hours or days to recover]A warm site is an off-site recovery facility that contains data center facilities (backup power supply, cabling, air conditioning, etc) and permanently available IT infrastructure (servers, disk, networking, etc). With warm site recovery, there may be some data mirroring or replication although mostly restores will be done from a backup. Warm site recovery is the most common recovery strategy and provides a balance between cost and potential for a relatively fast recovery (generally 8 hours plus). With warm site recovery, hardware is available on stand-by and, in the event of a disaster, operating system and data restores can start virtually immediately. In summary, a warm site recovery strategy is recommended for many environments and can be suitable in instances where the Recovery Time Objective (RTO) is from 8 hours - 72 hours for priority systems. In most situations where the RTO is under 4 hours a hot site strategy is likely to be more appropriate (where the RTO is 4 - 8 hours, warm site recovery could be possible depending on the individual circumstances such as amount of data to be restored and method to restore data). HOT SITE RECOVERY [Immediate to 4 hours to recover]A hot site recovery facility contains dedicated hardware that can be ready to take over production system processing immediately, within minutes or within a few hours at most. With hot site recovery the data required to continue operations is generally replicated to the recovery site and so is available virtually immediately. In some cases high availability technology is used at a hot site so that processing can continue with virtually no down time. An alternative strategy is to keep an equivalent recovery server operating system in sync with the production system and have replicated SAN disk at the recovery site. In the event of a disaster the replicated disk is attached to the recovery server (where the operating system is already synchronized) and processing can continue (this process can take an hour or less). Hot site recovery strategies are the most expensive solution as they require dedicated hardware and high bandwidth for replication of data / systems. However they do provide a very low RTO and RPO and therefore in certain environments they are the only option (especially financial services environments and often telecom environments). As bandwdith costs have reduced in recent years and higher bandwidth over longer distances has become more common the cost of hot site recovery strategies have fallen and have become more popular for medium sized companies. However the cost is certainly higher than a warm site recovery strategy and therefore is generally still used only where the RTO and RPO objectives are very low. DO IT YOURSELF OR USE A COMMERCIAL DR SITEFor all of the options discussed in this tutorial you may elect to use your own facilities or the facilities of a commercial disaster recovery site provider (such as IBM BCRS or Sungard Availability Services). At a commercial DR provider you may also elect to contract for dedicated servers (ie, for your use only) or for syndicated servers (servers that are shared for DR purposes between multiple customers). There are also in-between options, such as having an agreement with another company whereby you have extra facilities in your data centers and in the event of a disaster you can use the others data center. Or you can contract with a DR provider for delivery of required servers and disk at your own cold site, rather than using their data center facilities. CONCLUSIONYour choice of a recovery strategy is going to have a big impact on your budget and therefore it is critical that you have accurate and realistic RTO and RPO objectives for your critical business processes and therefore systems. Do not select a recovery strategy until you are certain that the RTO and RPO objectives have been set correctly and approved by senior management. Ultimately you may very well have different strategies for different systems, however you must ensure that your dependency groups are accurate so that systems that depend on each other have the same recovery strategy. The three primary recovery site strategies do not have clear lines between them - you may select a strategy that has elements of cold site and warm site in it, or warm site and hot site. For example, you may have dedicated, replicated disk but you may have bare metal servers where you recover the operating system at time of disaster and then attach the replicated disk. Each option has different budgetary requirements and pros and cons. You need to carefully review your budget, objectives and options before making a decision. However it is critical that whichever option you chose, you ensure that you are able to regularly perform disaster recovery tests where you go through the recovery of your systems. An untested disaster recovery plan can be highly dangerous as it may give you a false sense of confidence in your ability to recover in the event of a disaster.
|
||||