
| Preparing Your Production Environment for Disaster Recovery (Reseiliency and Backup) |
|
|
| Disaster Recovery Tutorials | ||||
| Written by Gareth Eagar | ||||
|
Once you have determined a Disaster Recovery strategy it's time to do some work in your primary IT environment with two very important objectives in mind. The first is to improve the resiliency of your systems (prevention is better than cure) and the second is to ensure that when it's time to recover you have everything you need to successfully recover. RESILIENCY The old adage 'prevention is better than cure' applies to IT and Disaster Recovery as well. Rather prevent a disaster in the first place then go through the stress of recovering once the systems have failed. So your challenge here is to increase the resiliency of your IT infrastructure and there are various ways that you can do this. One of the most important and yet simplest ways to increase system resiliency is to ensure that all disks in your IT environment are resilient - and this is easily achieved by using RAID technology. RAID can be applied to disks located physically inside a server or externally located in a SAN and ensures that your system continues to function even if a disk fails. If you don't know what RAID is don't worry about it too much - just ask your IT vendor or technical team to ensure disks used for all mission critical servers have RAID protection (and read the Wikipedia article on RAID). Other ways of increasing the resilience of a server include ensuring that you have dual power supplies and multiple network cards configured for resilience. Taking this further, you could also set-up a cluster of servers which are a number of servers working together and sharing common tasks. If one server fails, another server in the cluster can take over the processing for that server. On the topic of power, it is critical that you have an Uninterruptible Power Supply (UPS) which will keep your servers running for a short period of time in the event of a power failure (and certain types will also ensure that your servers receive 'clean' power supply which prevents damage to your servers from power spikes or other common power problems). Ensure that you purchase a UPS from a specialist vendor who will be able to advise you correctly on the size of UPS you need to cover all your critical IT infrastructure. Note though that a UPS is designed to provide power for a short period of time only - often only long enough to ensure that your servers can be shutdown correctly with the operating system shutdown procedure (data loss or corruption can occur if a server is powered off suddenly rather than shutdown with the correct procedure). If you have a requirement to be able to use your IT infrasturcture for an extended period of time during a power failure you need to invest in a generator which will generally run on fuel (most often diesel) and therefore can run as long as you keep it supplied with fuel. A generator however is a fairly expensive piece of equipment and so a decision on whether to purchase one will depend on your RTO and RPO metrics. Also note that a generator does not replace a UPS but should be used in conjunction with a UPS. The reason for this is that a generator takes a short while to start up so the UPS is required to keep your infrastructure running until the generator can take over supplying power. Beyond server and power resiliency, you may also also need to investigate telecomunications or network resiliency. This could include having a phone service from two different suppliers and redundent network connections from different suppliers. PREPARING FOR RECOVERYPreparing for recovery means ensuring that you have copies of all your data available off-site and ready for an unexpected event. This of course does not happen by chance and requires careful preparation and regular monitoring. The type of preparation required will depend on the recovery strategy for each system. For hot site recovery strategies you need to configure the high availability systems or disk replication technology that you will use. If you are doing disk replication of data and plan to recover the operating system separately then you need to ensure that you have proper operating system backups and that these are scheduled to take place regularly. You should perform regular small-scale testing to ensure that your data is being replicated correctly (this is in addition to full Disaster Recovery tests that you should be conducting at least once every 6 months). For cold and warm site recovery strategies you need to ensure that you have good backups of all data and operating systems that need to be recovered. For databases there may be special requirements for backup. If you are using a commercial backup product (such as EMC Networker or Symantec Backup Exec) you can purchase a database plugin that enables you to take a good backup of the database while the database is running. Alternatively you may need to have a script to shutdown the database before backup. It is strongly recommended that you take a full backup of all your systems - if you take selective backups you run the risk of not having backed up something that you may require later. You also risk the possibility of adding a new application or data to your system, and not including that in the selective backup (however if your change control and DR processes are well integrated this should not occur). It is important that you review the backup strategies and selections for all systems included in your Disaster Recovery plan on a regular basis to ensure that all required data is being backed up. In addition, it is recommended that you regularly (at least once a month) attempt a test restore of data that you have backed up, restoring to a test system. This backup test should restore data from various systems as verification that your backups are working correctly and can be restored. If you backup to tape, you should also follow best practice recommendations for writing to multiple tapes and cycling the tapes that you use. Also ensure that you do not use tapes for a period longer than recommended by the tape manufacturer. CONCLUSIONPrevention is better than cure - if you can increase the resiliency of your systems you reduce the risk to your IT environment (and therefore risk to your business). However you are unlikely to achieve complete resiliency and there are events outside of your control and outside of individual servers that could cause you to require full or partial recovery of your IT environment. Therefore ensuring that you are replicating or backing up all data that you require, that you verify the replication / backup on a regular basis and that you review your backup / replication strategy and selections regularly, aids in giving you the peace of mind that you will want when you need to recover after things have gone wrong.
|
||||