Disaster recovery plan
A disaster recovery plan (DRP) is a documented process or set of procedures to recover and protect a business IT infrastructure in the event of a disaster. Such a plan, ordinarily documented in written form, specifies procedures an organization is to follow in the event of a disaster. It is "a comprehensive statement of consistent actions to be taken before, during and after a disaster". The disaster could be natural, environmental or man-made. Man-made disasters could be intentional (for example, an act of a terrorist) or unintentional (that is, accidental, such as the breakage of a man-made dam).
Given organizations' increasing dependency on information technology to run their operations, a disaster recovery plan, sometimes erroneously called a Continuity of Operations Plan (COOP), is increasingly associated with the recovery of information technology data, assets, and facilities.
- 1 Objectives
- 2 Relationship to the Business Continuity Plan
- 3 Benefits
- 4 Types of plans
- 5 Types of disasters
- 6 Planning methodology
- 6.1 Obtaining top management commitment
- 6.2 Establishing a planning committee
- 6.3 Performing a risk assessment
- 6.4 Establishing priorities for processing and operations
- 6.5 Determining recovery strategies
- 6.6 Collecting data
- 6.7 Organizing and documenting a written plan
- 6.8 Developing testing criteria and procedures
- 6.9 Testing the plan
- 6.10 Obtaining plan approval
- 7 Caveats/controversies
- 8 See also
- 9 References
Organizations cannot always avoid disasters, but with careful planning the effects of a disaster can be minimized. The objective of a disaster recovery plan is to minimize downtime and data loss. The primary objective is to protect the organization in the event that all or part of its operations and/or computer services are rendered unusable. The plan minimizes the disruption of operations and ensures that some level of organizational stability and an orderly recovery after a disaster will prevail. Minimizing downtime and data loss is measured in terms of two concepts: the recovery time objective (RTO) and the recovery point objective (RPO).
The recovery time objective is the time within which a business process must be restored, after a major incident (MI) has occurred, in order to avoid unacceptable consequences associated with a break in business continuity. The recovery point objective (RPO) is the age of files that must be recovered from backup storage for normal operations to resume if a computer, system, or network goes down as a result of a MI. The RPO is expressed backwards in time (that is, into the past) starting from the instant at which the MI occurs, and can be specified in seconds, minutes, hours, or days. The recovery point objective (RPO) is thus the maximum acceptable amount of data loss measured in time. It is the age of the files or data in backup storage required to resume normal operations after the MI.
Relationship to the Business Continuity Plan
According to the SANS institute, the Business Continuity Plan (BCP) is a comprehensive organizational plan that includes the disaster recovery plan. The Institute further states that a Business Continuity Plan (BCP) consists of the five component plans:
- Business Resumption Plan
- Occupant Emergency Plan
- Continuity of Operations Plan
- Incident Management Plan
- Disaster Recovery Plan
The Institute states that the first three plans (Business Resumption, Occupant Emergency, and Continuity of Operations Plans) do not deal with the IT infrastructure. They further state that the Incident Management Plan (IMP) does deal with the IT infrastructure, but since it establishes structure and procedures to address cyber attacks against an organization’s IT systems, it generally does not represent an agent for activating the Disaster Recovery Plan, leaving The Disaster Recovery Plan as the only BCP component of interest to IT.
Like every insurance plan, there are benefits that can be obtained from the drafting of a disaster recovery plan. Some of these benefits are:
- Providing a sense of security
- Minimizing risk of delays
- Guaranteeing the reliability of standby systems
- Providing a standard for testing the plan
- Minimizing decision-making during a disaster
- Reducing potential legal liabilities
- Lowering unnecessarily stressful work environment
Types of plans
There is no one right type of disaster recovery plan, nor is there a one-size-fits-all disaster recovery plan. However, there are three basic strategies that feature in all disaster recovery plans: (1) preventive measures, (2) detective measures, and (3) corrective measures. Preventive measures will try to prevent a disaster from occurring. These measures seek to identify and reduce risks. They are designed to mitigate or prevent an event from happening. These measures may include keeping data backed up and off site, using surge protectors, installing generators and conducting routine inspections. Detective measures are taken to discover the presence of any unwanted events within the IT infrastructure. Their aim is to uncover new potential threats. They may detect or uncover unwanted events. These measures include installing fire alarms, using up-to-date antivirus software, holding employee training sessions, and installing server and network monitoring software. Corrective measures are aimed to restore a system after a disaster or otherwise unwanted event takes place. These measures focus on fixing or restoring the systems after a disaster. Corrective measures may include keeping critical documents in the Disaster Recovery Plan or securing proper insurance policies, after a "lessons learned" brainstorming session.
A disaster recovery plan must answer at least three basic questions: (1) what is its objective and purpose, (2) who will be the people or teams who will be responsible in case any disruptions happen, and (3) what will these people do (the procedures to be followed) when the disaster strikes.
Types of disasters
Disasters can be natural or man-made. Man-made disasters could be intentional (for example, sabotage or an act of terrorism) or unintentional (that is, accidental, such as the breakage of a man-made dam). Disasters may encompass more than weather. They may involve Internet threats or take on other man-made manifestations such as theft.
A natural disaster is a major adverse event resulting from the earth's natural hazards. Examples of natural disasters are floods, tsunamis, tornadoes, hurricanes/cyclones, volcanic eruptions, earthquakes, heat waves, and landslides. Other types of disasters include the more cosmic scenario of an asteroid hitting the Earth.
Man-made disasters are the consequence of technological or human hazards. Examples include stampedes, urban fires, industrial accidents, oil spills, nuclear explosions/nuclear radiation and acts of war. Other types of man-made disasters include the more cosmic scenarios of catastrophic global warming, nuclear war, and bioterrorism.
The following table categorizes some disasters and notes first response initiatives. Note that whereas the sources of a disaster may be natural (for example, heavy rains) or man-made (for example, a broken dam), the results may be similar (flooding).
|Avalanche||The sudden, drastic flow of snow down a slope, occurring when either natural triggers, such as loading from new snow or rain, or artificial triggers, such as explosives or backcountry skiers, overload the snowpack||Shut off utilities; Evacuate building if necessary; Determine impact on the equipment and facilities and any disruption|
|Blizzard||A severe snowstorm characterized by very strong winds and low temperatures||Power off all equipment; listen to blizzard advisories; Evacuate area, if unsafe; Assess damage|
|Earthquake||The shaking of the earth’s crust, caused by underground volcanic forces of breaking and shifting rock beneath the earth’s surface||Shut off utilities; Evacuate building if necessary; Determine impact on the equipment and facilities and any disruption|
|Fire (wild)||Fires that originate in uninhabited areas and which pose the risk to spread to inhabited areas||Attempt to suppress fire in early stages; Evacuate personnel on alarm, as necessary; Notify fire department; Shut off utilities; Monitor weather advisories|
|Flood||Flash flooding: Small creeks, gullies, dry streambeds, ravines, culverts or even low-lying areas flood quickly||Monitor flood advisories; Determine flood potential to facilities; Pre-stage emergency power generating equipment; Assess damage|
|Freezing Rain||Rain occurring when outside surface temperature is below freezing||Monitor weather advisories; Notify employees of business closure; home; Arrange for snow and ice removal|
|Heat wave||A prolonged period of excessively hot weather relative to the usual weather pattern of an area and relative to normal temperatures for the season||Listen to weather advisories; Power-off all servers after a graceful shutdown if there is imminent potential of power failure; Shut down main electric circuit usually located in the basement or the first floor|
|Hurricane||Heavy rains and high winds||Power off all equipment; listen to hurricane advisories; Evacuate area, if flooding is possible; Check gas, water and electrical lines for damage; Do not use telephones, in the event of severe lightning; Assess damage|
|Landslide||Geological phenomenon which includes a range of ground movement, such as rock falls, deep failure of slopes and shallow debris flows||Shut off utilities; Evacuate building if necessary; Determine impact on the equipment and facilities and any disruption|
|Lightning strike||An electrical discharge caused by lightning, typically during thunderstorms||Power off all equipment; listen to hurricane advisories; Evacuate area, if flooding is possible; Check gas, water and electrical lines for damage; Do not use telephones, in the event of severe lightning; Assess damage|
|Limnic eruption||The sudden eruption of carbon dioxide from deep lake water|
|Tornado||Violent rotating columns of air which descent from severe thunderstorm cloud systems||Monitor tornado advisories; Power off equipment; Shut off utilities (power and gas); Assess damage once storm passes|
|Tsunami||A series of water waves caused by the displacement of a large volume of a body of water, typically an ocean or a large lake, usually caused by earthquakes, volcanic eruptions, underwater explosions, landslides, glacier calvings, meteorite impacts and other disturbances above or below water||Power off all equipment; listen to tsunami advisories; Evacuate area, if flooding is possible; Check gas, water and electrical lines for damage; Assess damage|
|Volcanic eruption||The release of hot magma, volcanic ash and/or gases from a volcano|
|Man-made||Bioterrorism||The intentional release or dissemination of biological agents as a means of coercion||Get information immediately from your Public Health officials via the news media as to the right course of action; If you think you have been exposed, quickly remove your clothing and wash off your skin; Also put on a HEPA to help prevent inhalation of the agent|
|Civil unrest||A disturbance caused by a group of people that may include sit-ins and other forms of obstructions, riots, sabotage and other forms of crime, and which is intended to be a demonstration to the public and the government, but can escalate into general chaos||Contact local police or law enforcement|
|Fire (urban)||Even with strict building fire codes, people still perish needlessly in fires||Attempt to suppress fire in early stages; Evacuate personnel on alarm, as necessary; Notify fire department; Shut off utilities; Monitor weather advisories|
|Hazardous material spills||The escape of solids, liquids, or gases that can harm people, other living organisms, property or the environment, from their intended controlled environment such as a container.||Leave the area and call the local fire department for help. If anyone was affected by the spill, call the your local Emergency Medical Services line|
|Nuclear and Radiation Accidents||An event involving significant release of radioactivity to the environment or a reactor core meltdown and which leads to major undesirable consequences to people, the environment, or the facility||Recognize that a CBRN incident has or may occur. Gather, assess and disseminate all available information to first responders. Establish an overview of the affected area. Provide and obtain regular updates to and from first responders.|
|Power Failure||Caused by summer or winter storms, lightning or construction equipment digging in the wrong location||Wait 5–10 minutes; Power-off all Servers after a graceful shutdown; Do not use telephones, in the event of severe lightning; Shut down main electric circuit usually located in the basement or the first floor|
In the realm of information technology per se, disasters may also be the result of a computer security exploit. Some of these are: computer viruses, cyberattacks, denial-of-service attacks, hacking, and malware exploits. These are ordinarily attended to by information security experts.
According to Geoffrey H. Wold of the Disaster Recovery Journal, the entire process involved in developing a Disaster Recovery Plan consists of 10 steps:
Obtaining top management commitment
For a disaster recovery plan to be successful, the central responsibility for the plan must reside on top management. Management is responsible for coordinating the disaster recovery plan and ensuring its effectiveness within the organization. It is also responsible for allocating adequate time and resources required in the development of an effective plan. Resources that management must allocate include both financial considerations and the effort of all personnel involved.
Establishing a planning committee
A planning committee is appointed to oversee the development and implementation of the plan. The planning committee includes representatives from all functional areas of the organization. Key committee members customarily include the operations manager and the data processing manager. The committee also defines the scope of the plan.
Performing a risk assessment
The planning committee prepares a risk analysis and a business impact analysis (BIA) that includes a range of possible disasters, including natural, technical and human threats. Each functional area of the organization is analyzed to determine the potential consequence and impact associated with several disaster scenarios. The risk assessment process also evaluates the safety of critical documents and vital records. Traditionally, fire has posed the greatest threat to an organization. Intentional human destruction, however, should also be considered. A thorough plan provides for the “worst case” situation: destruction of the main building. It is important to assess the impacts and consequences resulting from loss of information and services. The planning committee also analyzes the costs related to minimizing the potential exposures.
Establishing priorities for processing and operations
At this point, the critical needs of each department within the organization are evaluated in order to prioritize them. Establishing priorities is important because no organization possesses infinite resources and criteria must be set as to where to allocate resources first. Some of the areas often reviewed during the prioritization process are functional operations, key personnel and their functions, information flow, processing systems used, services provided, existing documentation, historical records, and the department's policies and procedures.
Processing and operations are analyzed to determine the maximum amount of time that the department and organization can operate without each critical system. This will later get mapped into the Recovery Time Objective. A critical system is defined as that which is part of a system or procedure necessary to continue operations should a department, computer center, main facility or a combination of these be destroyed or become inaccessible. A method used to determine the critical needs of a department is to document all the functions performed by each department. Once the primary functions have been identified, the operations and processes are then ranked in order of priority: essential, important and non-essential.
Determining recovery strategies
During this phase, the most practical alternatives for processing in case of a disaster are researched and evaluated. All aspects of the organization are considered, including physical facilities, computer hardware and software, communications links, data files and databases, customer services provided, user operations, the overall management information systems (MIS) structure, end-user systems, and any other processing operations.
Alternatives, dependent upon the evaluation of the computer function, may include: hot sites, warm sites, cold sites, reciprocal agreements, the provision of more than one data center, the installation and deployment of multiple computer system, duplication of service center, consortium arrangements, lease of equipment, and any combinations of the above.
Written agreements for the specific recovery alternatives selected are prepared, specifying contract duration, termination conditions, system testing, cost, any special security procedures, procedure for the notification of system changes, hours of operation, the specific hardware and other equipment required for processing, personnel requirements, definition of the circumstances constituting an emergency, process to negotiate service extensions, guarantee of compatibility, availability, non-mainframe resource requirements, priorities, and other contractual issues.
In this phase, data collection takes place. Among the recommended data gathering materials and documentation often included are various lists (employee backup position listing, critical telephone numbers list, master call list, master vendor list, notification checklist), inventories (communications equipment, documentation, office equipment, forms, insurance policies, workgroup and data center computer hardware, microcomputer hardware and software, office supply, off-site storage location equipment, telephones, etc.), distribution register, software and data files backup/retention schedules, temporary location specifications, any other such other lists, materials, inventories and documentation. Pre-formatted forms are often used to facilitate the data gathering process.
Organizing and documenting a written plan
Next, an outline of the plan’s contents is prepared to guide the development of the detailed procedures. Top management reviews and approves the proposed plan. The outline can ultimately be used for the table of contents after final revision. Other four benefits of this approach are that (1) it helps to organize the detailed procedures, (2) identifies all major steps before the actual writing process begins, (3) identifies redundant procedures that only need to be written once, and (4) provides a road map for developing the procedures.
It is often considered best practice to develop a standard format for the disaster recovery plan so as to facilitate the writing of detailed procedures and the documentation of other information to be included in the plan later. This helps ensure that the disaster plan follows a consistent format and allows for its ongoing future maintenance. Standardization is also important if more than one person is involved in writing the procedures.
It is during this phase that the actual written plan is developed in its entirety, including all detailed procedures to be used before, during, and after a disaster. The procedures include methods for maintaining and updating the plan to reflect any significant internal, external or systems changes. The procedures allow for a regular review of the plan by key personnel within the organization. The disaster recovery plan is structured using a team approach. Specific responsibilities are assigned to the appropriate team for each functional area of the organization. Teams responsible for administrative functions, facilities, logistics, user support, computer backup, restoration and other important areas in the organization are identified.
The structure of the contingency organization may not be the same as the existing organization chart. The contingency organization is usually structured with teams responsible for major functional areas such as administrative functions, facilities, logistics, user support, computer backup, restoration, and any other important area.
The management team is especially important because it coordinates the recovery process. The team assesses the disaster, activates the recovery plan, and contacts team managers. The management team also oversees, documents and monitors the recovery process. It is helpful when management team members are the final decision-makers in setting priorities, policies and procedures. Each team has specific responsibilities that are completed to ensure successful execution of the plan. The teams have an assigned manager and an alternate in case the team manager is not available. Other team members may also have specific assignments where possible.
Developing testing criteria and procedures
Best practices dictate that DR plans be thoroughly tested and evaluated on a regular basis (at least annually). Thorough DR plans include documentation with the procedures for testing the plan. The tests will provide the organization with the assurance that all necessary steps are included in the plan. Other reasons for testing include:
- Determining the feasibility and compatibility of backup facilities and procedures.
- Identifying areas in the plan that need modification.
- Providing training to the team managers and team members.
- Demonstrating the ability of the organization to recover.
- Providing motivation for maintaining and updating the disaster recovery plan.
Testing the plan
After testing procedures have been completed, an initial "dry run" of the plan is performed by conducting a structured walk-through test. The test will provide additional information regarding any further steps that may need to be included, changes in procedures that are not effective, and other appropriate adjustments. These may not become evident unless an actual dry-run test is performed. The plan is subsequently updated to correct any problems identified during the test. Initially, testing of the plan is done in sections and after normal business hours to minimize disruptions to the overall operations of the organization. As the plan is further polished, future tests occur during normal business hours.
Types of tests include: checklist tests, simulation tests, parallel tests, and full interruption tests.
Obtaining plan approval
Once the disaster recovery plan has been written and tested, the plan is then submitted to management for approval. It is top management’s ultimate responsibility that the organization has a documented and tested plan. Management is responsible for (1) establishing the policies, procedures and responsibilities for comprehensive contingency planning, and (2) reviewing and approving the contingency plan annually, documenting such reviews in writing.
Organizations that receive information processing from service bureaus will, in addition, also need to (1) evaluate the adequacy of contingency plans for its service bureau, and (2)ensure that its contingency plan is compatible with its service bureau’s plan.
Lack of buy-in
One factor is the perception by executive management that DR planning is "just another fake earthquake drill" or CEOs that fail to make DR planning and preparation a priority, are often significant contributors to the failure of a DR plan.
Incomplete RTOs and RPOs
Another critical point is failure to include each and every important business process or a block of data. "Every item in your DR plan requires a Recovery Time Objective (RTO) defining maximum process downtime or a Recovery Point Objective (RPO) noting an acceptable restore point. Anything less creates ripples that can extend the disaster's impact." As an example, "payroll, accounting and the weekly customer newsletter may not be mission-critical in the first 24 hours, but left alone for several days, they can become more important than any of your initial problems."
A third point of failure involves focusing only on DR without considering the larger business continuity needs: "Data and systems restoration after a disaster are essential, but every business process in your organization will need IT support, and that support requires planning and resources." As an example, corporate office space lost to a disaster can result in an instant pool of teleworkers which, in turn, can overload a company's VPN overnight, overwork the IT support staff at the blink of an eye and cause serious bottlenecks and monopolies with the dial-in PBX system.
When there is a disaster, an organization's data and business processes become vulnerable. As such, security can be more important than the raw speed involved in a disaster recovery plan's RTO. The most critical consideration then becomes securing the new data pipelines: from new VPNs to the connection from offsite backup services. Another security concern includes documenting every step of the recovery process—something that is especially important in highly regulated industries, government agencies, or in disasters requiring post-mortem forensics. Locking down or remotely wiping lost handheld devices is also an area that may require addressing.
Another important aspect that is often overlooked involves the frequency with which DR Plans are updated. Yearly updates are recommended but some industries or organizations require more frequent updates because business processes evolve or because of quicker data growth. To stay relevant, disaster recovery plans should be an integral part of all business analysis processes, and should be revisited at every major corporate acquisition, at every new product launch and at every new system development milestone.
- Disaster recovery
- Business continuity planning
- Federal Emergency Management Agency
- Backup rotation scheme
- Seven tiers of disaster recovery
- Abram, Bill (14 June 2012). "5 Tips to Build an Effective Disaster Recovery Plan". Small Business Computing. Retrieved 9 August 2012.
- Wold, Geoffrey H. (1997). "Disaster Recovery Planning Process". Disaster Recovery Journal. Adapted from Volume 5 #1. Disaster Recovery World. Archived from the original on 15 August 2012. Retrieved 8 August 2012.
- An Overview of the Disaster Recovery Planning Process - From Start to Finish. Comprehensive Consulting Solutions Inc.( "Disaster Recovey Planning, An Overview: White Paper." )March 1999. Retrieved 8 August 2012.
- Definition: Recovery point objective (RPO). Retrieved 10 August 2012.
- "Recovery Point Objective (RPO): Definition - What does Recovery Point Objective (RPO) mean?". Techopedia. Janalta Interactive Inc. 2012. Retrieved 10 August 2012.
- The Disaster Recovery Plan. Chad Bahan. GSEC Practical Assignment version 1.4b. SANS Institute InfoSec Reading Room. June 2003. Retrieved 24 August 2012.
- "Disaster Recovery Planning - Step by Step Guide". Michigan State University. Archived from the original on 8 March 2014. Retrieved 9 May 2014.
- "Backup Disaster Recovery". Email Archiving and Remote Backup. 2010. Retrieved 9 May 2014.
- "Disaster Recovery & Business Continuity Plans". Stone Crossing Solutions. 2012. Archived from the original on 23 August 2012. Retrieved 9 August 2012.
- "Disaster Recovery – Benefits of Getting Disaster Planning Software and Template and Contracting with Companies Offering Data Disaster Recovery Plans, Solutions and Services: Why Would You Need a Disaster Recovery Plan?". Continuity Compliance. 7 June 2011. Archived from the original on 8 May 2014. Retrieved 14 August 2012.
- Business Continuity Planning (BCP): Sample Plan For Nonprofit Organizations. Archived June 2, 2010, at the Wayback Machine. Pages 11-12. Retrieved 8 August 2012.
- What should I do if there has been a bioterrorism attack?. Edmond A. Hooker. WebMD. 9 October 2007. Retrieved 18 September 2012.
- Report of the Joint Fire/Police Task Force on Civil Unrest (FA-142): Recommendations for Organization and Operations During Civil Disturbance. Page 55. FEMA. Retrieved 21 October 2012.
- Business Continuity Planning: Developing a Strategy to Minimize Risk and Maintain Operations. Archived March 27, 2014, at the Wayback Machine. Adam Booher. Retrieved 19 September 2012.
- Hazardous Materials. Archived October 11, 2012, at the Wayback Machine. Tennessee Emergency Management Office. Retrieved 7 September 2012.
- Managing Hazardous Materials Incidents (MHMIs). Center for Disease Control. Retrieved 7 September 2012.
- Guidelines for First Response to a CBRN Incident. Project on Minimum Standards and Non-Binding Guidelines for First Responders Regarding Planning, Training, Procedure and Equipment for Chemical, Biological, Radiological and Nuclear (CBRN) Incidents.] NATO. Emergency Management. Retrieved 21 October 2012.
- Five Mistakes That Can Kill a Disaster Recovery Plan. In archive.org Cormac Foster. Dell Corporation. 25 October 2010. Retrieved 8 August 2012.