User:Xilinx sample

XCS Disaster Recovery
	XCS & CSS corp
Original author(s)	Phani Kumar
Stable release	1.0 / 29 April 2012
Preview release	28 April 2012

XCS Disaster Recovery

About

From an XCS point of view a disaster is anything that would cause unscheduled downtime of the production XCS system and that the time for return to full production is unknown or longer than an acceptable period. Some examples of possible XCS disasters are

Complete and permanent loss of latest XCS repository data, due to, for example
- Complete destruction of main IT data center and contents.
- Complete destruction of main XCS Longmont building and contents
- Complete destruction of main XCS Longmont and Fordham buildings and contents.
Permanent corruption of all or some of the data in the XCS repository.
Loss of access to latest XCS repository data and time for return to full access is greater than an *acceptable period, due to any situation where replacing damaged hardware takes much longer than expected.
Any problem where data is not lost and an access solution can be found within an acceptable window, is NOT considered a disaster.

Disaster Recovery Terminology

Disaster recovery in information technology is the ability of an infrastructure to restart operations after a disaster. Disaster recovery is used both in the context of data loss prevention and data recovery. There are two primary metrics to demonstrate recoverability following failure: Recovery Point Objective (RPO) is the point in time that the restarted infrastructure will reflect. Essentially, this is the roll-back that will be experienced as a result of the recovery. Reducing RPO requires increasing synchronicity of data replication. Recovery Time Objective(RTO) is the amount of time that will pass before an infrastructure is available. Reducing RTO requires data to be online and available at a failover site.

XCS Recovery Point Objective (RPO)

Given the current XCS backup process, in the event of a disaster, the RPO that can be achieved is a minimum of 18 hours and maximum of 42 hours prior to the time that the disaster occurred. Both minimum and maximum will increase as the repository size grows, but will never be greater than 24Hrs and 48Hrs respectively. Therefore the XCS RPO is a maximum of 48 hours. Using the illustration above, the best case scenario, from a RPO point of view is when a disaster happens immediately after a nightly backup completes. In this case the backup snapshot is 18 hrs old. The worst case scenario would be if the disaster occurs during a backup and the backups are not usable. For example, if Tues night’s backups are not useable, Mon night’s backups would be the nest most recent – which is a snapshot of Monday @ 7pm – 42 hrs old.

XCS Recovery Time Objective (RTO)

XCS Recovery Time Objective(RTO) is the amount of time that will pass before XCS is available again after a disaster. If all XCS data is lost, the RTO will be the amount of time it takes to roll over to the backup system and get back on-line. If access to the XCS data has been lost, but the data is still intact, the RTO is determined by such issues as …

The estimated time for full recovery of access to the production data.
- Willingness to move to a full failover/backup system. Determined by such issues as …
- The day of week that the disaster occurred
- The stage in the release (many stakeholders, different release schedules)
- The amount of activity in past 42-48 hours.
- Implications of moving to a backup system
  - Loss up to 48 hours of checked-in code changes.
  - Developers would need to create new sandboxes and update with changes.
  - Relman would have to redo all branching and building activities.
  - No backup system while production system is still down.
  - Further downtime needed afterwards to move back to original setup

In the event of a disaster, it is DSD staff’s responsibility will need to weigh up these issues and decide how long to wait before bring XCS back online.The general DSD RTO is 5 days – i.e. in the event of a disaster, systems will not be down for more than 5 days. Therefore, in the event of a disaster, the XCS RTO is a maximum of 5 days.Known Risks with Current Setup

In the event of a disaster affecting the main XCS building in XCS Colorado, if it is necessary to move to the backed up data, stored in the Fordham building, the backed up data will reflect somewhere between 0 and 48 hours prior to the time the disaster occurred, depending on the time of the disaster and the state of the latest backups.
If the main building and the Fordham building in XCS Colorado are affected by a disaster, a full copy of the XCS repository would have to be retrieved from a tape storage facility in Denver, CO and a suitable machine found to function as an XCS server. There is no documented process for this scenario.
There is no XCS recovery plan that covers the situation where a disaster affects both XCS Longmont buildings and also the tape storage facility in Denver.
XCS currently shares a NAC fileserver with ATS for the backup system. This could cause some performance problems (CPU and IO) and dependency problems when used for failover.

Summary of XCS RPO and RTO

XCS RPO: In the event of a disaster, the point in time prior to which we would have to roll back the XCS data is a maximum of 48 hours prior to the occurrence of the disaster.
XCS RTO: In the event of a disaster, the maximum amount of time that the XCS server and repository would be unavailable is 5 days.

Additional Information