Recovery point objective
|This article needs additional citations for verification. (March 2013)|
A recovery point objective, or “RPO”, is defined by business continuity planning. It is the maximum targeted period in which data might be lost from an IT service due to a major incident. The RPO gives systems designers a limit to work to. For instance, if the RPO is set to four hours, then in practice, off-site mirrored backups must be continuously maintained – a daily off-site backup on tape will not suffice. Care must be taken to avoid two common mistakes around the use and definition of RPO. Firstly, business continuity staff use business impact analysis to determine RPO for each service – RPO is not determined by the existent backup regime. Secondly, when any level of preparation of off-site data is required, rather than at the time the backups are offsited, the period during which data is lost very often starts near the time of the beginning of the work to prepare backups which are eventually offsited.
Recovery point objective (RPO)
When computers used for normal business services are affected by a "major incident" that cannot be fixed quickly, then the Information Technology Service Continuity (ITSC) Plan is performed, by the ITSC recovery team. This plan will always assume that the production computing equipment and the wider geographic location they normally reside at might become completely out of bounds at an unpredictable time, without any warning. The location chosen to rebuild the service (the recovery site) must be distant (for example, at least 10 miles) from the normal Production site and suffer no threats in common with the production site (e.g. they should not be near the same coastline). The ITSC Plan must also satisfy two measurements- the Recovery Time Objective (RTO) and the Recovery Point Objective (RPO) for any potentially affected services. These measures are determined by a team of people, called the Business Continuity (BC) team, that quantifies what losses might ensue if the services are not available. It is sobering to think that "potential loss of life" appears in far more IT service risk assessments than one might assume. The RTO and RPO are time intervals, typically expressed in number of hours, specified by the BC team to be the longest time the business can allow for without incurring significant risks or significant loss, allowing system designers to specify designs that are as cost effective as the RTO and RPO will permit.
The RTO is the amount of time the business can be without the service, without incurring significant risks or significant losses. The events that mark the start and end of the RTO duration must be pre-agreed between Business Continuity and ITSC staff. It is best to agree to start the RTO clock at the moment when it is decided to proceed with the recovery. Sometimes too much time is taken over the decision to invoke recovery, sometimes Major Incidents do not start at easily definable wall-clock times anyway. The RTO clock should be deemed to stop once the team responsible for testing the service (before it is successfully released to the wider user community) begin work. By defining the RTO in this way it can be set to a very specific time period, which allows better decision making at all levels- accepting that this compromises a little the principle of setting the RTO to be "the amount of time the business can be without the service".
The RPO is deceptively difficult to explain. The RPO is only a measure of the maximum time period in which data might be lost if there is a Major Incident affecting an IT Service- not a direct measure of how much data might be lost. BC staff can then more easily take steps to cover this maximum period and make plans to avoid or mitigate any impact of losing data that is entered in a time period as defined in the RPO. Consider a very simple example- a data entry clerk transfers data to an IT Service, by copy typing from paper forms. If the only consideration is RPO, the clerk needs to keep back enough recent paper forms so that he is certain to be able to retype all of them going back the same amount of time as defined in the RPO. This article does not seek to address the complexities that arise if transactions are completed electronically between organisations, and the home side of such transactions are lost because of a Major Incident.
Data synchronization points
A data synchronization point is a point in time. It is used to assess the way in which data backups relate to each other. Data backups need to be related to each other correctly when considering the time of day they were made, or their relationship to computer system activity events. A data synchronization point is a point in time when a set of backups exist which if restored from can be synchronized to the same point in time. Often this point in time is some hours before the last backup is completed, i.e., some hours before the data synchronization point. Backups that have no synchronization points are generally useless.
A frequent mistake when setting RPO for traditional daily tape offsited backups is to assume 24 hours for the RPO. This mistake is the result of not considering that the RPO time begins with the start of the first data backup used in the synchronization point; and must also include time for boxing the tapes; the inevitable contingency time that must be allowed for "waiting for courier transport"; loading and final escape from site (not always at exactly the same time of day – the RPO must be increased by an amount of time equivalent to any such variability). It is also risky to assume that tapes will always be physically intact – the RPO should include enough time to use a previous synchronization point, too.
How RTO and RPO values affect computer system design
The RTO and RPO form part of the first specification for any IT Service. The RTO and the RPO have a very significant effect on the design of computer services and for this reason must be considered in concert with all the other major system design criteria.
When assessing the abilities of system designs to meet RPO criteria, for practical reasons, the RPO capability in a proposed design is tied to the times backups are sent offsite- if for instance offsiting is on tape and only daily (still quite common), then 49 or better, 73 hours is the best RPO the proposed system can deliver, so as to cover for tape hardware problems (tape failure is still too frequent, one bad tape can write off a whole daily synchronisation point). Another example- if a service is to be properly set up to restart from any point (data is capable of synchronisation at all times) and offsiting is via synchronous copies to an offsite mirror data storage device, then the RPO capability of the proposed service is to all intents and purposes 0 hours- although it is normal to allow an hour for RPO in this circumstance to cover off any unforeseen difficulty.
If the RTO and RPO can be set to be more than 73 hours then daily backups to tapes (or other transportable media), that are then couriered on a daily basis to an offsite location, comfortably covers backup needs at a relatively low cost. Recovery can be enacted at a predetermined site. Very often this site will be one belonging to a specialist recovery company who can more cheaply provide serviced floor space and hardware as required in recovery because it manages the risks to its clients and carefully shares (or "syndicates") hardware between them, according to these risks.
If the RTO is set to 4 hours and the RPO to 1 hour, then a mirror copy of production data must be continuously maintained at the recovery site and close to dedicated recovery hardware must be available at the recovery site- hardware that is always capable of being pressed into service within 30 minutes or so. These shorter RTO and RPO settings demand a fundamentally different hardware design- which is for instance, relatively much more expensive than tape backup designs.
If very high volumes of high value transactions are to be planned for, then the production hardware can be split across two sites; with a high bandwidth network connection between the two sites constant mirroring of data can be achieved. If the user community is dispersed or at least split across two geographic areas, then the configuration is resilient to single site Major Incidents- with zero RTO and RPO being achievable, and very often little loss of service being experienced at most times of day.