RPO and RTO
Recovery Point Objective (RPO): RPO defines the maximum acceptable amount of data loss for a system or service before a disaster. For example, if the RPO of a system is 1 hour, the system data must be backed up at least once every hour to ensure that only data within the previous hour is lost after a disaster occurs. Typically, important businesses usually require a short RPO and high bandwidth for data transmission during DR.
Recovery Time Objective (RTO): RTO defines the maximum acceptable downtime of a system or service after a disaster before normal operations are recovered. For example, if the RTO of a system is 4 hours, the operations of the system must be recovered within 4 hours to ensure business continuity and availability after a disaster occurs. The RTO depends on the importance and complexity of the system or service. In most cases, important systems or services require a short RTO to minimize the service downtime and impact.
During the implementation of DR, the RPO and RTO are mutual constraints. A shorter RPO usually means that a larger amount of data needs to be backed up, which results in longer backup and recovery time. This increases the risks of failure to meet the RTO. Therefore, when you design a DR solution, take into consideration both the RPO and RTO based on your actual systems or services and choose an appropriate backup policy, recovery plan, and DR technology to ensure optimal performance.
DR Site and Site Pair
Sites are the smallest units run by customer services in a DR system, as well as HCI clusters managed by SCP. A DR activity requires at least two sites: One primary site and one secondary site. Sites are usually interconnected by using networking technologies such as DWDM or MPLS VPN, depending on the distance between the sites and the implementation requirements.
When you want to fail over from VMs on the primary site to VMs on the secondary site, you need to configure the primary and secondary sites to form a site pair and create DR links between the DR transmission interfaces of the primary and secondary sites by using the corresponding networking technology. This way, you can transmit data from the primary site to the secondary site.
DR Policy
A DR policy defines the RPO, protected VMs, local backup method (backup repository), and destination to which backups are transmitted. You can create different DR policies for VMs, such as web or app systems, that require different RPOs. For example, you can set the RPO of a web system to 2 hours and the RPO of middleware to 1 hour, or configure VMs that have same DR requirements to use the same DR policy. This enables you to safeguard data based on multiple levels of protection.
Protection Group
Typically, a service recovers only after the app, middleware, and database of the service recover. A protection group is a group of associated service VMs used for primary/secondary failovers when a disaster occurs. It ensures that all VMs running the service can recover on the secondary site simultaneously to ensure the availability of the service. Therefore, VMs in a protection group can use different DR policies. In addition, you can create a DR policy to define the order in which VMs are powered on. In a regular business system, VMs for different components are powered on in the order of database, middleware, and app service.
Recovery Plan and Recovery Type
You can associate a recovery plan with protection groups to perform primary/secondary site failover or failback operations on VMs in the associated protection groups. You can perform the following primary/secondary site operations: DR test, scheduled recovery, and DR.