Operations 20 min read

Disaster Recovery (DR) Fundamentals: Definitions, Roles, Metrics, and Implementation

This article provides a comprehensive overview of disaster recovery, covering its definition, the distinction between backup and DR, their respective roles, key metrics such as RPO and RTO, various replication technologies, and practical implementation methods across storage, network, and host layers.

Architects' Tech Alliance

Jan 1, 2022

Disaster Recovery (DR) Fundamentals: Definitions, Roles, Metrics, and Implementation

1. Definition of Disaster Recovery

1.1 What is Disaster Recovery?

Disaster recovery (DR) refers to using existing scientific and technical means to establish reliable emergency procedures that can respond to sudden incidents, encompassing both backup systems and disaster‑recovery systems.

1.2 Concepts of Backup and Disaster Recovery

1.2.1 Backup

Backup: Ensures data safety by copying all or part of the data from production hosts or arrays to other storage media.

1.2.2 Disaster Recovery

Disaster Recovery: Guarantees business continuity by building two or more identical IT systems (including compute, network, storage, power, cooling, etc.) in a remote location; when the primary data center fails, the backup data center can quickly restore services via network‑based data transfer.

1.2.3 Differences

Protection Object: Backup protects data , while DR protects business continuity . Implementation Method: Backup uses backup software; DR uses replication or mirroring software. Time Cycle: Backup cycles are longer; replication/mirroring cycles are shorter. Note: Archiving uses backup.

1.2.4 Relationship

Only Backup: Business cannot recover quickly; data recovery may take time, causing potentially huge losses. Only DR: Business can recover quickly, but erroneous operations or failed upgrades on the production side may be replicated to the DR site, causing service interruption.

1.3 Protection Provided by Disaster Recovery

2. Role of Disaster Recovery

2.1 Problems in Data Centers

Virus and OS vulnerabilities

Human errors

Terrorist attacks

Power failures

Hardware failures

Natural disasters (earthquake, flood, typhoon)

2.1.2 Consequences of No DR

Business interruption

Data loss

Customer complaints

Revenue decline

Financial compensation

Company bankruptcy

(Data is priceless; loss can be catastrophic.)

2.2 Role of Backup

2.2.1 Storage Layer – Five Parts of Backup Configuration

Backup Sub‑Client : Execution carrier for backup tasks.

Storage Strategy : Includes backup media, deduplication policy, retention policy, write I/O count.

Backup Content : What to back up and what to exclude.

Backup Policy : Deduplication strategy, backup type, backup schedule.

Performance Optimization : Client read stream count.

2.2.2 Cloud Computing Layer

Cloud Server Backup Service (CSBS): Provides whole‑machine backup for cloud servers, supporting local snapshots and remote replication, enabling data recovery and ensuring business safety.

Volume Backup Service (VBS): Allows creating backups of cloud disks and rolling back using backup data, maximizing data correctness and security.

2.2.3 Replication Types

Synchronous Replication: Real‑time data copy. Asynchronous Replication: Data is copied with potential consistency lag.

2.3 Role of Disaster Recovery

2.3.1 Application Scenarios

Local High‑Availability (HA)

Active‑Standby (AS)

Active‑Active (AA) data centers

Two‑Site‑Three‑Center (3DC) – cascade/parallel

2.3.2 Solution Overview

Local Production Center: Implements local HA solutions.

In‑City DR (<100km): Active‑Active or Active‑Standby solutions.

Inter‑City DR (>100km): Two‑Site‑Three‑Center or Active‑Standby solutions.

2.3.3 Local High‑Availability Scheme

Advantages: Zero business interruption, zero data loss, high reliability.

Uses real‑time mirroring and synchronous replication, typically requiring RPO=0 due to short distance and high bandwidth.

2.3.4 Active‑Standby Scheme

Advantages: Approximate RPO≈0, low TCO, cross‑vendor storage compatibility, centralized topology and alerts, automated one‑click DR drills and recovery.

Key Technology: HyperReplication.

2.3.5 Active‑Active Data Center Scheme

Advantages: Six‑layer active‑active architecture, zero business interruption, zero data loss.

Key Technology: HyperMetro.

2.3.6 Two‑Site‑Three‑Center (Cascade/Parallel)

Network Type

Advantages

Disadvantages

Cascade

Minimal impact on production performance.

During regional disasters, if the same‑city DR site fails, RPO becomes large due to asynchronous replication.

Parallel

Effectively avoids cascade drawbacks during regional disasters.

Higher performance requirements on the production center.

3. Measurement of Disaster Recovery

3.1 Backup Types

Backup Window: The time interval during which backup can be performed without affecting normal business operations.

3.1.1 Full Backup

Creates a complete copy of all data at a point in time. Advantages: Small recovery window. Disadvantages: Large storage consumption, long backup time.

3.1.2 Cumulative Incremental Backup

Based on the last full backup, backs up all changes since then. Advantages: Saves storage, smaller backup and recovery windows. Disadvantages: Recovery depends on the last full backup plus the incremental set.

3.1.3 Differential Incremental Backup

Based on the last backup (full or differential), backs up changes since that point. Advantages: Minimal storage usage, small backup window. Disadvantages: Recovery requires the last full backup and each differential set, leading to longer reconstruction time.

3.1.4 Backup Strategy Principles

Combine full backup with either cumulative or differential incremental backups, but avoid mixing cumulative and differential in the same policy.

In environments with strict storage and window constraints, prefer full + differential backups.

3.2 DR Metrics

3.2.1 Recovery Point Objective (RPO)

Maximum tolerable data loss measured in time. Example: Backup at 08:00, failure at 09:00 → RPO = 1 hour.

3.2.2 Recovery Time Objective (RTO)

Maximum tolerable business interruption time. Example: Must restore within 12 hours after a disaster → RTO = 12 hours.

3.2.3 Comprehensive Standards

DR Capability Level

RTO

RPO

>2 days

1–7 days

<24 hours

1–7 days

>12 hours

Few hours–1 day

Few hours–2 days

Few hours–1 day

Few minutes–2 days

0–30 minutes

Few minutes

4. Implementation of Disaster Recovery

4.1 Backup Methods

Refer to the three classic backup methods (LAN‑Base, LAN‑Free, Server‑Free).

4.1.1 LAN‑Base

Installs a backup agent on production servers and a backup server; simple but unsuitable for very large data volumes due to LAN bandwidth consumption.

4.1.2 LAN‑Free

Data flows from the file server through an FC switch directly to tape, bypassing the LAN and reducing network load, though the file server still participates in the I/O path.

4.1.3 Server‑Free

Backup data does not pass through the server’s bus or memory; the file server issues SCSI replication commands and the storage system copies data directly to tape, greatly reducing server load.

Another option is NDMP (Network Data Management Protocol), which lets storage devices communicate directly with backup targets without involving the host.

4.2 Backup Media

Disk arrays

Tape libraries

Virtual tape libraries

Optical libraries

Cloud storage

Integrated appliances (e.g., Huawei HDP3500E)

4.3 Backup Design Principles

Customer requirements (data type, volume, objects).

Backup strategy (frequency, timing).

Network planning (bandwidth, topology).

Storage planning (current volume, growth).

4.4 DR Methods

4.5 DR Technologies

4.5.1 Host‑Level Data Replication

Installs replication software on servers in both production and DR sites; requires network connectivity and may consume significant host and network resources.

4.5.2 Network‑Level Data Replication

Similar to host‑level but focuses on network devices; still resource‑intensive on servers.

4.5.3 Storage‑Level Data Replication

Deploys a pair of storage systems that handle replication internally, often via direct fiber links or DWDM; minimal impact on application servers and widely used for high‑availability DR solutions.

Author: SkyBiuBiu Original link: https://www.cnblogs.com/Skybiubiu/p/14992848.html

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

high availability disaster recovery data replication backup RPO RTO

Written by

Architects' Tech Alliance

Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.