Disaster Recovery (DR) Fundamentals: Definitions, Roles, Metrics, and Implementation
This article provides a comprehensive overview of disaster recovery, covering its definition, the distinction between backup and DR, their respective roles, key metrics such as RPO and RTO, various replication technologies, and practical implementation methods across storage, network, and host layers.
1. Definition of Disaster Recovery
1.1 What is Disaster Recovery?
Disaster recovery (DR) refers to using existing scientific and technical means to establish reliable emergency procedures that can respond to sudden incidents, encompassing both backup systems and disaster‑recovery systems.
1.2 Concepts of Backup and Disaster Recovery
1.2.1 Backup
Backup: Ensures data safety by copying all or part of the data from production hosts or arrays to other storage media.
1.2.2 Disaster Recovery
Disaster Recovery: Guarantees business continuity by building two or more identical IT systems (including compute, network, storage, power, cooling, etc.) in a remote location; when the primary data center fails, the backup data center can quickly restore services via network‑based data transfer.
1.2.3 Differences
Protection Object: Backup protects data , while DR protects business continuity . Implementation Method: Backup uses backup software; DR uses replication or mirroring software. Time Cycle: Backup cycles are longer; replication/mirroring cycles are shorter. Note: Archiving uses backup.
1.2.4 Relationship
Only Backup: Business cannot recover quickly; data recovery may take time, causing potentially huge losses. Only DR: Business can recover quickly, but erroneous operations or failed upgrades on the production side may be replicated to the DR site, causing service interruption.
1.3 Protection Provided by Disaster Recovery
2. Role of Disaster Recovery
2.1 Problems in Data Centers
Virus and OS vulnerabilities
Human errors
Terrorist attacks
Power failures
Hardware failures
Natural disasters (earthquake, flood, typhoon)
2.1.2 Consequences of No DR
Business interruption
Data loss
Customer complaints
Revenue decline
Financial compensation
Company bankruptcy
(Data is priceless; loss can be catastrophic.)
2.2 Role of Backup
2.2.1 Storage Layer – Five Parts of Backup Configuration
Backup Sub‑Client : Execution carrier for backup tasks.
Storage Strategy : Includes backup media, deduplication policy, retention policy, write I/O count.
Backup Content : What to back up and what to exclude.
Backup Policy : Deduplication strategy, backup type, backup schedule.
Performance Optimization : Client read stream count.
2.2.2 Cloud Computing Layer
Cloud Server Backup Service (CSBS): Provides whole‑machine backup for cloud servers, supporting local snapshots and remote replication, enabling data recovery and ensuring business safety.
Volume Backup Service (VBS): Allows creating backups of cloud disks and rolling back using backup data, maximizing data correctness and security.
2.2.3 Replication Types
Synchronous Replication: Real‑time data copy. Asynchronous Replication: Data is copied with potential consistency lag.
2.3 Role of Disaster Recovery
2.3.1 Application Scenarios
Local High‑Availability (HA)
Active‑Standby (AS)
Active‑Active (AA) data centers
Two‑Site‑Three‑Center (3DC) – cascade/parallel
2.3.2 Solution Overview
Local Production Center: Implements local HA solutions.
In‑City DR (<100km): Active‑Active or Active‑Standby solutions.
Inter‑City DR (>100km): Two‑Site‑Three‑Center or Active‑Standby solutions.
2.3.3 Local High‑Availability Scheme
Advantages: Zero business interruption, zero data loss, high reliability.
Uses real‑time mirroring and synchronous replication, typically requiring RPO=0 due to short distance and high bandwidth.
2.3.4 Active‑Standby Scheme
Advantages: Approximate RPO≈0, low TCO, cross‑vendor storage compatibility, centralized topology and alerts, automated one‑click DR drills and recovery.
Key Technology: HyperReplication.
2.3.5 Active‑Active Data Center Scheme
Advantages: Six‑layer active‑active architecture, zero business interruption, zero data loss.
Key Technology: HyperMetro.
2.3.6 Two‑Site‑Three‑Center (Cascade/Parallel)
Network Type
Advantages
Disadvantages
Cascade
Minimal impact on production performance.
During regional disasters, if the same‑city DR site fails, RPO becomes large due to asynchronous replication.
Parallel
Effectively avoids cascade drawbacks during regional disasters.
Higher performance requirements on the production center.
3. Measurement of Disaster Recovery
3.1 Backup Types
Backup Window: The time interval during which backup can be performed without affecting normal business operations.
3.1.1 Full Backup
Creates a complete copy of all data at a point in time. Advantages: Small recovery window. Disadvantages: Large storage consumption, long backup time.
3.1.2 Cumulative Incremental Backup
Based on the last full backup, backs up all changes since then. Advantages: Saves storage, smaller backup and recovery windows. Disadvantages: Recovery depends on the last full backup plus the incremental set.
3.1.3 Differential Incremental Backup
Based on the last backup (full or differential), backs up changes since that point. Advantages: Minimal storage usage, small backup window. Disadvantages: Recovery requires the last full backup and each differential set, leading to longer reconstruction time.
3.1.4 Backup Strategy Principles
Combine full backup with either cumulative or differential incremental backups, but avoid mixing cumulative and differential in the same policy.
In environments with strict storage and window constraints, prefer full + differential backups.
3.2 DR Metrics
3.2.1 Recovery Point Objective (RPO)
Maximum tolerable data loss measured in time. Example: Backup at 08:00, failure at 09:00 → RPO = 1 hour.
3.2.2 Recovery Time Objective (RTO)
Maximum tolerable business interruption time. Example: Must restore within 12 hours after a disaster → RTO = 12 hours.
3.2.3 Comprehensive Standards
DR Capability Level
RTO
RPO
1
>2 days
1–7 days
2
<24 hours
1–7 days
3
>12 hours
Few hours–1 day
4
Few hours–2 days
Few hours–1 day
5
Few minutes–2 days
0–30 minutes
6
Few minutes
0
4. Implementation of Disaster Recovery
4.1 Backup Methods
Refer to the three classic backup methods (LAN‑Base, LAN‑Free, Server‑Free).
4.1.1 LAN‑Base
Installs a backup agent on production servers and a backup server; simple but unsuitable for very large data volumes due to LAN bandwidth consumption.
4.1.2 LAN‑Free
Data flows from the file server through an FC switch directly to tape, bypassing the LAN and reducing network load, though the file server still participates in the I/O path.
4.1.3 Server‑Free
Backup data does not pass through the server’s bus or memory; the file server issues SCSI replication commands and the storage system copies data directly to tape, greatly reducing server load.
Another option is NDMP (Network Data Management Protocol), which lets storage devices communicate directly with backup targets without involving the host.
4.2 Backup Media
Disk arrays
Tape libraries
Virtual tape libraries
Optical libraries
Cloud storage
Integrated appliances (e.g., Huawei HDP3500E)
4.3 Backup Design Principles
Customer requirements (data type, volume, objects).
Backup strategy (frequency, timing).
Network planning (bandwidth, topology).
Storage planning (current volume, growth).
4.4 DR Methods
4.5 DR Technologies
4.5.1 Host‑Level Data Replication
Installs replication software on servers in both production and DR sites; requires network connectivity and may consume significant host and network resources.
4.5.2 Network‑Level Data Replication
Similar to host‑level but focuses on network devices; still resource‑intensive on servers.
4.5.3 Storage‑Level Data Replication
Deploys a pair of storage systems that handle replication internally, often via direct fiber links or DWDM; minimal impact on application servers and widely used for high‑availability DR solutions.
Author: SkyBiuBiu Original link: https://www.cnblogs.com/Skybiubiu/p/14992848.html
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.