Disaster Recovery Key Concepts
Disaster recovery (DR): the processes and tools to restore IT systems after a catastrophic event (hardware failure, natural disaster, ransomware, fire, flood). Business Continuity Plan (BCP): broader than DR — covers all aspects of keeping the business running during a disruption. DR plan: the specific IT-focused subset of the BCP. RPO (Recovery Point Objective): the maximum acceptable data loss — how far back in time can data be lost? If RPO = 4 hours, backups must occur at least every 4 hours. Determines backup frequency. RTO (Recovery Time Objective): the maximum acceptable downtime — how long can the business be offline? If RTO = 2 hours, systems must be restored within 2 hours of a disaster. Determines infrastructure investment (hot site vs cold site). MTTR (Mean Time to Repair): average time to restore a failed component. MTBF (Mean Time Between Failures): average time between failures — higher is better for reliability.
Backup Types and Strategies
Full backup: backs up all data. Longest to complete, shortest to restore. Differential backup: backs up all data changed since the last full backup. Restore: last full + latest differential = 2 restore operations. Incremental backup: backs up only data changed since the last backup (full or incremental). Fastest to complete, requires restoring full + every incremental in sequence. 3-2-1 backup rule: 3 copies of data, 2 different media types, 1 offsite. Offsite backup: protects against site-level disasters (fire, flood). Cloud backup: offsite by nature. Backup rotation schemes: grandfather-father-son (daily, weekly, monthly retention). Backup testing: regularly restore test files to verify backup integrity. Tape rotation: older media for long-term archival (LTO tape). Snapshot: point-in-time copy of data (not a full backup — depends on the original data). Immutable backup: write-once backup that cannot be modified or deleted — ransomware cannot encrypt it.
Recovery Sites
Hot site: fully configured, operational duplicate of the primary site. Can take over immediately (RTO = minutes). Most expensive. Hardware, software, and up-to-date data replicas maintained continuously. Warm site: partially configured site. Hardware present but needs software configuration and data restoration. RTO = hours to days. Moderate cost. Cold site: facility with power, cooling, and connectivity but no hardware pre-deployed. Hardware must be shipped and configured during a disaster. RTO = days to weeks. Least expensive. Cloud recovery: virtual machines and infrastructure deployed in cloud (AWS, Azure, GCP) on-demand during disaster. Pay for standby resources at low cost, scale up instantly when needed. Cloud-based DR increasingly replaces physical recovery sites for many organizations.
Redundancy and High Availability
Redundancy: duplicating critical components to eliminate single points of failure. Server redundancy: clustering (multiple servers share workload and automatically fail over if one fails). Storage redundancy: RAID, storage area networks (SAN), replication to secondary site. Network redundancy: multiple ISP connections, redundant switches and routers, spanning tree to prevent loops. Power redundancy: UPS (Uninterruptible Power Supply) — provides battery backup during brief outages and protects against power surges. Generator — provides extended power during prolonged outages. PDU (Power Distribution Unit) — distributes power to rack equipment. Dual power supplies in servers — one supply fails, the other continues. High availability (HA): design goal of maximum uptime. '5 nines' (99.999%) uptime = 5.26 minutes downtime per year. Failover: automatic switch to backup system when primary fails. Failback: returning to primary system after recovery.
DR Plan Components
A DR plan documents procedures for responding to disasters. Key sections: Risk assessment: identify threats (fire, flood, ransomware, hardware failure) and their likelihood/impact. Business Impact Analysis (BIA): identify critical systems and processes, define RPO and RTO for each. Recovery procedures: step-by-step procedures for restoring each critical system. Contact lists: emergency contacts for staff, vendors, ISPs, cloud providers. Communication plan: how to notify stakeholders (employees, customers, management) during an incident. DR drills and testing: tabletop exercises (walk through the plan without actual recovery), simulation (test recovery in isolated environment), actual failover test (most thorough, but disruptive). Plan maintenance: update the DR plan when systems change, after incidents, and on a scheduled review cycle (annually minimum).