Multi-AZ and Multi-Region architectures
Availability Zones are physically separate data centers within a region, connected by low-latency links. Deploying across multiple AZs protects against a single data center failure. If AZ-a loses power, your resources in AZ-b and AZ-c keep serving traffic. This is the minimum acceptable architecture for production workloads. Single-AZ deployments have no protection against AZ-level failures.
Multi-region architectures protect against entire region failures or provide lower latency to users in different geographies. They are significantly more complex and expensive. A Pilot Light architecture keeps a minimal version of the environment running in a secondary region, ready to scale up in an emergency. A Warm Standby keeps a scaled-down but fully functional version running in the secondary region. Active-Active runs full capacity in multiple regions simultaneously, with traffic distributed between them.
For RDS, Multi-AZ creates a synchronous standby replica in a different AZ. If the primary fails, AWS automatically promotes the standby in typically 60 to 120 seconds, with a DNS endpoint flip. The standby is not readable during normal operation. Read Replicas are asynchronous and readable, used for read scaling rather than HA.
Health checks, failover, and resiliency patterns
Load balancers perform health checks and route traffic only to healthy targets. If a target fails its health check, it is removed from rotation. New targets are added when Auto Scaling launches them. This combination of load balancer health checks and Auto Scaling is the foundation of elastic, self-healing architectures: when demand spikes, more instances launch; when they fail, they are replaced automatically.
Route 53 health checks enable DNS-level failover between resources, including between regions. A failover routing policy keeps a primary record active as long as health checks pass and switches to a secondary record if the primary fails. This provides region-level failover without the user noticing more than the DNS TTL delay.
Designing for failure means assuming every component will fail and building so that failure of any single component does not bring down the system. Single points of failure (SPOF) are the enemy. If removing any one component takes down the system, that component is a SPOF and must be made redundant.
How to choose the correct answer
Availability goal: survive a single instance failure = Auto Scaling group. Survive an AZ failure = deploy across multiple AZs with a load balancer. Survive a region failure = Multi-region with Route 53 failover.
RDS HA: Multi-AZ for automatic failover. Read Replicas for read scaling, not HA.
High availability vs fault tolerance: HA = system recovers quickly (some downtime). FT = system continues without interruption. Fault tolerance costs more because active redundancy is always running.
RPO and RTO: RPO = maximum acceptable data loss (how old can the restored data be?). RTO = maximum acceptable downtime (how long to recover?). Architect backup and replication frequency to meet RPO. Architect failover speed to meet RTO.
Pilot Light: minimal standby, scales up in emergency. Warm Standby: scaled-down but functional. Active-Active: full traffic in multiple regions simultaneously.