Availability and Redundancy Concepts
Availability is measured as a percentage of time a system is operational. 99.9% ('three nines') = ~8.7 hours downtime per year. 99.99% ('four nines') = ~52 minutes. 99.999% ('five nines') = ~5 minutes. Each additional '9' requires significantly more redundancy and cost. SLA (Service Level Agreement) specifies the availability commitment between a service provider and customer.
MTTR (Mean Time To Repair): average time to restore a failed component. MTBF (Mean Time Between Failures): average time between failures. Higher MTBF = more reliable hardware. Lower MTTR = faster recovery (spare parts, procedures, monitoring). Availability = MTBF / (MTBF + MTTR). Both metrics inform HA design.
Redundancy eliminates single points of failure (SPOF): dual power supplies, redundant links (EtherChannel, dual ISP), redundant switches, server clustering, and RAID storage. Hot spare: a standby component that takes over immediately (no manual intervention). Warm spare: takes over quickly after minimal configuration. Cold spare: requires manual installation and configuration.
First Hop Redundancy Protocols
FHRP (First Hop Redundancy Protocol) provides router redundancy for hosts that have a single default gateway configured. Without FHRP, if the default gateway router fails, all hosts on that subnet lose internet and inter-VLAN connectivity.
HSRP (Hot Standby Router Protocol): Cisco proprietary. Two or more routers share a virtual IP and virtual MAC. One is Active (forwards traffic), others are Standby (waiting). If Active fails, Standby takes over. Hosts configure the virtual IP as their default gateway. Uses UDP port 1985. VRRP (Virtual Router Redundancy Protocol): IETF standard (RFC 5798) — vendor-neutral equivalent of HSRP. One Master router, one or more Backup routers. GLBP (Gateway Load Balancing Protocol): Cisco proprietary, allows multiple routers to actively forward traffic simultaneously (load balancing, not just failover).
NIC Teaming and Link Redundancy
NIC teaming (bonding): combining multiple physical NICs on a server into a single logical interface. Provides bandwidth aggregation and/or failover. Active-passive: one NIC active, other on standby. Active-active: both NICs carry traffic simultaneously. Requires switch support for LACP (802.3ad) when using active-active.
Dual-homed servers: servers connected to two different switches for switch-level redundancy. If one switch fails, the server remains connected via the other switch. Clustering: multiple servers operate as a single logical service. Active-active clusters process requests across all nodes; active-passive clusters have standby nodes that activate on failure.