SNMP Monitoring
SNMP (Simple Network Management Protocol) is the most widely used network device monitoring protocol. SNMP manager: the monitoring server (NMS — Network Management System). SNMP agent: software running on network devices (routers, switches, servers). MIB (Management Information Base): a database of variables (OIDs) that can be queried or set on a device — CPU usage, interface counters, error rates.
SNMP versions: v1 and v2c use community strings (plain-text shared secret) — no encryption. v2c adds bulk transfers and 64-bit counters. v3 adds user authentication (MD5/SHA) and encryption (DES/AES) — the only secure version. Always use SNMPv3 on production networks.
SNMP polling: the NMS periodically queries (GET) devices for metrics. SNMP traps: devices proactively send alerts to the NMS when significant events occur (interface down, high CPU, authentication failure) — uses UDP 162. Informs: like traps but with acknowledgment — the NMS acknowledges receipt. Traps are fire-and-forget; informs are reliable.
Flow-Based Monitoring
NetFlow (Cisco) / IPFIX (open standard) / sFlow: capture summary information about traffic flows rather than full packet content. A flow = traffic between the same source IP, destination IP, source port, destination port, and protocol. Flow data shows: who is talking to whom, how much traffic, which applications. Used for bandwidth analysis, capacity planning, and security (detecting anomalous traffic patterns).
NetFlow collector: receives and stores flow records from routers/switches. Flow analyzer: processes records for reports and alerts. sFlow is a sampling-based technology (samples 1 in N packets) — lower overhead than NetFlow on high-speed links. IPFIX is the IETF standard based on NetFlow v9.
Monitoring Tools and Alerting
Common monitoring platforms: PRTG Network Monitor (Windows, GUI-driven, sensor-based), Nagios (Linux/Windows, open source, highly extensible), Zabbix (open source, enterprise features), SolarWinds NPM (enterprise, expensive), Datadog (cloud-based SaaS), Grafana + Prometheus (open source metrics stack).
Alerting thresholds: configure alerts when metrics exceed normal ranges. Examples: CPU > 80% for 5 minutes, interface utilization > 90%, ping RTT > 100ms. Alerts should be actionable — too many false positives leads to alert fatigue where real problems are ignored. Base thresholds on established baselines.