SRE Principles: SLIs, SLOs, and Error Budgets
Service Level Indicator (SLI): a quantitative measurement of service behaviour. Common SLIs: request success rate (% of HTTP 200s), latency (p50/p99 response time), throughput (requests per second), availability (% of time service responds). Choose SLIs that reflect what users actually experience. Service Level Objective (SLO): the target for an SLI, defined over a time window (e.g., 99.9% of requests succeed over a 30-day rolling window). Error budget: the amount of downtime or failures you can afford within the SLO. Formula: error budget = (1 - SLO target) x time window. If the error budget is consumed, slow down deployments and focus on reliability. Service Level Agreement (SLA): the contractual commitment to customers, usually less strict than the SLO (to preserve headroom). SLO breach does not automatically mean SLA breach.
CI/CD Pipelines and Release Engineering
Cloud Build + Cloud Deploy pipeline: Cloud Build compiles, tests, and packages the application into a container image. Cloud Deploy promotes the image through environments (dev > staging > prod) with optional approval gates. Automated testing gates prevent broken builds from reaching production. Release strategies: Blue/Green (two identical environments, switch traffic atomically — easy rollback), Canary (route a percentage of traffic to new version, increase gradually), Rolling Update (replace pods incrementally — default in Kubernetes). Traffic splitting in Cloud Run and GKE Ingress supports canary patterns natively. GitOps: Config Sync (GKE add-on) and Anthos Config Management sync Kubernetes manifests from a Git repository. The Git repo is the source of truth for cluster state — changes are made via pull request, not kubectl apply.
GKE Operations and Reliability
GKE cluster topology for reliability: multi-zone node pools spread pods across zones, regional clusters replicate the control plane across zones. Pod Disruption Budgets (PDB) protect services during maintenance. Horizontal Pod Autoscaler (HPA) scales pods based on CPU/memory or custom metrics (Stackdriver or Prometheus). Vertical Pod Autoscaler (VPA) adjusts resource requests and limits based on observed usage. Node auto-provisioning (NAP): GKE automatically creates node pools to fit pending pods that do not fit existing pools. Useful for heterogeneous workloads. Spot VMs (preemptible): up to 91% cheaper but can be reclaimed with 30s notice — use for fault-tolerant batch workloads, not stateful services. Fleet management with Anthos: manage multiple GKE clusters (and clusters from other clouds) with unified policy, config, and service mesh. Anthos Service Mesh (based on Istio) provides mTLS, traffic management, and observability across the fleet.
Observability Stack: Logging, Monitoring, and Tracing
Cloud Logging: structured JSON logs are parsed automatically. Log Router sends logs to Cloud Storage, BigQuery, or Pub/Sub. Log-based metrics convert log patterns into custom metrics for alerting. Log exclusion filters reduce storage costs for verbose or irrelevant logs. Cloud Monitoring: metric types (gauge, delta, cumulative), metric descriptors, time series. Alerting policies: conditions (threshold, absent, rate), notification channels (email, PagerDuty, Slack via Pub/Sub). Uptime checks: HTTP/HTTPS/TCP checks from multiple global locations. Cloud Trace: distributed tracing for latency analysis. Automatically integrated with App Engine, Cloud Run, Cloud Functions; instrumented with OpenTelemetry for GKE workloads. Trace sampling rate configuration trades completeness for cost. Cloud Profiler: continuous low-overhead profiling (CPU time, heap, goroutine count) without code changes — valuable for finding latency regressions.