Designing for High Availability and Reliability
PCA-level reliability design starts with defining SLOs (Service Level Objectives) and working backwards to the architecture. SLO (e.g., 99.9% availability = 8.7 hours downtime/year), SLA (contractual commitment — often 10x less permissive than internal SLO), SLI (the metric that measures the SLO — request success rate, latency, throughput). Multi-zone and multi-region design: deploy across zones within a region for protection against zone failure (99.99% SLA for regional managed instance groups). Use multi-region Cloud Storage and Cloud Spanner for active/active global deployments. GKE regional clusters distribute nodes across zones automatically. Load balancing for HA: Global HTTP(S) Load Balancer + anycast IP routes users to the nearest healthy backend globally — supports failover across regions. Cloud Armor + CDN for DDoS resilience. Health checks: configure appropriate thresholds (not too sensitive — avoid flapping), use path-based health checks for stateless applications.
Hybrid and Multi-Cloud Architecture
Many PCA exam case studies involve companies with existing on-premises infrastructure that must integrate with Google Cloud. Connectivity options: Cloud VPN (IPSec, up to 3 Gbps per tunnel, multiple tunnels for HA VPN), Dedicated Interconnect (10G or 100G, private circuit, highest performance and reliability), Partner Interconnect (through partner, 50 Mbps to 50 Gbps). Anthos: Google's multi-cloud and hybrid platform — run GKE clusters on-premises (Anthos on-prem), on AWS, or on Azure, all managed from a single pane of glass. Config Sync (GitOps for GKE configuration), Policy Controller (OPA Gatekeeper for policy enforcement), Service Mesh (Anthos Service Mesh / Istio for service-to-service security and observability). Data migration: Storage Transfer Service (bulk transfer from AWS S3, Azure Blob, or on-prem), Transfer Appliance (physical device for offline transfer of petabytes), Datastream (CDC streaming replication from Oracle, MySQL, PostgreSQL to BigQuery or Cloud Storage).
Data Architecture: BigQuery and Analytics at Scale
BigQuery is Google Cloud's flagship data product and a PCA exam staple. Architecture decisions: BigQuery slots (units of compute — on-demand pay-per-query or committed capacity reservations), BigQuery Omni (query data in AWS S3 or Azure Blob without copying), BigQuery ML (run ML models in SQL). Performance optimisation: partitioning (by ingestion time or a date/timestamp column — reduces bytes scanned), clustering (sort data within partitions by frequently filtered columns), materialised views (pre-computed and maintained by BigQuery). Data pipeline architecture: batch (Cloud Storage > Dataflow > BigQuery — use Apache Beam in Dataflow), streaming (Pub/Sub > Dataflow > BigQuery — real-time), Composer (managed Apache Airflow for complex workflow orchestration). Looker and Looker Studio for BI and dashboards. Data governance: BigQuery column-level security with policy tags, BigQuery Data Catalog for metadata management, CMEK for compliance.
Security Architecture and Compliance
PCA security architecture requires defence-in-depth across the resource hierarchy. Google Cloud hierarchy: Organisation > Folder > Project > Resource — IAM policies and org policies inherit downward. Design principle: restrict at the highest level possible, allow exceptions at lower levels. Network security layers: VPC firewall rules (stateful, tag-based or service-account-based), Hierarchical firewall policies (org-wide enforcement), VPC Service Controls (perimeters around APIs), Cloud Armor (WAF and DDoS protection at the edge), Private Google Access (private connectivity to Google APIs without public IPs). Identity: use Google Workspace or Cloud Identity for human users (no service account keys for humans), Cloud Identity-Aware Proxy (IAP) for access to web applications based on identity and context (zero trust access — no VPN required). Key management: CMEK for regulatory compliance (you control key rotation, you can revoke access), CSEK (Customer-Supplied Encryption Keys — you provide the key material, Google never stores it).
Cost Optimisation and Operational Excellence
PCA exam case studies often present cost constraints. Committed Use Discounts (CUDs): commit to a minimum resource level for 1 or 3 years — 20-57% discount on Compute Engine VMs and Cloud SQL. Spot VMs (formerly preemptible): up to 91% discount — interrupted when Google needs capacity back, 30-second warning, no charge if interrupted in the first minute. Use spot VMs for batch workloads, CI/CD, ML training. Sustained Use Discounts (SUDs): automatically applied when running a VM for >25% of the month — no action required, 20-30% discount. Recommendations Hub: right-sizing recommendations for over-provisioned VMs, idle resource identification, committed use discount recommendations. Budget alerts: set monthly budgets per project or billing account, alert at 50%, 90%, 100% of budget. BigQuery cost control: set custom quotas per user or project (maximum bytes processed per day), use slot reservations for predictable cost at high volume.