Data Ingestion and Storage Architecture
PDE data architecture starts with choosing the right storage for each use case. Cloud Storage (GCS): object storage for raw and processed data — the foundation of a data lake on GCP. Choose storage class by access frequency: Standard (frequently accessed — streaming pipeline outputs), Nearline (monthly access — backups), Coldline (quarterly access — DR), Archive (yearly access — regulatory retention). Lifecycle rules automate transitions and deletions. BigQuery: serverless data warehouse — the destination for most analytical workloads on GCP. Optimise with partitioning (partition by ingestion time or date column — reduces bytes scanned), clustering (co-locate rows with similar values — further reduces scan), and materialised views (pre-computed query results refreshed automatically). BigQuery Omni: run queries directly against data in AWS S3 or Azure Blob — no data copying. Cloud Bigtable: managed NoSQL for high-throughput time-series and IoT data — wide-column model, petabyte-scale, millisecond latency — designed for billions of rows. Do not use Bigtable for small datasets; overhead is not worth it. Cloud Spanner: globally consistent relational database — use when you need both SQL semantics and global horizontal scale (financial transactions, global inventory).
Data Processing: Dataflow, Dataproc, and Pub/Sub
Processing layer choices for PDE. Pub/Sub: fully managed message queue and event streaming — publishers send messages to topics, subscribers consume from subscriptions. At-least-once delivery. Exactly-once processing requires deduplication in the consumer (Dataflow handles this with windowing and watermarks). Pub/Sub Lite: lower cost, zonal availability — for high-volume use cases where global availability is not required. Dataflow: managed Apache Beam execution environment — unified model for batch (bounded) and streaming (unbounded) pipelines. Beam concepts: PCollection (distributed dataset), PTransform (data transformation — ParDo for element-wise, GroupByKey for aggregation), Pipeline (the DAG of transforms), Runner (execution environment — Dataflow runner for GCP, Direct runner for local testing). Streaming concepts: windowing (fixed windows, sliding windows, session windows for activity-based grouping), watermarks (timestamp threshold for late data — hold results until watermark passes), triggers (when to emit results from a window). Dataproc: managed Hadoop and Spark — for existing Hadoop/Spark workloads that cannot be rewritten as Beam pipelines. Ephemeral clusters (spin up for a job, spin down — pay only for processing time) preferred over permanent clusters. Cloud Composer (managed Apache Airflow): orchestrate complex pipelines with dependencies, scheduling, and monitoring — DAGs define workflow structure.
Machine Learning on GCP: Vertex AI
Vertex AI is Google's unified ML platform — the answer to almost every ML question on the PDE exam. Vertex AI components: Workbench (managed Jupyter notebooks for data exploration and model development), Training (custom training jobs using Docker containers or pre-built containers for TensorFlow, PyTorch, sklearn), AutoML (train high-quality models without writing training code — AutoML Tables for structured data, AutoML Image, AutoML Text, AutoML Video), Model Registry (versioned model storage with lineage), Endpoints (deploy models for real-time prediction — online prediction), Batch Prediction (run predictions on a large dataset in Cloud Storage), Pipelines (Kubeflow Pipelines or TFX for ML workflow orchestration), Feature Store (centralised feature management — online and offline serving). BigQuery ML: run ML models using SQL directly in BigQuery — CREATE MODEL statement, SELECT * FROM ML.PREDICT — for data analysts who prefer SQL over Python. Supported algorithms: linear regression, logistic regression, k-means, matrix factorisation, boosted trees, deep neural networks, ARIMA for time-series.
Data Governance, Security, and Reliability
PDE data governance: data classification (identify sensitive data — DLP API scans BigQuery, Cloud Storage, Datastore for PII, payment card data, credentials), data lineage (Dataplex tracks how data flows and transforms — built on Data Catalog metadata), BigQuery column-level security (policy tags restrict access to specific columns — sensitive columns like SSN invisible to users without the policy tag role). Dataplex: unified data governance — organises data across GCS and BigQuery into lakes, zones, and assets, applies data quality rules, automates metadata discovery and classification, manages data lifecycle policies. Data reliability: BigQuery slot reservation for consistent performance (no resource contention from other queries), Pub/Sub subscription acknowledgement deadlines (set longer than maximum processing time to avoid redelivery), Dataflow autoscaling (automatically scales worker count based on input volume — reduces cost during off-peak). Monitoring: Cloud Monitoring metrics for Dataflow job health (data freshness, system lag, backlog bytes), Pub/Sub subscription age (how old is the oldest unacknowledged message — high age indicates slow consumer), BigQuery audit logs (who ran what query against which table — Data Access logs).