Dataflow and Dataproc do the same thing

Dataflow runs Apache Beam pipelines — it has a unified batch and streaming model and is fully managed (no cluster administration). Dataproc runs Hadoop and Spark — it requires cluster management and is primarily for existing Hadoop/Spark workloads. Dataflow is preferred for new pipelines.

BigQuery is only for batch analytics — it cannot handle real-time data

BigQuery streaming insertion (via the Streaming API or Dataflow) allows data to be visible for queries within seconds of ingestion. BigQuery also supports materialised views that refresh automatically and BI Engine for sub-second query response times.

Google Professional Data Engineer: Building and Operationalising Data Pipelines on GCP

Data Ingestion and Storage Architecture

PDE data architecture starts with choosing the right storage for each use case. Cloud Storage (GCS): object storage for raw and processed data — the foundation of a data lake on GCP. Choose storage class by access frequency: Standard (frequently accessed — streaming pipeline outputs), Nearline (monthly access — backups), Coldline (quarterly access — DR), Archive (yearly access — regulatory retention). Lifecycle rules automate transitions and deletions. BigQuery: serverless data warehouse — the destination for most analytical workloads on GCP. Optimise with partitioning (partition by ingestion time or date column — reduces bytes scanned), clustering (co-locate rows with similar values — further reduces scan), and materialised views (pre-computed query results refreshed automatically). BigQuery Omni: run queries directly against data in AWS S3 or Azure Blob — no data copying. Cloud Bigtable: managed NoSQL for high-throughput time-series and IoT data — wide-column model, petabyte-scale, millisecond latency — designed for billions of rows. Do not use Bigtable for small datasets; overhead is not worth it. Cloud Spanner: globally consistent relational database — use when you need both SQL semantics and global horizontal scale (financial transactions, global inventory).

Data Processing: Dataflow, Dataproc, and Pub/Sub

Processing layer choices for PDE. Pub/Sub: fully managed message queue and event streaming — publishers send messages to topics, subscribers consume from subscriptions. At-least-once delivery. Exactly-once processing requires deduplication in the consumer (Dataflow handles this with windowing and watermarks). Pub/Sub Lite: lower cost, zonal availability — for high-volume use cases where global availability is not required. Dataflow: managed Apache Beam execution environment — unified model for batch (bounded) and streaming (unbounded) pipelines. Beam concepts: PCollection (distributed dataset), PTransform (data transformation — ParDo for element-wise, GroupByKey for aggregation), Pipeline (the DAG of transforms), Runner (execution environment — Dataflow runner for GCP, Direct runner for local testing). Streaming concepts: windowing (fixed windows, sliding windows, session windows for activity-based grouping), watermarks (timestamp threshold for late data — hold results until watermark passes), triggers (when to emit results from a window). Dataproc: managed Hadoop and Spark — for existing Hadoop/Spark workloads that cannot be rewritten as Beam pipelines. Ephemeral clusters (spin up for a job, spin down — pay only for processing time) preferred over permanent clusters. Cloud Composer (managed Apache Airflow): orchestrate complex pipelines with dependencies, scheduling, and monitoring — DAGs define workflow structure.

Machine Learning on GCP: Vertex AI

Vertex AI is Google's unified ML platform — the answer to almost every ML question on the PDE exam. Vertex AI components: Workbench (managed Jupyter notebooks for data exploration and model development), Training (custom training jobs using Docker containers or pre-built containers for TensorFlow, PyTorch, sklearn), AutoML (train high-quality models without writing training code — AutoML Tables for structured data, AutoML Image, AutoML Text, AutoML Video), Model Registry (versioned model storage with lineage), Endpoints (deploy models for real-time prediction — online prediction), Batch Prediction (run predictions on a large dataset in Cloud Storage), Pipelines (Kubeflow Pipelines or TFX for ML workflow orchestration), Feature Store (centralised feature management — online and offline serving). BigQuery ML: run ML models using SQL directly in BigQuery — CREATE MODEL statement, SELECT * FROM ML.PREDICT — for data analysts who prefer SQL over Python. Supported algorithms: linear regression, logistic regression, k-means, matrix factorisation, boosted trees, deep neural networks, ARIMA for time-series.

Data Governance, Security, and Reliability

PDE data governance: data classification (identify sensitive data — DLP API scans BigQuery, Cloud Storage, Datastore for PII, payment card data, credentials), data lineage (Dataplex tracks how data flows and transforms — built on Data Catalog metadata), BigQuery column-level security (policy tags restrict access to specific columns — sensitive columns like SSN invisible to users without the policy tag role). Dataplex: unified data governance — organises data across GCS and BigQuery into lakes, zones, and assets, applies data quality rules, automates metadata discovery and classification, manages data lifecycle policies. Data reliability: BigQuery slot reservation for consistent performance (no resource contention from other queries), Pub/Sub subscription acknowledgement deadlines (set longer than maximum processing time to avoid redelivery), Dataflow autoscaling (automatically scales worker count based on input volume — reduces cost during off-peak). Monitoring: Cloud Monitoring metrics for Dataflow job health (data freshness, system lag, backlog bytes), Pub/Sub subscription age (how old is the oldest unacknowledged message — high age indicates slow consumer), BigQuery audit logs (who ran what query against which table — Data Access logs).

Google Professional Data Engineer: Building and Operationalising Data Pipelines on GCP

Data Ingestion and Storage Architecture

Data Processing: Dataflow, Dataproc, and Pub/Sub

Machine Learning on GCP: Vertex AI

Data Governance, Security, and Reliability

Key exam facts — PDE

Common exam traps

Practice this topic

Test yourself on Google Cloud Data Engineer

Related certification topics