AWS Glue and AWS EMR do the same transformation work

Glue is managed Spark with a simplified development model — great for standard ETL. EMR gives full control over Spark, Hive, Presto, and Hadoop — needed for complex ML feature engineering, custom Spark configurations, or workloads that need fine-grained cluster tuning.

Storing all data in S3 in CSV format is sufficient for analytics

CSV is row-oriented and uncompressed — Athena scans the entire file regardless of the columns you query. Columnar formats like Parquet store data by column, compressed, and support predicate pushdown. A Parquet file is typically 10-100x cheaper to query in Athena than the equivalent CSV.

Kinesis Data Streams and Kinesis Data Firehose are interchangeable

Kinesis Data Streams is a low-latency streaming buffer you consume directly with custom consumers (Lambda, KCL app, Flink). Firehose is a managed delivery service — it buffers and delivers to a destination (S3, Redshift, Splunk) without you writing consumer code. Choose Streams when you need real-time processing; Firehose when you need managed near-real-time delivery.

AWS Data Engineer Associate: Data Pipelines, Storage, and Analytics on AWS

Data Ingestion: Kinesis, MSK, and AWS Glue

Data ingestion patterns form the foundation of any data engineering architecture. Kinesis Data Streams: real-time streaming — shards are the unit of capacity (1 MB/s write, 2 MB/s read per shard), retention up to 365 days, replay capability for late consumers. Kinesis Data Firehose: managed delivery to S3, Redshift, OpenSearch, Splunk — buffered (by size or time), optional Lambda transformation inline, no consumer management required. Kinesis Data Analytics (now Amazon Managed Service for Apache Flink): real-time SQL or Flink processing on streams — window functions, anomaly detection, aggregations. Amazon MSK (Managed Streaming for Apache Kafka): fully managed Kafka — bring existing Kafka workloads to AWS, use Kafka Connect for source/sink integrations, MSK Serverless for variable throughput. AWS Glue: serverless ETL — Glue Crawlers automatically discover schemas and populate the Data Catalog, Glue ETL jobs run PySpark or Python Shell scripts for transformation, Glue DataBrew for visual no-code data preparation. AWS Glue Data Catalog: central metadata repository — tables, partitions, schemas — integrated with Athena, Redshift Spectrum, and EMR for schema-on-read queries.

Data Storage: S3 Lake House, Redshift, and DynamoDB

Storage layer decisions for data engineering. S3 data lake: the foundation — use S3 intelligent-tiering for cost optimisation, partition data in S3 by date or category (year=2025/month=01/day=15/) for Athena and Redshift Spectrum query efficiency, enable S3 Event Notifications to trigger Glue or Lambda on new file arrival. Apache Iceberg on S3: open table format for data lake — ACID transactions on S3 files, time travel (query previous snapshots), schema evolution without rewriting data — supported natively by Athena, Glue, and EMR. Amazon Redshift: fully managed columnar data warehouse — Redshift Serverless for variable workloads (pay per RPU-hour), provisioned clusters for consistent high-volume workloads. Distribution styles: KEY (collocate matching rows for join performance), EVEN (round-robin, balanced load), ALL (replicate small dimension tables to every node). Sort keys: COMPOUND (sequential queries on leading columns), INTERLEAVED (multiple columns equally weighted). Redshift Spectrum: query S3 data directly from Redshift SQL — no data movement, separates compute from storage. DynamoDB Streams + Kinesis Data Streams integration: capture all DynamoDB changes for downstream processing.

Data Transformation and Orchestration

Transformation and orchestration are the glue of data engineering. AWS Glue workflows: chain crawlers and jobs into dependency graphs — triggered on schedule or event. AWS Step Functions: orchestrate multi-step workflows involving Lambda, Glue, EMR, and ECS — Standard workflow (audit trail, up to 1 year execution), Express workflow (high-volume, up to 5 minutes — for streaming pipelines). Amazon EMR: managed Hadoop/Spark/Hive/Presto cluster — ephemeral clusters for cost efficiency (terminate after job completes), EMR Serverless for fully managed compute. Use EMR for large-scale Spark transformations that exceed Glue's capabilities (complex ML feature engineering, custom Spark configurations). Amazon Athena: serverless SQL on S3 — pay per TB scanned, use partitioning and columnar formats (Parquet, ORC) to reduce scan volume and cost. Athena Federated Query: query data in RDS, DynamoDB, and on-premises sources alongside S3 — Lambda data source connectors. AWS Lake Formation: security and governance layer for the data lake — column-level and row-level permissions on Glue Data Catalog tables, cross-account data sharing with Lake Formation permissions.

Data Quality, Security, and Monitoring

Production data engineering requires quality and governance controls. AWS Glue Data Quality: define data quality rules using DQDL (Data Quality Definition Language) — completeness, uniqueness, accuracy, consistency rules — runs as part of Glue ETL jobs. AWS Deequ: open-source data quality library for Spark — used in EMR for programmatic quality checks. Data lineage: AWS Glue lineage tracking records data origins and transformations automatically — visualised in the Glue console. Security: S3 bucket policies and IAM for data lake access control, Lake Formation for fine-grained column and row permissions, KMS CMK encryption for S3 and Redshift, VPC endpoints for private access to S3 and Redshift from EMR and Glue. Monitoring: CloudWatch metrics for Kinesis (GetRecords.IteratorAgeMilliseconds monitors consumer lag — high age means consumers are falling behind), Glue job metrics (bytes read/written, error counts), Redshift query monitoring rules (alert on long-running queries, memory-intensive queries). AWS CloudTrail data events: log every S3 object access and DynamoDB API call — essential for data audit compliance.

AWS Data Engineer Associate: Data Pipelines, Storage, and Analytics on AWS

Data Ingestion: Kinesis, MSK, and AWS Glue

Data Storage: S3 Lake House, Redshift, and DynamoDB

Data Transformation and Orchestration

Data Quality, Security, and Monitoring

Key exam facts — DEA-C01

Common exam traps

Practice this topic

Test yourself on AWS Data Engineer

Related certification topics