AWSDEA-C01

AWS Data Engineer Associate: Data Pipelines, Storage, and Analytics on AWS

The AWS Data Engineer Associate (DEA-C01) validates your ability to design, build, and maintain data pipelines and analytics infrastructure on AWS. It bridges the gap between developer and data analyst — you need to know how to ingest data at scale, store it efficiently, transform it with the right tools, and make it available for analysis. If you build the data infrastructure that data scientists and analysts consume, this exam validates that foundational engineering capability.

12 min
4 sections · 10 exam key points

Data Ingestion: Kinesis, MSK, and AWS Glue

Data ingestion patterns form the foundation of any data engineering architecture. Kinesis Data Streams: real-time streaming — shards are the unit of capacity (1 MB/s write, 2 MB/s read per shard), retention up to 365 days, replay capability for late consumers. Kinesis Data Firehose: managed delivery to S3, Redshift, OpenSearch, Splunk — buffered (by size or time), optional Lambda transformation inline, no consumer management required. Kinesis Data Analytics (now Amazon Managed Service for Apache Flink): real-time SQL or Flink processing on streams — window functions, anomaly detection, aggregations. Amazon MSK (Managed Streaming for Apache Kafka): fully managed Kafka — bring existing Kafka workloads to AWS, use Kafka Connect for source/sink integrations, MSK Serverless for variable throughput. AWS Glue: serverless ETL — Glue Crawlers automatically discover schemas and populate the Data Catalog, Glue ETL jobs run PySpark or Python Shell scripts for transformation, Glue DataBrew for visual no-code data preparation. AWS Glue Data Catalog: central metadata repository — tables, partitions, schemas — integrated with Athena, Redshift Spectrum, and EMR for schema-on-read queries.

Data Storage: S3 Lake House, Redshift, and DynamoDB

Storage layer decisions for data engineering. S3 data lake: the foundation — use S3 intelligent-tiering for cost optimisation, partition data in S3 by date or category (year=2025/month=01/day=15/) for Athena and Redshift Spectrum query efficiency, enable S3 Event Notifications to trigger Glue or Lambda on new file arrival. Apache Iceberg on S3: open table format for data lake — ACID transactions on S3 files, time travel (query previous snapshots), schema evolution without rewriting data — supported natively by Athena, Glue, and EMR. Amazon Redshift: fully managed columnar data warehouse — Redshift Serverless for variable workloads (pay per RPU-hour), provisioned clusters for consistent high-volume workloads. Distribution styles: KEY (collocate matching rows for join performance), EVEN (round-robin, balanced load), ALL (replicate small dimension tables to every node). Sort keys: COMPOUND (sequential queries on leading columns), INTERLEAVED (multiple columns equally weighted). Redshift Spectrum: query S3 data directly from Redshift SQL — no data movement, separates compute from storage. DynamoDB Streams + Kinesis Data Streams integration: capture all DynamoDB changes for downstream processing.

Data Transformation and Orchestration

Transformation and orchestration are the glue of data engineering. AWS Glue workflows: chain crawlers and jobs into dependency graphs — triggered on schedule or event. AWS Step Functions: orchestrate multi-step workflows involving Lambda, Glue, EMR, and ECS — Standard workflow (audit trail, up to 1 year execution), Express workflow (high-volume, up to 5 minutes — for streaming pipelines). Amazon EMR: managed Hadoop/Spark/Hive/Presto cluster — ephemeral clusters for cost efficiency (terminate after job completes), EMR Serverless for fully managed compute. Use EMR for large-scale Spark transformations that exceed Glue's capabilities (complex ML feature engineering, custom Spark configurations). Amazon Athena: serverless SQL on S3 — pay per TB scanned, use partitioning and columnar formats (Parquet, ORC) to reduce scan volume and cost. Athena Federated Query: query data in RDS, DynamoDB, and on-premises sources alongside S3 — Lambda data source connectors. AWS Lake Formation: security and governance layer for the data lake — column-level and row-level permissions on Glue Data Catalog tables, cross-account data sharing with Lake Formation permissions.

Data Quality, Security, and Monitoring

Production data engineering requires quality and governance controls. AWS Glue Data Quality: define data quality rules using DQDL (Data Quality Definition Language) — completeness, uniqueness, accuracy, consistency rules — runs as part of Glue ETL jobs. AWS Deequ: open-source data quality library for Spark — used in EMR for programmatic quality checks. Data lineage: AWS Glue lineage tracking records data origins and transformations automatically — visualised in the Glue console. Security: S3 bucket policies and IAM for data lake access control, Lake Formation for fine-grained column and row permissions, KMS CMK encryption for S3 and Redshift, VPC endpoints for private access to S3 and Redshift from EMR and Glue. Monitoring: CloudWatch metrics for Kinesis (GetRecords.IteratorAgeMilliseconds monitors consumer lag — high age means consumers are falling behind), Glue job metrics (bytes read/written, error counts), Redshift query monitoring rules (alert on long-running queries, memory-intensive queries). AWS CloudTrail data events: log every S3 object access and DynamoDB API call — essential for data audit compliance.

Key exam facts — DEA-C01

  • Kinesis Data Streams: 1 MB/s write per shard; Firehose: managed delivery with buffering, no consumer code
  • Glue Data Catalog: central schema registry — integrated with Athena, Redshift Spectrum, EMR
  • Apache Iceberg on S3: ACID transactions, time travel, schema evolution — no rewrite needed
  • Redshift KEY distribution: collocate matching rows to eliminate shuffle joins
  • Athena: use Parquet/ORC columnar format + partitioning to minimise data scanned per query
  • Lake Formation: column-level and row-level security on top of Glue Data Catalog
  • Step Functions Standard: audit trail up to 1 year; Express: high-volume, up to 5 minutes
  • EMR ephemeral clusters: terminate after job completes to avoid idle costs
  • Kinesis IteratorAgeMilliseconds: high value means consumer is falling behind — scale shards or consumers
  • Glue Data Quality DQDL rules run inline in ETL jobs — fail job or log issues on quality violations

Common exam traps

AWS Glue and AWS EMR do the same transformation work

Glue is managed Spark with a simplified development model — great for standard ETL. EMR gives full control over Spark, Hive, Presto, and Hadoop — needed for complex ML feature engineering, custom Spark configurations, or workloads that need fine-grained cluster tuning.

Storing all data in S3 in CSV format is sufficient for analytics

CSV is row-oriented and uncompressed — Athena scans the entire file regardless of the columns you query. Columnar formats like Parquet store data by column, compressed, and support predicate pushdown. A Parquet file is typically 10-100x cheaper to query in Athena than the equivalent CSV.

Kinesis Data Streams and Kinesis Data Firehose are interchangeable

Kinesis Data Streams is a low-latency streaming buffer you consume directly with custom consumers (Lambda, KCL app, Flink). Firehose is a managed delivery service — it buffers and delivers to a destination (S3, Redshift, Splunk) without you writing consumer code. Choose Streams when you need real-time processing; Firehose when you need managed near-real-time delivery.

Practice this topic

Test yourself on AWS Data Engineer

JT Exams routes you to questions in your exact weak areas — automatically, after every session.

No credit card · Cancel anytime

Related certification topics