Data Ingestion: Kinesis, MSK, and AWS Glue
Data ingestion patterns form the foundation of any data engineering architecture. Kinesis Data Streams: real-time streaming — shards are the unit of capacity (1 MB/s write, 2 MB/s read per shard), retention up to 365 days, replay capability for late consumers. Kinesis Data Firehose: managed delivery to S3, Redshift, OpenSearch, Splunk — buffered (by size or time), optional Lambda transformation inline, no consumer management required. Kinesis Data Analytics (now Amazon Managed Service for Apache Flink): real-time SQL or Flink processing on streams — window functions, anomaly detection, aggregations. Amazon MSK (Managed Streaming for Apache Kafka): fully managed Kafka — bring existing Kafka workloads to AWS, use Kafka Connect for source/sink integrations, MSK Serverless for variable throughput. AWS Glue: serverless ETL — Glue Crawlers automatically discover schemas and populate the Data Catalog, Glue ETL jobs run PySpark or Python Shell scripts for transformation, Glue DataBrew for visual no-code data preparation. AWS Glue Data Catalog: central metadata repository — tables, partitions, schemas — integrated with Athena, Redshift Spectrum, and EMR for schema-on-read queries.
Data Storage: S3 Lake House, Redshift, and DynamoDB
Storage layer decisions for data engineering. S3 data lake: the foundation — use S3 intelligent-tiering for cost optimisation, partition data in S3 by date or category (year=2025/month=01/day=15/) for Athena and Redshift Spectrum query efficiency, enable S3 Event Notifications to trigger Glue or Lambda on new file arrival. Apache Iceberg on S3: open table format for data lake — ACID transactions on S3 files, time travel (query previous snapshots), schema evolution without rewriting data — supported natively by Athena, Glue, and EMR. Amazon Redshift: fully managed columnar data warehouse — Redshift Serverless for variable workloads (pay per RPU-hour), provisioned clusters for consistent high-volume workloads. Distribution styles: KEY (collocate matching rows for join performance), EVEN (round-robin, balanced load), ALL (replicate small dimension tables to every node). Sort keys: COMPOUND (sequential queries on leading columns), INTERLEAVED (multiple columns equally weighted). Redshift Spectrum: query S3 data directly from Redshift SQL — no data movement, separates compute from storage. DynamoDB Streams + Kinesis Data Streams integration: capture all DynamoDB changes for downstream processing.
Data Transformation and Orchestration
Transformation and orchestration are the glue of data engineering. AWS Glue workflows: chain crawlers and jobs into dependency graphs — triggered on schedule or event. AWS Step Functions: orchestrate multi-step workflows involving Lambda, Glue, EMR, and ECS — Standard workflow (audit trail, up to 1 year execution), Express workflow (high-volume, up to 5 minutes — for streaming pipelines). Amazon EMR: managed Hadoop/Spark/Hive/Presto cluster — ephemeral clusters for cost efficiency (terminate after job completes), EMR Serverless for fully managed compute. Use EMR for large-scale Spark transformations that exceed Glue's capabilities (complex ML feature engineering, custom Spark configurations). Amazon Athena: serverless SQL on S3 — pay per TB scanned, use partitioning and columnar formats (Parquet, ORC) to reduce scan volume and cost. Athena Federated Query: query data in RDS, DynamoDB, and on-premises sources alongside S3 — Lambda data source connectors. AWS Lake Formation: security and governance layer for the data lake — column-level and row-level permissions on Glue Data Catalog tables, cross-account data sharing with Lake Formation permissions.
Data Quality, Security, and Monitoring
Production data engineering requires quality and governance controls. AWS Glue Data Quality: define data quality rules using DQDL (Data Quality Definition Language) — completeness, uniqueness, accuracy, consistency rules — runs as part of Glue ETL jobs. AWS Deequ: open-source data quality library for Spark — used in EMR for programmatic quality checks. Data lineage: AWS Glue lineage tracking records data origins and transformations automatically — visualised in the Glue console. Security: S3 bucket policies and IAM for data lake access control, Lake Formation for fine-grained column and row permissions, KMS CMK encryption for S3 and Redshift, VPC endpoints for private access to S3 and Redshift from EMR and Glue. Monitoring: CloudWatch metrics for Kinesis (GetRecords.IteratorAgeMilliseconds monitors consumer lag — high age means consumers are falling behind), Glue job metrics (bytes read/written, error counts), Redshift query monitoring rules (alert on long-running queries, memory-intensive queries). AWS CloudTrail data events: log every S3 object access and DynamoDB API call — essential for data audit compliance.