AzureDP-203

DP-203 Azure Data Engineer Associate: Data Storage, Processing, and Security at Scale

DP-203 is Microsoft's data engineering certification for Azure. It validates your ability to design and implement data engineering solutions — data ingestion pipelines, storage architecture, batch and streaming processing, and data security on Azure. This is the credential for data engineers who build the plumbing that moves, transforms, and stores data at scale, enabling the data analysts and scientists who consume it downstream.

11 min
3 sections · 10 exam key points

Data Storage: ADLS Gen2, Synapse, and Cosmos DB

DP-203 storage architecture. Azure Data Lake Storage Gen2 (ADLS Gen2): hierarchical namespace (HNS) over Azure Blob — directory and file structure (unlike flat blob containers), POSIX ACLs for fine-grained security, optimised for analytics workloads. Data Lake zones: Raw/Bronze (ingested data as-is), Curated/Silver (cleaned and joined), Refined/Gold (aggregated for reporting). Azure Synapse Analytics: unified analytics platform with four compute options — Dedicated SQL Pool (formerly SQL DW — Massively Parallel Processing (MPP) database, distribute tables across 60 distributions: hash distribution for large fact tables, round-robin for staging tables, replicated for small dimension tables), Serverless SQL Pool (query files in ADLS Gen2 or Blob using T-SQL — pay per TB scanned, no provisioning), Apache Spark Pool (managed Spark for data engineering, ML, and Python notebooks), Synapse Link (real-time analytics on Cosmos DB operational data without ETL). Distribution design: hash distribution on the column used in joins and GROUP BY — avoids data movement (shuffle) during query execution, largest performance lever in dedicated SQL pool.

Data Pipelines: Azure Data Factory and Synapse Pipelines

Azure Data Factory (ADF) and Synapse Pipelines (same engine) orchestrate data movement. Pipeline building blocks: Linked Services (connection configurations to sources and sinks — Azure SQL, Blob, ADLS, Snowflake, REST APIs, SAP, Salesforce), Datasets (named reference to data within a linked service — table, file path, query), Activities (the work performed — Copy Data copies between linked services, Data Flow transforms data visually, Execute Pipeline chains pipelines, Stored Procedure runs SQL, Web Activity calls REST APIs, Until/ForEach loops). Integration Runtimes: Azure IR (managed, regional — for cloud-to-cloud data movement), Self-Hosted IR (install on-premises or in a private network — for on-premises sources behind firewall), Azure-SSIS IR (run SSIS packages in the cloud). Mapping Data Flows: visually design transformations (select, filter, join, aggregate, pivot, rank, lookup, conditional split, derived column, flatten JSON arrays) — compiled to Spark under the hood, debuggable in the UI. Triggers: Schedule (time-based), Tumbling Window (non-overlapping intervals — dependency chains between runs), Event-based (Blob created/deleted, custom event from Event Grid).

Stream Processing and Security

Streaming data engineering: Azure Event Hubs — high-throughput event streaming (millions of events/second), partition model (30 days default retention), consumer groups for independent consumers, Kafka-compatible protocol. Azure Stream Analytics (ASA): real-time SQL-based stream processing — queries use SQL syntax with time windows (Tumbling window: non-overlapping fixed-size intervals, Hopping window: overlapping fixed-size intervals, Sliding window: triggered by events — all events within a time range of each event, Session window: variable-size, grouped by activity), reference data joins (enrich stream events with lookup data from Blob or SQL). Output to Blob/ADLS (data lake), SQL Database, Cosmos DB, Power BI (real-time dashboard), Event Hubs (forward processed events). Data security: data masking in Synapse (dynamic data masking hides columns from non-privileged users), column-level security (grant specific users access to specific columns using SQL GRANT on columns), row-level security (CREATE SECURITY POLICY with predicate functions — filter rows by user context), encryption (TDE for SQL pools, ADLS Gen2 encrypted at rest with service-managed or CMK). Private endpoints for all Azure data services — eliminate public internet exposure of data infrastructure.

Key exam facts — DP-203

  • ADLS Gen2: hierarchical namespace + POSIX ACLs — optimised for analytics on top of Azure Blob
  • Synapse Dedicated SQL Pool: hash distribute fact tables on join/GROUP BY column — avoids shuffle
  • Serverless SQL Pool: query ADLS Gen2 files with T-SQL — pay per TB scanned, no provisioning
  • ADF activities: Copy Data (movement), Mapping Data Flow (transformation), Execute Pipeline (orchestration)
  • Integration Runtime: Azure IR for cloud-to-cloud; Self-Hosted IR for on-premises sources
  • Stream Analytics tumbling window: non-overlapping fixed intervals — aggregate per time bucket
  • Event Hubs: consumer groups enable independent stream consumers with separate offsets
  • Mapping Data Flows compile to Spark — visually designed transformations, debuggable in UI
  • Row-Level Security in Synapse: CREATE SECURITY POLICY with predicate function filters rows by user
  • Private endpoints for Synapse, ADLS, Event Hubs — remove public internet access to data services

Common exam traps

Azure Data Factory and Synapse Pipelines are different products

Synapse Pipelines is the same ADF pipeline engine embedded within Azure Synapse Analytics. The code is identical — the difference is the workspace it is managed in. Choose ADF standalone for pure data integration; choose Synapse Pipelines when you also use Synapse SQL or Spark.

Dedicated SQL Pool and Serverless SQL Pool are interchangeable

Dedicated SQL Pool is a provisioned MPP database with billable capacity — optimised for complex queries over large structured tables. Serverless SQL Pool queries files directly in the data lake on demand — pay-per-scan, no management, best for ad-hoc exploration and lightly structured data.

Practice this topic

Test yourself on DP-203 Data Engineer

JT Exams routes you to questions in your exact weak areas — automatically, after every session.

No credit card · Cancel anytime

Related certification topics