Data Storage: ADLS Gen2, Synapse, and Cosmos DB
DP-203 storage architecture. Azure Data Lake Storage Gen2 (ADLS Gen2): hierarchical namespace (HNS) over Azure Blob — directory and file structure (unlike flat blob containers), POSIX ACLs for fine-grained security, optimised for analytics workloads. Data Lake zones: Raw/Bronze (ingested data as-is), Curated/Silver (cleaned and joined), Refined/Gold (aggregated for reporting). Azure Synapse Analytics: unified analytics platform with four compute options — Dedicated SQL Pool (formerly SQL DW — Massively Parallel Processing (MPP) database, distribute tables across 60 distributions: hash distribution for large fact tables, round-robin for staging tables, replicated for small dimension tables), Serverless SQL Pool (query files in ADLS Gen2 or Blob using T-SQL — pay per TB scanned, no provisioning), Apache Spark Pool (managed Spark for data engineering, ML, and Python notebooks), Synapse Link (real-time analytics on Cosmos DB operational data without ETL). Distribution design: hash distribution on the column used in joins and GROUP BY — avoids data movement (shuffle) during query execution, largest performance lever in dedicated SQL pool.
Data Pipelines: Azure Data Factory and Synapse Pipelines
Azure Data Factory (ADF) and Synapse Pipelines (same engine) orchestrate data movement. Pipeline building blocks: Linked Services (connection configurations to sources and sinks — Azure SQL, Blob, ADLS, Snowflake, REST APIs, SAP, Salesforce), Datasets (named reference to data within a linked service — table, file path, query), Activities (the work performed — Copy Data copies between linked services, Data Flow transforms data visually, Execute Pipeline chains pipelines, Stored Procedure runs SQL, Web Activity calls REST APIs, Until/ForEach loops). Integration Runtimes: Azure IR (managed, regional — for cloud-to-cloud data movement), Self-Hosted IR (install on-premises or in a private network — for on-premises sources behind firewall), Azure-SSIS IR (run SSIS packages in the cloud). Mapping Data Flows: visually design transformations (select, filter, join, aggregate, pivot, rank, lookup, conditional split, derived column, flatten JSON arrays) — compiled to Spark under the hood, debuggable in the UI. Triggers: Schedule (time-based), Tumbling Window (non-overlapping intervals — dependency chains between runs), Event-based (Blob created/deleted, custom event from Event Grid).
Stream Processing and Security
Streaming data engineering: Azure Event Hubs — high-throughput event streaming (millions of events/second), partition model (30 days default retention), consumer groups for independent consumers, Kafka-compatible protocol. Azure Stream Analytics (ASA): real-time SQL-based stream processing — queries use SQL syntax with time windows (Tumbling window: non-overlapping fixed-size intervals, Hopping window: overlapping fixed-size intervals, Sliding window: triggered by events — all events within a time range of each event, Session window: variable-size, grouped by activity), reference data joins (enrich stream events with lookup data from Blob or SQL). Output to Blob/ADLS (data lake), SQL Database, Cosmos DB, Power BI (real-time dashboard), Event Hubs (forward processed events). Data security: data masking in Synapse (dynamic data masking hides columns from non-privileged users), column-level security (grant specific users access to specific columns using SQL GRANT on columns), row-level security (CREATE SECURITY POLICY with predicate functions — filter rows by user context), encryption (TDE for SQL pools, ADLS Gen2 encrypted at rest with service-managed or CMK). Private endpoints for all Azure data services — eliminate public internet exposure of data infrastructure.