Data Engineering for ML: Ingestion, Processing, and Labelling
Data is the foundation of ML. Data ingestion: S3 as the central data lake for training data, streaming ingestion via Kinesis Data Streams or Firehose for real-time features, AWS Glue for ETL (serverless Spark jobs that extract, clean, and transform data for ML training). Data exploration: SageMaker Data Wrangler — visual interface for data preparation, joins, transformations, and bias detection — exports data flow to SageMaker Pipeline. EMR (managed Hadoop/Spark cluster) for large-scale distributed data processing using Spark MLlib. AWS Glue DataBrew: no-code data preparation with 250+ built-in transformations — for analysts who prefer visual over code. Data labelling: SageMaker Ground Truth — managed data labelling service using a workforce (public Amazon Mechanical Turk, private internal workforce, AWS Marketplace vendor workforce) with automated labelling using active learning (model labels high-confidence samples, humans label low-confidence — reduces cost by 70-80% vs full human labelling). Feature engineering: SageMaker Feature Store — centralised repository for ML features — online store (low-latency real-time inference), offline store (training — S3-backed). Features are versioned and reusable across multiple models.
Model Training and Hyperparameter Tuning with SageMaker
SageMaker is the central ML platform on AWS. SageMaker built-in algorithms: XGBoost (gradient boosted trees — tabular data, classification and regression, most versatile), Linear Learner (linear and logistic regression — binary/multi-class classification, regression), Object Detection and Image Classification (CNNs — computer vision), BlazingText (word embeddings and text classification), Sequence-to-Sequence (NLP translation and summarisation), Factorisation Machines (recommendation systems with sparse data), K-Means (unsupervised clustering), PCA (dimensionality reduction). Training job configuration: choose container (built-in algorithm, AWS deep learning container for TensorFlow/PyTorch/MXNet, custom Docker image), instance type (ml.p3 for GPU training, ml.c5 for CPU), input channels (S3 data sources, RecordIO or CSV format, File or Pipe mode — Pipe mode streams data directly to the training job from S3, faster for large datasets). SageMaker Automatic Model Tuning (hyperparameter optimisation): define the hyperparameter ranges, choose tuning strategy (Bayesian — more efficient than random, learns from prior evaluations), set the objective metric (validation:accuracy, validation:rmse) — runs parallel training jobs to find optimal hyperparameters.
Model Evaluation, Deployment, and Monitoring
Model evaluation: classification metrics — accuracy (correct predictions / total), precision (TP/(TP+FP) — of predicted positives, how many were correct?), recall/sensitivity (TP/(TP+FN) — of actual positives, how many did we catch?), F1 score (harmonic mean of precision and recall — balanced metric), AUC-ROC (area under receiver operating characteristic curve — 1.0 = perfect classifier, 0.5 = random). Regression metrics: RMSE (Root Mean Square Error — penalises large errors more), MAE (Mean Absolute Error — robust to outliers). Confusion matrix: TP, TN, FP (Type I error), FN (Type II error). SageMaker endpoints: Real-time inference (persistent endpoint — synchronous predictions, auto-scales with SageMaker Auto Scaling), Serverless inference (pay-per-request, no idle cost — for intermittent traffic, cold starts acceptable), Batch Transform (offline predictions on S3 dataset — no endpoint needed). Model Registry: versioned model catalogue, approval workflow (Pending > Approved > Rejected), automatic deployment to production when approved. SageMaker Model Monitor: detects data drift (distribution shift between training data and inference data), concept drift (model output distribution changes), bias drift (fairness metrics change over time) — sends CloudWatch alarms.
AWS AI Services and MLOps
AWS pre-built AI services for common use cases. Rekognition: image and video analysis (object detection, face recognition, content moderation, celebrity recognition, PPE detection). Comprehend: NLP service (sentiment analysis, entity recognition, key phrase extraction, language detection, topic modelling). Transcribe: automatic speech recognition (ASR) — audio to text. Translate: real-time neural machine translation. Textract: document analysis — extract text, tables, and forms from PDFs and images (more than OCR — understands document structure). Lex: build conversational interfaces (chatbots) — NLU (Natural Language Understanding) + ASR — same technology that powers Amazon Alexa. Forecast: time-series forecasting using ML — predicts future values from historical data (inventory demand, website traffic, energy consumption). Personalize: real-time personalisation and recommendation — similar to Amazon.com's recommendation engine, no ML expertise required. MLOps with SageMaker Pipelines: define ML workflows as directed acyclic graphs (DAGs) — data processing > training > evaluation > conditional deployment — version-controlled, reproducible, triggered on schedule or event.