ML Problem Framing and Data Strategy
Before building a model, frame the problem correctly: What is the prediction target? What labels do you have? What are the business metrics, and how do they relate to ML metrics (accuracy, AUC, RMSE)? Is this classification, regression, ranking, or generation? Can the problem be solved with rules or simple statistics instead of ML? Data strategy: structured data (Cloud SQL, BigQuery, Spanner), unstructured data (Cloud Storage for images/audio/video/text). Feature engineering in BigQuery ML or Vertex AI Feature Store (serves features consistently between training and serving — avoids training-serving skew). Data validation with TFX (TensorFlow Extended) DataValidationComponent detects schema drift and anomalies.
Vertex AI: Training and Model Registry
Vertex AI is GCP's unified ML platform. Custom Training: run training code in a managed container on CPUs, GPUs, or TPUs. Training pipelines in Vertex AI Pipelines (Kubeflow Pipelines SDK or TFX) orchestrate multi-step workflows with automatic caching and artifact tracking. Hyperparameter tuning: Vertex AI Vizier (Bayesian optimisation) explores the hyperparameter space more efficiently than grid or random search. Vertex AI Experiments tracks runs, parameters, and metrics for comparison. Model Registry: versioned model artefacts with aliases (production, staging, challenger) — separates model management from deployment. Distributed training: data parallelism (same model on multiple workers, each sees a different batch), model parallelism (split model layers across devices for models too large for one device). MirroredStrategy (single node, multiple GPUs) versus MultiWorkerMirroredStrategy (multiple nodes).
Model Deployment and Serving
Vertex AI Endpoints: deploy model versions with traffic splits for A/B testing and canary rollouts. Dedicated endpoints (always-on) versus Serverless prediction (autoscaling to zero). Online prediction: low-latency, single-record requests. Batch prediction: high-throughput, asynchronous, for scoring large datasets. Model optimisation for serving: quantisation (FP32 to INT8 reduces model size and improves latency with some accuracy trade-off), distillation (train a smaller student model to mimic a larger teacher), TensorRT or ONNX Runtime for GPU inference optimisation. Feature latency: pre-compute slow features offline, serve fast features online from Memorystore. Explainability: Vertex Explainable AI provides feature attributions using SHAP (Shapley values) or Integrated Gradients. Required for regulated industries and for debugging unexpected model behaviour.
MLOps: Pipelines, Monitoring, and Governance
MLOps maturity levels: Level 0 (manual, notebook-driven), Level 1 (automated training pipeline, triggered by schedule or data drift), Level 2 (full CI/CD for ML — code changes trigger pipeline, evaluation gates before promotion). Model monitoring: Vertex AI Model Monitoring detects training-serving skew (difference between training data distribution and live prediction input distribution) and prediction drift (change in model output distribution over time). Alerts trigger retraining pipelines. Data freshness: stale feature data degrades model performance before accuracy metrics detect it. Governance: Dataplex for data cataloguing and lineage, BigQuery Authorized Views for column-level access control on training data, Vertex AI Model Cards for model documentation. Privacy-preserving ML: differential privacy (add calibrated noise during training), federated learning (train on device, aggregate model updates centrally — no raw data leaves the device).