Vertex AI Model Monitoring PoC
An end-to-end proof-of-concept running ML model monitoring across three deployment paradigms — custom sklearn, AutoML Tabular, and BigQuery ML — on Google Vertex AI with Terraform-provisioned infrastructure.
Teams evaluating Vertex AI Model Monitoring V2 had no single reference showing how drift detection actually differs across custom-trained, AutoML, and BigQuery ML models. Each path has different serving, logging, and drift-validation mechanics, and stitching them together from documentation alone was slow and error-prone.
Built an end-to-end PoC covering all three pillars: synthetic data generation, training (sklearn / AutoML / BQML), endpoint and batch deployment, and Model Monitoring V2 jobs — including BigQuery-native drift via ML.VALIDATE_DATA_DRIFT and attribution drift via ML.EXPLAIN_PREDICT. Provisioned the full GCP footprint in Terraform (GCS, BigQuery, Pub/Sub retraining gate, provisioner/runtime service accounts, IAM, logging) with layered config resolving from env vars → Terraform outputs → defaults.
A reference PoC with full technical documentation. [Add the decision or platform adoption it supported if shareable.]
The value of a PoC like this is in the breadth done cleanly: three genuinely different MLOps paradigms — a custom sklearn model, an AutoML Tabular model, and a BQML logistic regression — unified behind one configuration system and one Terraform footprint. Without a reference that spans all three, teams evaluating the platform end up testing one path and assuming the others behave similarly. They don’t.
The Pub/Sub topic standing in as a retraining gate reflects an intentional architectural choice: rather than triggering retraining automatically on any drift signal, the gate gives a human or downstream system the chance to validate the signal before committing to a retraining run. That’s the pattern production MLOps actually needs, and the PoC demonstrates it rather than simplifying it away.
The config-resolution order (env vars → Terraform outputs → defaults) means the PoC can be pointed at a different GCP project by changing a single variable — a small detail that makes the difference between a demo that only works in its original environment and a reference that teams can actually adopt.