Contacts
Get in touch
Close

Contacts

USA, Washington D.C

+ (1) 240-380-7545

info@zorost.com

Pull-quote: “Production ML is not training a model. It’s the disciplines around training, registering, serving, monitoring, retraining, and retiring.”

Why this matters

Most teams shipping their first ML model on Databricks underestimate the discipline required. Training is the small part. The system around training is the large part.

The reference stack

   Data ──►  Feature Store  ◄────  online + offline serving
                  │
                  ▼
   Training pipeline (Databricks Job)
                  │
                  ▼
   MLflow Model Registry  ◄────  versions, stages, approvals
                  │
                  ▼
   Mosaic AI Model Serving  ◄────  A/B + canary
                  │
                  ▼
   Monitoring (drift, calibration, performance)
                  │
                  ▼
   Retraining trigger (event, schedule, drift threshold)

Feature Store — point-in-time correctness

The Feature Store enforces point-in-time correctness: training features are joined as they were at the historical point in time the label was generated. This eliminates leakage that destroys offline evaluation reliability. Online serving uses the same feature definitions to keep training and serving consistent.

MLflow Model Registry — lifecycle stages

Models progress through stages with explicit gates:

Stage Gate
Staging Passes regression suite + calibration checks
Production Passes A/B + canary criteria
Archived Replaced by a newer Production model

Every stage transition is logged with the user, the reason, and the metrics that justified it.

Calibration-first evaluation

We require every model to ship with Expected Calibration Error (ECE) and conformal prediction intervals (LACP). Headline accuracy is reported but is not the gate.

Gate Default threshold
ECE < 0.02 on holdout
Reliability diagram No bin > 0.05 deviation
Conformal coverage Within 2pp of stated coverage
Performance regression No metric below the prior production model

Mosaic AI Model Serving — A/B and canary

Traffic splits and canary rollouts are first-class. New versions get 5% of traffic, observed for SLAs and metrics, then ramp. Rollback is one click.

Monitoring — drift, calibration, performance

Three things to monitor:

  • Feature drift — input distribution shift
  • Calibration drift — ECE moving
  • Performance drift — labeled outcomes degrading

Monitoring runs as a Databricks Job. Alerts go to Slack / Teams / PagerDuty.

Closing

Production ML on Databricks is straightforward when the stack is right: Feature Store for consistency, MLflow Registry for lifecycle, Mosaic AI Model Serving for delivery, calibration-first evaluation, and disciplined monitoring. The training is the easy part.