Pull-quote: “Production ML is not training a model. It’s the disciplines around training, registering, serving, monitoring, retraining, and retiring.”
Why this matters
Most teams shipping their first ML model on Databricks underestimate the discipline required. Training is the small part. The system around training is the large part.
The reference stack
Data ──► Feature Store ◄──── online + offline serving
│
▼
Training pipeline (Databricks Job)
│
▼
MLflow Model Registry ◄──── versions, stages, approvals
│
▼
Mosaic AI Model Serving ◄──── A/B + canary
│
▼
Monitoring (drift, calibration, performance)
│
▼
Retraining trigger (event, schedule, drift threshold)
Feature Store — point-in-time correctness
The Feature Store enforces point-in-time correctness: training features are joined as they were at the historical point in time the label was generated. This eliminates leakage that destroys offline evaluation reliability. Online serving uses the same feature definitions to keep training and serving consistent.
MLflow Model Registry — lifecycle stages
Models progress through stages with explicit gates:
| Stage | Gate |
|---|---|
| Staging | Passes regression suite + calibration checks |
| Production | Passes A/B + canary criteria |
| Archived | Replaced by a newer Production model |
Every stage transition is logged with the user, the reason, and the metrics that justified it.
Calibration-first evaluation
We require every model to ship with Expected Calibration Error (ECE) and conformal prediction intervals (LACP). Headline accuracy is reported but is not the gate.
| Gate | Default threshold |
|---|---|
| ECE | < 0.02 on holdout |
| Reliability diagram | No bin > 0.05 deviation |
| Conformal coverage | Within 2pp of stated coverage |
| Performance regression | No metric below the prior production model |
Mosaic AI Model Serving — A/B and canary
Traffic splits and canary rollouts are first-class. New versions get 5% of traffic, observed for SLAs and metrics, then ramp. Rollback is one click.
Monitoring — drift, calibration, performance
Three things to monitor:
- Feature drift — input distribution shift
- Calibration drift — ECE moving
- Performance drift — labeled outcomes degrading
Monitoring runs as a Databricks Job. Alerts go to Slack / Teams / PagerDuty.
Closing
Production ML on Databricks is straightforward when the stack is right: Feature Store for consistency, MLflow Registry for lifecycle, Mosaic AI Model Serving for delivery, calibration-first evaluation, and disciplined monitoring. The training is the easy part.


