Production ML on Databricks: Mlflow, Feature Store, Calibration

Zorost Intelligence — Tue, 03 Mar 2026 09:00:00 +0000

Pull-quote: “Production ML is not training a model. It’s the disciplines around training, registering, serving, monitoring, retraining, and retiring.”

Why this matters

Most teams shipping their first ML model on Databricks underestimate the discipline required. Training is the small part. The system around training is the large part.

The reference stack

   Data ──►  Feature Store  ◄────  online + offline serving
                  │
                  ▼
   Training pipeline (Databricks Job)
                  │
                  ▼
   MLflow Model Registry  ◄────  versions, stages, approvals
                  │
                  ▼
   Mosaic AI Model Serving  ◄────  A/B + canary
                  │
                  ▼
   Monitoring (drift, calibration, performance)
                  │
                  ▼
   Retraining trigger (event, schedule, drift threshold)

Feature Store — point-in-time correctness

The Feature Store enforces point-in-time correctness: training features are joined as they were at the historical point in time the label was generated. This eliminates leakage that destroys offline evaluation reliability. Online serving uses the same feature definitions to keep training and serving consistent.

MLflow Model Registry — lifecycle stages

Models progress through stages with explicit gates:

Stage	Gate
Staging	Passes regression suite + calibration checks
Production	Passes A/B + canary criteria
Archived	Replaced by a newer Production model

Every stage transition is logged with the user, the reason, and the metrics that justified it.

Calibration-first evaluation

We require every model to ship with Expected Calibration Error (ECE) and conformal prediction intervals (LACP). Headline accuracy is reported but is not the gate.

Gate	Default threshold
ECE	< 0.02 on holdout
Reliability diagram	No bin > 0.05 deviation
Conformal coverage	Within 2pp of stated coverage
Performance regression	No metric below the prior production model

Mosaic AI Model Serving — A/B and canary

Traffic splits and canary rollouts are first-class. New versions get 5% of traffic, observed for SLAs and metrics, then ramp. Rollback is one click.

Monitoring — drift, calibration, performance

Three things to monitor:

Feature drift — input distribution shift
Calibration drift — ECE moving
Performance drift — labeled outcomes degrading

Monitoring runs as a Databricks Job. Alerts go to Slack / Teams / PagerDuty.

Closing

Production ML on Databricks is straightforward when the stack is right: Feature Store for consistency, MLflow Registry for lifecycle, Mosaic AI Model Serving for delivery, calibration-first evaluation, and disciplined monitoring. The training is the easy part.

The post Production ML on Databricks: Mlflow, Feature Store, Calibration appeared first on Zorost Intelligence | AI, Cloud & Data Experts.

Building Multi-Agent Workflows on Databricks (mosaic AI Agent Framework)

Zorost Intelligence — Tue, 24 Feb 2026 09:00:00 +0000

Pull-quote: “Agents on the Lakehouse mean tools that read and write Delta tables, models that serve under MLflow, and evaluations that ship as Delta tables themselves.”

Why this matters

Agentic workflows are the next layer on the Lakehouse — agents that reason, plan, call tools, and produce verifiable artifacts. The Mosaic AI Agent Framework provides the runtime. The architectural decisions still belong to you.

Reference architecture

┌──────────────────────────────────────────────────────────────────┐
│                    AGENT (LangGraph / LlamaIndex / Custom)        │
│                                                                    │
│   Planner ──► Executor ──► Critic ──► Referee                    │
└─────────────────────┬────────────────────────────────────────────┘
                      │
                      ▼
       ┌──────────────────────────────┐
       │   Typed Tools                 │ ◄── Tool catalog
       │   - read Delta tables         │     (Unity Catalog)
       │   - write Delta tables        │
       │   - call MLflow models        │
       │   - call REST APIs            │
       └──────────────┬───────────────┘
                      │
                      ▼
       ┌──────────────────────────────┐
       │   Mosaic AI Model Serving     │
       │   - foundation models         │
       │   - fine-tuned models         │
       │   - per-agent traffic split   │
       └──────────────┬───────────────┘
                      │
                      ▼
       ┌──────────────────────────────┐
       │   Evaluations as Delta tables │ ◄── Versioned
       │   - golden datasets           │
       │   - regression suite          │
       │   - hallucination detection   │
       └──────────────────────────────┘

What “typed tools” means

Every tool has a JSON schema for inputs and outputs. The agent cannot call a tool with invalid inputs — the schema rejects the call. This eliminates an entire class of failure that plagues unconstrained agents.

What “evaluations as Delta tables” means

Evaluation results are stored as rows in versioned Delta tables. Each row is (agent_version, input, expected_output, actual_output, score, metadata). Regression analysis is a JOIN between two agent_version slices. New versions don’t promote unless they pass.

The agent / human contract

Where humans fit:

High-risk operations require human-in-the-loop checkpoints. Agents can propose; humans approve.
Critic disagreements with the executor route to humans when the referee cannot adjudicate.
Periodic spot-checks on agent decisions are scheduled into the evaluation harness.

This is not “manual override.” This is a designed-in contract about which decisions are agent-final and which are human-final.

Common architectural decisions

Decision	Default
Number of executors	One unless sub-goals are independent
Critic per executor or shared	Shared unless executors are heterogeneous
Memory model	Working memory in agent state; long-term memory in Delta table
Tool call timeout	30 s default, with retries on idempotent tools
Cost ceiling per session	Configurable; defaults to a hard cap

Closing

Multi-agent workflows on Databricks are productive when the framework is paired with discipline: typed tools, deterministic logging, evaluations as Delta tables, and a designed-in agent / human contract. The Mosaic AI Agent Framework is the runtime; the architecture is yours.

The post Building Multi-Agent Workflows on Databricks (mosaic AI Agent Framework) appeared first on Zorost Intelligence | AI, Cloud & Data Experts.

MLflow Archives - Zorost Intelligence | AI, Cloud & Data Experts

Production ML on Databricks: Mlflow, Feature Store, Calibration

Why this matters

The reference stack

Feature Store — point-in-time correctness

MLflow Model Registry — lifecycle stages

Calibration-first evaluation

Mosaic AI Model Serving — A/B and canary

Monitoring — drift, calibration, performance

Closing

Building Multi-Agent Workflows on Databricks (mosaic AI Agent Framework)

Why this matters

Reference architecture

What “typed tools” means

What “evaluations as Delta tables” means

The agent / human contract

Common architectural decisions

Closing