Zorost Intelligence, Author at Zorost Intelligence | AI, Cloud & Data Experts

Databricks Cost Optimization & Finops: Where the Real Savings Are

Zorost Intelligence — Tue, 21 Apr 2026 09:00:00 +0000

Pull-quote: “Cost optimization is not a one-time project. It’s a recurring discipline. The tooling is there. The discipline is the ask.”

Why this matters

Most Databricks deployments have 30–60% slack in their spend within twelve months of go-live. Some of it is unavoidable (early-stage discovery). Some of it is technical (file layout, cluster sizing). Most of it is organizational (no cost ownership, no tagging, no review cadence).

Where the real savings are

Lever	Typical impact
Right-sized cluster types (Photon, autoscaling, spot)	15–30%
Job orchestration (concurrent runs, dependencies, retries)	5–15%
File compaction (`OPTIMIZE`, `Z-ORDER`, `liquid clustering`)	10–25% on read-heavy workloads
Caching strategies (Delta cache, query cache)	5–15%
Workload migration to Serverless SQL where appropriate	10–25%
BI semantic-model rationalization	10–20% on Power BI / Tableau queries
Autoscaling thresholds	5–10%
Tombstone management (`VACUUM`)	Cleanup, not a direct saving, but sustainable

Ranges are typical for engagements where the team has not previously focused on cost. Mature deployments have less to find.

Tagging and ownership — the prerequisite

Without tagging, you can’t optimize. Required tags:

cost_center
environment (dev / stage / prod)
owner (team or person)
workload (training / serving / ETL / BI / ad-hoc)

These flow into the system tables for cost reporting (system.billing.usage).

The audit, in twelve hours

A typical audit takes about twelve hours of senior engineering time:

Pull system.billing.usage for the last 90 days, joined with cluster metadata
Identify the top 10 jobs by cost
For each, evaluate: is the cluster the right type? Is autoscaling tuned? Are files compacted? Is the workload running at the right cadence?
Identify candidates for serverless migration
Identify candidates for materialized view replacement
Produce a prioritized list with estimated savings

Most teams find five to ten actions that together deliver 20–40% savings.

Common findings

A nightly batch job using a high-end cluster size when a Photon-enabled smaller cluster would do
A streaming pipeline running with a cluster sized for peak when traffic is bimodal
A Power BI model importing 80% of data that nobody queries
A SELECT * materialized in a downstream view, doubling storage cost on a hot dataset
An ad-hoc cluster left running over a weekend

Cost ownership cadence

The discipline that holds savings: monthly cost review with the data leadership and the FinOps lead. Each owner explains anomalies. Tags get fixed. Wasteful patterns get retired.

Closing

Cost optimization on Databricks is not a one-time project. It is a recurring discipline backed by tagging, system tables, and a monthly review. The platform tooling is there. The discipline is the ask.

The post Databricks Cost Optimization & Finops: Where the Real Savings Are appeared first on Zorost Intelligence | AI, Cloud & Data Experts.

Air-Gapped Agentic Stacks for Sovereign Environments

Zorost Intelligence — Tue, 14 Apr 2026 09:00:00 +0000

Pull-quote: “Sovereign AI is not ‘AI minus features.’ It is ‘AI plus discipline.'”

Why this matters

Some federal mission environments cannot accept internet egress. Some cannot accept any data leaving the customer boundary. Some cannot accept models that the customer cannot inspect end-to-end. Cloud-only AI vendors do not serve these environments.

The good news: air-gapped agentic AI is operationally feasible in 2026. The bad news: it requires engineering discipline that most vendors don’t have.

The reference stack (engineering view)

Local LLM serving. Open-weights models (Llama 3.x, Qwen 2.5, Mistral, Phi-4, Gemma 3, code-tuned variants) served via Ollama, vLLM, or llama.cpp on customer hardware.
Local embeddings. Open-source embedding models on the same stack.
Local vector database. pgvector, Weaviate, or Qdrant on a private subnet.
Local model registry. MLflow Model Registry running inside the boundary.
Local RAG pipeline. Ingestion, chunking, embedding, retrieval, re-ranking, generation — all inside the boundary.
Local evaluation harness. Golden datasets, regression suites, hallucination detection, grounding scoring — version-controlled and runnable inside the boundary.
Local observability. Grafana, Prometheus, Loki running inside the boundary.
Local update pipeline. Models, weights, and corpus updates delivered as signed bundles via approved transfer.

The reference stack (governance view)

Documented model selection — which model, which version, which quantization, why
Documented evaluation — what the golden dataset is, what it tests, what passing looks like
Documented update procedure — who signs the update bundle, who imports it, who validates it post-import
Documented retirement — when and why a model is retired
Audit trail — every decision the system makes is logged with model version, prompt, output, and grounding evidence

Trade-offs vs. cloud

Latency. Comparable for the smaller models; better for chained calls (no network round-trip).
Capability. Behind the absolute frontier of closed-source models. Open-weights models in 2026 are excellent but not at parity with the strongest closed-source options.
Cost. Higher up-front (hardware), lower over time (no per-token bills).
Update cadence. Slower because updates must clear the boundary.
Evaluation discipline. Tighter, because there is no vendor evaluation to lean on.
Sovereignty. Complete. The customer owns the stack end-to-end.

Where it fits in federal posture

Air-gapped agentic stacks fit:

Classified or otherwise sensitive environments without internet egress
Mission environments where data cannot leave the customer boundary
Programs where the agency requires end-to-end inspection and audit of the AI stack

It does not fit:

Environments where the very latest closed-source model capability is required and the data sensitivity allows cloud
Environments where rapid model iteration is more important than sovereignty

Closing

Sovereign agentic AI is real. It requires engineering discipline. We’ve built it for our manufacturing-quality platform (with a partner) and we apply the same discipline to federal mission environments. The deployment shape is different from cloud. The trade-offs are real. For the customers who need it, no other shape fits.

The post Air-Gapped Agentic Stacks for Sovereign Environments appeared first on Zorost Intelligence | AI, Cloud & Data Experts.

Power Bi Direct Lake on Databricks SQL: a Modernization Playbook

Zorost Intelligence — Tue, 31 Mar 2026 09:00:00 +0000

Pull-quote: “Direct Lake is not faster DirectQuery. It is a different mode that eliminates a class of refreshes that should never have existed.”

Why this matters

Power BI has been deployed in three modes for a decade: Import, DirectQuery, and Composite. Each has trade-offs. Import is fast but stale; DirectQuery is fresh but slow; Composite is a compromise. Direct Lake — Power BI talking directly to Delta tables in Databricks SQL — is a fourth mode that eliminates a class of refresh problems that should never have existed.

The four modes

Mode	Freshness	Performance	When to use
Import	Stale until next refresh	Fast	Small models, infrequent updates
DirectQuery	Live	Slow on large fact tables	Real-time-ish dashboards over modest volume
Composite	Mixed	Mixed	Hybrid scenarios
Direct Lake	Live (on Delta)	Fast	Lakehouse-native consumption

Why Direct Lake works

Direct Lake reads Delta files directly into Power BI’s analytics engine without import. There is no refresh schedule. There is no DirectQuery overhead. The semantic model points at Unity Catalog tables and the engine handles the rest.

The conditions for it to work:

Source data must be in Delta format
Tables must be in Unity Catalog
Model size must fit in the engine’s memory budget for the SKU
DAX must be Direct Lake-compatible (most is; some isn’t)

Migration playbook

Phase	Output
Discovery	Catalog of existing Power BI models · usage telemetry
Source landing in Delta	Sources moved to Delta tables in Unity Catalog
Semantic model rebuild	New model on Direct Lake
Visual rebuild	Reports and dashboards rebuilt against the new model
Parallel run	Old and new models in production simultaneously
Cutover	Old retired

Governance benefits

Row and column security live in the dynamic views in Unity Catalog, not in the semantic model. One source of truth for security.
Lineage covers the entire path from source through Delta to Power BI.
Performance tuning happens at the Delta layer (liquid clustering, OPTIMIZE, Z-order) and benefits every consumer, not just Power BI.

Closing

Direct Lake is the modern Power BI mode for Lakehouse-native consumption. The migration is methodical, the trade-offs are clear, and the result is faster, fresher dashboards with simpler operations.

The post Power Bi Direct Lake on Databricks SQL: a Modernization Playbook appeared first on Zorost Intelligence | AI, Cloud & Data Experts.

Calibration-First AI for Federal Decision Support

Zorost Intelligence — Tue, 24 Mar 2026 09:00:00 +0000

Pull-quote: “Federal procurement should require calibration metrics in every AI proposal. Anything less is buying a black box.”

Why this matters

Federal decision support runs on AI now. Risk scoring, fraud detection, predictive maintenance, safety analysis, mission planning — every category has at least one AI vendor pitching the agency. The procurement question is: how does an agency tell the credible vendors from the rest?

Headline accuracy doesn’t help. Every vendor claims high accuracy. The number doesn’t translate into operational trust.

The right standard is calibration — and conformal prediction for individual uncertainty.

Calibration as a procurement requirement

Expected Calibration Error (ECE) is the standard metric. Below 0.02 is excellent. Below 0.01 is very good. The metric is widely adopted in academic ML evaluation and is the right floor for any high-stakes federal AI use.

A procurement RFP for an AI system should require:

ECE on a documented holdout slice of representative size
Reliability diagrams showing calibration across the full probability range
Sensitivity analysis on how calibration degrades under common distribution shifts (seasonal, regime change, missing data)
A monitoring plan for calibration drift in production

Every vendor that ships calibrated models can produce this. Every vendor that ships only headline accuracy will struggle to.

Conformal prediction as the second standard

Calibration tells you the average probability is honest. Conformal prediction tells you the individual uncertainty is honest. Locally Adaptive Conformal Prediction (LACP) produces distribution-free prediction intervals — when the model says “between 18 and 47 minutes with 90% coverage,” the actual answer falls in that interval 90% of the time, regardless of underlying distribution shape.

For federal decision support, this is non-negotiable. A point estimate without coverage is operationally meaningless.

NIST AI RMF alignment

The NIST AI Risk Management Framework articulates four functions: Map, Measure, Manage, Govern. Calibration and conformal prediction sit squarely in Measure. They are the operationally meaningful measurements of model trustworthiness — far more useful than the marketing accuracy a vendor leads with.

What this implies for vendor evaluation

Three concrete recommendations for federal AI procurement:

Require ECE and reliability diagrams in every AI proposal.
Require a stated coverage method (preferably conformal) for any system that produces numerical estimates.
Require a monitoring plan for calibration drift, not just accuracy drift.

A vendor that cannot answer those is not a credible vendor for high-stakes use.

Closing

Federal decision support is too consequential to run on headline accuracy. Calibration and conformal prediction are the right standards. Procurement should require them. Vendors should ship them. We do, and we think the field should follow.

The post Calibration-First AI for Federal Decision Support appeared first on Zorost Intelligence | AI, Cloud & Data Experts.

What We Open-Sourced This Year — and Why

Zorost Intelligence — Tue, 10 Mar 2026 09:00:00 +0000

Pull-quote: “We don’t open-source everything. We open-source the things that should belong to the community.”

Why this matters

A lot of AI startups treat “open source” as a marketing posture. We treat it as a deliberate decision per project. Some projects belong in the commons because the community is better served by everyone using and improving them. Others stay proprietary because the R&D investment is significant and the value flows back to customers through the product.

This year we open-sourced four projects.

MarkForge

What it is. Bi-directional Markdown ↔ PDF / Office / HTML conversion. Built on Microsoft’s MarkItDown with extensions for PDF rendering, page sizes, and a WordPress plugin.

Why open source. Document conversion is plumbing. Plumbing should be free. Every team — internal docs, technical writers, AI ingestion engineers — needs it, and there is no defensible business advantage in hoarding it.

Status. Production. Used inside several of our platforms as an ingestion stage for RAG.

Weaviate Local UI

What it is. A local desktop interface for the Weaviate vector database. Schema browsing, object inspection, vector search, RAG chat with multi-provider LLM support, document upload with chunking and embedding.

Why open source. Vector databases are an active part of the agentic AI stack. Tooling that makes them accessible benefits the entire community. Weaviate is excellent and deserves a great local UX.

Status. Production. Used inside our development workflow for any RAG system in early design.

DevOps Monitor

What it is. A complete Docker-based monitoring stack: Grafana, Prometheus, Loki, Alertmanager, cAdvisor, Node Exporter. Configuration-driven target management. HTTP health checks. Per-application dashboards.

Why open source. Every multi-service deployment needs this. Most teams either rebuild it from scratch (slow) or adopt a vendor SaaS (expensive and exfiltrating). The reference stack is a community good.

Status. Production. Runs in front of every internal Zorost service.

Sigma Axion (selected components)

What it is. Components of our quantitative research framework — indicator chains, walk-forward backtesting infrastructure, transaction-cost modeling — published under MIT.

Why open source for these components. The plumbing of a quant stack should be a community good. The actual edges (the strategies themselves) are not open-sourced, because that is where the R&D investment lives.

Status. Production. Live at sigmaaxion.com.

What we don’t open-source

We don’t open-source the platforms with significant proprietary R&D investment: AeroFarr (causal AI for aviation), EvidAI (pharma evidence synthesis), FreightCortex (freight intelligence), Aquil (geopolitical intelligence), SPCio (co-developed with a manufacturing intelligence partner), or ComplyGrid. The investment is real and the value flows back to customers through the product.

Closing

Open source is a deliberate decision, not a posture. Some things belong to the community; some things belong to customers. We try to draw the line clearly.

The post What We Open-Sourced This Year — and Why appeared first on Zorost Intelligence | AI, Cloud & Data Experts.

Production ML on Databricks: Mlflow, Feature Store, Calibration

Zorost Intelligence — Tue, 03 Mar 2026 09:00:00 +0000

Pull-quote: “Production ML is not training a model. It’s the disciplines around training, registering, serving, monitoring, retraining, and retiring.”

Why this matters

Most teams shipping their first ML model on Databricks underestimate the discipline required. Training is the small part. The system around training is the large part.

The reference stack

   Data ──►  Feature Store  ◄────  online + offline serving
                  │
                  ▼
   Training pipeline (Databricks Job)
                  │
                  ▼
   MLflow Model Registry  ◄────  versions, stages, approvals
                  │
                  ▼
   Mosaic AI Model Serving  ◄────  A/B + canary
                  │
                  ▼
   Monitoring (drift, calibration, performance)
                  │
                  ▼
   Retraining trigger (event, schedule, drift threshold)

Feature Store — point-in-time correctness

The Feature Store enforces point-in-time correctness: training features are joined as they were at the historical point in time the label was generated. This eliminates leakage that destroys offline evaluation reliability. Online serving uses the same feature definitions to keep training and serving consistent.

MLflow Model Registry — lifecycle stages

Models progress through stages with explicit gates:

Stage	Gate
Staging	Passes regression suite + calibration checks
Production	Passes A/B + canary criteria
Archived	Replaced by a newer Production model

Every stage transition is logged with the user, the reason, and the metrics that justified it.

Calibration-first evaluation

We require every model to ship with Expected Calibration Error (ECE) and conformal prediction intervals (LACP). Headline accuracy is reported but is not the gate.

Gate	Default threshold
ECE	< 0.02 on holdout
Reliability diagram	No bin > 0.05 deviation
Conformal coverage	Within 2pp of stated coverage
Performance regression	No metric below the prior production model

Mosaic AI Model Serving — A/B and canary

Traffic splits and canary rollouts are first-class. New versions get 5% of traffic, observed for SLAs and metrics, then ramp. Rollback is one click.

Monitoring — drift, calibration, performance

Three things to monitor:

Feature drift — input distribution shift
Calibration drift — ECE moving
Performance drift — labeled outcomes degrading

Monitoring runs as a Databricks Job. Alerts go to Slack / Teams / PagerDuty.

Closing

Production ML on Databricks is straightforward when the stack is right: Feature Store for consistency, MLflow Registry for lifecycle, Mosaic AI Model Serving for delivery, calibration-first evaluation, and disciplined monitoring. The training is the easy part.

The post Production ML on Databricks: Mlflow, Feature Store, Calibration appeared first on Zorost Intelligence | AI, Cloud & Data Experts.

Building Multi-Agent Workflows on Databricks (mosaic AI Agent Framework)

Zorost Intelligence — Tue, 24 Feb 2026 09:00:00 +0000

Pull-quote: “Agents on the Lakehouse mean tools that read and write Delta tables, models that serve under MLflow, and evaluations that ship as Delta tables themselves.”

Why this matters

Agentic workflows are the next layer on the Lakehouse — agents that reason, plan, call tools, and produce verifiable artifacts. The Mosaic AI Agent Framework provides the runtime. The architectural decisions still belong to you.

Reference architecture

┌──────────────────────────────────────────────────────────────────┐
│                    AGENT (LangGraph / LlamaIndex / Custom)        │
│                                                                    │
│   Planner ──► Executor ──► Critic ──► Referee                    │
└─────────────────────┬────────────────────────────────────────────┘
                      │
                      ▼
       ┌──────────────────────────────┐
       │   Typed Tools                 │ ◄── Tool catalog
       │   - read Delta tables         │     (Unity Catalog)
       │   - write Delta tables        │
       │   - call MLflow models        │
       │   - call REST APIs            │
       └──────────────┬───────────────┘
                      │
                      ▼
       ┌──────────────────────────────┐
       │   Mosaic AI Model Serving     │
       │   - foundation models         │
       │   - fine-tuned models         │
       │   - per-agent traffic split   │
       └──────────────┬───────────────┘
                      │
                      ▼
       ┌──────────────────────────────┐
       │   Evaluations as Delta tables │ ◄── Versioned
       │   - golden datasets           │
       │   - regression suite          │
       │   - hallucination detection   │
       └──────────────────────────────┘

What “typed tools” means

Every tool has a JSON schema for inputs and outputs. The agent cannot call a tool with invalid inputs — the schema rejects the call. This eliminates an entire class of failure that plagues unconstrained agents.

What “evaluations as Delta tables” means

Evaluation results are stored as rows in versioned Delta tables. Each row is (agent_version, input, expected_output, actual_output, score, metadata). Regression analysis is a JOIN between two agent_version slices. New versions don’t promote unless they pass.

The agent / human contract

Where humans fit:

High-risk operations require human-in-the-loop checkpoints. Agents can propose; humans approve.
Critic disagreements with the executor route to humans when the referee cannot adjudicate.
Periodic spot-checks on agent decisions are scheduled into the evaluation harness.

This is not “manual override.” This is a designed-in contract about which decisions are agent-final and which are human-final.

Common architectural decisions

Decision	Default
Number of executors	One unless sub-goals are independent
Critic per executor or shared	Shared unless executors are heterogeneous
Memory model	Working memory in agent state; long-term memory in Delta table
Tool call timeout	30 s default, with retries on idempotent tools
Cost ceiling per session	Configurable; defaults to a hard cap

Closing

Multi-agent workflows on Databricks are productive when the framework is paired with discipline: typed tools, deterministic logging, evaluations as Delta tables, and a designed-in agent / human contract. The Mosaic AI Agent Framework is the runtime; the architecture is yours.

The post Building Multi-Agent Workflows on Databricks (mosaic AI Agent Framework) appeared first on Zorost Intelligence | AI, Cloud & Data Experts.

When Agents Call Agents: Why the MCP Server Matters in Freight

Zorost Intelligence — Tue, 24 Feb 2026 09:00:00 +0000

Pull-quote: “If your platform isn’t callable by other agents, your platform isn’t future-proof.”

Why this matters

The next generation of enterprise software is being shaped by a simple fact: users have agents now. Claude Desktop, custom internal agents, vendor-provided agents — they’re all going to call your platform. Either they call it through your REST API (and the agent has to know your URL structure, your authentication, your error semantics) or they call it through a standard protocol.

That standard is Model Context Protocol (MCP).

What MCP is

MCP is an open protocol developed by Anthropic and adopted across the agent ecosystem. It defines how an AI agent describes its tools, how a host (the agent’s runtime) discovers and calls those tools, and how results are returned. The result is a clean separation: tools are advertised, agents discover and call them, and you can swap tool servers without touching the agent.

For FreightCortex, the MCP server is a thin layer that exposes our 16 tools using the protocol. An external agent — a customer’s internal Claude Desktop, an OEM’s analytics chatbot, or a third-party tool — can connect to our MCP endpoint and use FreightCortex like a native tool.

What this unlocks

Three things:

Native callability from any MCP-compatible agent. Customers do not need to write custom integrations. Their agent just connects to our MCP server.
Composability with other tools. A customer agent can use FreightCortex tools alongside their own internal tools. The agent decides when to call which.
Future-proofing. As the agent ecosystem grows, MCP-compatible platforms are accessible by default. REST-only platforms have to be manually integrated, one customer at a time.

What it requires

Three engineering investments:

Tool contracts — every tool we want to expose has a typed schema. (We already had this.)
The MCP server itself — a thin transport layer over those tools.
Authentication and rate limiting — MCP doesn’t replace your existing auth; it sits on top of it.

A concrete example

An analyst is using Claude Desktop on her workstation. She asks “what’s driving the cost increase on the Atlanta–Dallas corridor?” Claude knows about the FreightCortex MCP server (configured once per workstation) and decides to use it. It calls query_corridor_metrics, compute_anomaly_score, query_carrier_metrics, and run_capacity_simulation — and produces an answer with the same structure as the answer it would have given inside the FreightCortex web app, except this time it is in her existing analyst environment.

The customer never had to log in to FreightCortex.

Closing

If your platform isn’t callable by other agents, your platform isn’t future-proof. MCP is how you make that callable. It is a small engineering investment with very high leverage.

The post When Agents Call Agents: Why the MCP Server Matters in Freight appeared first on Zorost Intelligence | AI, Cloud & Data Experts.

Hybrid Retrieval: Why Vector Alone Isn’t Enough

Zorost Intelligence — Tue, 17 Feb 2026 09:00:00 +0000

Pull-quote: “Pure vector retrieval is the most common production-grade RAG mistake. Pure BM25 is the second most common.”

Why this matters

A pattern repeats in every RAG project that goes wrong: someone embeds the corpus, runs vector search, and ships. The system works in demos and disappoints in production. The fix is a structural architecture change: hybrid retrieval.

The components

Query
  │
  ├──► Dense (vector)   — pgvector / Weaviate / Qdrant + an embedding model
  │
  ├──► Sparse (BM25)    — Postgres FTS / Elasticsearch / OpenSearch
  │
  ├──► Optional filters — date range, source, entity tags
  │
  └──► Merge (RRF or weighted) ──► Cross-encoder re-rank ──► Top-K
                                                                │
                                                                ▼
                                                Citation-grounded generation

Why each piece matters

Vector is excellent at semantic similarity — finding documents that are about the same topic in different words. It is bad at named entities — exact terms, IDs, dates.
BM25 is the opposite — excellent at named entities, weaker on semantic similarity.
Filters — when the question is bounded (“just look at 2024 reports about Boeing 737”), filters dramatically reduce the candidate set before ranking.
Merge — Reciprocal Rank Fusion (RRF) is a clean default. Weighted merges work with calibrated scores.
Cross-encoder re-rank — sees the query and the candidate document together and scores them jointly. More expensive than bi-encoder vector search, but the precision improvement on the top-K is large enough to pay for itself.

What changes when you do this right

Hallucination rate drops. The model has better evidence to ground in.
Citation precision goes up. The cited documents actually support the claim.
Edge cases (rare entity queries, exact-quote queries) work properly.
Generation latency stays low because the model only sees the top-K (typically 6–10), not the top-100.

Common mistakes

No re-ranker. Top-50 from vector + top-50 from BM25 with RRF is a starting point, but without a re-ranker the top-K still contains noise.
No filtering. Filtering before retrieval is essentially free if your data is properly indexed.
Skip evaluation. Without a golden Q&A dataset and grounding scoring, you have no way to compare retrieval architectures.

Closing

Pure vector retrieval is the most common production-grade RAG mistake. Hybrid retrieval — vector + sparse + filters + re-rank — is the boring, reliable, production answer. Every Zorost RAG system runs this architecture.

The post Hybrid Retrieval: Why Vector Alone Isn’t Enough appeared first on Zorost Intelligence | AI, Cloud & Data Experts.

Why Calibration Matters More Than Accuracy: an ECE 0.012 Story

Zorost Intelligence — Tue, 10 Feb 2026 09:00:00 +0000

Pull-quote: “When the model says 70%, it should be right 70% of the time. That’s calibration. Anything less is dishonest.”

Why this matters

“Our model is 92% accurate” is a marketing line. It tells you almost nothing about whether you should trust the model with a decision. The real question is: when the model says it is 70% confident, is it actually right 70% of the time?

That is calibration. The metric is Expected Calibration Error (ECE).

The metric, briefly

Group predictions by their stated probability. For each bin, compare the average predicted probability to the actual observed frequency. The weighted average of the absolute differences is the ECE. Lower is better. Below 0.02 is excellent. Below 0.01 is very good in production.

AeroFarr’s gate classifier achieves ECE 0.012 on 581,316 held-out flights. That means the predicted probabilities track the actual observed frequencies very tightly across the full probability range — not just at the mean.

How we got there

Three ingredients:

A multi-head stacked architecture — separate heads for gate / severity / regression / quantile, each tuned on the loss most appropriate for its job, then combined under a non-linear meta-learner. The meta sees the heads’ outputs and learns how to combine them. Calibration is enforced at each head and at the meta.
Loss functions chosen for calibration, not accuracy. Cross-entropy with label smoothing for classifiers; quantile loss for the quantile heads.
Post-hoc calibration on a holdout slice. Platt scaling and isotonic regression are applied as a final stage on a slice of data the heads never saw.

Calibration has to be designed in from the start. Bolting it on at the end as a band-aid does not work for high-stakes operational use.

Why it matters operationally

If a planner is making a “should we keep this aircraft on the gate?” decision and the model says 30% chance of cancellation, the planner’s mental model is: roughly one in three. If the model is poorly calibrated and 30% is actually 60%, the planner’s prior is wrong, and every decision downstream is wrong.

Calibrated probabilities preserve the planner’s intuition. Uncalibrated probabilities corrupt it.

Conformal prediction on top

Calibration tells you about average behavior. Conformal prediction tells you about individual uncertainty. We use Locally Adaptive Conformal Prediction (LACP) to produce distribution-free prediction intervals — meaning when AeroFarr says “delay between 18 and 47 minutes with 90% coverage,” the actual delay falls in that interval 90% of the time, regardless of underlying distribution shape.

This is the second ingredient of honesty in a production model. Calibration says the model’s stated probabilities mean what they say. Conformal prediction says the model’s stated intervals mean what they say.

Closing

Headline accuracy is a misleading metric for high-stakes decisions. Calibration and conformal prediction are the real ones. ECE 0.012 is what we ship. We don’t quote accuracy without calibration, and we don’t quote intervals without coverage.

The post Why Calibration Matters More Than Accuracy: an ECE 0.012 Story appeared first on Zorost Intelligence | AI, Cloud & Data Experts.