Aviation Intelligence Archives - Zorost Intelligence | AI, Cloud & Data Experts

Why Calibration Matters More Than Accuracy: an ECE 0.012 Story

Zorost Intelligence — Tue, 10 Feb 2026 09:00:00 +0000

Pull-quote: “When the model says 70%, it should be right 70% of the time. That’s calibration. Anything less is dishonest.”

Why this matters

“Our model is 92% accurate” is a marketing line. It tells you almost nothing about whether you should trust the model with a decision. The real question is: when the model says it is 70% confident, is it actually right 70% of the time?

That is calibration. The metric is Expected Calibration Error (ECE).

The metric, briefly

Group predictions by their stated probability. For each bin, compare the average predicted probability to the actual observed frequency. The weighted average of the absolute differences is the ECE. Lower is better. Below 0.02 is excellent. Below 0.01 is very good in production.

AeroFarr’s gate classifier achieves ECE 0.012 on 581,316 held-out flights. That means the predicted probabilities track the actual observed frequencies very tightly across the full probability range — not just at the mean.

How we got there

Three ingredients:

A multi-head stacked architecture — separate heads for gate / severity / regression / quantile, each tuned on the loss most appropriate for its job, then combined under a non-linear meta-learner. The meta sees the heads’ outputs and learns how to combine them. Calibration is enforced at each head and at the meta.
Loss functions chosen for calibration, not accuracy. Cross-entropy with label smoothing for classifiers; quantile loss for the quantile heads.
Post-hoc calibration on a holdout slice. Platt scaling and isotonic regression are applied as a final stage on a slice of data the heads never saw.

Calibration has to be designed in from the start. Bolting it on at the end as a band-aid does not work for high-stakes operational use.

Why it matters operationally

If a planner is making a “should we keep this aircraft on the gate?” decision and the model says 30% chance of cancellation, the planner’s mental model is: roughly one in three. If the model is poorly calibrated and 30% is actually 60%, the planner’s prior is wrong, and every decision downstream is wrong.

Calibrated probabilities preserve the planner’s intuition. Uncalibrated probabilities corrupt it.

Conformal prediction on top

Calibration tells you about average behavior. Conformal prediction tells you about individual uncertainty. We use Locally Adaptive Conformal Prediction (LACP) to produce distribution-free prediction intervals — meaning when AeroFarr says “delay between 18 and 47 minutes with 90% coverage,” the actual delay falls in that interval 90% of the time, regardless of underlying distribution shape.

This is the second ingredient of honesty in a production model. Calibration says the model’s stated probabilities mean what they say. Conformal prediction says the model’s stated intervals mean what they say.

Closing

Headline accuracy is a misleading metric for high-stakes decisions. Calibration and conformal prediction are the real ones. ECE 0.012 is what we ship. We don’t quote accuracy without calibration, and we don’t quote intervals without coverage.

The post Why Calibration Matters More Than Accuracy: an ECE 0.012 Story appeared first on Zorost Intelligence | AI, Cloud & Data Experts.

A Retrieval Engine over the World’s Aviation Safety Corpus

Zorost Intelligence — Tue, 13 Jan 2026 09:00:00 +0000

Pull-quote: “Vector search alone is not retrieval. It is one signal among several.”

Why this matters

Aviation safety knowledge sits in two enormous public-domain corpora: the U.S. NTSB accident reports and the NASA ASRS voluntary safety reports. Together, that’s 247,000+ documents of structured incident narratives. Pilots, controllers, and operations engineers have written them under the assumption that they would be searched, cross-referenced, and learned from.

Most platforms reduce this to keyword search. Better platforms add full-text search. The frontier is citation-grounded retrieval-augmented generation — the assistant retrieves, the model writes, every claim links back to the source documents.

Why hybrid retrieval

The naive approach to a RAG system is “embed everything and run a vector search.” It does not work in production. Vector search is excellent at finding semantically similar documents and bad at finding specifically named entities. BM25 is the opposite. Production retrieval needs both.

Our retrieval pipeline:

Question
   │
   ├──► dense (pgvector + BGE-large) ──► top 50
   ├──► sparse (BM25)                  ──► top 50
   │
   └──► merge + cross-encoder re-rank   ──► top 8
                            │
                            ▼
                Citation-grounded generation
                (Gemini 2.5 Flash for fast answers,
                 Claude / GPT for detailed analysis)

Why a re-ranker

The re-ranker (a cross-encoder, not a bi-encoder) sees the query and the candidate document together and scores them jointly. This is more expensive per call than vector search, but the precision improvement on the top-8 is large enough that it pays for itself — fewer retrievals, fewer hallucinations, better answers.

Why citation grounding

The default mode of an LLM is to fabricate plausible-sounding answers. The fix is structural: the model is constrained to write its answer with bracketed citation tokens, and the citation tokens must reference documents that actually exist in the retrieval set. Generation is post-processed to validate the citations and reject any answer that fails validation.

This is a small structural change with a large operational impact. It moves the system from “talking to a model that has ingested aviation knowledge” to “asking a model to summarize specific source documents.”

What this is good at

“What are the leading causes of runway incursions for regional jets in low-visibility conditions?”
“Show me ASRS reports that match the pattern of sudden hydraulic failure during flap retraction.”
“What are the recurring training gaps that show up in cargo operations CRM reports?”

What it is not good at: real-time operational queries that need current schedule data — those go to the predictive and causal layers.

Closing

Vector search alone is not retrieval. It is one signal. Production-grade RAG over a regulated safety corpus requires hybrid retrieval, a real re-ranker, and structural citation grounding. The result is an assistant analysts trust enough to use — which is the only metric that matters.

The post A Retrieval Engine over the World’s Aviation Safety Corpus appeared first on Zorost Intelligence | AI, Cloud & Data Experts.

Causal AI for Aviation Operations: from Correlation to Cause

Zorost Intelligence — Tue, 09 Dec 2025 09:00:00 +0000

Pull-quote: “Saying ‘weather correlates with delays’ is not an operational claim. Saying ‘an upstream weather event caused 32 ± 6 minutes of average delay through a specific ATC mechanism — with an E-value of 1.9 — is.”

Why this matters

Aviation operations centers run on correlations. Weather correlates with delay. Connecting traffic correlates with delay. Crew availability correlates with delay. Every dashboard in the industry shows you which inputs associate with disruption.

But operational decisions are causal decisions. If we cancel three flights at this hub now, what will the cascade look like in three hours? That is not a correlation question. It is a counterfactual question. To answer it credibly, you need a structural model — not a regression dashboard.

What we built

AeroFarr’s causal layer is built on DoWhy (Microsoft Research) and EconML. It produces three classes of output for any operational question:

Average Treatment Effect (ATE) and Conditional Average Treatment Effect (CATE) — the average causal effect of an intervention, optionally conditional on subgroup features
Counterfactual estimates via do-calculus — what would happen if we changed a specific variable, holding everything else constant
Sensitivity analysis — E-values, Austen plots, and Rosenbaum bounds quantifying how much unmeasured confounding would be needed to overturn the conclusion

The headline architectural decision is to keep the causal model separate from the prediction model. The prediction core (a multi-head stacked ensemble) tells you what is likely to happen. The causal layer tells you why. Different problems, different methodologies, deliberately decoupled.

Why sensitivity analysis is the heart of it

A causal claim without sensitivity analysis is a marketing claim. The classic critique is: “What if there’s an unmeasured confounder?” Sensitivity analysis answers that critique numerically. An E-value of 1.9 says: an unmeasured confounder would need to have a relative association of at least 1.9 with both the treatment and the outcome to overturn the conclusion. Operational stakeholders can decide whether that is plausible in their environment.

This is the same standard you would expect from a peer-reviewed epidemiological paper. We hold our operational claims to it.

The operational pattern

A typical operational session uses the causal layer in three steps:

Identify the question. “Why did the disruption at hub X spread north today?”
Identify the candidate causal mechanism. “Was it weather acting through ATC ground-stops, or was it crew positioning?”
Run the analysis. AeroFarr returns the estimated effect, the prediction interval, and the sensitivity analysis — and it returns the safety reports that match the pattern from the RAG layer.

Operations leaders get an answer with a confidence band, a stated mechanism, and a sensitivity result. That is the standard operational decision-support should meet.

What this is not

Causal AI is not a substitute for prediction. AeroFarr’s ensemble — gate / severity / regression trio / quantile / non-linear meta — does the prediction work. Causal AI is a complement: it explains and quantifies the why that the prediction model cannot articulate.

It is also not a free lunch. Identification (what’s actually identifiable from the data) and assumptions (no unmeasured confounders, correct DAG, ignorability) are all live questions. We address them with explicit DAGs, sensitivity analysis, and documented limitations.

Closing

Operations decisions are causal decisions. Treating them with correlation tools and headline accuracy numbers is a category error. The decade in front of us is the decade of operational causal AI — and aviation is one of the domains best suited to it, because the data exists in volume and the questions are unambiguous.

The post Causal AI for Aviation Operations: from Correlation to Cause appeared first on Zorost Intelligence | AI, Cloud & Data Experts.

Modeling Delay Cascades with Spatial-Temporal Gnns

Zorost Intelligence — Tue, 18 Nov 2025 09:00:00 +0000

Pull-quote: “A cascade is not a sequence of events; it is a graph. Treating it as anything else loses the structure.”

Why this matters

A weather event at one hub doesn’t just cause local delays. Within hours, it ripples through dozens of downstream airports — and the ripple does not follow great-circle distance. It follows the graph of operations: which airline operates which crews where, which gates feed which routes, which cargo terminals connect to which hubs. The graph is non-obvious and non-stationary.

Treating cascade prediction as a tabular regression problem misses the structure. Treating it as a sequence model misses the spatial pattern. We use a spatial-temporal graph neural network.

Architecture

The model is a Spatial-Temporal Graph Dual-Attention Network (SGDAN) built on PyTorch Geometric. Three things are happening at once:

Spatial attention over edges in the airport graph — which connections carry disruption load right now
Temporal attention over the recent history — which past time slices are most predictive of the next
Dual heads — one for short-horizon (0–60 min) cascade probability, one for medium-horizon (1–6 h)

The graph is built from operational adjacency, not great-circle distance. Edges carry weights — operational throughput, recent congestion signals, and route-criticality measures.

Training data

29.6 million public-domain flight records spanning 2022–2025. Records are aligned to a temporal graph snapshot — for every hour, the network state is captured as a graph with weighted edges and node-level features.

What the attention weights tell us

The most useful by-product of this architecture is the interpretability of the attention weights. After a cascade, you can ask the model: which paths through the network were responsible for the propagation? It returns the top-K edges with attention weights — letting an analyst trace the actual mechanism, not just observe the symptom.

This matters for operational reviews. After a major disruption day, you can reconstruct the propagation path. After a minor one, you can spot patterns that are accumulating into a major disruption.

Calibration

GNN outputs are calibrated alongside the rest of the AeroFarr stack. The cascade probabilities pass through the same calibration pipeline (Platt scaling on a holdout slice) so that the cascade head is in the same probability scale as the gate / severity heads.

Closing

A cascade is a graph problem. We treat it as a graph problem. The result is a model whose outputs are not just predictions but explanations — and whose explanations are usable for operational debriefs.

The post Modeling Delay Cascades with Spatial-Temporal Gnns appeared first on Zorost Intelligence | AI, Cloud & Data Experts.