Evaluation Archives - Zorost Intelligence | AI, Cloud & Data Experts

Hybrid Retrieval: Why Vector Alone Isn’t Enough

Zorost Intelligence — Tue, 17 Feb 2026 09:00:00 +0000

Pull-quote: “Pure vector retrieval is the most common production-grade RAG mistake. Pure BM25 is the second most common.”

Why this matters

A pattern repeats in every RAG project that goes wrong: someone embeds the corpus, runs vector search, and ships. The system works in demos and disappoints in production. The fix is a structural architecture change: hybrid retrieval.

The components

Query
  │
  ├──► Dense (vector)   — pgvector / Weaviate / Qdrant + an embedding model
  │
  ├──► Sparse (BM25)    — Postgres FTS / Elasticsearch / OpenSearch
  │
  ├──► Optional filters — date range, source, entity tags
  │
  └──► Merge (RRF or weighted) ──► Cross-encoder re-rank ──► Top-K
                                                                │
                                                                ▼
                                                Citation-grounded generation

Why each piece matters

Vector is excellent at semantic similarity — finding documents that are about the same topic in different words. It is bad at named entities — exact terms, IDs, dates.
BM25 is the opposite — excellent at named entities, weaker on semantic similarity.
Filters — when the question is bounded (“just look at 2024 reports about Boeing 737”), filters dramatically reduce the candidate set before ranking.
Merge — Reciprocal Rank Fusion (RRF) is a clean default. Weighted merges work with calibrated scores.
Cross-encoder re-rank — sees the query and the candidate document together and scores them jointly. More expensive than bi-encoder vector search, but the precision improvement on the top-K is large enough to pay for itself.

What changes when you do this right

Hallucination rate drops. The model has better evidence to ground in.
Citation precision goes up. The cited documents actually support the claim.
Edge cases (rare entity queries, exact-quote queries) work properly.
Generation latency stays low because the model only sees the top-K (typically 6–10), not the top-100.

Common mistakes

No re-ranker. Top-50 from vector + top-50 from BM25 with RRF is a starting point, but without a re-ranker the top-K still contains noise.
No filtering. Filtering before retrieval is essentially free if your data is properly indexed.
Skip evaluation. Without a golden Q&A dataset and grounding scoring, you have no way to compare retrieval architectures.

Closing

Pure vector retrieval is the most common production-grade RAG mistake. Hybrid retrieval — vector + sparse + filters + re-rank — is the boring, reliable, production answer. Every Zorost RAG system runs this architecture.

The post Hybrid Retrieval: Why Vector Alone Isn’t Enough appeared first on Zorost Intelligence | AI, Cloud & Data Experts.

Why Calibration Matters More Than Accuracy: an ECE 0.012 Story

Zorost Intelligence — Tue, 10 Feb 2026 09:00:00 +0000

Pull-quote: “When the model says 70%, it should be right 70% of the time. That’s calibration. Anything less is dishonest.”

Why this matters

“Our model is 92% accurate” is a marketing line. It tells you almost nothing about whether you should trust the model with a decision. The real question is: when the model says it is 70% confident, is it actually right 70% of the time?

That is calibration. The metric is Expected Calibration Error (ECE).

The metric, briefly

Group predictions by their stated probability. For each bin, compare the average predicted probability to the actual observed frequency. The weighted average of the absolute differences is the ECE. Lower is better. Below 0.02 is excellent. Below 0.01 is very good in production.

AeroFarr’s gate classifier achieves ECE 0.012 on 581,316 held-out flights. That means the predicted probabilities track the actual observed frequencies very tightly across the full probability range — not just at the mean.

How we got there

Three ingredients:

A multi-head stacked architecture — separate heads for gate / severity / regression / quantile, each tuned on the loss most appropriate for its job, then combined under a non-linear meta-learner. The meta sees the heads’ outputs and learns how to combine them. Calibration is enforced at each head and at the meta.
Loss functions chosen for calibration, not accuracy. Cross-entropy with label smoothing for classifiers; quantile loss for the quantile heads.
Post-hoc calibration on a holdout slice. Platt scaling and isotonic regression are applied as a final stage on a slice of data the heads never saw.

Calibration has to be designed in from the start. Bolting it on at the end as a band-aid does not work for high-stakes operational use.

Why it matters operationally

If a planner is making a “should we keep this aircraft on the gate?” decision and the model says 30% chance of cancellation, the planner’s mental model is: roughly one in three. If the model is poorly calibrated and 30% is actually 60%, the planner’s prior is wrong, and every decision downstream is wrong.

Calibrated probabilities preserve the planner’s intuition. Uncalibrated probabilities corrupt it.

Conformal prediction on top

Calibration tells you about average behavior. Conformal prediction tells you about individual uncertainty. We use Locally Adaptive Conformal Prediction (LACP) to produce distribution-free prediction intervals — meaning when AeroFarr says “delay between 18 and 47 minutes with 90% coverage,” the actual delay falls in that interval 90% of the time, regardless of underlying distribution shape.

This is the second ingredient of honesty in a production model. Calibration says the model’s stated probabilities mean what they say. Conformal prediction says the model’s stated intervals mean what they say.

Closing

Headline accuracy is a misleading metric for high-stakes decisions. Calibration and conformal prediction are the real ones. ECE 0.012 is what we ship. We don’t quote accuracy without calibration, and we don’t quote intervals without coverage.

The post Why Calibration Matters More Than Accuracy: an ECE 0.012 Story appeared first on Zorost Intelligence | AI, Cloud & Data Experts.

Production-Grade RAG on the Lakehouse with Mosaic AI Vector Search

Zorost Intelligence — Tue, 03 Feb 2026 09:00:00 +0000

Pull-quote: “RAG works in demos. RAG that works in production requires hybrid retrieval, a re-ranker, citation grounding, and an evaluation harness.”

Why this matters

Most RAG projects pilot well and disappoint in production. The pattern is the same: embed the corpus, run vector search, ship. Production-grade RAG requires more.

The production RAG architecture

                     ┌────────────────────┐
        Question ───►│  AI Gateway        │  ← key mgmt, routing, observability
                     └──────────┬─────────┘
                                ▼
        ┌────────────────────────────────────────────┐
        │                Retrieval                    │
        │  ┌────────────────┐  ┌────────────────┐   │
        │  │ Mosaic AI      │  │ BM25 (lexical) │   │
        │  │ Vector Search  │  │ on Delta SQL   │   │
        │  │ (Delta-synced) │  │                │   │
        │  └───────┬────────┘  └────────┬───────┘   │
        │          └──── merge (RRF) ───┘           │
        │                  │                          │
        │              cross-encoder                  │
        │              re-rank                        │
        └────────────────┬─────────────────────────────┘
                         ▼
              top-K (typically 6–10)
                         │
                         ▼
              Citation-grounded generation
              (Mosaic AI Model Serving)
                         │
                         ▼
              Validated answer with source links

Why Mosaic AI Vector Search specifically

Mosaic AI Vector Search synchronizes with Delta tables. Update the source table, the index updates. No orchestration glue. Tagging, ACLs, and lineage flow through Unity Catalog. For RAG over enterprise data that changes, this matters more than people initially appreciate.

Hybrid retrieval is the pattern

Pure vector search is the most common production RAG mistake. Pure BM25 is the second most common. Hybrid — vector + BM25 + filters + re-rank — is the answer that actually works.

Citation grounding as a structural fix

Constrain the model to write with bracketed citation tokens. Validate every citation against the retrieval set. Reject answers that fail validation. This is a small structural change with a large operational impact.

Evaluation harness — non-negotiable

A production RAG system without an evaluation harness is a guess. The harness has three components:

Golden Q&A dataset — questions paired with the documents that should ground the answers
Grounding rate — what fraction of generated claims are supported by retrieved documents
Hallucination detection — flagging unsupported claims

The harness runs as a Databricks Job on every model or retrieval change. Regressions are caught before deployment.

Closing

Production RAG on the Lakehouse with Mosaic AI is straightforward when you adopt the architecture: hybrid retrieval, re-ranker, citation grounding, evaluation harness. The result is a RAG system analysts trust enough to use.

The post Production-Grade RAG on the Lakehouse with Mosaic AI Vector Search appeared first on Zorost Intelligence | AI, Cloud & Data Experts.

Multi-Agent OSINT with a Critic and a Referee

Zorost Intelligence — Tue, 20 Jan 2026 09:00:00 +0000

Pull-quote: “Speed of agents matters less than honesty of agents. Critic and referee are how you build honesty into the swarm.”

Why this matters

The first wave of multi-agent OSINT systems was a swarm: ten agents reading the same inputs and producing summaries, which were then averaged. The result was confident-sounding mediocrity. The agents reinforced each other’s biases. The aggregator could not tell whether the consensus was real or echo.

The second wave adds structure to the swarm. Specifically, two roles that are missing in the naive design:

Critic — adversarial review. The critic’s job is to find the weakest link in the analysts’ reasoning and challenge it.
Referee — adjudicates when analysts disagree. The referee’s job is to apply explicit decision criteria and produce a final answer with explicit reasoning.

This is not a UI improvement. It is a structural change in what the system is.

Aquil’s swarm

Aquil runs a structured OSINT swarm with four roles:

Sourcers — discover and ingest open-source signals (news, public data, leaks, public records, satellite imagery sources where licensed)
Analysts — produce hypotheses, summarize evidence, and propose causal explanations
Critic — reviews analyst output for unsupported claims, missing evidence, plausible alternative explanations, and reasoning gaps
Referee — adjudicates when the analysts and the critic disagree, with explicit criteria

The critic is structurally different from the analysts: it does not propose new claims. Its only function is to challenge existing ones. The referee is structurally different again: it does not propose or challenge. It decides, with explicit reasoning that goes into the audit trail.

Causal-graph synthesis

On top of the swarm, Aquil produces a causal graph of the assessed situation — events as nodes, hypothesized causal relationships as edges, with confidence weights. The graph is the team’s shared mental model. It is updateable, queryable, and exportable.

A causal graph is not just a visualization. It is a structured commitment to what we think is going on. New evidence updates the graph; missing evidence flags weak edges; alternative hypotheses are visible as competing edges.

Why this works

The naive swarm fails because mediocre answers can hide behind a chorus. The structured swarm makes the chorus disagree on purpose, and then makes a referee adjudicate. The agents’ weaknesses are surfaced rather than averaged. The team gets a more honest answer.

Closing

Speed of agents matters less than honesty of agents. The critic and the referee are how you build honesty into the swarm. Aquil is structured around that thesis.

The post Multi-Agent OSINT with a Critic and a Referee appeared first on Zorost Intelligence | AI, Cloud & Data Experts.

The Agent Factory: Planner, Executor, Critic, Referee

Zorost Intelligence — Tue, 23 Dec 2025 09:00:00 +0000

Pull-quote: “The four-role pattern is not an opinion. It’s the architecture every production multi-agent system converges on once it survives the first round of real users.”

Why this matters

Multi-agent AI starts as a clever idea (let agents talk to each other!) and dies in production as an unreliable mess (agents hallucinate to each other, disagreements never resolve, the audit trail is unreadable). The fix is structural: four roles, typed contracts, deterministic logs.

The four roles

Planner — decomposes the high-level goal into sub-goals and decides the sequence. Reads the task, the available tools, and the agent’s memory; emits a structured plan.
Executor(s) — carries out sub-goals. Calls tools. Returns structured outputs. Knows nothing about the high-level plan; just executes its assigned sub-goal honestly.
Critic — reviews each executor output adversarially. Looks for unsupported claims, broken citations, missed evidence, alternative interpretations. Does not propose new actions; only critiques.
Referee — adjudicates when the critic disagrees with the executor. Has explicit criteria. Produces the final decision with explicit reasoning.

Why this works

Planner / executor separation prevents the planner from drifting into execution and getting confused by tool errors.
Critic separation prevents the executors from grading their own work, which is a category error.
Referee separation prevents endless analyst-vs-critic loops.

Common variations

Single executor vs. multi-executor (parallelism). Parallel executors for independent sub-goals; serial for dependent ones.
Critic per executor or shared critic. Per-executor for specialized critique; shared for consistency across the run.
Hierarchical planning. A meta-planner produces a plan that includes “now plan this sub-task in detail” steps.

What we standardize

We standardize three things across every production agentic system:

Typed tool contracts — every tool has explicit input/output schemas. No improvisation.
Deterministic logs — every call (planner → executor, executor → tool, critic → executor) is logged with timestamps and parameters.
Evaluation harnesses — every system ships with a golden dataset, a regression suite, hallucination detection, and grounding scoring. New versions are evaluated before promotion.

Where we run this pattern

AeroFarr — multi-tool aviation analyst (planner / executor / critic over the prediction core, the cascade GNN, the causal engine, and the RAG corpus)
EvidAI — 4-model consensus screening with explicit critic and referee
FreightCortex — 16-tool AI freight analyst with planner / executor and a critic on report quality
Aquil — sourcers / analysts / critic / referee for OSINT
SPCio (with a manufacturing intelligence partner) — 8 specialized agents with a meta-coordinator

Closing

The four-role pattern is not an opinion. It is the architecture every production multi-agent system converges on once it survives the first round of real users. Skipping it is a tax you pay later.

The post The Agent Factory: Planner, Executor, Critic, Referee appeared first on Zorost Intelligence | AI, Cloud & Data Experts.

Living Systematic Reviews: Evidence That Stays Current

Zorost Intelligence — Tue, 16 Dec 2025 09:00:00 +0000

Pull-quote: “A review that is six months out of date is not a review. It is a historical artifact.”

Why this matters

The fundamental flaw of the traditional systematic review is that it is a snapshot. A team works on it for six months, freezes the literature search at a date, and publishes a result that becomes outdated the moment the next paper appears. In rapidly evolving fields — oncology, infectious disease, AI/ML methodology, certain rare-disease indications — that lag is unacceptable.

The fix is a living systematic review — a review that is continuously refreshed as new evidence appears.

What “living” actually requires

Living reviews are not just “running the search again every quarter.” They require:

Protocol stability — the inclusion / exclusion criteria do not change between updates
Federated search at scheduled cadence across the full database set
Delta detection — what’s new since the last update
Consistent screening — the same multi-agent consensus applied to new papers
Risk-of-bias and GRADE re-assessment — if a new high-quality study changes the certainty of evidence, that needs to surface
Versioned reporting — each refresh produces a versioned report with a clear changelog
Subscriber notification — stakeholders are alerted when something material changes

This is not a research methodology improvement. It is an engineering problem: how to do high-rigor evidence synthesis on a recurring schedule, with reproducibility and auditability preserved.

Architecture

EvidAI’s living review architecture:

Protocol (versioned) ──► Federated search (11 databases, scheduled)
                                          │
                                          ▼
                              Delta detection
                                          │
                          New papers since last refresh
                                          │
                                          ▼
                       Multi-agent consensus screening
                                          │
                          Included papers (new)
                                          │
                                          ▼
                  Risk-of-bias (RoB 2 / ROBINS-I / NOS)
                                          │
                                          ▼
                  GRADE re-assessment per outcome
                                          │
                                          ▼
                  Living report (versioned, with changelog)
                                          │
                                          ▼
                  Subscriber notifications

What changes for the team

The team’s role shifts from “run a six-month review every two years” to “monitor a continuously updated review and adjudicate the small fraction of decisions the AI escalated.” That is a fundamentally different work pattern, and it scales.

Closing

A review that is six months out of date is not a review. Living reviews are an engineering solution to a research methodology problem — and they are now operationally feasible.

The post Living Systematic Reviews: Evidence That Stays Current appeared first on Zorost Intelligence | AI, Cloud & Data Experts.

Multi-Agent Consensus for Systematic Literature Review

Zorost Intelligence — Tue, 04 Nov 2025 09:00:00 +0000

Pull-quote: “If four independent reasoners agree, the inclusion decision is high-confidence. If they disagree, the question goes to a human. That’s the design contract.”

Why this matters

Systematic literature reviews underpin regulatory submissions, clinical practice guidelines, and HTA decisions. Doing them well is expensive and slow — typically 4–6 months and a six-figure investment for a single review. Doing them badly is dangerous.

The first wave of LLM-assisted screening was a single model judging each title/abstract against the inclusion criteria. It was faster than manual review. It was no more accurate. In some cases, it was less accurate, because a single model has systematic biases that a human reviewer doesn’t share.

What multi-agent consensus does

EvidAI runs every screening decision through four independent LLMs, each with a structured prompt that includes the protocol’s inclusion and exclusion criteria, a brief excerpt from the abstract, and a request for explicit reasoning.

The four models vote. Three patterns emerge:

Pattern	Frequency	Action
4–0 unanimous include	~78%	Auto-include
4–0 unanimous exclude	~13%	Auto-exclude
3–1 majority	~6%	Flag for human reviewer with explanations
2–2 split	~2%	Mandatory human reviewer with adjudication
Disagreement on reasoning	varies	Flag for human reviewer regardless of outcome

(Frequencies are typical for a well-designed protocol; they vary with topic.)

Why the design works

The key insight is that independent errors are uncorrelated. Different LLMs have different systematic biases — different training data, different RLHF preferences, different prompt sensitivities. When four independent reasoners agree, the marginal probability of error drops sharply. When they disagree, the model designers’ expected behavior is reproducing the disagreement that human reviewers would have had — which is exactly what should be escalated.

Single-model screening hides disagreement. Multi-agent consensus surfaces it.

Auditability

Every screening decision is stored as a row with: paper ID, protocol version, model identifiers, raw model outputs, parsed decisions, the reason for inclusion/exclusion in each model’s words, the consensus result, and (if applicable) the human reviewer’s adjudication. The complete chain is replayable by an auditor and reproducible by a successor team.

This is the difference between an AI tool that speeds up the SLR process and one that preserves the audit standard it requires.

Closing

The multi-agent consensus pattern is the right answer for any high-stakes screening problem where accountability and auditability matter. EvidAI applies it to systematic reviews. The same pattern transfers cleanly to compliance screening, regulatory document review, due diligence, and grant assessment.

The post Multi-Agent Consensus for Systematic Literature Review appeared first on Zorost Intelligence | AI, Cloud & Data Experts.