<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Evaluation Archives - Zorost Intelligence | AI, Cloud &amp; Data Experts</title>
	<atom:link href="https://zorost.com/tag/evaluation/feed/" rel="self" type="application/rss+xml" />
	<link>https://zorost.com/tag/evaluation/</link>
	<description>Production AI systems for aviation, manufacturing, pharma, government, finance, freight, and geopolitical intelligence.</description>
	<lastBuildDate>Wed, 20 May 2026 18:52:40 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=7.0</generator>

<image>
	<url>https://zorost.com/wp-content/uploads/2025/08/ZOROST-Intel-Logo3_512-150x150.png</url>
	<title>Evaluation Archives - Zorost Intelligence | AI, Cloud &amp; Data Experts</title>
	<link>https://zorost.com/tag/evaluation/</link>
	<width>32</width>
	<height>32</height>
</image> 
<site xmlns="com-wordpress:feed-additions:1">81719879</site>	<item>
		<title>Hybrid Retrieval: Why Vector Alone Isn&#8217;t Enough</title>
		<link>https://zorost.com/hybrid-retrieval-vector-alone-not-enough/</link>
		
		<dc:creator><![CDATA[Zorost Intelligence]]></dc:creator>
		<pubDate>Tue, 17 Feb 2026 09:00:00 +0000</pubDate>
				<category><![CDATA[Agentic AI Engineering]]></category>
		<category><![CDATA[BM25]]></category>
		<category><![CDATA[Evaluation]]></category>
		<category><![CDATA[Hybrid Retrieval]]></category>
		<category><![CDATA[RAG]]></category>
		<category><![CDATA[Vector Search]]></category>
		<guid isPermaLink="false">https://zorost.com/hybrid-retrieval-vector-alone-not-enough/</guid>

					<description><![CDATA[<p>Vector search is excellent at semantic similarity and bad at named entities. BM25 is the opposite. Production-grade retrieval is hybrid — and the architecture decisions matter.</p>
<p>The post <a href="https://zorost.com/hybrid-retrieval-vector-alone-not-enough/">Hybrid Retrieval: Why Vector Alone Isn&#8217;t Enough</a> appeared first on <a href="https://zorost.com">Zorost Intelligence | AI, Cloud &amp; Data Experts</a>.</p>
]]></description>
										<content:encoded><![CDATA[<blockquote>
<p><strong>Pull-quote:</strong> &#8220;Pure vector retrieval is the most common production-grade RAG mistake. Pure BM25 is the second most common.&#8221;</p>
</blockquote>
<h4>Why this matters</h4>
<p>A pattern repeats in every RAG project that goes wrong: someone embeds the corpus, runs vector search, and ships. The system works in demos and disappoints in production. The fix is a structural architecture change: <strong>hybrid retrieval</strong>.</p>
<h4>The components</h4>
<pre><code>Query
  │
  ├──► Dense (vector)   — pgvector / Weaviate / Qdrant + an embedding model
  │
  ├──► Sparse (BM25)    — Postgres FTS / Elasticsearch / OpenSearch
  │
  ├──► Optional filters — date range, source, entity tags
  │
  └──► Merge (RRF or weighted) ──► Cross-encoder re-rank ──► Top-K
                                                                │
                                                                ▼
                                                Citation-grounded generation</code></pre>
<h4>Why each piece matters</h4>
<ul>
<li><strong>Vector</strong> is excellent at <em>semantic similarity</em> — finding documents that are about the same topic in different words. It is bad at <em>named entities</em> — exact terms, IDs, dates.</li>
<li><strong>BM25</strong> is the opposite — excellent at named entities, weaker on semantic similarity.</li>
<li><strong>Filters</strong> — when the question is bounded (&#8220;just look at 2024 reports about Boeing 737&#8221;), filters dramatically reduce the candidate set before ranking.</li>
<li><strong>Merge</strong> — Reciprocal Rank Fusion (RRF) is a clean default. Weighted merges work with calibrated scores.</li>
<li><strong>Cross-encoder re-rank</strong> — sees the query and the candidate document together and scores them jointly. More expensive than bi-encoder vector search, but the precision improvement on the top-K is large enough to pay for itself.</li>
</ul>
<h4>What changes when you do this right</h4>
<ul>
<li>Hallucination rate drops. The model has better evidence to ground in.</li>
<li>Citation precision goes up. The cited documents actually support the claim.</li>
<li>Edge cases (rare entity queries, exact-quote queries) work properly.</li>
<li>Generation latency stays low because the model only sees the top-K (typically 6–10), not the top-100.</li>
</ul>
<h4>Common mistakes</h4>
<ul>
<li><strong>No re-ranker.</strong> Top-50 from vector + top-50 from BM25 with RRF is a starting point, but without a re-ranker the top-K still contains noise.</li>
<li><strong>No filtering.</strong> Filtering before retrieval is essentially free if your data is properly indexed.</li>
<li><strong>Skip evaluation.</strong> Without a golden Q&amp;A dataset and grounding scoring, you have no way to compare retrieval architectures.</li>
</ul>
<h4>Closing</h4>
<p>Pure vector retrieval is the most common production-grade RAG mistake. Hybrid retrieval — vector + sparse + filters + re-rank — is the boring, reliable, production answer. Every Zorost RAG system runs this architecture.</p>
<hr>
<p>The post <a href="https://zorost.com/hybrid-retrieval-vector-alone-not-enough/">Hybrid Retrieval: Why Vector Alone Isn&#8217;t Enough</a> appeared first on <a href="https://zorost.com">Zorost Intelligence | AI, Cloud &amp; Data Experts</a>.</p>
]]></content:encoded>
					
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">24296</post-id>	</item>
		<item>
		<title>Why Calibration Matters More Than Accuracy: an ECE 0.012 Story</title>
		<link>https://zorost.com/calibration-matters-more-than-accuracy/</link>
		
		<dc:creator><![CDATA[Zorost Intelligence]]></dc:creator>
		<pubDate>Tue, 10 Feb 2026 09:00:00 +0000</pubDate>
				<category><![CDATA[Aviation Intelligence]]></category>
		<category><![CDATA[AeroFarr]]></category>
		<category><![CDATA[Calibration]]></category>
		<category><![CDATA[Conformal Prediction]]></category>
		<category><![CDATA[ECE]]></category>
		<category><![CDATA[Evaluation]]></category>
		<category><![CDATA[LACP]]></category>
		<guid isPermaLink="false">https://zorost.com/calibration-matters-more-than-accuracy/</guid>

					<description><![CDATA[<p>Headline accuracy is a misleading metric for high-stakes decisions. Calibration is the real one. Here is what ECE 0.012 means and how we got there.</p>
<p>The post <a href="https://zorost.com/calibration-matters-more-than-accuracy/">Why Calibration Matters More Than Accuracy: an ECE 0.012 Story</a> appeared first on <a href="https://zorost.com">Zorost Intelligence | AI, Cloud &amp; Data Experts</a>.</p>
]]></description>
										<content:encoded><![CDATA[<blockquote>
<p><strong>Pull-quote:</strong> &#8220;When the model says 70%, it should be right 70% of the time. That&#8217;s calibration. Anything less is dishonest.&#8221;</p>
</blockquote>
<h4>Why this matters</h4>
<p>&#8220;Our model is 92% accurate&#8221; is a marketing line. It tells you almost nothing about whether you should trust the model with a decision. The real question is: <strong>when the model says it is 70% confident, is it actually right 70% of the time?</strong></p>
<p>That is <strong>calibration</strong>. The metric is <strong>Expected Calibration Error (ECE)</strong>.</p>
<h4>The metric, briefly</h4>
<p>Group predictions by their stated probability. For each bin, compare the average predicted probability to the actual observed frequency. The weighted average of the absolute differences is the ECE. Lower is better. Below 0.02 is excellent. Below 0.01 is very good in production.</p>
<p>AeroFarr&#8217;s gate classifier achieves <strong>ECE 0.012 on 581,316 held-out flights</strong>. That means the predicted probabilities track the actual observed frequencies very tightly across the full probability range — not just at the mean.</p>
<h4>How we got there</h4>
<p>Three ingredients:</p>
<ol>
<li><strong>A multi-head stacked architecture</strong> — separate heads for gate / severity / regression / quantile, each tuned on the loss most appropriate for its job, then combined under a non-linear meta-learner. The meta sees the heads&#8217; outputs and learns how to combine them. Calibration is enforced at each head and at the meta.</li>
<li><strong>Loss functions chosen for calibration, not accuracy.</strong> Cross-entropy with label smoothing for classifiers; quantile loss for the quantile heads.</li>
<li><strong>Post-hoc calibration on a holdout slice.</strong> Platt scaling and isotonic regression are applied as a final stage on a slice of data the heads never saw.</li>
</ol>
<p>Calibration has to be designed in from the start. Bolting it on at the end as a band-aid does not work for high-stakes operational use.</p>
<h4>Why it matters operationally</h4>
<p>If a planner is making a &#8220;should we keep this aircraft on the gate?&#8221; decision and the model says 30% chance of cancellation, the planner&#8217;s mental model is: <em>roughly one in three.</em> If the model is poorly calibrated and 30% is actually 60%, the planner&#8217;s prior is wrong, and every decision downstream is wrong.</p>
<p>Calibrated probabilities preserve the planner&#8217;s intuition. Uncalibrated probabilities corrupt it.</p>
<h4>Conformal prediction on top</h4>
<p>Calibration tells you about average behavior. <strong>Conformal prediction</strong> tells you about <em>individual</em> uncertainty. We use <strong>Locally Adaptive Conformal Prediction (LACP)</strong> to produce distribution-free prediction intervals — meaning when AeroFarr says &#8220;delay between 18 and 47 minutes with 90% coverage,&#8221; the actual delay falls in that interval 90% of the time, regardless of underlying distribution shape.</p>
<p>This is the second ingredient of honesty in a production model. Calibration says the model&#8217;s stated probabilities mean what they say. Conformal prediction says the model&#8217;s stated intervals mean what they say.</p>
<h4>Closing</h4>
<p>Headline accuracy is a misleading metric for high-stakes decisions. Calibration and conformal prediction are the real ones. ECE 0.012 is what we ship. We don&#8217;t quote accuracy without calibration, and we don&#8217;t quote intervals without coverage.</p>
<hr>
<p>The post <a href="https://zorost.com/calibration-matters-more-than-accuracy/">Why Calibration Matters More Than Accuracy: an ECE 0.012 Story</a> appeared first on <a href="https://zorost.com">Zorost Intelligence | AI, Cloud &amp; Data Experts</a>.</p>
]]></content:encoded>
					
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">24285</post-id>	</item>
		<item>
		<title>Production-Grade RAG on the Lakehouse with Mosaic AI Vector Search</title>
		<link>https://zorost.com/production-rag-mosaic-ai-vector-search/</link>
		
		<dc:creator><![CDATA[Zorost Intelligence]]></dc:creator>
		<pubDate>Tue, 03 Feb 2026 09:00:00 +0000</pubDate>
				<category><![CDATA[Databricks Modernization]]></category>
		<category><![CDATA[Evaluation]]></category>
		<category><![CDATA[Hybrid Retrieval]]></category>
		<category><![CDATA[Mosaic AI]]></category>
		<category><![CDATA[RAG]]></category>
		<category><![CDATA[Vector Search]]></category>
		<guid isPermaLink="false">https://zorost.com/production-rag-mosaic-ai-vector-search/</guid>

					<description><![CDATA[<p>How to design, build, and evaluate a production RAG system on Databricks using Mosaic AI Vector Search, hybrid retrieval, and a real evaluation harness.</p>
<p>The post <a href="https://zorost.com/production-rag-mosaic-ai-vector-search/">Production-Grade RAG on the Lakehouse with Mosaic AI Vector Search</a> appeared first on <a href="https://zorost.com">Zorost Intelligence | AI, Cloud &amp; Data Experts</a>.</p>
]]></description>
										<content:encoded><![CDATA[<blockquote>
<p><strong>Pull-quote:</strong> &#8220;RAG works in demos. RAG that works in production requires hybrid retrieval, a re-ranker, citation grounding, and an evaluation harness.&#8221;</p>
</blockquote>
<h4>Why this matters</h4>
<p>Most RAG projects pilot well and disappoint in production. The pattern is the same: embed the corpus, run vector search, ship. Production-grade RAG requires more.</p>
<h4>The production RAG architecture</h4>
<pre><code>                     ┌────────────────────┐
        Question ───►│  AI Gateway        │  ← key mgmt, routing, observability
                     └──────────┬─────────┘
                                ▼
        ┌────────────────────────────────────────────┐
        │                Retrieval                    │
        │  ┌────────────────┐  ┌────────────────┐   │
        │  │ Mosaic AI      │  │ BM25 (lexical) │   │
        │  │ Vector Search  │  │ on Delta SQL   │   │
        │  │ (Delta-synced) │  │                │   │
        │  └───────┬────────┘  └────────┬───────┘   │
        │          └──── merge (RRF) ───┘           │
        │                  │                          │
        │              cross-encoder                  │
        │              re-rank                        │
        └────────────────┬─────────────────────────────┘
                         ▼
              top-K (typically 6–10)
                         │
                         ▼
              Citation-grounded generation
              (Mosaic AI Model Serving)
                         │
                         ▼
              Validated answer with source links</code></pre>
<h4>Why Mosaic AI Vector Search specifically</h4>
<p>Mosaic AI Vector Search <strong>synchronizes with Delta tables</strong>. Update the source table, the index updates. No orchestration glue. Tagging, ACLs, and lineage flow through Unity Catalog. For RAG over enterprise data that changes, this matters more than people initially appreciate.</p>
<h4>Hybrid retrieval is the pattern</h4>
<p>Pure vector search is the most common production RAG mistake. Pure BM25 is the second most common. Hybrid — vector + BM25 + filters + re-rank — is the answer that actually works.</p>
<h4>Citation grounding as a structural fix</h4>
<p>Constrain the model to write with bracketed citation tokens. Validate every citation against the retrieval set. Reject answers that fail validation. This is a small structural change with a large operational impact.</p>
<h4>Evaluation harness — non-negotiable</h4>
<p>A production RAG system without an evaluation harness is a guess. The harness has three components:</p>
<ol>
<li><strong>Golden Q&amp;A dataset</strong> — questions paired with the documents that should ground the answers</li>
<li><strong>Grounding rate</strong> — what fraction of generated claims are supported by retrieved documents</li>
<li><strong>Hallucination detection</strong> — flagging unsupported claims</li>
</ol>
<p>The harness runs as a Databricks Job on every model or retrieval change. Regressions are caught before deployment.</p>
<h4>Closing</h4>
<p>Production RAG on the Lakehouse with Mosaic AI is straightforward when you adopt the architecture: hybrid retrieval, re-ranker, citation grounding, evaluation harness. The result is a RAG system analysts trust enough to use.</p>
<hr>
<p>The post <a href="https://zorost.com/production-rag-mosaic-ai-vector-search/">Production-Grade RAG on the Lakehouse with Mosaic AI Vector Search</a> appeared first on <a href="https://zorost.com">Zorost Intelligence | AI, Cloud &amp; Data Experts</a>.</p>
]]></content:encoded>
					
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">24305</post-id>	</item>
		<item>
		<title>Multi-Agent OSINT with a Critic and a Referee</title>
		<link>https://zorost.com/multi-agent-osint-critic-referee/</link>
		
		<dc:creator><![CDATA[Zorost Intelligence]]></dc:creator>
		<pubDate>Tue, 20 Jan 2026 09:00:00 +0000</pubDate>
				<category><![CDATA[Geopolitical Intelligence]]></category>
		<category><![CDATA[Aquil]]></category>
		<category><![CDATA[Causal Inference]]></category>
		<category><![CDATA[Evaluation]]></category>
		<category><![CDATA[Multi-Agent]]></category>
		<category><![CDATA[OSINT]]></category>
		<guid isPermaLink="false">https://zorost.com/multi-agent-osint-critic-referee/</guid>

					<description><![CDATA[<p>A swarm of agents producing summaries is not analysis. Adding a critic and a referee changes what the system is. Here is how Aquil's OSINT architecture is structured.</p>
<p>The post <a href="https://zorost.com/multi-agent-osint-critic-referee/">Multi-Agent OSINT with a Critic and a Referee</a> appeared first on <a href="https://zorost.com">Zorost Intelligence | AI, Cloud &amp; Data Experts</a>.</p>
]]></description>
										<content:encoded><![CDATA[<blockquote>
<p><strong>Pull-quote:</strong> &#8220;Speed of agents matters less than honesty of agents. Critic and referee are how you build honesty into the swarm.&#8221;</p>
</blockquote>
<h4>Why this matters</h4>
<p>The first wave of multi-agent OSINT systems was a swarm: ten agents reading the same inputs and producing summaries, which were then averaged. The result was confident-sounding mediocrity. The agents reinforced each other&#8217;s biases. The aggregator could not tell whether the consensus was real or echo.</p>
<p>The second wave adds <strong>structure</strong> to the swarm. Specifically, two roles that are missing in the naive design:</p>
<ul>
<li><strong>Critic</strong> — adversarial review. The critic&#8217;s job is to find the weakest link in the analysts&#8217; reasoning and challenge it.</li>
<li><strong>Referee</strong> — adjudicates when analysts disagree. The referee&#8217;s job is to apply explicit decision criteria and produce a final answer with explicit reasoning.</li>
</ul>
<p>This is not a UI improvement. It is a structural change in what the system is.</p>
<h4>Aquil&#8217;s swarm</h4>
<p>Aquil runs a structured OSINT swarm with four roles:</p>
<ol>
<li><strong>Sourcers</strong> — discover and ingest open-source signals (news, public data, leaks, public records, satellite imagery sources where licensed)</li>
<li><strong>Analysts</strong> — produce hypotheses, summarize evidence, and propose causal explanations</li>
<li><strong>Critic</strong> — reviews analyst output for unsupported claims, missing evidence, plausible alternative explanations, and reasoning gaps</li>
<li><strong>Referee</strong> — adjudicates when the analysts and the critic disagree, with explicit criteria</li>
</ol>
<p>The critic is structurally different from the analysts: it does not propose new claims. Its only function is to challenge existing ones. The referee is structurally different again: it does not propose or challenge. It decides, with explicit reasoning that goes into the audit trail.</p>
<h4>Causal-graph synthesis</h4>
<p>On top of the swarm, Aquil produces a <strong>causal graph</strong> of the assessed situation — events as nodes, hypothesized causal relationships as edges, with confidence weights. The graph is the team&#8217;s shared mental model. It is updateable, queryable, and exportable.</p>
<p>A causal graph is not just a visualization. It is a structured commitment to <em>what we think is going on</em>. New evidence updates the graph; missing evidence flags weak edges; alternative hypotheses are visible as competing edges.</p>
<h4>Why this works</h4>
<p>The naive swarm fails because mediocre answers can hide behind a chorus. The structured swarm makes the chorus disagree on purpose, and then makes a referee adjudicate. The agents&#8217; weaknesses are surfaced rather than averaged. The team gets a more honest answer.</p>
<h4>Closing</h4>
<p>Speed of agents matters less than honesty of agents. The critic and the referee are how you build honesty into the swarm. Aquil is structured around that thesis.</p>
<hr>
<p>The post <a href="https://zorost.com/multi-agent-osint-critic-referee/">Multi-Agent OSINT with a Critic and a Referee</a> appeared first on <a href="https://zorost.com">Zorost Intelligence | AI, Cloud &amp; Data Experts</a>.</p>
]]></content:encoded>
					
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">24293</post-id>	</item>
		<item>
		<title>The Agent Factory: Planner, Executor, Critic, Referee</title>
		<link>https://zorost.com/agent-factory-planner-executor-critic-referee/</link>
		
		<dc:creator><![CDATA[Zorost Intelligence]]></dc:creator>
		<pubDate>Tue, 23 Dec 2025 09:00:00 +0000</pubDate>
				<category><![CDATA[Agentic AI Engineering]]></category>
		<category><![CDATA[Agentic AI]]></category>
		<category><![CDATA[Evaluation]]></category>
		<category><![CDATA[Governance]]></category>
		<category><![CDATA[LangGraph]]></category>
		<category><![CDATA[Multi-Agent]]></category>
		<guid isPermaLink="false">https://zorost.com/agent-factory-planner-executor-critic-referee/</guid>

					<description><![CDATA[<p>Most production agentic systems converge on the same architecture: a planner, an executor, a critic, and a referee. Here is the pattern, why it works, and how we apply it across industries.</p>
<p>The post <a href="https://zorost.com/agent-factory-planner-executor-critic-referee/">The Agent Factory: Planner, Executor, Critic, Referee</a> appeared first on <a href="https://zorost.com">Zorost Intelligence | AI, Cloud &amp; Data Experts</a>.</p>
]]></description>
										<content:encoded><![CDATA[<blockquote>
<p><strong>Pull-quote:</strong> &#8220;The four-role pattern is not an opinion. It&#8217;s the architecture every production multi-agent system converges on once it survives the first round of real users.&#8221;</p>
</blockquote>
<h4>Why this matters</h4>
<p>Multi-agent AI starts as a clever idea (let agents talk to each other!) and dies in production as an unreliable mess (agents hallucinate to each other, disagreements never resolve, the audit trail is unreadable). The fix is structural: four roles, typed contracts, deterministic logs.</p>
<h4>The four roles</h4>
<ol>
<li><strong>Planner</strong> — decomposes the high-level goal into sub-goals and decides the sequence. Reads the task, the available tools, and the agent&#8217;s memory; emits a structured plan.</li>
<li><strong>Executor(s)</strong> — carries out sub-goals. Calls tools. Returns structured outputs. Knows nothing about the high-level plan; just executes its assigned sub-goal honestly.</li>
<li><strong>Critic</strong> — reviews each executor output adversarially. Looks for unsupported claims, broken citations, missed evidence, alternative interpretations. Does not propose new actions; only critiques.</li>
<li><strong>Referee</strong> — adjudicates when the critic disagrees with the executor. Has explicit criteria. Produces the final decision with explicit reasoning.</li>
</ol>
<h4>Why this works</h4>
<ul>
<li><strong>Planner / executor separation</strong> prevents the planner from drifting into execution and getting confused by tool errors.</li>
<li><strong>Critic separation</strong> prevents the executors from grading their own work, which is a category error.</li>
<li><strong>Referee separation</strong> prevents endless analyst-vs-critic loops.</li>
</ul>
<h4>Common variations</h4>
<ul>
<li><strong>Single executor vs. multi-executor (parallelism).</strong> Parallel executors for independent sub-goals; serial for dependent ones.</li>
<li><strong>Critic per executor or shared critic.</strong> Per-executor for specialized critique; shared for consistency across the run.</li>
<li><strong>Hierarchical planning.</strong> A meta-planner produces a plan that includes &#8220;now plan this sub-task in detail&#8221; steps.</li>
</ul>
<h4>What we standardize</h4>
<p>We standardize three things across every production agentic system:</p>
<ol>
<li><strong>Typed tool contracts</strong> — every tool has explicit input/output schemas. No improvisation.</li>
<li><strong>Deterministic logs</strong> — every call (planner → executor, executor → tool, critic → executor) is logged with timestamps and parameters.</li>
<li><strong>Evaluation harnesses</strong> — every system ships with a golden dataset, a regression suite, hallucination detection, and grounding scoring. New versions are evaluated before promotion.</li>
</ol>
<h4>Where we run this pattern</h4>
<ul>
<li><strong>AeroFarr</strong> — multi-tool aviation analyst (planner / executor / critic over the prediction core, the cascade GNN, the causal engine, and the RAG corpus)</li>
<li><strong>EvidAI</strong> — 4-model consensus screening with explicit critic and referee</li>
<li><strong>FreightCortex</strong> — 16-tool AI freight analyst with planner / executor and a critic on report quality</li>
<li><strong>Aquil</strong> — sourcers / analysts / critic / referee for OSINT</li>
<li><strong>SPCio</strong> (with a manufacturing intelligence partner) — 8 specialized agents with a meta-coordinator</li>
</ul>
<h4>Closing</h4>
<p>The four-role pattern is not an opinion. It is the architecture every production multi-agent system converges on once it survives the first round of real users. Skipping it is a tax you pay later.</p>
<hr>
<p>The post <a href="https://zorost.com/agent-factory-planner-executor-critic-referee/">The Agent Factory: Planner, Executor, Critic, Referee</a> appeared first on <a href="https://zorost.com">Zorost Intelligence | AI, Cloud &amp; Data Experts</a>.</p>
]]></content:encoded>
					
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">24295</post-id>	</item>
		<item>
		<title>Living Systematic Reviews: Evidence That Stays Current</title>
		<link>https://zorost.com/living-systematic-reviews-evidence-current/</link>
		
		<dc:creator><![CDATA[Zorost Intelligence]]></dc:creator>
		<pubDate>Tue, 16 Dec 2025 09:00:00 +0000</pubDate>
				<category><![CDATA[Pharmaceutical Research]]></category>
		<category><![CDATA[Benchmarking]]></category>
		<category><![CDATA[Evaluation]]></category>
		<category><![CDATA[EvidAI]]></category>
		<category><![CDATA[PRISMA 2020]]></category>
		<category><![CDATA[RAG]]></category>
		<guid isPermaLink="false">https://zorost.com/living-systematic-reviews-evidence-current/</guid>

					<description><![CDATA[<p>A traditional systematic review is a snapshot, frozen at the search date. A living review is a stream, refreshed as new evidence appears. Here is the architecture that makes living reviews operationally feasible.</p>
<p>The post <a href="https://zorost.com/living-systematic-reviews-evidence-current/">Living Systematic Reviews: Evidence That Stays Current</a> appeared first on <a href="https://zorost.com">Zorost Intelligence | AI, Cloud &amp; Data Experts</a>.</p>
]]></description>
										<content:encoded><![CDATA[<blockquote>
<p><strong>Pull-quote:</strong> &#8220;A review that is six months out of date is not a review. It is a historical artifact.&#8221;</p>
</blockquote>
<h4>Why this matters</h4>
<p>The fundamental flaw of the traditional systematic review is that it is a <strong>snapshot</strong>. A team works on it for six months, freezes the literature search at a date, and publishes a result that becomes outdated the moment the next paper appears. In rapidly evolving fields — oncology, infectious disease, AI/ML methodology, certain rare-disease indications — that lag is unacceptable.</p>
<p>The fix is a <strong>living systematic review</strong> — a review that is continuously refreshed as new evidence appears.</p>
<h4>What &#8220;living&#8221; actually requires</h4>
<p>Living reviews are not just &#8220;running the search again every quarter.&#8221; They require:</p>
<ol>
<li><strong>Protocol stability</strong> — the inclusion / exclusion criteria do not change between updates</li>
<li><strong>Federated search at scheduled cadence</strong> across the full database set</li>
<li><strong>Delta detection</strong> — what&#8217;s new since the last update</li>
<li><strong>Consistent screening</strong> — the same multi-agent consensus applied to new papers</li>
<li><strong>Risk-of-bias and GRADE re-assessment</strong> — if a new high-quality study changes the certainty of evidence, that needs to surface</li>
<li><strong>Versioned reporting</strong> — each refresh produces a versioned report with a clear changelog</li>
<li><strong>Subscriber notification</strong> — stakeholders are alerted when something material changes</li>
</ol>
<p>This is not a research methodology improvement. It is an engineering problem: how to do high-rigor evidence synthesis on a recurring schedule, with reproducibility and auditability preserved.</p>
<h4>Architecture</h4>
<p>EvidAI&#8217;s living review architecture:</p>
<pre><code>Protocol (versioned) ──► Federated search (11 databases, scheduled)
                                          │
                                          ▼
                              Delta detection
                                          │
                          New papers since last refresh
                                          │
                                          ▼
                       Multi-agent consensus screening
                                          │
                          Included papers (new)
                                          │
                                          ▼
                  Risk-of-bias (RoB 2 / ROBINS-I / NOS)
                                          │
                                          ▼
                  GRADE re-assessment per outcome
                                          │
                                          ▼
                  Living report (versioned, with changelog)
                                          │
                                          ▼
                  Subscriber notifications</code></pre>
<h4>What changes for the team</h4>
<p>The team&#8217;s role shifts from &#8220;run a six-month review every two years&#8221; to &#8220;monitor a continuously updated review and adjudicate the small fraction of decisions the AI escalated.&#8221; That is a fundamentally different work pattern, and it scales.</p>
<h4>Closing</h4>
<p>A review that is six months out of date is not a review. Living reviews are an engineering solution to a research methodology problem — and they are now operationally feasible.</p>
<hr>
<p>The post <a href="https://zorost.com/living-systematic-reviews-evidence-current/">Living Systematic Reviews: Evidence That Stays Current</a> appeared first on <a href="https://zorost.com">Zorost Intelligence | AI, Cloud &amp; Data Experts</a>.</p>
]]></content:encoded>
					
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">24287</post-id>	</item>
		<item>
		<title>Multi-Agent Consensus for Systematic Literature Review</title>
		<link>https://zorost.com/multi-agent-consensus-systematic-review/</link>
		
		<dc:creator><![CDATA[Zorost Intelligence]]></dc:creator>
		<pubDate>Tue, 04 Nov 2025 09:00:00 +0000</pubDate>
				<category><![CDATA[Pharmaceutical Research]]></category>
		<category><![CDATA[Evaluation]]></category>
		<category><![CDATA[EvidAI]]></category>
		<category><![CDATA[Multi-Agent]]></category>
		<category><![CDATA[PRISMA 2020]]></category>
		<category><![CDATA[Risk of Bias]]></category>
		<category><![CDATA[ROBINS-I]]></category>
		<guid isPermaLink="false">https://zorost.com/multi-agent-consensus-systematic-review/</guid>

					<description><![CDATA[<p>Single-LLM screening makes the SLR process faster but no more accurate. Multi-agent consensus screening — with four models, explanations, and disagreement detection — preserves PRISMA 2020 rigor.</p>
<p>The post <a href="https://zorost.com/multi-agent-consensus-systematic-review/">Multi-Agent Consensus for Systematic Literature Review</a> appeared first on <a href="https://zorost.com">Zorost Intelligence | AI, Cloud &amp; Data Experts</a>.</p>
]]></description>
										<content:encoded><![CDATA[<blockquote>
<p><strong>Pull-quote:</strong> &#8220;If four independent reasoners agree, the inclusion decision is high-confidence. If they disagree, the question goes to a human. That&#8217;s the design contract.&#8221;</p>
</blockquote>
<h4>Why this matters</h4>
<p>Systematic literature reviews underpin regulatory submissions, clinical practice guidelines, and HTA decisions. Doing them well is expensive and slow — typically 4–6 months and a six-figure investment for a single review. Doing them badly is dangerous.</p>
<p>The first wave of LLM-assisted screening was a single model judging each title/abstract against the inclusion criteria. It was faster than manual review. It was no more accurate. In some cases, it was less accurate, because a single model has systematic biases that a human reviewer doesn&#8217;t share.</p>
<h4>What multi-agent consensus does</h4>
<p>EvidAI runs every screening decision through <strong>four independent LLMs</strong>, each with a structured prompt that includes the protocol&#8217;s inclusion and exclusion criteria, a brief excerpt from the abstract, and a request for explicit reasoning.</p>
<p>The four models vote. Three patterns emerge:</p>
<table>
<thead>
<tr>
<th>Pattern</th>
<th>Frequency</th>
<th>Action</th>
</tr>
</thead>
<tbody>
<tr>
<td>4–0 unanimous include</td>
<td>~78%</td>
<td>Auto-include</td>
</tr>
<tr>
<td>4–0 unanimous exclude</td>
<td>~13%</td>
<td>Auto-exclude</td>
</tr>
<tr>
<td>3–1 majority</td>
<td>~6%</td>
<td>Flag for human reviewer with explanations</td>
</tr>
<tr>
<td>2–2 split</td>
<td>~2%</td>
<td>Mandatory human reviewer with adjudication</td>
</tr>
<tr>
<td>Disagreement on reasoning</td>
<td>varies</td>
<td>Flag for human reviewer regardless of outcome</td>
</tr>
</tbody>
</table>
<p>(Frequencies are typical for a well-designed protocol; they vary with topic.)</p>
<h4>Why the design works</h4>
<p>The key insight is that <strong>independent errors are uncorrelated</strong>. Different LLMs have different systematic biases — different training data, different RLHF preferences, different prompt sensitivities. When four independent reasoners agree, the marginal probability of error drops sharply. When they disagree, the model designers&#8217; expected behavior is reproducing the disagreement that human reviewers would have had — which is exactly what should be escalated.</p>
<p>Single-model screening hides disagreement. Multi-agent consensus surfaces it.</p>
<h4>Auditability</h4>
<p>Every screening decision is stored as a row with: paper ID, protocol version, model identifiers, raw model outputs, parsed decisions, the reason for inclusion/exclusion in each model&#8217;s words, the consensus result, and (if applicable) the human reviewer&#8217;s adjudication. The complete chain is replayable by an auditor and reproducible by a successor team.</p>
<p>This is the difference between an AI tool that <em>speeds up</em> the SLR process and one that <em>preserves the audit standard</em> it requires.</p>
<h4>Closing</h4>
<p>The multi-agent consensus pattern is the right answer for any high-stakes screening problem where accountability and auditability matter. EvidAI applies it to systematic reviews. The same pattern transfers cleanly to compliance screening, regulatory document review, due diligence, and grant assessment.</p>
<hr>
<p>The post <a href="https://zorost.com/multi-agent-consensus-systematic-review/">Multi-Agent Consensus for Systematic Literature Review</a> appeared first on <a href="https://zorost.com">Zorost Intelligence | AI, Cloud &amp; Data Experts</a>.</p>
]]></content:encoded>
					
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">24286</post-id>	</item>
	</channel>
</rss>
