Contacts
Get in touch
Close

Contacts

USA, Washington D.C

+ (1) 240-380-7545

info@zorost.com

Pull-quote: “If four independent reasoners agree, the inclusion decision is high-confidence. If they disagree, the question goes to a human. That’s the design contract.”

Why this matters

Systematic literature reviews underpin regulatory submissions, clinical practice guidelines, and HTA decisions. Doing them well is expensive and slow — typically 4–6 months and a six-figure investment for a single review. Doing them badly is dangerous.

The first wave of LLM-assisted screening was a single model judging each title/abstract against the inclusion criteria. It was faster than manual review. It was no more accurate. In some cases, it was less accurate, because a single model has systematic biases that a human reviewer doesn’t share.

What multi-agent consensus does

EvidAI runs every screening decision through four independent LLMs, each with a structured prompt that includes the protocol’s inclusion and exclusion criteria, a brief excerpt from the abstract, and a request for explicit reasoning.

The four models vote. Three patterns emerge:

Pattern Frequency Action
4–0 unanimous include ~78% Auto-include
4–0 unanimous exclude ~13% Auto-exclude
3–1 majority ~6% Flag for human reviewer with explanations
2–2 split ~2% Mandatory human reviewer with adjudication
Disagreement on reasoning varies Flag for human reviewer regardless of outcome

(Frequencies are typical for a well-designed protocol; they vary with topic.)

Why the design works

The key insight is that independent errors are uncorrelated. Different LLMs have different systematic biases — different training data, different RLHF preferences, different prompt sensitivities. When four independent reasoners agree, the marginal probability of error drops sharply. When they disagree, the model designers’ expected behavior is reproducing the disagreement that human reviewers would have had — which is exactly what should be escalated.

Single-model screening hides disagreement. Multi-agent consensus surfaces it.

Auditability

Every screening decision is stored as a row with: paper ID, protocol version, model identifiers, raw model outputs, parsed decisions, the reason for inclusion/exclusion in each model’s words, the consensus result, and (if applicable) the human reviewer’s adjudication. The complete chain is replayable by an auditor and reproducible by a successor team.

This is the difference between an AI tool that speeds up the SLR process and one that preserves the audit standard it requires.

Closing

The multi-agent consensus pattern is the right answer for any high-stakes screening problem where accountability and auditability matter. EvidAI applies it to systematic reviews. The same pattern transfers cleanly to compliance screening, regulatory document review, due diligence, and grant assessment.