A Retrieval Engine over the World's Aviation Safety Corpus - Zorost Intelligence

Pull-quote: “Vector search alone is not retrieval. It is one signal among several.”

Why this matters

Aviation safety knowledge sits in two enormous public-domain corpora: the U.S. NTSB accident reports and the NASA ASRS voluntary safety reports. Together, that’s 247,000+ documents of structured incident narratives. Pilots, controllers, and operations engineers have written them under the assumption that they would be searched, cross-referenced, and learned from.

Most platforms reduce this to keyword search. Better platforms add full-text search. The frontier is citation-grounded retrieval-augmented generation — the assistant retrieves, the model writes, every claim links back to the source documents.

Why hybrid retrieval

The naive approach to a RAG system is “embed everything and run a vector search.” It does not work in production. Vector search is excellent at finding semantically similar documents and bad at finding specifically named entities. BM25 is the opposite. Production retrieval needs both.

Our retrieval pipeline:

Question
   │
   ├──► dense (pgvector + BGE-large) ──► top 50
   ├──► sparse (BM25)                  ──► top 50
   │
   └──► merge + cross-encoder re-rank   ──► top 8
                            │
                            ▼
                Citation-grounded generation
                (Gemini 2.5 Flash for fast answers,
                 Claude / GPT for detailed analysis)

Why a re-ranker

The re-ranker (a cross-encoder, not a bi-encoder) sees the query and the candidate document together and scores them jointly. This is more expensive per call than vector search, but the precision improvement on the top-8 is large enough that it pays for itself — fewer retrievals, fewer hallucinations, better answers.

Why citation grounding

The default mode of an LLM is to fabricate plausible-sounding answers. The fix is structural: the model is constrained to write its answer with bracketed citation tokens, and the citation tokens must reference documents that actually exist in the retrieval set. Generation is post-processed to validate the citations and reject any answer that fails validation.

This is a small structural change with a large operational impact. It moves the system from “talking to a model that has ingested aviation knowledge” to “asking a model to summarize specific source documents.”

What this is good at

“What are the leading causes of runway incursions for regional jets in low-visibility conditions?”
“Show me ASRS reports that match the pattern of sudden hydraulic failure during flap retraction.”
“What are the recurring training gaps that show up in cargo operations CRM reports?”

What it is not good at: real-time operational queries that need current schedule data — those go to the predictive and causal layers.

Closing

Vector search alone is not retrieval. It is one signal. Production-grade RAG over a regulated safety corpus requires hybrid retrieval, a real re-ranker, and structural citation grounding. The result is an assistant analysts trust enough to use — which is the only metric that matters.

M	T	W	T	F	S	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

Why this matters

Why hybrid retrieval

Why a re-ranker

Why citation grounding

What this is good at

Closing

Related

Recent posts

Archive

Tags

Transformative Consulting for Cloud, AI, and Beyond.

Recent comments

Company

Services