<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>BM25 Archives - Zorost Intelligence | AI, Cloud &amp; Data Experts</title>
	<atom:link href="https://zorost.com/tag/bm25/feed/" rel="self" type="application/rss+xml" />
	<link>https://zorost.com/tag/bm25/</link>
	<description>Production AI systems for aviation, manufacturing, pharma, government, finance, freight, and geopolitical intelligence.</description>
	<lastBuildDate>Wed, 20 May 2026 18:52:39 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=7.0</generator>

<image>
	<url>https://zorost.com/wp-content/uploads/2025/08/ZOROST-Intel-Logo3_512-150x150.png</url>
	<title>BM25 Archives - Zorost Intelligence | AI, Cloud &amp; Data Experts</title>
	<link>https://zorost.com/tag/bm25/</link>
	<width>32</width>
	<height>32</height>
</image> 
<site xmlns="com-wordpress:feed-additions:1">81719879</site>	<item>
		<title>Hybrid Retrieval: Why Vector Alone Isn&#8217;t Enough</title>
		<link>https://zorost.com/hybrid-retrieval-vector-alone-not-enough/</link>
		
		<dc:creator><![CDATA[Zorost Intelligence]]></dc:creator>
		<pubDate>Tue, 17 Feb 2026 09:00:00 +0000</pubDate>
				<category><![CDATA[Agentic AI Engineering]]></category>
		<category><![CDATA[BM25]]></category>
		<category><![CDATA[Evaluation]]></category>
		<category><![CDATA[Hybrid Retrieval]]></category>
		<category><![CDATA[RAG]]></category>
		<category><![CDATA[Vector Search]]></category>
		<guid isPermaLink="false">https://zorost.com/hybrid-retrieval-vector-alone-not-enough/</guid>

					<description><![CDATA[<p>Vector search is excellent at semantic similarity and bad at named entities. BM25 is the opposite. Production-grade retrieval is hybrid — and the architecture decisions matter.</p>
<p>The post <a href="https://zorost.com/hybrid-retrieval-vector-alone-not-enough/">Hybrid Retrieval: Why Vector Alone Isn&#8217;t Enough</a> appeared first on <a href="https://zorost.com">Zorost Intelligence | AI, Cloud &amp; Data Experts</a>.</p>
]]></description>
										<content:encoded><![CDATA[<blockquote>
<p><strong>Pull-quote:</strong> &#8220;Pure vector retrieval is the most common production-grade RAG mistake. Pure BM25 is the second most common.&#8221;</p>
</blockquote>
<h4>Why this matters</h4>
<p>A pattern repeats in every RAG project that goes wrong: someone embeds the corpus, runs vector search, and ships. The system works in demos and disappoints in production. The fix is a structural architecture change: <strong>hybrid retrieval</strong>.</p>
<h4>The components</h4>
<pre><code>Query
  │
  ├──► Dense (vector)   — pgvector / Weaviate / Qdrant + an embedding model
  │
  ├──► Sparse (BM25)    — Postgres FTS / Elasticsearch / OpenSearch
  │
  ├──► Optional filters — date range, source, entity tags
  │
  └──► Merge (RRF or weighted) ──► Cross-encoder re-rank ──► Top-K
                                                                │
                                                                ▼
                                                Citation-grounded generation</code></pre>
<h4>Why each piece matters</h4>
<ul>
<li><strong>Vector</strong> is excellent at <em>semantic similarity</em> — finding documents that are about the same topic in different words. It is bad at <em>named entities</em> — exact terms, IDs, dates.</li>
<li><strong>BM25</strong> is the opposite — excellent at named entities, weaker on semantic similarity.</li>
<li><strong>Filters</strong> — when the question is bounded (&#8220;just look at 2024 reports about Boeing 737&#8221;), filters dramatically reduce the candidate set before ranking.</li>
<li><strong>Merge</strong> — Reciprocal Rank Fusion (RRF) is a clean default. Weighted merges work with calibrated scores.</li>
<li><strong>Cross-encoder re-rank</strong> — sees the query and the candidate document together and scores them jointly. More expensive than bi-encoder vector search, but the precision improvement on the top-K is large enough to pay for itself.</li>
</ul>
<h4>What changes when you do this right</h4>
<ul>
<li>Hallucination rate drops. The model has better evidence to ground in.</li>
<li>Citation precision goes up. The cited documents actually support the claim.</li>
<li>Edge cases (rare entity queries, exact-quote queries) work properly.</li>
<li>Generation latency stays low because the model only sees the top-K (typically 6–10), not the top-100.</li>
</ul>
<h4>Common mistakes</h4>
<ul>
<li><strong>No re-ranker.</strong> Top-50 from vector + top-50 from BM25 with RRF is a starting point, but without a re-ranker the top-K still contains noise.</li>
<li><strong>No filtering.</strong> Filtering before retrieval is essentially free if your data is properly indexed.</li>
<li><strong>Skip evaluation.</strong> Without a golden Q&amp;A dataset and grounding scoring, you have no way to compare retrieval architectures.</li>
</ul>
<h4>Closing</h4>
<p>Pure vector retrieval is the most common production-grade RAG mistake. Hybrid retrieval — vector + sparse + filters + re-rank — is the boring, reliable, production answer. Every Zorost RAG system runs this architecture.</p>
<hr>
<p>The post <a href="https://zorost.com/hybrid-retrieval-vector-alone-not-enough/">Hybrid Retrieval: Why Vector Alone Isn&#8217;t Enough</a> appeared first on <a href="https://zorost.com">Zorost Intelligence | AI, Cloud &amp; Data Experts</a>.</p>
]]></content:encoded>
					
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">24296</post-id>	</item>
		<item>
		<title>A Retrieval Engine over the World&#8217;s Aviation Safety Corpus</title>
		<link>https://zorost.com/retrieval-engine-aviation-safety-corpus/</link>
		
		<dc:creator><![CDATA[Zorost Intelligence]]></dc:creator>
		<pubDate>Tue, 13 Jan 2026 09:00:00 +0000</pubDate>
				<category><![CDATA[Aviation Intelligence]]></category>
		<category><![CDATA[AeroFarr]]></category>
		<category><![CDATA[BM25]]></category>
		<category><![CDATA[Hybrid Retrieval]]></category>
		<category><![CDATA[RAG]]></category>
		<category><![CDATA[Safety]]></category>
		<category><![CDATA[Vector Search]]></category>
		<guid isPermaLink="false">https://zorost.com/retrieval-engine-aviation-safety-corpus/</guid>

					<description><![CDATA[<p>247,000 public-domain aviation safety reports — indexed with hybrid retrieval, re-ranking, and citation-grounded generation. Here is what we learned designing it for production.</p>
<p>The post <a href="https://zorost.com/retrieval-engine-aviation-safety-corpus/">A Retrieval Engine over the World&#8217;s Aviation Safety Corpus</a> appeared first on <a href="https://zorost.com">Zorost Intelligence | AI, Cloud &amp; Data Experts</a>.</p>
]]></description>
										<content:encoded><![CDATA[<blockquote>
<p><strong>Pull-quote:</strong> &#8220;Vector search alone is not retrieval. It is one signal among several.&#8221;</p>
</blockquote>
<h4>Why this matters</h4>
<p>Aviation safety knowledge sits in two enormous public-domain corpora: the U.S. NTSB accident reports and the NASA ASRS voluntary safety reports. Together, that&#8217;s <strong>247,000+ documents</strong> of structured incident narratives. Pilots, controllers, and operations engineers have written them under the assumption that they would be searched, cross-referenced, and learned from.</p>
<p>Most platforms reduce this to keyword search. Better platforms add full-text search. The frontier is <strong>citation-grounded retrieval-augmented generation</strong> — the assistant retrieves, the model writes, every claim links back to the source documents.</p>
<h4>Why hybrid retrieval</h4>
<p>The naive approach to a RAG system is &#8220;embed everything and run a vector search.&#8221; It does not work in production. Vector search is excellent at finding <em>semantically similar</em> documents and bad at finding <em>specifically named</em> entities. BM25 is the opposite. Production retrieval needs both.</p>
<p>Our retrieval pipeline:</p>
<pre><code>Question
   │
   ├──► dense (pgvector + BGE-large) ──► top 50
   ├──► sparse (BM25)                  ──► top 50
   │
   └──► merge + cross-encoder re-rank   ──► top 8
                            │
                            ▼
                Citation-grounded generation
                (Gemini 2.5 Flash for fast answers,
                 Claude / GPT for detailed analysis)</code></pre>
<h4>Why a re-ranker</h4>
<p>The re-ranker (a cross-encoder, not a bi-encoder) sees the query and the candidate document together and scores them jointly. This is more expensive per call than vector search, but the precision improvement on the top-8 is large enough that it pays for itself — fewer retrievals, fewer hallucinations, better answers.</p>
<h4>Why citation grounding</h4>
<p>The default mode of an LLM is to <strong>fabricate plausible-sounding answers</strong>. The fix is structural: the model is constrained to write its answer with bracketed citation tokens, and the citation tokens must reference documents that actually exist in the retrieval set. Generation is post-processed to validate the citations and reject any answer that fails validation.</p>
<p>This is a small structural change with a large operational impact. It moves the system from &#8220;talking to a model that has ingested aviation knowledge&#8221; to &#8220;asking a model to summarize specific source documents.&#8221;</p>
<h4>What this is good at</h4>
<ul>
<li>&#8220;What are the leading causes of runway incursions for regional jets in low-visibility conditions?&#8221;</li>
<li>&#8220;Show me ASRS reports that match the pattern of sudden hydraulic failure during flap retraction.&#8221;</li>
<li>&#8220;What are the recurring training gaps that show up in cargo operations CRM reports?&#8221;</li>
</ul>
<p>What it is <em>not</em> good at: real-time operational queries that need current schedule data — those go to the predictive and causal layers.</p>
<h4>Closing</h4>
<p>Vector search alone is not retrieval. It is one signal. Production-grade RAG over a regulated safety corpus requires hybrid retrieval, a real re-ranker, and structural citation grounding. The result is an assistant analysts trust enough to use — which is the only metric that matters.</p>
<hr>
<p>The post <a href="https://zorost.com/retrieval-engine-aviation-safety-corpus/">A Retrieval Engine over the World&#8217;s Aviation Safety Corpus</a> appeared first on <a href="https://zorost.com">Zorost Intelligence | AI, Cloud &amp; Data Experts</a>.</p>
]]></content:encoded>
					
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">24284</post-id>	</item>
	</channel>
</rss>
