Contacts
Get in touch
Close

Contacts

USA, Washington D.C

+ (1) 240-380-7545

info@zorost.com

Pull-quote: “Pure vector retrieval is the most common production-grade RAG mistake. Pure BM25 is the second most common.”

Why this matters

A pattern repeats in every RAG project that goes wrong: someone embeds the corpus, runs vector search, and ships. The system works in demos and disappoints in production. The fix is a structural architecture change: hybrid retrieval.

The components

Query
  │
  ├──► Dense (vector)   — pgvector / Weaviate / Qdrant + an embedding model
  │
  ├──► Sparse (BM25)    — Postgres FTS / Elasticsearch / OpenSearch
  │
  ├──► Optional filters — date range, source, entity tags
  │
  └──► Merge (RRF or weighted) ──► Cross-encoder re-rank ──► Top-K
                                                                │
                                                                ▼
                                                Citation-grounded generation

Why each piece matters

  • Vector is excellent at semantic similarity — finding documents that are about the same topic in different words. It is bad at named entities — exact terms, IDs, dates.
  • BM25 is the opposite — excellent at named entities, weaker on semantic similarity.
  • Filters — when the question is bounded (“just look at 2024 reports about Boeing 737”), filters dramatically reduce the candidate set before ranking.
  • Merge — Reciprocal Rank Fusion (RRF) is a clean default. Weighted merges work with calibrated scores.
  • Cross-encoder re-rank — sees the query and the candidate document together and scores them jointly. More expensive than bi-encoder vector search, but the precision improvement on the top-K is large enough to pay for itself.

What changes when you do this right

  • Hallucination rate drops. The model has better evidence to ground in.
  • Citation precision goes up. The cited documents actually support the claim.
  • Edge cases (rare entity queries, exact-quote queries) work properly.
  • Generation latency stays low because the model only sees the top-K (typically 6–10), not the top-100.

Common mistakes

  • No re-ranker. Top-50 from vector + top-50 from BM25 with RRF is a starting point, but without a re-ranker the top-K still contains noise.
  • No filtering. Filtering before retrieval is essentially free if your data is properly indexed.
  • Skip evaluation. Without a golden Q&A dataset and grounding scoring, you have no way to compare retrieval architectures.

Closing

Pure vector retrieval is the most common production-grade RAG mistake. Hybrid retrieval — vector + sparse + filters + re-rank — is the boring, reliable, production answer. Every Zorost RAG system runs this architecture.