Pull-quote: “Pure vector retrieval is the most common production-grade RAG mistake. Pure BM25 is the second most common.”
Why this matters
A pattern repeats in every RAG project that goes wrong: someone embeds the corpus, runs vector search, and ships. The system works in demos and disappoints in production. The fix is a structural architecture change: hybrid retrieval.
The components
Query
│
├──► Dense (vector) — pgvector / Weaviate / Qdrant + an embedding model
│
├──► Sparse (BM25) — Postgres FTS / Elasticsearch / OpenSearch
│
├──► Optional filters — date range, source, entity tags
│
└──► Merge (RRF or weighted) ──► Cross-encoder re-rank ──► Top-K
│
▼
Citation-grounded generation
Why each piece matters
- Vector is excellent at semantic similarity — finding documents that are about the same topic in different words. It is bad at named entities — exact terms, IDs, dates.
- BM25 is the opposite — excellent at named entities, weaker on semantic similarity.
- Filters — when the question is bounded (“just look at 2024 reports about Boeing 737”), filters dramatically reduce the candidate set before ranking.
- Merge — Reciprocal Rank Fusion (RRF) is a clean default. Weighted merges work with calibrated scores.
- Cross-encoder re-rank — sees the query and the candidate document together and scores them jointly. More expensive than bi-encoder vector search, but the precision improvement on the top-K is large enough to pay for itself.
What changes when you do this right
- Hallucination rate drops. The model has better evidence to ground in.
- Citation precision goes up. The cited documents actually support the claim.
- Edge cases (rare entity queries, exact-quote queries) work properly.
- Generation latency stays low because the model only sees the top-K (typically 6–10), not the top-100.
Common mistakes
- No re-ranker. Top-50 from vector + top-50 from BM25 with RRF is a starting point, but without a re-ranker the top-K still contains noise.
- No filtering. Filtering before retrieval is essentially free if your data is properly indexed.
- Skip evaluation. Without a golden Q&A dataset and grounding scoring, you have no way to compare retrieval architectures.
Closing
Pure vector retrieval is the most common production-grade RAG mistake. Hybrid retrieval — vector + sparse + filters + re-rank — is the boring, reliable, production answer. Every Zorost RAG system runs this architecture.


