Skip the Vector Database: Why Elastic Search Powers Better RAG Systems

The modern RAG stack has become a game of architectural Jenga. Everyone’s so busy adding specialized vector databases, embedding pipelines, and GPU-powered inference nodes that they’ve forgotten the first rule of systems design: simple scales, complex fails. If your document corpus fits on a laptop and your queries don’t require multi-hop reasoning across millions of nodes, that shiny new vector database isn’t solving problems, it’s creating them.

Elastic and OpenSearch have been handling retrieval at scale since before “vector search” was a VC pitch deck buzzword. Yet there’s a growing realization among data engineers that the gap between traditional search and modern RAG isn’t as wide as the vendor marketing suggests. For most production RAG applications, those handling fewer than 10,000 documents with decent variance, you’re likely better off with BM25 and a small BERT model running on CPU than you are with a managed vector service burning through your cloud budget.

The Vector Database Tax

We’ve seen this movie before. A new technology emerges, vendors slap “AI-native” on the tin, and suddenly everyone believes they need specialized infrastructure to compete. It’s the same logic that convinces enterprises to drop $60,000 on H200 rigs just to experiment with local models they’ll never actually deploy.

Vector databases like Pinecone, Weaviate, or Milvus aren’t bad technologies. They’re simply overkill for the majority of RAG implementations currently being built. The typical corporate knowledge base, confluence docs, PDFs, support tickets, and Slack archives, rarely exceeds the scale where traditional search breaks down. Yet developers instinctively reach for pgvector or dedicated vector stores because that’s what the tutorials show, not because their use case demands it.

The hidden costs stack up quickly. You need embedding models at index time and query time. You need GPU or API credits to generate those embeddings. You need to manage another database cluster with its own replication, backup, and monitoring concerns. And for what? To search a corpus that would fit comfortably in Elasticsearch’s cold storage?

BM25 Isn’t Dead, It Just Got Ignored

BM25 (Best Matching 25) has been the backbone of search engines like Elasticsearch and Lucene for decades. It works by scoring documents based on term frequency saturation (controlled by parameter k₁, typically 1.2-2.0) and length normalization (parameter b, typically 0.75). Rare terms get higher weights through Inverse Document Frequency (IDF), while common words like “the” get ignored.

The algorithm has a fundamental limitation: it only matches the words you type, not what you meant. Search for “cardiac arrest” and BM25 won’t return documents about “heart failure” unless those exact words appear. This is exactly the gap vector embeddings were designed to fill, semantic similarity matching through high-dimensional vector space.

But here’s the uncomfortable truth: BM25 and vector search fail in opposite directions, which is why production systems increasingly use hybrid approaches. BM25 is fast, deterministic, requires no model inference, and runs on commodity hardware. Vector search requires GPU acceleration, adds latency, and produces scores that are harder to interpret. When you combine both using Reciprocal Rank Fusion (RRF), which OpenSearch and Elasticsearch support natively, you get the precision of keyword search with the recall of semantic search.

How BM25 and RAG Retrieve Information Differently

The image above illustrates the fundamental difference: BM25 matches surface form while vector search matches meaning. For RAG systems, you typically need both, exact matches for specific terminology and semantic matches for conceptual similarity.

The 10K Document Reality Check

A data engineer recently noted what many in the LLM space have missed: if your document set is relatively small (under ~10K documents) and has good variance, you can skip embeddings entirely and use TF-IDF or BM25. Even if you need deeper semantic similarity, a small BERT model (around 100MB in FP32) running on CPU within Elastic or OpenSearch handles the task perfectly well.

This isn’t theoretical. Elasticsearch has supported text embedding since version 8.x, allowing you to run inference directly on the cluster. You don’t need a separate vector database, you need a search engine that understands both lexical and semantic retrieval. The distinction between “search engine” and “vector database” is largely a marketing construct at smaller scales, both are doing similarity search, just with different mathematical foundations.

The scale argument matters because vector databases shine when you’re dealing with billions of vectors or require complex multi-hop reasoning across dense knowledge graphs. For the typical enterprise RAG use case, “search our 5,000 technical docs and 20,000 support tickets”, you’re paying vector database prices for problems Elasticsearch solved years ago.

When Structured Data Meets Semantic Search

The “just use a vector database” crowd often overlooks a critical limitation: vector search struggles with scalar and relational data. If you need to query for “documents authored by John Doe in 2025” or filter by specific metadata criteria, pure vector search is inefficient. It requires hybrid architectures that combine vector similarity with keyword-based filtering.

Elasticsearch excels here because it was built for exactly this hybrid world. You can store structured metadata alongside vector embeddings in the same index, filtering by date ranges, author IDs, or document types before applying semantic similarity. Systems like ScyllaDB and Milvus have adopted similar dual-layer architectures, but Elastic had this figured out before the current AI boom.

This aligns with the broader pattern of pragmatic AI implementation where the goal is solving business problems, not accumulating impressive-sounding infrastructure. If your RAG system needs to respect document permissions, filter by date ranges, or handle structured queries alongside semantic ones, a pure vector database forces you into complex workarounds that Elastic handles natively.

The Implementation: Elastic + Small BERT

Setting up semantic search in Elasticsearch doesn’t require a PhD in information retrieval. For local development, you can spin up a single node with Docker and enable the machine learning features:

docker run -p 9200:9200 -e "discovery.type=single-node" \
  -e "xpack.security.enabled=false" \
  docker.elastic.co/elasticsearch/elasticsearch:8.12.0

Upload a small BERT model (like sentence-transformers/all-MiniLM-L6-v2, roughly 80MB) directly to the cluster using the Eland library:

from eland.ml.pytorch import PyTorchModel
from eland.ml.pytorch.transformers import TransformerModel

# Upload model to Elasticsearch
tm = TransformerModel("sentence-transformers/all-MiniLM-L6-v2", "text_embedding")
ptm = PyTorchModel(es, "my_bert_model")
ptm.import_model(tm)

Now you have vector search capabilities, dense embeddings, cosine similarity, the works, without managing a separate database. For production, you scale horizontally the same way you’ve always scaled Elastic: add nodes, let the cluster rebalance, and tune your shard counts.

Cost, Complexity, and the “Just Postgres” Alternative

The complexity tax of adding another database to your stack is real. Every new component is a potential point of failure, a security surface to audit, and a bill to pay. This mirrors the data warehouse cost optimization problem where organizations discover they’re paying premium prices for idle resources and unnecessary abstraction layers.

Some engineers advocate for pgvector as the middle ground, keeping vectors in PostgreSQL alongside your application data. This works until it doesn’t. Postgres plugins are convenient for prototyping, but when compared to native search solutions like Elastic or OpenSearch, they often hit performance walls under load. As one engineer noted, “Postgres plugins are normally shit tier in terms of perf when compared to native solutions.”

If you’re already running Elastic or OpenSearch for log aggregation or application search, adding RAG capabilities is a configuration change, not a new infrastructure project. You’re not “building a vector database”, you’re enabling a feature in a system you already maintain.

When You Actually Need a Specialized Vector DB

To be clear, there are legitimate use cases for dedicated vector databases. If you’re building a recommendation engine across 100 million products, doing real-time semantic search across billions of documents, or need specialized indexing like HNSW (Hierarchical Navigable Small World) with specific tuning parameters that Elastic doesn’t expose, then Pinecone, Milvus, or Weaviate make sense.

Similarly, if you’re implementing Graph RAG, where you need to traverse complex entity relationships across a knowledge graph with multi-hop reasoning, a graph database like Neo4j combined with vector search provides capabilities that pure Elastic can’t match. Systems like Microsoft’s GraphRAG or Neo4j’s implementation specifically address these limitations.

But for the 80% of RAG applications that are essentially “smart document search”, you’re solving a solved problem with unnecessarily complex tools.

The Verdict

The RAG stack doesn’t need to be a Frankenstein’s monster of specialized databases, embedding services, and orchestration layers. Elasticsearch and OpenSearch provide BM25 keyword search, vector similarity search, hybrid RRF ranking, and structured filtering in a single, battle-tested platform.

Before you spin up that managed vector database and start piping embeddings across your network, ask yourself: do you have more than 10,000 documents? Do you need pure semantic matching without any keyword precision? Do you have the operational budget to monitor another distributed system?

If the answer to those questions is no, then you don’t have a vector database problem. You have a search configuration problem. And that’s something Elastic solved years ago.