AI Development

How We Built a Production RAG System for an Enterprise Knowledge Base

14 Min Read

A deep-dive into the architecture, tradeoffs, and lessons learned from deploying a Retrieval-Augmented Generation system that handles thousands of internal queries daily.

Introduction

When a mid-sized enterprise comes to us with 50,000 internal documents, policies, SOPs, product manuals, HR guides, and asks 'can our employees just ask questions and get answers?', the architecture decision you make in week one shapes everything that follows.

This post walks through exactly how we designed and shipped a production RAG (Retrieval-Augmented Generation) system for an enterprise client. We cover the choices we made, the ones we regret, and what we'd do differently on the next project.

Why RAG Over Fine-Tuning?

The first question we always address with clients is: 'Should we fine-tune a model on our data instead?' Fine-tuning is appealing because it feels permanent, the knowledge is baked into the model. But for enterprise knowledge bases, it has serious drawbacks.

First, enterprise documents change constantly. Policy updates, product revisions, org changes, fine-tuning cannot keep up with a living knowledge base without expensive retraining cycles. Second, fine-tuned models hallucinate with confidence. They interpolate between training examples and produce plausible-sounding but wrong answers with no citation.

RAG solves both problems: retrieval is always live against your latest document store, and every answer can be grounded with a source reference the user can verify. For compliance-sensitive environments, that auditability is non-negotiable.

A further evolution is the shift from simple RAG to agentic RAG. In 2026, modern systems don't just retrieve once and generate: they autonomously reason and, if the first search fails or is insufficient, reformulate queries and retry. This makes them far more robust for complex, multi-step questions where a single retrieval pass would miss the mark.

💡When Fine-Tuning Makes Sense

Fine-tuning is the right call when you need to change the model's behavior or tone, not when you need it to know specific facts. Think: teaching a model to respond in your brand voice, not teaching it your product catalog.

System Architecture

System Architecture diagram

Our production architecture uses three decoupled microservices: an Ingestion Service, an Embeddings Service, and a Query Orchestration Service. All three are containerized and deployed on AWS ECS.

The Ingestion Service handles document upload, format normalization (PDF, DOCX, HTML, Confluence exports), metadata extraction, and chunking before publishing chunks to an SQS queue. The Embeddings Service consumes from that queue, calls the embeddings model, and upserts vectors into Pinecone with namespace isolation per department.

The Query Orchestration Service receives user queries via a REST API, embeds the query, retrieves the top-k relevant chunks, constructs a prompt with context, calls the LLM, and returns a structured response with source citations. All services are stateless; session history is managed by the client application.

query_orchestrator.py
from langchain.chains import RetrievalQA
from langchain.vectorstores import Pinecone
from langchain.chat_models import ChatOpenAI

def handle_query(query: str, namespace: str, user_role: str) -> dict:
    vectorstore = Pinecone.from_existing_index(
        index_name="enterprise-kb",
        namespace=namespace
    )
    retriever = vectorstore.as_retriever(
        search_kwargs={"k": 6, "filter": {"access_level": user_role}}
    )
    llm = ChatOpenAI(model="gpt-4o", temperature=0)
    qa_chain = RetrievalQA.from_chain_type(
        llm=llm,
        retriever=retriever,
        return_source_documents=True
    )
    result = qa_chain({"query": query})
    return {
        "answer": result["result"],
        "sources": [doc.metadata["source"] for doc in result["source_documents"]]
    }

Chunking & Embedding Strategy

Chunking is where most RAG projects get into trouble. Arbitrary character-based splits (e.g., 'every 500 characters') destroy semantic coherence. A sentence that begins in chunk 12 and ends in chunk 13 becomes useless in retrieval.

We use a hierarchical chunking strategy: documents are first split at natural semantic boundaries (headings, paragraphs, list items) using a custom parser. Chunks are then sized to approximately 300–400 tokens with a 50-token overlap to preserve context at boundaries. For dense technical documents like API references, we use smaller chunks (150 tokens) with higher overlap.

For embeddings, we use OpenAI's text-embedding-3-large for English content. Where multilingual support is needed, we switch to a multilingual-e5-large model. All embeddings are stored with rich metadata: document title, section heading, last-modified date, author, and access level.

💡Pro-Tip: Chunk at Meaning, Not Length

Before building your chunker, spend two hours manually reading 20 documents from your corpus and writing down where YOU would split them. Those natural breakpoints, topic shifts, section changes, list endings, are where your chunker should cut too.

Retrieval Quality Tuning

Shipping a RAG system is easy. Shipping one that retrieves the right document 95% of the time is hard. Our tuning process goes through three phases: baseline measurement, failure analysis, and targeted fixes.

For baseline measurement, we build a golden evaluation set of 100 question-answer pairs with known source documents. We measure recall@k (did the right chunk appear in the top k results?) and answer faithfulness (does the LLM's answer match the retrieved source?). On first deployment, recall@5 typically sits around 72–78% for enterprise knowledge bases.

The two most common failure modes are: query-document vocabulary mismatch (user asks 'vacation days', document says 'PTO policy') and dense jargon retrieval failures. We address the first with HyDE (Hypothetical Document Embeddings), ask the LLM to generate a hypothetical answer first, then embed that for retrieval. For jargon, we add a keyword-based BM25 retrieval layer and use RRF (Reciprocal Rank Fusion) to merge results.

Lessons from Production

After six months in production, three lessons stand out. First: cache aggressively. Over 40% of queries are near-duplicates. A semantic cache (using cosine similarity on query embeddings to find cached responses) cuts LLM costs dramatically. Second: log everything. User queries are a goldmine for finding gaps in your document corpus, if 30 users asked the same question and got poor answers, you know exactly what document to add.

Third, and most importantly: retrieval quality degrades as your document store grows if you don't maintain it. Stale documents, duplicate content, and contradictory policies all show up as degraded answer quality. Build a document freshness pipeline from day one.

Conclusion

A well-architected RAG system can genuinely transform how an organization accesses its institutional knowledge. The key is treating retrieval quality as a first-class engineering concern, not an afterthought. Start with a small, high-quality document corpus, instrument everything from day one, and expand incrementally.

If you're building something similar or want to explore what a knowledge assistant could look like for your team, our Agentic Knowledge Assistant is a great starting point, reach out to discuss.

#RAG#LLM#Enterprise#VectorDB

Related Projects

Ready to Harness the Power of AI?

Whether you're optimizing operations, enhancing customer experiences, or exploring automation, our team at TechiZen is ready to bring your vision to life with 20+ years of software excellence. Let's start building your AI advantage today.