Chapter 3RAG Systems — Implementing Retrieval-Augmented Generation at Scale

Building a RAG Pipeline — Vector Search + LLM Answering

Combine semantic search with LLM generation to build an app that answers using your data.

🚀 Turning Chunks Into Answers

Now that you've ingested and chunked your documents, it's time to connect the dots: take a user query, search for relevant content, and pass that into a language model for a final, grounded answer.

This chapter focuses on the heart of RAG: the Retriever-Generator pipeline.

You'll build a fully functional setup using:

A vector store (Chroma) to store and search chunks
A retriever to semantically match user queries
A language model (OpenAI or Gemini) to generate answers
An optional chain to orchestrate the components cleanly

📦 The Pipeline Architecture

User query

→ Embed query

→ Retrieve top-k matching chunks from the vector store

→ Insert into prompt template

→ Generate LLM response

→ Return grounded answer

You can build this pipeline with LangChain or LlamaIndex, both of which support modular components.

🛠️ Step 1: Connect to Your Vector Store

If you used Chroma for ingestion, load your persisted store:

from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings

embedding = OpenAIEmbeddings()
vectorstore = Chroma(
    persist_directory="./chroma_store",
    embedding_function=embedding
)

You can also initialize from a new set of documents, if needed:

vectorstore = Chroma.from_documents(documents, embedding)

🔍 Step 2: Create a Retriever

A retriever is a wrapper around the vector store that handles searching.

retriever = vectorstore.as_retriever(search_kwargs={"k": 3})

You can tweak k to control how many chunks are retrieved. Some retrievers also support hybrid (keyword + vector) search.

To test:

results = retriever.get_relevant_documents("What's the return policy?")
for doc in results:
    print(doc.page_content)

🧠 Step 3: Generate an Answer with an LLM

Use LangChain's built-in chain system to wrap retrieval + response:

from langchain.llms import OpenAI
from langchain.chains.question_answering import load_qa_chain

llm = OpenAI(temperature=0)

qa_chain = load_qa_chain(
    llm,
    chain_type="stuff"
)

response = qa_chain.run(
    input_documents=results,
    question="What's the return policy?"
)

print(response)

This is the simplest possible RAG setup — and already quite powerful.

🧩 What is "Stuff" Chain Type?

In LangChain, there are multiple chain types:

stuff: simply concatenate all retrieved docs into one prompt
map_reduce: answer each doc individually, then summarize
refine: start with one doc, then refine the answer using the next
map_rerank: score answers and return the highest scoring one

Start with stuff, but test others for long docs or when you want more stability.

📄 Prompt Customization for Better Control

To create more custom output, define a prompt template yourself:

from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain

template = """Use the context below to answer the question.
If the answer isn't in the context, say "I don't know."

Context:
{context}

Question:
{question}

Answer:"""

prompt = PromptTemplate(
    input_variables=["context", "question"],
    template=template
)

chain = LLMChain(llm=llm, prompt=prompt)

context = "\n\n".join([doc.page_content for doc in results])
response = chain.run(context=context, question="What is the refund policy?")
print(response)

This gives you total control over how information is presented to the model.

📊 Response Scoring and Logging (Optional)

You can log retrieval scores for ranking or evaluation:

for doc in results:
    print(f"{doc.metadata.get('source')}: {doc.metadata.get('score')}")

Consider saving:

Query
Retrieved chunks
Generated response
Token usage
Retrieval latency

This helps you improve performance and analyze LLM behavior later.

⚡ Alternatives with LlamaIndex

LlamaIndex lets you create a QueryEngine that wraps all logic into a single interface:

from llama_index import VectorStoreIndex, ServiceContext
from llama_index.llms import OpenAI
from llama_index.query_engine import RetrieverQueryEngine

service_context = ServiceContext.from_defaults(llm=OpenAI())

index = VectorStoreIndex.from_vector_store(vectorstore)
retriever = index.as_retriever()
query_engine = RetrieverQueryEngine(retriever=retriever)

response = query_engine.query("What is the return policy?")
print(response)

This abstracts away prompt logic and works well if you're fully in the LlamaIndex ecosystem.

🧪 Fine-Tuning Retrieval

To improve results:

Try Dense Passage Retrieval (DPR) or hybrid search
Increase chunk overlap during ingestion
Test different embedding models like text-embedding-3-large or bge-large-en
Normalize or clean chunk text
Use reranking or scoring mechanisms (e.g., Cohere Rerank)

Your retriever is the brain of your RAG — if it fails to find the right content, the model will hallucinate.

🧠 LLM Selection Tips

You can plug in any LLM supported by LangChain:

OpenAI (GPT-3.5, GPT-4) — Stable, predictable
Gemini Flash or Pro — Fast and scalable, great for real-time QA
Claude 3 — Excellent summarization, low hallucination
Mistral / LLaMA 3 — Use with Ollama or vLLM for private/local apps

Each model performs differently in RAG. Some (like GPT-4) are more precise but slower, while others (like Gemini Flash) excel in speed and context compression.

🔜 Coming Up Next

Now that you've built a complete retrieval and answering pipeline, it's time to turn it into a production-ready service.

In Chapter 4, we'll walk through deploying your RAG system behind a real API using FastAPI or LangServe. We'll cover:

How to expose your RAG app as a POST endpoint
How to handle scale, timeouts, and rate limits
Best practices for security, logging, and cost control

👉 Continue to Chapter 4: Deploying RAG Systems — APIs, Scaling, and Security Best Practices

Ingesting and Chunking Documents — Using LlamaIndex or LangChain

Deploying RAG Systems — APIs, Scaling, and Security Best Practices

Back to RAG Systems — Implementing Retrieval-Augmented Generation at Scale