Combine semantic search with LLM generation to build an app that answers using your data.
Now that you've ingested and chunked your documents, it's time to connect the dots: take a user query, search for relevant content, and pass that into a language model for a final, grounded answer.
This chapter focuses on the heart of RAG: the Retriever-Generator pipeline.
You'll build a fully functional setup using:
User query
→ Embed query
→ Retrieve top-k matching chunks from the vector store
→ Insert into prompt template
→ Generate LLM response
→ Return grounded answer
You can build this pipeline with LangChain or LlamaIndex, both of which support modular components.
If you used Chroma for ingestion, load your persisted store:
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
embedding = OpenAIEmbeddings()
vectorstore = Chroma(
persist_directory="./chroma_store",
embedding_function=embedding
)
You can also initialize from a new set of documents, if needed:
vectorstore = Chroma.from_documents(documents, embedding)
A retriever is a wrapper around the vector store that handles searching.
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})
You can tweak k to control how many chunks are retrieved. Some retrievers also support hybrid (keyword + vector) search.
To test:
results = retriever.get_relevant_documents("What's the return policy?")
for doc in results:
print(doc.page_content)
Use LangChain's built-in chain system to wrap retrieval + response:
from langchain.llms import OpenAI
from langchain.chains.question_answering import load_qa_chain
llm = OpenAI(temperature=0)
qa_chain = load_qa_chain(
llm,
chain_type="stuff"
)
response = qa_chain.run(
input_documents=results,
question="What's the return policy?"
)
print(response)
This is the simplest possible RAG setup — and already quite powerful.
In LangChain, there are multiple chain types:
Start with stuff, but test others for long docs or when you want more stability.
To create more custom output, define a prompt template yourself:
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
template = """Use the context below to answer the question.
If the answer isn't in the context, say "I don't know."
Context:
{context}
Question:
{question}
Answer:"""
prompt = PromptTemplate(
input_variables=["context", "question"],
template=template
)
chain = LLMChain(llm=llm, prompt=prompt)
context = "\n\n".join([doc.page_content for doc in results])
response = chain.run(context=context, question="What is the refund policy?")
print(response)
This gives you total control over how information is presented to the model.
You can log retrieval scores for ranking or evaluation:
for doc in results:
print(f"{doc.metadata.get('source')}: {doc.metadata.get('score')}")
Consider saving:
This helps you improve performance and analyze LLM behavior later.
LlamaIndex lets you create a QueryEngine that wraps all logic into a single interface:
from llama_index import VectorStoreIndex, ServiceContext
from llama_index.llms import OpenAI
from llama_index.query_engine import RetrieverQueryEngine
service_context = ServiceContext.from_defaults(llm=OpenAI())
index = VectorStoreIndex.from_vector_store(vectorstore)
retriever = index.as_retriever()
query_engine = RetrieverQueryEngine(retriever=retriever)
response = query_engine.query("What is the return policy?")
print(response)
This abstracts away prompt logic and works well if you're fully in the LlamaIndex ecosystem.
To improve results:
Your retriever is the brain of your RAG — if it fails to find the right content, the model will hallucinate.
You can plug in any LLM supported by LangChain:
Each model performs differently in RAG. Some (like GPT-4) are more precise but slower, while others (like Gemini Flash) excel in speed and context compression.
Now that you've built a complete retrieval and answering pipeline, it's time to turn it into a production-ready service.
In Chapter 4, we'll walk through deploying your RAG system behind a real API using FastAPI or LangServe. We'll cover:
👉 Continue to Chapter 4: Deploying RAG Systems — APIs, Scaling, and Security Best Practices