Host your RAG pipeline with confidence — handle auth, load, cost, and privacy at scale.
You've built a functioning Retrieval-Augmented Generation (RAG) pipeline: it ingests documents, retrieves relevant chunks, and feeds them into an LLM for a grounded answer.
Now it's time to deploy it — so your frontend, internal team, or clients can use it via a real API.
In this chapter, you'll learn how to:
A production-ready RAG backend typically includes:
Let's start with the simplest form of deployment: wrapping your RAG logic in a FastAPI server.
from fastapi import FastAPI
from pydantic import BaseModel
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
app = FastAPI()
# Define request schema
class Query(BaseModel):
question: str
# Load vectorstore and retriever
embedding = OpenAIEmbeddings()
vectorstore = Chroma(
persist_directory="./chroma_store",
embedding_function=embedding
)
retriever = vectorstore.as_retriever()
# Load LLM + QA Chain
llm = OpenAI()
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
retriever=retriever,
chain_type="stuff"
)
@app.post("/rag")
def answer(query: Query):
result = qa_chain.run(query.question)
return {"answer": result}
Start your API with:
uvicorn main:app --reload
This gives you a /rag endpoint that accepts POST requests with a question and returns the LLM's response based on your vector search.
Don't expose your endpoint without protection. Add basic headers or token-based access.
from fastapi import Header, HTTPException
@app.post("/rag")
def answer(query: Query, api_key: str = Header(...)):
if api_key != "your-secret-key":
raise HTTPException(status_code=401, detail="Invalid API Key")
result = qa_chain.run(query.question)
return {"answer": result}
Better still, use:
Track what users are asking, which chunks were retrieved, and how many tokens were used.
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
@app.post("/rag")
def answer(query: Query, api_key: str = Header(...)):
logger.info(f"Query received: {query.question}")
result = qa_chain.run(query.question)
logger.info(f"Response: {result}")
return {"answer": result}
Optional: Log similarity scores, model latency, and token counts using OpenAI's API logs or LangChain's CallbackManager.
If you're using:
Never commit credentials or embedding vectors to Git.
Depending on your scale and team setup, here are your options:
If you're using LangChain v0.1+, langserve is a no-boilerplate way to expose chains as REST APIs.
Install it:
pip install langserve
Then wrap your chain like this:
from langserve import add_routes
app = FastAPI()
add_routes(app, qa_chain, path="/qa")
This generates:
LangServe is ideal for teams looking to ship fast and iterate with minimal backend overhead.
As usage grows, prepare for:
Tools like FastAPI-Limiter, Cloudflare Workers, or PostHog can help monitor and throttle as needed.
Track:
For deeper observability, integrate with:
RAG systems can scale fast — and so can your OpenAI or Gemini bill.
You now have a fully deployed, secure, and scalable RAG backend.