Process PDFs, markdown, and HTML into smart chunks ready for vector search and fast retrieval.
The quality of a RAG system begins long before any LLM is called.
It starts with how you ingest and prepare your content — turning PDFs, Word docs, Notion pages, HTML files, and emails into searchable, meaningful units of information.
In this chapter, you'll learn how to:
Let's start building the retrieval brain behind your RAG app.
In RAG, ingestion = document loading + preprocessing + chunking + embedding preparation.
But for now, we'll focus only on the first two stages — document loading and chunking — which set the foundation for everything that follows.
You can ingest almost any content source into a RAG system. The most common include:
The goal is to normalize and chunk this data into clean, readable segments for semantic search.
LlamaIndex has one of the cleanest ingestion and chunking pipelines. Here's how to use it:
pip install llama-index openai
from llama_index.readers.file.base import SimpleDirectoryReader
documents = SimpleDirectoryReader("./docs").load_data()
This will recursively read all files in /docs and parse them automatically.
from llama_index.node_parser import SentenceSplitter
splitter = SentenceSplitter(chunk_size=512, chunk_overlap=64)
nodes = splitter.get_nodes_from_documents(documents)
This chunking preserves semantic meaning by breaking at sentence boundaries — ideal for QA systems.
for node in nodes:
node.metadata = {
"source": "employee-handbook.pdf",
"section": "Leave Policy"
}
Adding metadata improves traceability and lets your LLM cite sources later.
LangChain provides an alternative ingestion path, especially if you're already building LangChain-native chains.
pip install langchain pypdf
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("docs/handbook.pdf")
documents = loader.load()
Each document will include page_content and optional metadata like page number.
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=50
)
chunks = splitter.split_documents(documents)
This splitter handles various formats well (e.g., code + prose), and can be customized to split by sentences, paragraphs, or character length.
for chunk in chunks:
chunk.metadata["source"] = "employee-handbook.pdf"
chunk.metadata["doc_type"] = "policy"
Use LlamaIndex if:
Use LangChain if:
Both are fully compatible with vector DBs like Chroma, Weaviate, and Pinecone. You can also convert between formats if needed.
Automate the above steps with a single CLI ingestion script:
python ingest_docs.py --input_dir ./docs --output_dir ./indexed --chunk_size 500
Log:
This makes ingestion reproducible, especially in production pipelines.
In Chapter 3, we'll build your RAG pipeline — connecting the chunks you created to a vector store and LLM to answer user questions dynamically.
It's time to go from prepared data → real-time retrieval → grounded generation.
👉 Continue to Chapter 3: Building a RAG Pipeline — Vector Search + LLM Answering