Chapter 2RAG Systems — Implementing Retrieval-Augmented Generation at Scale

Ingesting and Chunking Documents — Using LlamaIndex or LangChain

Process PDFs, markdown, and HTML into smart chunks ready for vector search and fast retrieval.

🚀 Why Chunking Is the Foundation of Every RAG System

The quality of a RAG system begins long before any LLM is called.

It starts with how you ingest and prepare your content — turning PDFs, Word docs, Notion pages, HTML files, and emails into searchable, meaningful units of information.

In this chapter, you'll learn how to:

Load and parse documents from different formats
Chunk content into optimal sizes for semantic search
Choose between LlamaIndex or LangChain based on your needs
Attach metadata (like file name, author, or page number) for traceability
Prepare a dataset that works well with vector stores and LLMs

Let's start building the retrieval brain behind your RAG app.

📥 Step 1: Understand What "Ingestion" Means

In RAG, ingestion = document loading + preprocessing + chunking + embedding preparation.

But for now, we'll focus only on the first two stages — document loading and chunking — which set the foundation for everything that follows.

🧾 Supported Document Types

You can ingest almost any content source into a RAG system. The most common include:

PDF reports (contracts, policies, handbooks)
Word documents (DOCX)
Markdown or plain text
HTML pages and blog posts
CSVs, spreadsheets
Notion exports
Email threads
Transcripts from calls, webinars, or Zoom recordings

The goal is to normalize and chunk this data into clean, readable segments for semantic search.

⚙️ Option A: Ingesting with LlamaIndex

LlamaIndex has one of the cleanest ingestion and chunking pipelines. Here's how to use it:

1. Install dependencies:

pip install llama-index openai

2. Load documents:

from llama_index.readers.file.base import SimpleDirectoryReader

documents = SimpleDirectoryReader("./docs").load_data()

This will recursively read all files in /docs and parse them automatically.

3. Chunk your documents:

from llama_index.node_parser import SentenceSplitter

splitter = SentenceSplitter(chunk_size=512, chunk_overlap=64)
nodes = splitter.get_nodes_from_documents(documents)

chunk_size: how many tokens to include per chunk
chunk_overlap: how many tokens to include from the previous chunk

This chunking preserves semantic meaning by breaking at sentence boundaries — ideal for QA systems.

4. Add metadata (optional but recommended):

for node in nodes:
    node.metadata = {
        "source": "employee-handbook.pdf",
        "section": "Leave Policy"
    }

Adding metadata improves traceability and lets your LLM cite sources later.

⚙️ Option B: Ingesting with LangChain

LangChain provides an alternative ingestion path, especially if you're already building LangChain-native chains.

1. Install LangChain PDF tools:

pip install langchain pypdf

2. Load documents:

from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("docs/handbook.pdf")
documents = loader.load()

Each document will include page_content and optional metadata like page number.

3. Chunk using RecursiveCharacterTextSplitter:

from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50
)
chunks = splitter.split_documents(documents)

This splitter handles various formats well (e.g., code + prose), and can be customized to split by sentences, paragraphs, or character length.

4. Add or preserve metadata:

for chunk in chunks:
    chunk.metadata["source"] = "employee-handbook.pdf"
    chunk.metadata["doc_type"] = "policy"

🧠 Which Should You Use: LlamaIndex or LangChain?

Use LlamaIndex if:

You need a dedicated chunking and ingestion pipeline
You want to create graphs or use custom query engines
You prefer simple wrappers over raw chains

Use LangChain if:

You're already using LangChain agents, tools, or chains
You want fine-grained control and plug-and-play with FastAPI
You're using non-traditional file formats or custom loaders

Both are fully compatible with vector DBs like Chroma, Weaviate, and Pinecone. You can also convert between formats if needed.

🧪 Chunking Strategy Tips

Don't go too small (under 100 tokens) — the model loses context
Don't go too large (over 1,000 tokens) — it won't fit in the prompt
Include chunk overlap to avoid mid-sentence truncation
Break at semantic boundaries (sentences, paragraphs)
Include metadata like page number, title, and file name
Store both raw and cleaned content — raw for debugging, cleaned for retrieval

💡 Common Pitfalls to Avoid

Blank documents from malformed PDF parsing
Duplicate chunks when chunking overlapping sections incorrectly
Missing metadata that makes traceability impossible
Unstructured HTML content (use BeautifulSoup or MarkdownIt to clean)
Hardcoded chunk sizes that don't generalize across documents

📂 Bonus: Create an Ingestion Pipeline Script

Automate the above steps with a single CLI ingestion script:

python ingest_docs.py --input_dir ./docs --output_dir ./indexed --chunk_size 500

Log:

File name
Number of chunks created
Any unreadable documents
Processing time

This makes ingestion reproducible, especially in production pipelines.

🔜 Coming Up Next

In Chapter 3, we'll build your RAG pipeline — connecting the chunks you created to a vector store and LLM to answer user questions dynamically.

It's time to go from prepared data → real-time retrieval → grounded generation.

👉 Continue to Chapter 3: Building a RAG Pipeline — Vector Search + LLM Answering

Understanding Retrieval-Augmented Generation (RAG) and Its Architecture

Building a RAG Pipeline — Vector Search + LLM Answering

Back to RAG Systems — Implementing Retrieval-Augmented Generation at Scale

Chapter 2RAG Systems — Implementing Retrieval-Augmented Generation at Scale

Ingesting and Chunking Documents — Using LlamaIndex or LangChain

Process PDFs, markdown, and HTML into smart chunks ready for vector search and fast retrieval.

🚀 Why Chunking Is the Foundation of Every RAG System

The quality of a RAG system begins long before any LLM is called.

It starts with how you ingest and prepare your content — turning PDFs, Word docs, Notion pages, HTML files, and emails into searchable, meaningful units of information.

In this chapter, you'll learn how to:

Load and parse documents from different formats
Chunk content into optimal sizes for semantic search
Choose between LlamaIndex or LangChain based on your needs
Attach metadata (like file name, author, or page number) for traceability
Prepare a dataset that works well with vector stores and LLMs

Let's start building the retrieval brain behind your RAG app.

📥 Step 1: Understand What "Ingestion" Means

In RAG, ingestion = document loading + preprocessing + chunking + embedding preparation.

But for now, we'll focus only on the first two stages — document loading and chunking — which set the foundation for everything that follows.

🧾 Supported Document Types

You can ingest almost any content source into a RAG system. The most common include:

PDF reports (contracts, policies, handbooks)
Word documents (DOCX)
Markdown or plain text
HTML pages and blog posts
CSVs, spreadsheets
Notion exports
Email threads
Transcripts from calls, webinars, or Zoom recordings

The goal is to normalize and chunk this data into clean, readable segments for semantic search.

⚙️ Option A: Ingesting with LlamaIndex

LlamaIndex has one of the cleanest ingestion and chunking pipelines. Here's how to use it:

1. Install dependencies:

pip install llama-index openai

2. Load documents:

from llama_index.readers.file.base import SimpleDirectoryReader

documents = SimpleDirectoryReader("./docs").load_data()

This will recursively read all files in /docs and parse them automatically.

3. Chunk your documents:

from llama_index.node_parser import SentenceSplitter

splitter = SentenceSplitter(chunk_size=512, chunk_overlap=64)
nodes = splitter.get_nodes_from_documents(documents)

chunk_size: how many tokens to include per chunk
chunk_overlap: how many tokens to include from the previous chunk

This chunking preserves semantic meaning by breaking at sentence boundaries — ideal for QA systems.

4. Add metadata (optional but recommended):

for node in nodes:
    node.metadata = {
        "source": "employee-handbook.pdf",
        "section": "Leave Policy"
    }

Adding metadata improves traceability and lets your LLM cite sources later.

⚙️ Option B: Ingesting with LangChain

LangChain provides an alternative ingestion path, especially if you're already building LangChain-native chains.

1. Install LangChain PDF tools:

pip install langchain pypdf

2. Load documents:

from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("docs/handbook.pdf")
documents = loader.load()

Each document will include page_content and optional metadata like page number.

3. Chunk using RecursiveCharacterTextSplitter:

from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50
)
chunks = splitter.split_documents(documents)

This splitter handles various formats well (e.g., code + prose), and can be customized to split by sentences, paragraphs, or character length.

4. Add or preserve metadata:

for chunk in chunks:
    chunk.metadata["source"] = "employee-handbook.pdf"
    chunk.metadata["doc_type"] = "policy"

🧠 Which Should You Use: LlamaIndex or LangChain?

Use LlamaIndex if:

You need a dedicated chunking and ingestion pipeline
You want to create graphs or use custom query engines
You prefer simple wrappers over raw chains

Use LangChain if:

You're already using LangChain agents, tools, or chains
You want fine-grained control and plug-and-play with FastAPI
You're using non-traditional file formats or custom loaders

Both are fully compatible with vector DBs like Chroma, Weaviate, and Pinecone. You can also convert between formats if needed.

🧪 Chunking Strategy Tips

Don't go too small (under 100 tokens) — the model loses context
Don't go too large (over 1,000 tokens) — it won't fit in the prompt
Include chunk overlap to avoid mid-sentence truncation
Break at semantic boundaries (sentences, paragraphs)
Include metadata like page number, title, and file name
Store both raw and cleaned content — raw for debugging, cleaned for retrieval

💡 Common Pitfalls to Avoid

Blank documents from malformed PDF parsing
Duplicate chunks when chunking overlapping sections incorrectly
Missing metadata that makes traceability impossible
Unstructured HTML content (use BeautifulSoup or MarkdownIt to clean)
Hardcoded chunk sizes that don't generalize across documents

📂 Bonus: Create an Ingestion Pipeline Script

Automate the above steps with a single CLI ingestion script:

python ingest_docs.py --input_dir ./docs --output_dir ./indexed --chunk_size 500

Log:

File name
Number of chunks created
Any unreadable documents
Processing time

This makes ingestion reproducible, especially in production pipelines.

🔜 Coming Up Next

In Chapter 3, we'll build your RAG pipeline — connecting the chunks you created to a vector store and LLM to answer user questions dynamically.

It's time to go from prepared data → real-time retrieval → grounded generation.

👉 Continue to Chapter 3: Building a RAG Pipeline — Vector Search + LLM Answering

Understanding Retrieval-Augmented Generation (RAG) and Its Architecture

Building a RAG Pipeline — Vector Search + LLM Answering

Back to RAG Systems — Implementing Retrieval-Augmented Generation at Scale

Ingest and Chunk Documents for RAG with LangChain or LlamaIndex | Alternates - AI Agent Discovery Platform