Salt Technologies AI AI
Architecture Patterns

Chunking

Chunking is the process of splitting documents into smaller, semantically meaningful segments for storage in a vector database and retrieval in a RAG pipeline. The chunk size, overlap, and splitting strategy directly impact retrieval quality and LLM answer accuracy. Poor chunking is the most common cause of underwhelming RAG performance.

On this page
  1. What Is Chunking?
  2. Use Cases
  3. Misconceptions
  4. Why It Matters
  5. How We Use It
  6. FAQ

What Is Chunking?

Chunking sits at the foundation of every RAG system, yet it receives far less attention than model selection or prompt engineering. The goal is to split documents into pieces that are small enough to be relevant to specific queries but large enough to retain sufficient context for the LLM to generate useful answers. Get this balance wrong, and your RAG system will either retrieve overly broad passages (diluting relevance) or snippets too small to be meaningful (losing context).

Common chunking strategies include fixed-size chunking (splitting every N tokens with overlap), recursive character splitting (splitting on paragraphs, then sentences, then words until chunks fit the size limit), semantic chunking (using embedding similarity to detect topic boundaries), and document-structure-aware chunking (splitting on headings, sections, or page breaks). Fixed-size chunking is the simplest and often good enough for uniform documents. Semantic chunking produces higher quality chunks but is more complex and slower. Document-structure-aware chunking is essential for structured content like PDFs, legal documents, and technical manuals.

Chunk size is typically measured in tokens. For most use cases, 256 to 512 tokens per chunk works well, with 50 to 100 tokens of overlap between adjacent chunks. Larger chunks (1,000+ tokens) work better for questions requiring broad context, while smaller chunks (128 tokens) work better for precise factual lookups. Many production systems use a parent-child chunking strategy: store small chunks for retrieval but return the surrounding larger section as context to the LLM.

Salt Technologies AI evaluates chunking strategies empirically rather than relying on rules of thumb. We create test query sets, run retrieval with multiple chunking configurations, and measure which strategy retrieves the most relevant passages. In our experience, switching from naive fixed-size chunking to semantic or structure-aware chunking improves retrieval precision by 15-30% for complex document collections.

Real-World Use Cases

1

Legal Contract Analysis

A legal AI platform uses structure-aware chunking to split contracts by clause, preserving the logical boundaries between terms, conditions, and definitions. This approach ensures each chunk represents a complete legal concept, enabling precise retrieval when attorneys search for specific contractual provisions.

2

Technical Documentation RAG

A developer platform chunks API documentation using heading-based splitting with parent-child relationships. Code examples stay attached to their explanations, and small definition chunks link back to their full section. This strategy delivers accurate answers for both "what does this function do?" and "show me an example" queries.

3

Research Paper Knowledge Base

A research institution chunks academic papers by sections (abstract, methods, results, discussion) with semantic sub-chunking within each section. This strategy allows researchers to retrieve specific methodological details without losing the broader context of the paper's findings.

Common Misconceptions

There is a universally optimal chunk size.

Optimal chunk size depends on your document types, query patterns, and use case. Support articles might work best at 256 tokens, while legal documents perform better at 512 to 1,000 tokens. Always test multiple configurations against your specific data and queries.

Chunking is a one-time setup task.

Chunking strategy should evolve as your data and query patterns change. New document types, new user questions, and changes to your embedding model may all warrant re-evaluating your chunking approach. Treat chunking as an ongoing optimization, not a set-and-forget configuration.

Smaller chunks always produce more precise retrieval.

Chunks that are too small lose context and can be ambiguous. A sentence fragment about "the system" is meaningless without knowing which system. Parent-child chunking addresses this: use small chunks for matching but return larger surrounding context to the LLM.

Why Chunking Matters for Your Business

Chunking is the single most underinvested area in RAG development, yet it has the largest impact on retrieval quality. Companies that optimize their chunking strategy see immediate improvements in answer accuracy, relevance, and user satisfaction. Poor chunking forces teams to compensate with expensive reranking or larger context windows, increasing both latency and cost. A 30-minute investment in chunking evaluation can save months of debugging downstream RAG quality issues.

How Salt Technologies AI Uses Chunking

Salt Technologies AI treats chunking as a first-class engineering decision in every RAG project. We evaluate 3-5 chunking strategies against client data using automated retrieval benchmarks before committing to a production configuration. Our toolkit includes LlamaIndex's node parsers for recursive and semantic chunking, Unstructured for document-structure-aware parsing, and custom splitters for domain-specific formats. We typically achieve a 15-30% improvement over naive chunking through systematic evaluation.

Further Reading

Related Terms

Architecture Patterns
RAG Pipeline

A RAG pipeline is an architecture that augments large language model responses by retrieving relevant documents from an external knowledge base before generating answers. It combines retrieval (typically vector search) with generation, grounding LLM output in verified, up-to-date information. This pattern dramatically reduces hallucinations and enables domain-specific accuracy without retraining the model.

Architecture Patterns
Retrieval Pipeline

A retrieval pipeline is the sequence of steps that finds and ranks the most relevant documents or data chunks in response to a user query. It typically includes query processing, embedding generation, vector search, optional keyword search, reranking, and filtering. The quality of your retrieval pipeline directly determines the quality of your RAG system's answers.

Architecture Patterns
Document Ingestion Pipeline

A document ingestion pipeline is the automated workflow that converts raw documents (PDFs, web pages, Word files, spreadsheets) into structured, chunked, and embedded content ready for storage in a vector database. It handles parsing, cleaning, metadata extraction, chunking, embedding generation, and loading. This pipeline determines the quality of your entire downstream AI system.

Architecture Patterns
Vector Indexing

Vector indexing is the process of organizing high-dimensional vectors in data structures optimized for fast approximate nearest neighbor (ANN) search. Algorithms like HNSW, IVF, and Product Quantization enable sub-millisecond similarity searches across millions of vectors. The choice of index type directly affects search speed, memory usage, and recall accuracy.

Core AI Concepts
Embeddings

Embeddings are numerical vector representations of text, images, or other data that capture semantic meaning in a high-dimensional space. Similar concepts produce similar vectors, enabling machines to measure meaning-based similarity between documents, sentences, or words. Embeddings are the mathematical backbone of semantic search, RAG systems, recommendation engines, and clustering applications.

Architecture Patterns
Semantic Search

Semantic search uses vector embeddings to find documents based on meaning rather than keyword matching. It converts queries and documents into high-dimensional vectors, then finds the closest matches using distance metrics like cosine similarity. This approach understands synonyms, paraphrases, and conceptual relationships that keyword search completely misses.

Chunking: Frequently Asked Questions

What chunk size should I use for my RAG system?
Start with 256 to 512 tokens with 50 to 100 tokens of overlap for general-purpose RAG. For precise factual lookups, try 128 to 256 tokens. For broad contextual answers, try 512 to 1,024 tokens. Always benchmark multiple sizes against your actual queries and documents to find the optimum.
What is parent-child chunking?
Parent-child chunking stores small chunks (children) for retrieval matching but returns larger surrounding sections (parents) as context to the LLM. This gives you the precision of small chunks with the context richness of large ones. LlamaIndex supports this pattern natively through its auto-merging retriever.
How does chunking affect embedding costs?
More chunks (from smaller chunk sizes) mean more embedding API calls and more vectors to store. A 10,000-page knowledge base with 256-token chunks produces roughly 80,000 vectors; with 512-token chunks, roughly 40,000. The cost difference in embedding generation is modest ($5 to $20), but storage and search costs scale linearly with vector count.

14+

Years of Experience

800+

Projects Delivered

100+

Engineers

4.9★

Clutch Rating

Need help implementing this?

Start with a $3,000 AI Readiness Audit. Get a clear roadmap in 1-2 weeks.