Chunking
Chunking is the process of splitting documents into smaller, semantically meaningful segments for storage in a vector database and retrieval in a RAG pipeline. The chunk size, overlap, and splitting strategy directly impact retrieval quality and LLM answer accuracy. Poor chunking is the most common cause of underwhelming RAG performance.
On this page
What Is Chunking?
Chunking sits at the foundation of every RAG system, yet it receives far less attention than model selection or prompt engineering. The goal is to split documents into pieces that are small enough to be relevant to specific queries but large enough to retain sufficient context for the LLM to generate useful answers. Get this balance wrong, and your RAG system will either retrieve overly broad passages (diluting relevance) or snippets too small to be meaningful (losing context).
Common chunking strategies include fixed-size chunking (splitting every N tokens with overlap), recursive character splitting (splitting on paragraphs, then sentences, then words until chunks fit the size limit), semantic chunking (using embedding similarity to detect topic boundaries), and document-structure-aware chunking (splitting on headings, sections, or page breaks). Fixed-size chunking is the simplest and often good enough for uniform documents. Semantic chunking produces higher quality chunks but is more complex and slower. Document-structure-aware chunking is essential for structured content like PDFs, legal documents, and technical manuals.
Chunk size is typically measured in tokens. For most use cases, 256 to 512 tokens per chunk works well, with 50 to 100 tokens of overlap between adjacent chunks. Larger chunks (1,000+ tokens) work better for questions requiring broad context, while smaller chunks (128 tokens) work better for precise factual lookups. Many production systems use a parent-child chunking strategy: store small chunks for retrieval but return the surrounding larger section as context to the LLM.
Salt Technologies AI evaluates chunking strategies empirically rather than relying on rules of thumb. We create test query sets, run retrieval with multiple chunking configurations, and measure which strategy retrieves the most relevant passages. In our experience, switching from naive fixed-size chunking to semantic or structure-aware chunking improves retrieval precision by 15-30% for complex document collections.
Real-World Use Cases
Legal Contract Analysis
A legal AI platform uses structure-aware chunking to split contracts by clause, preserving the logical boundaries between terms, conditions, and definitions. This approach ensures each chunk represents a complete legal concept, enabling precise retrieval when attorneys search for specific contractual provisions.
Technical Documentation RAG
A developer platform chunks API documentation using heading-based splitting with parent-child relationships. Code examples stay attached to their explanations, and small definition chunks link back to their full section. This strategy delivers accurate answers for both "what does this function do?" and "show me an example" queries.
Research Paper Knowledge Base
A research institution chunks academic papers by sections (abstract, methods, results, discussion) with semantic sub-chunking within each section. This strategy allows researchers to retrieve specific methodological details without losing the broader context of the paper's findings.
Common Misconceptions
There is a universally optimal chunk size.
Optimal chunk size depends on your document types, query patterns, and use case. Support articles might work best at 256 tokens, while legal documents perform better at 512 to 1,000 tokens. Always test multiple configurations against your specific data and queries.
Chunking is a one-time setup task.
Chunking strategy should evolve as your data and query patterns change. New document types, new user questions, and changes to your embedding model may all warrant re-evaluating your chunking approach. Treat chunking as an ongoing optimization, not a set-and-forget configuration.
Smaller chunks always produce more precise retrieval.
Chunks that are too small lose context and can be ambiguous. A sentence fragment about "the system" is meaningless without knowing which system. Parent-child chunking addresses this: use small chunks for matching but return larger surrounding context to the LLM.
Why Chunking Matters for Your Business
Chunking is the single most underinvested area in RAG development, yet it has the largest impact on retrieval quality. Companies that optimize their chunking strategy see immediate improvements in answer accuracy, relevance, and user satisfaction. Poor chunking forces teams to compensate with expensive reranking or larger context windows, increasing both latency and cost. A 30-minute investment in chunking evaluation can save months of debugging downstream RAG quality issues.
How Salt Technologies AI Uses Chunking
Salt Technologies AI treats chunking as a first-class engineering decision in every RAG project. We evaluate 3-5 chunking strategies against client data using automated retrieval benchmarks before committing to a production configuration. Our toolkit includes LlamaIndex's node parsers for recursive and semantic chunking, Unstructured for document-structure-aware parsing, and custom splitters for domain-specific formats. We typically achieve a 15-30% improvement over naive chunking through systematic evaluation.
Further Reading
- RAG vs Fine-Tuning: When to Use Each
Salt Technologies AI Blog
- Vector Database Performance Benchmark 2026
Salt Technologies AI Datasets
Related Terms
RAG Pipeline
A RAG pipeline is an architecture that augments large language model responses by retrieving relevant documents from an external knowledge base before generating answers. It combines retrieval (typically vector search) with generation, grounding LLM output in verified, up-to-date information. This pattern dramatically reduces hallucinations and enables domain-specific accuracy without retraining the model.
Retrieval Pipeline
A retrieval pipeline is the sequence of steps that finds and ranks the most relevant documents or data chunks in response to a user query. It typically includes query processing, embedding generation, vector search, optional keyword search, reranking, and filtering. The quality of your retrieval pipeline directly determines the quality of your RAG system's answers.
Document Ingestion Pipeline
A document ingestion pipeline is the automated workflow that converts raw documents (PDFs, web pages, Word files, spreadsheets) into structured, chunked, and embedded content ready for storage in a vector database. It handles parsing, cleaning, metadata extraction, chunking, embedding generation, and loading. This pipeline determines the quality of your entire downstream AI system.
Vector Indexing
Vector indexing is the process of organizing high-dimensional vectors in data structures optimized for fast approximate nearest neighbor (ANN) search. Algorithms like HNSW, IVF, and Product Quantization enable sub-millisecond similarity searches across millions of vectors. The choice of index type directly affects search speed, memory usage, and recall accuracy.
Embeddings
Embeddings are numerical vector representations of text, images, or other data that capture semantic meaning in a high-dimensional space. Similar concepts produce similar vectors, enabling machines to measure meaning-based similarity between documents, sentences, or words. Embeddings are the mathematical backbone of semantic search, RAG systems, recommendation engines, and clustering applications.
Semantic Search
Semantic search uses vector embeddings to find documents based on meaning rather than keyword matching. It converts queries and documents into high-dimensional vectors, then finds the closest matches using distance metrics like cosine similarity. This approach understands synonyms, paraphrases, and conceptual relationships that keyword search completely misses.