Unstructured
Unstructured is an open-source library and managed service for extracting and transforming data from unstructured documents (PDFs, Word files, emails, HTML, images) into clean, chunked, LLM-ready formats. It is the leading tool for the document ingestion stage of RAG and data processing pipelines.
What Is Unstructured?
Unstructured, founded by Brian Raymond in 2022, tackles one of the most underappreciated challenges in AI applications: getting data out of real-world documents. Enterprise data lives in PDFs, PowerPoints, Word documents, HTML pages, emails, and scanned images. These formats contain tables, headers, footers, sidebars, images, and complex layouts that simple text extraction tools butcher. Unstructured provides intelligent parsers that understand document structure and extract content with layout awareness, preserving the semantic meaning of headings, lists, tables, and narrative text.
The library supports over 20 document types, including PDF, DOCX, PPTX, XLSX, HTML, Markdown, RTF, EML (email), and images (via OCR). For each format, it applies format-specific parsing strategies. PDF parsing, for example, uses a combination of PDFMiner for text extraction and Detectron2 (a deep learning model) for layout analysis. This dual approach correctly identifies titles, section headers, narrative text, tables, and figures, even in complex multi-column layouts.
After extraction, Unstructured provides a chunking pipeline that splits documents into semantically meaningful chunks. Unlike naive text splitters that cut at character counts, Unstructured's chunking respects document structure: it keeps table rows together, avoids splitting mid-sentence, and attaches metadata (source file, page number, element type, section heading) to each chunk. This metadata is critical for RAG systems that need to cite sources and filter by document attributes.
The Unstructured Platform (their managed service) adds production-grade features: API-based document processing, batch ingestion from cloud storage (S3, Azure Blob, Google Cloud Storage), and pre-built connectors for destination systems (Pinecone, Weaviate, Qdrant, Elasticsearch). The Platform handles scaling, GPU provisioning for layout analysis, and document format detection automatically.
For teams building RAG systems, Unstructured fills a critical gap. Frameworks like LangChain and LlamaIndex provide document loaders, but these often use basic text extraction that loses structural information. Unstructured's layout-aware parsing produces significantly cleaner chunks, which translates directly to better retrieval quality and more accurate LLM responses.
Real-World Use Cases
Enterprise document ingestion for RAG
A financial services firm uses Unstructured to process 200,000 quarterly earnings reports, analyst notes, and regulatory filings into a RAG knowledge base. Unstructured's layout analysis correctly handles complex PDF tables containing financial data, enabling analysts to ask questions like "What was Company X's EBITDA margin in Q3 2025?" and receive accurate, cited answers.
Email and attachment processing pipeline
A legal discovery team processes 500,000 emails and their attachments using Unstructured. The library handles EML parsing, extracts text from attached PDFs and Word documents, and produces structured chunks with metadata (sender, date, subject, attachment name). The processed data feeds into a search system for case review, reducing manual review time by 75%.
Multi-format knowledge base migration
A company migrating from a legacy document management system uses Unstructured to process their entire document library: 50,000 PDFs, 30,000 Word docs, and 20,000 HTML pages. Unstructured normalizes all formats into a consistent chunked structure with uniform metadata, enabling a unified search experience across historically siloed document collections.
Common Misconceptions
Simple text extraction (like PyPDF) is good enough for RAG.
Basic text extraction tools ignore document structure, producing flat text where tables become garbled strings, headers are mixed with body text, and multi-column layouts merge incorrectly. This garbage-in-garbage-out problem is a leading cause of poor RAG quality. Layout-aware parsing with Unstructured produces dramatically cleaner chunks.
Unstructured is only useful for PDFs.
Unstructured supports 20+ document formats including Word, PowerPoint, Excel, HTML, Markdown, email, and images with OCR. It is a general-purpose document processing library, not a PDF-only tool. Its strength is normalizing all these formats into a consistent, chunked output.
Document parsing is a solved problem that does not need a dedicated tool.
Document parsing is deceptively difficult. Real-world PDFs contain scanned images, complex tables, multi-column layouts, headers/footers, watermarks, and embedded charts. Each of these requires specialized handling. Unstructured invests significant engineering effort into handling these edge cases, which is why it has become the standard for production document ingestion.
Why Unstructured Matters for Your Business
Unstructured matters because the quality of a RAG system is limited by the quality of its input data. If document parsing produces garbled text, missing tables, or lost structure, no amount of prompt engineering or model capability can compensate. Unstructured ensures that the data entering your RAG pipeline is clean, well-structured, and enriched with metadata. For enterprises sitting on millions of unstructured documents, Unstructured is the bridge between raw files and AI-accessible knowledge.
How Salt Technologies AI Uses Unstructured
Salt Technologies AI uses Unstructured as the first stage of our RAG Knowledge Base pipelines. We deploy it for enterprise clients with complex document collections containing PDFs with tables, multi-column layouts, and scanned images. Our standard pipeline runs Unstructured for layout-aware extraction, then feeds the structured output into LlamaIndex or LangChain for embedding and indexing. For particularly complex PDFs (scientific papers, regulatory filings), we combine Unstructured with LlamaParse for maximum extraction fidelity.
Further Reading
- RAG vs. Fine-Tuning: Choosing the Right Approach
Salt Technologies AI Blog
- AI Development Cost Benchmark 2026
Salt Technologies AI Datasets
- Unstructured Official Documentation
Unstructured
Related Terms
Document Ingestion Pipeline
A document ingestion pipeline is the automated workflow that converts raw documents (PDFs, web pages, Word files, spreadsheets) into structured, chunked, and embedded content ready for storage in a vector database. It handles parsing, cleaning, metadata extraction, chunking, embedding generation, and loading. This pipeline determines the quality of your entire downstream AI system.
Chunking
Chunking is the process of splitting documents into smaller, semantically meaningful segments for storage in a vector database and retrieval in a RAG pipeline. The chunk size, overlap, and splitting strategy directly impact retrieval quality and LLM answer accuracy. Poor chunking is the most common cause of underwhelming RAG performance.
Retrieval-Augmented Generation (RAG)
Retrieval-Augmented Generation (RAG) is an architecture pattern that enhances LLM responses by retrieving relevant information from external knowledge sources before generating an answer. Instead of relying solely on the model's training data, RAG systems search vector databases, document stores, or APIs to inject fresh, factual context into each prompt. This dramatically reduces hallucinations and enables LLMs to answer questions about private, proprietary, or real-time data.
RAG Pipeline
A RAG pipeline is an architecture that augments large language model responses by retrieving relevant documents from an external knowledge base before generating answers. It combines retrieval (typically vector search) with generation, grounding LLM output in verified, up-to-date information. This pattern dramatically reduces hallucinations and enables domain-specific accuracy without retraining the model.
LlamaParse
LlamaParse is a managed document parsing service built by LlamaIndex that uses AI models to extract high-fidelity structured content from complex documents, particularly PDFs with tables, charts, and multi-column layouts. It is designed specifically as the ingestion layer for RAG and LLM applications.
LlamaIndex
LlamaIndex is an open-source data framework purpose-built for connecting large language models to private, structured, and unstructured data sources. It excels at data ingestion, indexing, and retrieval, making it the go-to choice for building production RAG pipelines.