How much does LLM inference cost per query?

A typical query costs $0.002 to $0.05 depending on the model and prompt length. GPT-4o with a 2,000-token prompt and 500-token response costs approximately $0.01 per query. Claude 3.5 Sonnet costs roughly $0.013 for the same. GPT-4o-mini costs $0.0006, making it ideal for high-volume, lower-complexity tasks. Multiply per-query cost by daily volume to project monthly expenses.

What is the latency of LLM inference?

API-based LLM inference typically takes 500ms to 3 seconds for the complete response, depending on model size and output length. Streaming reduces perceived latency to 200 to 500ms for the first token. Self-hosted inference on optimized hardware can achieve 100 to 300ms first-token latency. Edge-deployed small models can run inference in under 50ms.

When should I self-host instead of using an LLM API?

Self-host when you have strict data privacy requirements (data cannot leave your infrastructure), need consistent sub-200ms latency, process over 100,000 queries per day (cost break-even point), or require offline capability. For most small-to-medium deployments, API-based inference is simpler and more cost-effective.

Core AI Concepts Last reviewed: February 2026

Inference

Inference is the process of using a trained AI model to generate predictions or outputs from new input data. In the context of LLMs, inference is every API call where you send a prompt and receive a generated response. Inference is the runtime phase of AI (as opposed to training) and accounts for the majority of ongoing costs, latency considerations, and scaling challenges in production AI systems.

On this page

What Is Inference?
Use Cases
Misconceptions
Why It Matters
How We Use It
FAQ

What Is Inference?

AI models go through two distinct phases: training and inference. Training is the computationally expensive process of learning patterns from data, which happens once (or periodically). Inference is the ongoing process of applying the trained model to new inputs, which happens on every user request. When you send a message to ChatGPT or call the OpenAI API, you are performing inference. The model's weights are fixed; it is simply computing an output from your input.

Inference costs dominate the economics of production AI. Training GPT-4 cost an estimated $100 million, but that is a one-time investment amortized across billions of inference requests. For businesses using LLM APIs, inference is the recurring cost that scales with usage. At $2.50 per million input tokens, a system handling 10,000 queries per day with 2,000 tokens per query costs roughly $1,500 per month in inference alone. Optimizing inference efficiency (through caching, model selection, prompt compression, and batching) directly reduces operational costs.

Inference latency is the time between sending a request and receiving a response. For LLMs, this is typically 500ms to 5 seconds depending on model size, input length, output length, and server load. Streaming responses (sending tokens as they are generated rather than waiting for the complete response) improves perceived latency and user experience. For real-time applications, inference latency is a critical design constraint that influences model selection, hosting decisions, and architecture.

Self-hosted inference gives organizations full control over latency, throughput, and data privacy, but requires significant GPU infrastructure investment. A single Llama 3 70B deployment requires at least 2 A100 GPUs ($4,000 to $8,000 per month in cloud costs). Inference optimization techniques like quantization (reducing model precision from 16-bit to 4-bit), speculative decoding, and continuous batching can double throughput on the same hardware. Salt Technologies AI evaluates hosted vs. self-hosted inference for every project based on volume, latency, cost, and data privacy requirements.

Real-World Use Cases

Real-Time Customer Support

Optimizing inference latency for customer-facing chatbots where response time directly impacts satisfaction. Using streaming responses, model caching, and edge deployment to achieve sub-1-second first-token latency. Fast inference keeps conversation feel natural and reduces abandonment rates.

Batch Document Processing

Processing thousands of documents overnight using batch inference APIs (like OpenAI Batch API at 50% cost reduction). Insurance claims, legal discovery, and medical records analysis run as batch jobs where latency is less important than cost efficiency.

Edge AI for Low-Latency Applications

Deploying quantized models on edge devices or local servers for applications requiring sub-100ms inference (real-time translation, factory quality inspection, autonomous vehicle perception). Edge inference eliminates network latency and works offline.

Common Misconceptions

Inference is cheap and not worth optimizing.

Inference costs scale linearly with usage and can quickly become the largest line item in an AI project's operating budget. A medium-volume application (50,000 queries/day) on GPT-4o costs $7,000+ per month. Optimization techniques (caching, model routing, prompt compression) can reduce this by 40 to 70%, saving tens of thousands annually.

Faster inference always means lower quality.

Modern optimization techniques like quantization, speculative decoding, and model distillation maintain 95-99% of output quality while cutting latency by 50-80% and reducing costs proportionally. Quality loss is measurable and often negligible for production use cases. Always benchmark optimized models against baselines on your specific tasks.

Self-hosted inference is always cheaper than API-based inference.

Self-hosting is only cheaper at high volumes (typically 100,000+ queries per day). Below that threshold, the fixed costs of GPU infrastructure, DevOps staffing, and model management exceed API costs. The break-even point depends on model size, query volume, and required uptime. Many organizations use a hybrid approach: API-based for burst capacity and self-hosted for baseline load.

Why Inference Matters for Your Business

Inference is where AI delivers value to users, and it is also where costs accumulate. Every customer interaction, every document processed, and every recommendation generated requires inference. Understanding inference economics (cost per query, latency requirements, throughput limits) is essential for building sustainable AI products. Organizations that optimize inference achieve better user experiences, lower costs, and more scalable systems.

How Salt Technologies AI Uses Inference

Salt Technologies AI designs inference architectures optimized for each client's specific requirements. We evaluate API-based, self-hosted, and hybrid inference strategies based on query volume, latency targets, budget, and data privacy needs. Our implementations include intelligent caching (reducing redundant API calls by 15-30%), model routing (using cheaper models for simple queries), and streaming responses for user-facing applications. We provide clients with inference cost dashboards that track spending by feature, model, and query type.

AI Integration Sprint AI Managed Pod AI Chatbot Development AI Readiness Audit

Related Terms

Core AI Concepts

Large Language Model (LLM)

A large language model (LLM) is a deep neural network trained on massive text datasets to understand, generate, and reason about human language. Models like GPT-4, Claude, Llama 3, and Gemini contain billions of parameters that encode linguistic patterns, world knowledge, and reasoning capabilities. LLMs form the foundation of modern AI applications, from chatbots to code generation to enterprise automation.

Core AI Concepts

Tokens

Tokens are the fundamental units of text that LLMs process. A token can be a word, a subword, a character, or a punctuation mark, depending on the model's tokenizer. Understanding tokens is essential for managing LLM costs, fitting content within context windows, and optimizing prompt design. One token is roughly 3/4 of an English word, so 1,000 tokens equal approximately 750 words.

Core AI Concepts

Temperature

Temperature is a parameter that controls the randomness and creativity of an LLM's output. A temperature of 0 makes the model deterministic, always choosing the most probable next token. Higher temperatures (0.7 to 1.0) increase randomness, producing more creative and varied responses. Temperature tuning is a critical configuration choice that affects the reliability, creativity, and consistency of AI outputs.

Core AI Concepts

Context Window

The context window is the maximum amount of text (measured in tokens) that an LLM can process in a single request, including the prompt, system instructions, retrieved context, conversation history, and the generated response. Context window size determines how much information the model can "see" at once. Current frontier models support 128K to 1M+ tokens, but effective utilization decreases with length.

Architecture Patterns

Streaming Response

Streaming response is the technique of delivering LLM-generated text to the user token by token as the model produces it, rather than waiting for the complete response before displaying anything. Using Server-Sent Events (SSE) or WebSocket connections, streaming reduces perceived latency from seconds to milliseconds, creating a real-time conversational experience. Streaming is the standard delivery mechanism for all production AI chat interfaces.

Business & Strategy

Total Cost of Ownership (AI)

Total cost of ownership (TCO) for AI captures every expense associated with an AI system over its entire lifecycle: initial development, infrastructure, API costs, data management, monitoring, maintenance, retraining, and team upskilling. Most organizations underestimate AI TCO by 40% to 60% because they budget only for development and ignore operational costs.

Inference

What Is Inference?

Real-World Use Cases

Common Misconceptions

Why Inference Matters for Your Business

How Salt Technologies AI Uses Inference

Further Reading

Related Terms

Large Language Model (LLM)

Tokens

Temperature

Context Window

Streaming Response

Total Cost of Ownership (AI)

Inference: Frequently Asked Questions

Need help implementing this?