ChatGPT API Pricing: How OpenAI Bills in 2026 (Tokens, Batch, Caching)

ChatGPT API Pricing: How OpenAI Bills in 2026 (Tokens, Batch, Caching)

OpenAI’s ChatGPT API is billed by tokens: every input you send and every response the model generates counts toward your usage, and you are charged a per-token rate that varies by model. In 2026 the pricing structure also includes batch processing discounts, prompt caching discounts, and tiered enterprise rates that did not exist a year or two ago. The mechanics matter for any team building on the API, because token usage at scale becomes a meaningful line item quickly.

Learn More About Moesif Grow Your API Business with Moesif 14 day free trial. No credit card required. Try for Free

This guide walks through how OpenAI prices the API in 2026, the model tiers worth knowing about, how to estimate cost ahead of time, the discount mechanisms that materially affect the bill, and how teams typically re-bill or charge back LLM consumption to their own customers. Specific prices change frequently; this post focuses on the structure rather than dollar amounts (check openai.com/api/pricing for the current numbers).

What is the ChatGPT API?

The ChatGPT API is OpenAI’s developer-facing interface to the GPT family of language models. Where the consumer ChatGPT product is a finished application, the API exposes the underlying models for direct integration: you send a prompt, the model returns a response, your application does whatever it needs to with the output.

Most developer use cases fall into a few patterns: chat-style applications, retrieval-augmented generation (RAG) over private data, classification and extraction over structured documents, agent workflows that call APIs as part of a multi-step task. The underlying API behavior is the same across patterns; the integration shape and the token budget vary.

How OpenAI prices the ChatGPT API

The fundamental unit is the token. A token is roughly a word or sub-word piece of text; “tokenization” is the process by which text is split into the units the model actually processes. As a rough rule of thumb, 750 words is around 1,000 tokens in English. The exact ratio depends on the language and content (code, structured data, and non-English text tokenize differently).

OpenAI charges per million tokens, with separate prices for input tokens (the prompt you send) and output tokens (what the model generates back). Output tokens are typically priced higher than input tokens, because generation is more compute-intensive than reading the prompt.

The bill at the end of the month is roughly: sum across calls of (input tokens × input rate) + (output tokens × output rate), with adjustments for any discounts that applied.

The current model tiers

OpenAI’s 2026 lineup includes several model families, each with its own pricing tier. The pattern: larger, more capable models cost more per token; smaller, faster models cost less. The structure (rather than exact numbers, since prices move):

  • Flagship general-purpose models sit at the top of the pricing table. Used when output quality is critical.
  • Mid-tier models trade a small amount of quality for substantially lower per-token cost. The right choice for most production traffic that does not need the absolute highest quality.
  • Small/fast models are aggressively priced for high-volume, low-complexity tasks: classification, simple extraction, lightweight chat. Some are an order of magnitude cheaper than the flagship.
  • Specialized models. OpenAI ships dedicated models for chat-style consumer workloads and code, priced separately.
  • Realtime, image, video, and transcription models. Each has its own rate card. Realtime audio is priced per token by modality; image generation is priced per token with separate input/cached/output rates; video generation (Sora-family) is priced per second; transcription is priced per token plus a per-minute equivalent.
  • Embedding models are priced separately, on a much smaller per-million-token scale. Used for semantic search and RAG.

The right model for a given task is rarely the most expensive one. Production teams that haven’t reviewed their model selection in six months are almost always over-paying.

Fine-tuning availability shifts with each model generation. OpenAI continues to support fine-tuning on its current GPT family, but availability and pricing vary by model and have changed multiple times since 2023. If your roadmap depends on a specific fine-tuning capability, verify the current support status on OpenAI’s documentation before committing.

Input tokens vs output tokens

A practical detail that catches teams by surprise: the input side of a chat call is usually larger than people expect. In a multi-turn conversation, the API is stateless. Every call you make has to include the full conversation history in the prompt, and that history is billed as input tokens on every turn.

By turn 20 of a conversation with 100-token user messages and 200-token assistant responses, each new call carries roughly 6,000 input tokens of accumulated history, since the entire prior conversation is re-sent with every request. Production chat applications spend a significant share of their budget on input tokens for this reason.

The implications:

  • Trim conversation history aggressively. Drop or summarize old turns rather than carrying them forever.
  • System prompts add up. A 500-token system prompt sent on every call is 500 input tokens × every call.
  • Few-shot examples in the prompt are not free. They are billed as input tokens on every call.

The model’s output side is typically capped by the max_tokens parameter, so output cost has a natural ceiling per call. Input cost has no ceiling unless you impose one in your client.

2026 discount mechanisms: Batch API and prompt caching

Two cost-reduction features OpenAI introduced over the last 18 months meaningfully change the cost equation for many production workloads.

Batch API. For workloads that do not need real-time responses (overnight enrichment jobs, async classification, large-scale extraction), the Batch API accepts a file of requests and returns results within a stated SLA (typically up to 24 hours). The trade-off is a substantial per-token discount: Anthropic publishes the Batch API as a flat 50% off both input and output tokens, and OpenAI’s pricing page exposes a comparable Batch column on every flagship model. If your workload is mostly batch-shaped, this is the largest single cost reduction available without changing models.

Prompt caching. When the same prefix appears in many of your prompts (a long system prompt, a stable instruction header, a RAG context that doesn’t change between turns), the provider caches the prefix server-side and charges a discounted rate for cached input tokens on subsequent calls. Cache hits run at a small fraction of the standard input rate: Anthropic publishes a 0.1x multiplier (90% off) for cache reads, and OpenAI’s pricing table shows a similar order-of-magnitude discount for “cached input” across OpenAI’s current model family. Cache TTLs are short (Anthropic offers 5-minute and 1-hour write durations; OpenAI does not publish a single TTL), so the discount is most useful for high-frequency calls.

The combination matters: a RAG application with a stable instruction prefix and overnight batch jobs can routinely cut its bill by 50-70% versus a naive implementation that ignores both features.

How to estimate your cost: a worked example

Suppose you are building a customer-support assistant that handles 10,000 conversations per day, with each conversation averaging 6 turns. The system prompt is 400 tokens, user messages average 80 tokens per turn, and assistant responses average 250 tokens per turn.

Per conversation, the input bills compound across turns because the prior conversation history is re-sent on every call:

  • System prompt re-sent on each turn: 400 × 6 = 2,400 input tokens
  • User messages, accumulating across turns: 80 × (1+2+3+4+5+6) = 1,680 input tokens
  • Assistant responses, re-sent as input on subsequent turns: 250 × (0+1+2+3+4+5) = 3,750 input tokens
  • Assistant responses generated this conversation: 250 × 6 = 1,500 output tokens

Total per conversation: roughly 7,800 input tokens and 1,500 output tokens.

Across 10,000 conversations/day: about 78M input tokens and 15M output tokens per day, or roughly 2.3B input and 450M output per month.

At current flagship-model rates that runs into thousands of dollars per month. Switching to a mid-tier model or applying prompt caching to the stable system prompt can cut that by half or more without changing the integration shape. Switching to the Batch API does not apply here because the use case is real-time, but a parallel batch job for nightly conversation summarization or analytics would qualify.

The shape of this exercise is what matters. Estimate before you ship, instrument once you ship, and revisit when usage doubles.

ChatGPT API vs ChatGPT Plus subscription

A frequent confusion: ChatGPT Plus is OpenAI’s consumer subscription (around $20/month at launch pricing) that gives priority access and feature unlocks for the chat.openai.com web product. It is not API access. API access is billed separately at the per-token rates above.

Teams that need both (developers using the consumer ChatGPT product for their own work plus production API integration) end up paying for both separately. Some enterprise tiers bundle, but the default assumption is that the two products are independent SKUs.

How ChatGPT API pricing compares to Anthropic Claude and Google Gemini

OpenAI is not the only frontier-model API in 2026. The competitive landscape worth knowing:

  • Anthropic Claude API. Per-token pricing structurally similar to OpenAI, with a flat 50% Batch API discount and explicit prompt-caching multipliers (5-minute or 1-hour write windows, 0.1x cache-read multiplier). The current lineup at the time of writing spans Opus, Sonnet, and Haiku tiers with several recent versions of each in active service alongside Anthropic’s published deprecation schedule. Verify the latest model names and rates on Anthropic’s pricing page before committing to a specific model.
  • Google Gemini API. Per-token pricing structurally similar, with the Gemini family of models. Cost-competitive at the lower end; bundled deals available for organizations on Google Cloud.

The frontier-model market in 2026 is competitive enough that all three providers offer broadly similar pricing structures, with the specific dollar rates moving frequently as model generations roll out. Multi-provider deployment (routing different workloads to different providers based on price-performance) is increasingly common at meaningful scale, and this is where a dedicated AI gateway becomes important: the WSO2 AI Gateway routes outbound calls across providers, applies guardrails, and reports token cost per customer through Moesif’s metering layer.

For an evaluation of provider choice that goes beyond pricing, see our API pricing strategy guide. The decision is rarely just about per-token rates.

Rebilling and chargeback for LLM consumption

A pattern that has become standard at companies building products on top of the ChatGPT API: re-billing or charging back LLM consumption to the end customer. The customer’s usage of your product drives LLM API calls, and you pass through (often with a margin) the cost of those calls.

This is where Moesif shows up in the conversation. Moesif’s usage-based billing records every customer’s API consumption, attributes the underlying LLM costs back to that customer, and syncs the usage to billing providers (Stripe, Recurly, Chargebee). The work of measuring “customer X consumed Y tokens this month” becomes operational telemetry rather than a quarterly finance project.

Two common patterns:

  • Pass-through pricing with a markup. Charge customers a fixed multiplier (1.5x-3x) of the underlying LLM cost. Works when the value to the customer is proportional to the LLM consumption.
  • Tiered subscriptions with token allowances. Bundle a monthly token allowance into each subscription tier, charge overage at the per-token rate. Works when usage is reasonably predictable and customers value cost certainty.

Whichever pattern fits your product, the prerequisite is per-customer LLM cost attribution. Without it, neither pricing model can be executed cleanly.

Next steps

The ChatGPT API pricing model is straightforward in structure (per token, per model) but operationally complex once you account for Batch, caching, multi-turn conversation cost accumulation, and per-customer attribution. The teams that handle it cleanly are the teams that instrument early and revisit model selection every six months.

If you are building a product that meters or re-bills LLM consumption, start a 14-day Moesif free trial to see per-customer attribution across your ChatGPT API usage. No credit card required.

Frequently asked questions

How does ChatGPT API pricing work? Per token. You pay separately for input tokens (the prompt) and output tokens (the response), at a per-million-token rate that varies by model. Discounts apply for the Batch API and prompt caching.

What is a token in the ChatGPT API? A unit of text the model processes. Roughly a word or sub-word piece. For English text, ~750 words is around 1,000 tokens.

Is the ChatGPT API expensive? Cost depends on volume and model choice. A handful of calls per day costs cents; production workloads at meaningful scale run into thousands of dollars per month. The cost equation depends heavily on model selection, Batch API usage, and prompt caching.

How is the ChatGPT API different from ChatGPT Plus? ChatGPT Plus is OpenAI’s consumer subscription for the chat.openai.com product (around $20/month at launch pricing). The API is billed separately by token. Different products, different SKUs.

Does OpenAI offer volume discounts? Effectively yes through the Batch API (lower rates for async workloads) and prompt caching (lower rates for repeated prompt prefixes). Direct enterprise-tier pricing is available for high-volume customers via OpenAI’s sales team.

Can I track per-customer LLM costs in my product? Yes, by attributing the API calls your application makes to the customer that triggered them and accumulating the token counts per customer. Moesif’s usage-based billing layer automates this for teams building products on top of the ChatGPT API.

Learn More About Moesif Deep API Observability with Moesif 14 day free trial. No credit card required. Try for Free
Grow Your API Business with Moesif Grow Your API Business with Moesif

Grow Your API Business with Moesif

Learn More