Does it model prompt caching and batch discounts?

Yes. Set the share of input tokens served from cache and the calculator discounts that portion (cached input is billed at roughly a quarter of the normal rate across providers). Toggle Batch mode for asynchronous jobs to apply the standard 50% discount to both input and output. Caching and batch stack, cutting effective cost to about a quarter of list price.

Can it show costs in my local currency?

Yes. Pick your country at the top of the page and every figure — per-million prices, per-request cost, and daily, monthly or yearly spend — is converted to your local currency at an approximate rate. Providers still bill in USD, so treat the converted figures as planning estimates.

LLM API Cost Calculator — compare 36 models in seconds

Paste a prompt to count tokens, dial in your traffic, and see exactly what GPT, Claude, Gemini, DeepSeek, Llama, Mistral and more will cost you per month — with a sortable table, value scores, a context-window fit finder, shareable scenarios and CSV export.

🌍 Show all costs in Providers bill in USD; local amounts are approximate estimates.

1 · Estimate your prompt & set your traffic

Token counts use a client-side BPE-approximate estimator (±10–15%). Nothing you paste ever leaves your browser.

est. tokens

words

characters

Quick start

Requests / day

Avg input tokens / request

Avg output tokens / request

Prompt caching — 0% of input cached

Cached input is billed at roughly 25% of the normal rate (a cross-provider average). Stacks with batch.

Batch API mode ~50% off async jobs Project totals

—

requests / month

—

input tokens / mo

—

output tokens / mo

—

cheapest monthly

2 · Cheapest model that fits your context window

Your context need = input + output tokens per request (plus headroom for history or RAG chunks). Models that don't fit are dimmed in the table below.

I need at least tokens of context auto from sliders

3 · Side-by-side model comparison

Click any column to sort. Prices are standard API list rates in USD per 1M tokens — estimates as of June 2026.

fits context only

Model	Tier	Input /1M	Output /1M	Context	Quality*	Monthly	Value

✎ Spotted a stale price? Suggest a correction

Monthly spend by provider

Cheapest model vs. highest-quality model per provider at your current traffic (log scale).

PRO

Unlock the cost-modeling toolkit

A blended multi-model router calculator, saved scenarios with side-by-side compare, and one-click PDF + live-formula Excel export. One-time unlock — runs offline on this device.

Try it instantly with demo code AV-TOKEN-TALLY-DEMO · Get a license →

Cut your AI bill

Affiliate

TraceStack

Teams report 15–30% savings

LLM observability that surfaces your ten most expensive prompts and flags silent token bloat in production.

Try TraceStack free →

Affiliate

CacheWarp

Cache hits cost ~$0

Semantic caching proxy: serve repeat and near-duplicate questions from cache instead of paying for fresh generations.

Start caching →

Affiliate

ModelMux

Route 60% of traffic to cheap models

A smart router that sends easy requests to budget models and reserves frontier models for the hard ones.

Route smarter →

📬 Monthly AI pricing digest

One email a month: every price change across 13 providers, plus the updated CSV. No spam, unsubscribe anytime.

Get the digest →

How LLM API pricing actually works

Every major LLM provider bills the same way: you pay separately for input tokens (everything you send — system prompt, conversation history, retrieved documents) and output tokens (everything the model writes back). Output tokens are typically 3–8× more expensive than input tokens, which is why a chatty model with long answers can quietly cost several times more than a terse one at identical request volume. A token is roughly four characters of English text, or about three-quarters of a word; code, non-Latin scripts and unusual punctuation tokenize less efficiently, so budget extra headroom for those workloads.

Monthly cost is simple arithmetic once you know three numbers: requests per day, average input tokens per request, and average output tokens per request. TokenTally multiplies those out over a 30-day month against each model's per-million-token rates. The biggest savings levers, in rough order of impact: shorten your system prompt (it's resent on every request), cap output length, use prompt caching for repeated prefixes (most providers discount cached input 50–90%), batch non-urgent jobs (typically 50% off), and route easy requests to a cheaper tier instead of sending everything to a frontier model.

Frequently asked questions

How accurate is the token counter?

It's a BPE-approximate estimator that runs entirely in your browser — it mimics how modern tokenizers split words, numbers, punctuation and CJK characters, and is typically within ±10–15% of the real count for English prose. Each provider uses a slightly different tokenizer (o200k, Claude's tokenizer, SentencePiece variants), so even "exact" counts differ between models. For billing-critical work, use the provider's official token-counting endpoint.

Where do the prices come from, and how fresh are they?

Prices are standard pay-as-you-go API list rates in USD per million tokens, compiled from public provider pricing pages and dated in the badge at the top of the page. They are estimates: providers change rates frequently, and the table excludes batch discounts, cached-input rates, long-context surcharges, and negotiated enterprise pricing. If you spot a stale number, use the "suggest a correction" link under the table.

What is the value score?

Value = our editorial quality estimate (0–100, weighted heavily) divided by your per-request cost at the current slider settings, normalized so the best model in view scores 100. It rewards capable-but-cheap models and updates live as you move the sliders — a model that wins at 400 output tokens may lose at 4,000. Quality estimates are editorial judgments blending public benchmarks and community evals, not an official benchmark.

Why do output tokens cost so much more than input tokens?

Generating a token requires a full forward pass through the model, sequentially, one token at a time — while input tokens are processed in parallel in a single pass. Output is therefore far more compute-intensive per token, and pricing reflects that. Practical upshot: setting a sensible max_tokens and asking for concise answers is one of the highest-leverage cost optimizations available.

How big a context window do I actually need?

Add up: system prompt + conversation history you keep + retrieved documents + the user's message + the maximum response you allow. For a typical chatbot that's 4K–32K tokens; for RAG over long documents, 64K–200K; for whole-codebase or book-length work, 500K+. Note that many models get slower and slightly less accurate near their context limit, and some providers charge premium rates above a threshold — so "fits" isn't the same as "optimal". The finder above highlights the cheapest model that clears your requirement.

How can I cut my LLM bill without changing models?

Five proven levers: (1) trim your system prompt — at 1,000 requests/day, every 100 tokens removed saves ~3M input tokens a month; (2) enable prompt caching for stable prefixes; (3) use batch APIs for anything that can wait an hour (usually 50% off); (4) cap and compress outputs — ask for bullet points, not essays; (5) add a router or cascade so cheap models handle the easy 60–80% of traffic. Observability tools (see sidebar) help you find which prompts are actually burning the budget.

Is my pasted prompt sent anywhere?

No. TokenTally is a fully static page — the tokenizer, calculator, chart and CSV export all run client-side in your browser. There is no backend, no analytics on your prompt text, and nothing is transmitted when you type or paste.

How do prompt caching and batch discounts change the numbers?

Prompt caching lets you re-use a stable prefix (system prompt, retrieved documents, few-shot examples) so you are not billed full price for the same input on every call — cached input typically costs around a quarter of the normal rate. The Batch API runs non-urgent jobs asynchronously for roughly half price on both input and output. They stack: a heavily-cached batch workload can land near 25% of list price. Set the cached-input share and toggle Batch mode in step 1, and every monthly figure, the value scores and the chart update live.

Can I model a router that splits traffic across several models?

Yes — that is the blended multi-model router in the Pro toolkit. Assign a percentage of traffic to each of up to three models (for example 70% to a budget model, 25% to a mid model, 5% to a frontier model) and TokenTally returns the blended monthly cost, the blended cost per request, and the saving versus sending 100% of traffic to your most capable model. It is the fastest way to size the payoff of a cascade or router before you build one.

More free AI & builder tools

TokenTally is part of a fleet of fast, private, no-login tools. If you are pricing an LLM project you may also want these:

Estimates only. All prices, quality scores and projections on this page are editorial estimates for planning purposes — not quotes, not financial advice, and not affiliated with any model provider. Verify current pricing on each provider's official page before committing to a budget. Affiliate links may earn us a commission at no cost to you.

LLM API Cost Calculator — compare 36 models in seconds

1 · Estimate your prompt & set your traffic

2 · Cheapest model that fits your context window

3 · Side-by-side model comparison

Monthly spend by provider

Unlock the cost-modeling toolkit

4 · Blended multi-model router

5 · Saved scenarios & compare

6 · Export your cost model

How LLM API pricing actually works

Frequently asked questions

More free AI & builder tools