OpenKitMule Guide · 2026-07-03
Estimating LLM API Costs Before You Ship: A Practical Guide for GPT-4o, Claude 3.5, Gemini and DeepSeek
You have a prompt that works. Now you need to answer a boring but critical question before your CTO asks it: "What is this going to cost at 10,000 calls per day?"
The answer is not "cheap" or "we will figure it out." It is a number, and you can compute it in about sixty seconds. This guide walks through the exact math, current pricing for the four model families most teams actually ship with, and a zero-dependency CLI you can use to skip the spreadsheet entirely.
1. What actually gets billed: tokens, not characters
LLM providers bill by tokens, not by characters or words. A token is roughly three to four English characters, or one to two Chinese characters. Every request has two token counts:
- Input tokens - everything you send: system prompt, user message, prior conversation history, tool definitions, retrieved context.
- Output tokens - what the model generates in its reply.
Output tokens are almost always priced higher than input tokens, often three to five times higher. That single fact drives most cost decisions: the cheapest way to lower an LLM bill is usually to shorten the output, not the input.
2. Current pricing (per 1M tokens, USD)
These are the models most production teams use right now. All numbers reflect publicly listed API rates at the time of writing; always verify on the provider's own pricing page before signing anything.
| Model | Input | Output | Best for |
|---|---|---|---|
| GPT-4o | $2.50 | $10.00 | General reasoning, multimodal. |
| GPT-4o mini | $0.15 | $0.60 | Cheapest OpenAI tier; great default for high-volume tasks. |
| GPT-4 Turbo | $10.00 | $30.00 | Legacy tier; usually worth swapping for 4o. |
| Claude 3.5 Sonnet | $3.00 | $15.00 | Long-form writing, careful reasoning, code. |
| Claude 3 Haiku | $0.25 | $1.25 | Fastest Claude; cheap enough for classification pipelines. |
| Gemini 1.5 Pro | $1.25 | $5.00 | Very long context (up to 2M tokens). |
| Gemini 1.5 Flash | $0.075 | $0.30 | Cheapest tier in this table. |
| DeepSeek V3 | $0.27 | $1.10 | Strong price/performance for reasoning-heavy work. |
| DeepSeek R1 | $0.55 | $2.19 | Reasoning model; strong on math and code. |
3. The sixty-second cost estimate
The formula is simple:
cost = (input_tokens / 1,000,000) * input_price
+ (output_tokens / 1,000,000) * output_price
Worked example: you are building a customer-support classifier. Each call sends about 800 input tokens (system prompt plus the user's message) and generates about 60 output tokens (a JSON label). You expect 10,000 calls per day.
| Model | Per call | Per day (10K calls) | Per month |
|---|---|---|---|
| Gemini 1.5 Flash | $0.000078 | $0.78 | $23 |
| GPT-4o mini | $0.000156 | $1.56 | $47 |
| Claude 3 Haiku | $0.000275 | $2.75 | $83 |
| DeepSeek V3 | $0.000282 | $2.82 | $85 |
| Claude 3.5 Sonnet | $0.003300 | $33.00 | $990 |
The same task on Gemini 1.5 Flash is about 42 times cheaper than on Claude 3.5 Sonnet. Picking the right tier for the task is by far the highest-leverage cost decision on an LLM-powered feature.
4. Where estimates go wrong in practice
- Output tokens are wildly underestimated. Developers plan for a 200-token JSON reply, then the model writes a friendly 800-token explanation. Sample real outputs from at least fifty representative inputs before committing to a budget.
- Retries multiply cost. If 5% of requests hit a retry loop (rate limits, malformed JSON, tool errors), your real bill is 1.05x higher, and worse for reasoning models that generate long hidden chains of thought before finishing.
- Retrieval context is invisible until you log it. RAG pipelines silently append thousands of retrieved tokens per call. Log the actual prompt size in production, not the template size.
- Tokenizers differ between providers. The same English sentence tokenizes to slightly different counts on GPT vs. Claude vs. Gemini. For serious planning, use each provider's own tokenizer; for rough estimation, "characters divided by four" is within about 10% for typical English text.
- Prompt caching changes the math. Anthropic and OpenAI now bill cached input tokens at a large discount. If your system prompt is stable across calls, design for it.
5. Skip the spreadsheet: a free CLI
OpenKitMule maintains a small tool called PromptForge that does this math from your terminal - token counting, cost estimation, template rendering, and A/B comparison between two prompts. It is a single Python file with no third-party dependencies (Python 3.9+ stdlib only).
Typical usage after downloading the kit:
python main.py tokens "Summarize this article: ..." --model gpt-4o-mini
python main.py cost "Summarize this article: {article}" --model gpt-4o-mini --count 10000
python main.py compare "You are a helpful assistant." "You are a senior Python expert."
Output is structured JSON or Markdown, so you can pipe it into a CI job to catch cost regressions before a prompt change ships.
6. Related reading
- How to hand a kit to an AI agent instead of running it yourself - the workflow OpenKitMule kits are designed for.
- PromptForge - the CLI referenced above.
- Agent Scraper Kit - an LLM-driven scraping framework where cost estimation matters heavily at scale.
- News Refiner Kit - a local-LLM alternative for the classification/summary use case, with zero API cost.
Try the estimator
PromptForge is free, offline-first, and takes about a minute to download and run. Go from "we should probably know" to "here is the monthly number."