AI Cost Curves Don’t Behave: Why Classical Scaling Assumptions Break

June 2026 · nokhiz.github.io

TL;DR — 6 Central Insights ⚡

#	Insight
1	Classical workloads scale on requests — AI scales on tokens, context length, and model tier
2	Context window costs grow super-linearly — a longer conversation costs disproportionately more per turn
3	Auto-scaling intuition breaks — model load latency makes reactive scale-down actively expensive
4	The unit of cost attribution must shift from requests/second to tokens/conversation
5	Prompt discipline — shorter, more precise inputs — is the highest-leverage cost lever available
6	Semantic caching and async batching are the two structural optimizations worth engineering properly

1. The Mental Model That Breaks 🏗️

💡 Key Message: Every cost intuition engineers built over the last twenty years was trained on a specific kind of workload — one that AI systems fundamentally are not.

Classical systems have a reassuringly predictable relationship with cost. More requests arrive, more compute is needed, bills go up in proportion. Add a cache, cut the database load in half. Add a CDN, cut egress costs by 80%. The patterns are well-understood because the unit of work is well-understood: a request, a query, a transaction.

AI systems do not work this way. The unit of work is not the request — it is the token. And tokens behave differently from requests in ways that quietly break three assumptions that most engineers treat as facts of life.

1. Assumption 📦 “Requests Drive Cost”

In a classical API, two requests to the same endpoint cost roughly the same. The compute is bounded by the operation — fetch a record, apply logic, return JSON.

In an AI system, two requests to the same endpoint can cost 50× differently depending on one variable: how many tokens are involved. A user who sends a one-line query costs almost nothing. A user who pastes a 20-page document and asks for a summary consumes as much compute as thousands of simple queries.

📌 Example: A support chatbot handling both “where is my order?” and “analyze this contract and flag all termination clauses” will show wildly unpredictable per-request costs — not because anything went wrong, but because token volume drives cost, not request count.

2. Assumption 📈 “Scaling is Linear”

Classical workloads trend toward linear scaling: double the users, roughly double the cost. AI costs are non-linear in at least two dimensions.

First, token pricing compounds with context. Each new message in a conversation does not just add its own tokens — it forces the model to process the entire conversation history again. A 10-message conversation is significantly more expensive per message than the first message was.

Second, model selection creates discrete cost jumps. There is no smooth curve between a small model and a large one. Switching from a mid-tier model to a frontier model may improve quality by 20% while increasing cost by 400%. These are not linear trade-offs.

Workload Type	Cost Growth Pattern
Classical API	Linear with request volume
AI — short, uniform prompts	Near-linear with request volume
AI — variable prompt length	Non-linear, dominated by token distribution
AI — long conversations	Super-linear — each turn processes full history

3. Assumption ⚡ “Auto-Scaling Works”

Scale down at night, scale up under load — this is well-established infrastructure practice. With AI workloads it is expensive to execute naively.

Large language models have cold-start costs. Loading a model onto a GPU takes time — sometimes tens of seconds. If auto-scaling aggressively terminates idle GPU instances and then receives a burst of requests, the cold-start penalty can exceed the savings from scaling down. Platforms that don’t account for this end up paying more than they would have by keeping instances warm.

✏️ Key Rule: Apply auto-scaling to stateless CPU-bound services around AI systems. Treat the model-serving layer as a semi-persistent resource with warm and cold state — not as a stateless compute unit.

2. What Actually Drives AI Costs 💰

Three variables dominate. Understanding them replaces the broken intuitions above.

4. Driver 🎯 Tokens, Not Requests

The atomic unit of cost is the token — roughly 0.75 words. Input tokens (what goes in) and output tokens (what come out) are priced separately, with output typically costing more. Everything that enters the context window — system prompts, conversation history, injected documents, tool outputs — adds to the input token count.

💡 Insight: A bloated system prompt that runs 2,000 tokens gets paid for on every single request. At scale, prompt fat is not a nuisance — it is a recurring operational cost.

5. Driver 📐 Context Window Quadratics

Transformer-based models have attention mechanisms that scale quadratically with input length. Providers abstract this away in per-token pricing, but the computational reality shows up in latency and throughput limits at scale.

More practically: every token added to the context increases the cost of processing every subsequent token. A conversation spanning 100,000 tokens is not 10× more expensive than one spanning 10,000 — the growth is steeper, and it compounds turn by turn.

6. Driver 🔢 Model Tier Selection

Not every task requires a frontier model. This is perhaps the most immediately actionable insight — and the most commonly ignored.

Task Type	Appropriate Tier	Relative Cost
Intent classification, routing	Small / fast model	1×
RAG retrieval, summarization	Mid-tier model	5–15×
Complex reasoning, generation	Frontier model	50–200×

Sending classification tasks to a frontier model because it is “already integrated” is the AI equivalent of running a static website on a 64-core server. The capability is there, but the cost-to-task ratio is not defensible.

3. Keeping an Overview 🔍

💡 Key Message: The metrics that tell you whether a classical system is healthy — latency, error rate, throughput — are necessary but not sufficient for AI systems. You need a second instrument cluster.

7. Metric 📊 Tokens-per-Conversation

Track median and P95 token consumption per user session, broken down by feature area. This surfaces which parts of a product are consuming disproportionate budget — usually before the monthly invoice makes it obvious.

A P95 that is 10× the median means a small percentage of users or misuse scenarios are driving a large fraction of costs. That is an architectural signal, not just a finance signal.

8. Metric 🔬 Cost-per-Task

Define what constitutes a task for your system — a support ticket resolved, a document analyzed, a code review completed. Then compute average cost per completed task and track it over time.

This metric is what allows a meaningful conversation about AI value without resorting to abstractions. It also makes regressions visible: if a prompt change silently doubles average token consumption, cost-per-task catches it before the next billing cycle.

4. The Levers That Work 🛠️

9. Lever ✂️ Prompt Discipline

System prompts and context injection are operational costs, not one-time engineering decisions. Every token in a system prompt multiplies across every request that uses it. Audit them regularly. Remove examples, chain-of-thought scaffolding, and instructions that no longer improve output quality — they pay rent on every call.

🎯 Core Function: Treat your system prompt like compiled binary — it should contain exactly what is needed, nothing more. The savings compound at request volume.

10. Lever 🧠 Semantic Caching

Classical caches return identical responses for identical inputs. Semantic caching returns cached responses for semantically equivalent inputs — questions that mean the same thing even if worded differently.

Implementations store responses alongside their embedding representations. When a new query arrives, it is compared against cached embeddings. If similarity exceeds a threshold, the cached response is returned without calling the model at all.

Hit rates of 30–50% are achievable in domains with repetitive queries — support, documentation search, FAQ handling. At those hit rates, the cost reduction is substantial and the latency improvement is a bonus.

11. Lever 📦 Async Batching

Real-time inference is expensive because it requires immediate GPU availability. Many AI tasks — document processing, batch summarization, offline analysis — do not require real-time responses.

Routing these tasks to batch inference endpoints (available on Azure OpenAI, AWS Bedrock, and Google Vertex AI) typically reduces cost by 40–60% at the expense of latency measured in minutes rather than milliseconds. For non-interactive workflows, this trade-off is worth making unconditionally.

📋 Note: Batch endpoints are underused because they require architectural discipline — separating interactive from async workloads. Most teams default to real-time because it is simpler. The cost difference is what makes the separation worth engineering.

5. The Shift Required 🚀

💡 Key Message: Running AI in production does not just require new infrastructure — it requires a new mental model for what drives cost, what constitutes waste, and which levers are actually available.

The engineers who navigate this well are not the ones who know the most billing dashboards. They are the ones who internalized token economics the way they once internalized requests-per-second and memory footprint.

The intuitions transfer, but only partly. Request count, linear scaling, and auto-scaling are still part of the picture — but they are no longer sufficient to describe it. The cost curve bent. The thinking has to bend with it.

Concretely: this means instrumenting token consumption before you need to, choosing model tiers deliberately rather than by default, and treating the context window as a scarce resource with a real price per token.

The bill will teach you eventually. Building the mental model first is cheaper.

Tags: ai · architecture · cost · production