Token Limits Explained — Why Your AI App Forgets and How to Fix It

“Why did the AI forget what I said 10 messages ago?” Because you hit the token limit. Every LLM has a fixed context window — a maximum number of tokens it can process at once. When your conversation exceeds this, older messages get dropped. The model literally can’t see them anymore.

Understanding tokens isn’t just an API detail. It’s the fundamental constraint that shapes how you architect AI applications.

1. Tokens Are Not Words

First misconception: “128K tokens means 128K words.” No. Tokens are subword chunks created by the tokenizer. Common words are one token. Uncommon words get split into pieces. Code and special characters are particularly token-hungry.

What Tokens Actually Are

Tokens aren't words. They're chunks the model's tokenizer creates. One word can be multiple tokens.

hello→

hello

1 token

Kubernetes→

Kubernetes

3 tokens

indistinguishable→

indistinguishable

5 tokens

{"key": "value"}→

{"key": "value"}

6 tokens

Rule of thumb: 1 token ≈ 4 characters in English. 100 tokens ≈ 75 words. Code and non-English text use MORE tokens per word.

This matters because it’s easy to blow your budget with code-heavy prompts. That JSON payload you’re stuffing into context? It could be 3-5x more tokens than the equivalent English description. Structured data is expensive.

2. Context Windows — The Hard Ceiling

The context window is everything the model can see at once. System prompt, entire conversation history, retrieved documents, the user’s current message, AND the space reserved for the response — all must fit within this window.

Context Window Sizes — 2024-2026

GPT-3 (2020)

4K tokens

GPT-4 (2023)

8K / 32K

Claude 3 (2024)

200K tokens

Gemini 1.5 (2024)

1M tokens

Gemini 2.0 (2025)

2M tokens

What fits in 200K tokens?

~500 pages of text~150K lines of codean entire codebase3 full novels

Bigger isn’t always better. A 2M token window means you CAN stuff more in — but the model’s attention degrades in the middle of long contexts (“lost in the middle” problem). Just because you have 200K tokens available doesn’t mean you should use all of them. Focused, relevant context beats comprehensive context.

3. Budget Your Tokens

Think of the context window as a fixed budget. Every component of your prompt competes for space. If your system prompt is 2000 tokens and your RAG context is 50K tokens, you’ve already used most of a 128K window before the user says anything.

Token Budget Allocation

Your context window has a fixed size. Everything — system prompt, history, retrieval docs, user query, AND the response — must fit inside it.

10%System Prompt

25%Conversation History

35%Retrieved Context (RAG)

10%User Query

20%Response (max_tokens)

system + history + retrieval + query + max_tokens ≤ context_window

If you exceed the limit: the API either truncates (cuts off older messages) or returns an error. Neither is good. Budget proactively.

The mistake I see in production: teams build elaborate system prompts (3000+ tokens) with instructions the model doesn’t follow anyway. Then they wonder why their RAG context gets truncated. Shorter system prompts leave more room for the information that actually helps the model answer correctly.

4. When You Hit the Limit

Every long-running AI application eventually hits the token limit. The conversation grows, context accumulates, and suddenly you’re over budget. How you handle this determines whether your app gracefully degrades or silently loses important context.

Strategies When You Hit the Limit

Sliding WindowKeep last N messages, drop oldest. Simple but loses early context. Works for chatbots where recent context matters most.Simple

Summarize + CompressPeriodically summarize older messages into a compact summary. Keep summary + recent messages. 80% context retention at 20% token cost.Balanced

RAG Instead of StuffingDon't put everything in the context. Retrieve only relevant chunks per query. 10 relevant chunks beats 100 pages of everything.Best for knowledge

Prompt CachingCache the system prompt + static context. Only send dynamic parts fresh. Saves tokens AND reduces latency by 80%+ on cached portions.Cost efficient

Multi-turn with MemoryStore facts in external memory (database). Inject only relevant memories per turn. Like human working memory — selective recall, not total recall.Scalable

The approach I recommend for most production apps: summarize + RAG hybrid. Periodically compress old conversation into a summary paragraph (saves 90% of tokens while retaining key facts). For knowledge-heavy queries, retrieve relevant chunks instead of stuffing everything into context. This scales to infinite conversations.

5. The Cost Dimension

Tokens aren’t just a size constraint — they’re a cost multiplier. Every token processed costs money. At scale (millions of conversations/month), token efficiency directly impacts your margin. Output tokens cost 3-5x more than input tokens.

Token Pricing — The Math That Matters

ModelInput $/1M tokensOutput $/1M tokensCost per 1K conversations

GPT-4o$2.50$10~$8

GPT-4o-mini$0.15$0.60~$0.50

Claude 3.5 Sonnet$3$15~$12

Claude 3.5 Haiku$0.25$1.25~$1

Gemini 1.5 Flash$0.075$0.30~$0.25

Output tokens cost 3-5x more than input tokens. This means verbose responses are expensive. Set max_tokens appropriately. A concise 200-token answer costs 5x less than a rambling 1000-token answer.

The optimization that saves the most money: model routing. Use expensive models (GPT-4, Claude Sonnet) only for complex queries. Route simple questions to cheap models (GPT-4o-mini, Haiku, Flash). A good router cuts token costs by 70% with minimal quality loss. Most questions don’t need the strongest model.