Tokens, Context & Cost
The currency of AI — understanding tokens is understanding costs.
What Are Tokens?
Everything AI processes is broken into tokens:
- 1 token ≈ ¾ of a word (or ~4 characters)
- "Hello, how are you?" = 6 tokens
- Code is more token-dense than natural language
- Every language has different token efficiency
Examples:
"AI" → 1 token
"Artificial" → 1 token
"Intelligence" → 1 token
"künstliche" → 2 tokens (German is less efficient)
"人工智能" → 2 tokens (Chinese)
Why it matters: You pay per token. Shorter prompts = cheaper. But too short = worse results.
Context Window
The "memory" of a conversation — how much text the model can see at once:
| Model | Context Window | ≈ Pages |
|---|---|---|
| GPT-3.5 (2023) | 4,000 tokens | ~6 pages |
| GPT-4 (2023) | 128,000 tokens | ~200 pages |
| Claude 3.5 (2024) | 200,000 tokens | ~300 pages |
| Gemini 1.5 (2024) | 1,000,000 tokens | ~1,500 pages |
| Claude Opus 4.6 (2026) | 200,000 tokens | ~300 pages |
Critical: When the context is full, older messages are dropped or compressed. The AI literally "forgets" the beginning of your conversation.
Context Compression
What happens when you hit the limit?
Strategy 1: Truncation
- Drop oldest messages, keep recent ones
- Simple but loses important context
Strategy 2: Summarization
- Summarize old messages into a shorter version
- Preserves key points but lossy
Strategy 3: RAG (Retrieval Augmented Generation)
- Store context in a vector database
- Retrieve only relevant parts when needed
- Most sophisticated, best for large knowledge bases
Pro tip: Always put the most important information at the END of your prompt — models pay more attention to recent tokens.
The Cost Equation
Every AI request has a cost:
Cost = (Input Tokens × Input Price) + (Output Tokens × Output Price)
Example — asking GPT-5 to review 100 lines of code:
- Input: ~2,000 tokens × $10/1M = $0.02
- Output: ~500 tokens × $40/1M = $0.02
- Total: $0.04 per request
At scale:
- 100 developers × 50 requests/day = 5,000 requests
- 5,000 × $0.04 = $200/day = $6,000/month
This is why model routing matters — not every task needs GPT-5. A simple question can use a $0.001 model.
Cost Optimization Strategies
How to cut AI costs by 60-80% without losing quality:
- Model routing — use cheap models for simple tasks, expensive ones for complex tasks
- Prompt optimization — shorter prompts = fewer input tokens
- Caching — identical prompts get cached responses
- Batch processing — group similar requests for volume discounts
- Self-hosting — run open-source models for high-volume workloads
Model Prism by Ohara Systems automates strategy #1 — it classifies each request and routes it to the optimal model automatically.
---quiz question: Approximately how many tokens is one English word? options:
- { text: "Exactly 1 token per word", correct: false }
- { text: "About ¾ of a word per token (1 token ≈ 4 characters)", correct: true }
- { text: "1 token = 1 sentence", correct: false } feedback: One token is approximately ¾ of a word, or about 4 characters. This means "artificial intelligence" is 2 tokens, not 1.
---quiz question: What happens when a conversation exceeds the context window? options:
- { text: "The model crashes with an error", correct: false }
- { text: "Older messages are dropped or compressed — the model 'forgets'", correct: true }
- { text: "The model automatically upgrades to a larger window", correct: false } feedback: When the context window is full, older messages are truncated or summarized. The model literally loses access to earlier parts of the conversation.