Tokens, Context & Cost

The currency of AI — understanding tokens is understanding costs.

What Are Tokens?

Everything AI processes is broken into tokens:

1 token ≈ ¾ of a word (or ~4 characters)
"Hello, how are you?" = 6 tokens
Code is more token-dense than natural language
Every language has different token efficiency

Examples:

"AI"           → 1 token
"Artificial"   → 1 token
"Intelligence" → 1 token
"künstliche"   → 2 tokens (German is less efficient)
"人工智能"      → 2 tokens (Chinese)

Why it matters: You pay per token. Shorter prompts = cheaper. But too short = worse results.

Context Window

The "memory" of a conversation — how much text the model can see at once:

Model	Context Window	≈ Pages
GPT-3.5 (2023)	4,000 tokens	~6 pages
GPT-4 (2023)	128,000 tokens	~200 pages
Claude 3.5 (2024)	200,000 tokens	~300 pages
Gemini 1.5 (2024)	1,000,000 tokens	~1,500 pages
Claude Opus 4.6 (2026)	200,000 tokens	~300 pages

Critical: When the context is full, older messages are dropped or compressed. The AI literally "forgets" the beginning of your conversation.

Context Compression

What happens when you hit the limit?

Strategy 1: Truncation

Drop oldest messages, keep recent ones
Simple but loses important context

Strategy 2: Summarization

Summarize old messages into a shorter version
Preserves key points but lossy

Strategy 3: RAG (Retrieval Augmented Generation)

Store context in a vector database
Retrieve only relevant parts when needed
Most sophisticated, best for large knowledge bases

Pro tip: Always put the most important information at the END of your prompt — models pay more attention to recent tokens.

The Cost Equation

Every AI request has a cost:

Cost = (Input Tokens × Input Price) + (Output Tokens × Output Price)

Example — asking GPT-5 to review 100 lines of code:

Input: ~2,000 tokens × $10/1M = $0.02
Output: ~500 tokens × $40/1M = $0.02
Total: $0.04 per request

At scale:

100 developers × 50 requests/day = 5,000 requests
5,000 × $0.04 = $200/day = $6,000/month

This is why model routing matters — not every task needs GPT-5. A simple question can use a $0.001 model.

Cost Optimization Strategies

How to cut AI costs by 60-80% without losing quality:

Model routing — use cheap models for simple tasks, expensive ones for complex tasks
Prompt optimization — shorter prompts = fewer input tokens
Caching — identical prompts get cached responses
Batch processing — group similar requests for volume discounts
Self-hosting — run open-source models for high-volume workloads

Model Prism by Ohara Systems automates strategy #1 — it classifies each request and routes it to the optimal model automatically.

---quiz question: Approximately how many tokens is one English word? options:

{ text: "Exactly 1 token per word", correct: false }
{ text: "About ¾ of a word per token (1 token ≈ 4 characters)", correct: true }
{ text: "1 token = 1 sentence", correct: false } feedback: One token is approximately ¾ of a word, or about 4 characters. This means "artificial intelligence" is 2 tokens, not 1.

---quiz question: What happens when a conversation exceeds the context window? options:

{ text: "The model crashes with an error", correct: false }
{ text: "Older messages are dropped or compressed — the model 'forgets'", correct: true }
{ text: "The model automatically upgrades to a larger window", correct: false } feedback: When the context window is full, older messages are truncated or summarized. The model literally loses access to earlier parts of the conversation.