How LLMs Actually Work

They don't "think" — they predict. Understanding this changes everything.

The Token Prediction Loop

An LLM does one thing, billions of times:

Given all previous tokens → predict the most likely next token.

Input:  "The capital of France is"
Step 1: P("Paris") = 0.92, P("Lyon") = 0.03, P("a") = 0.01 ...
Step 2: Select "Paris"
Step 3: "The capital of France is Paris"
Step 4: Predict next token after "Paris" → "." (0.87)

This is similar to your phone's autocomplete — but with 175 billion+ parameters instead of a small dictionary.

Key insight: The AI doesn't "know" that Paris is the capital of France. It has learned that the token "Paris" very frequently follows "capital of France is" in its training data.

How Neural Networks Learn

Think of a spam filter — the simplest neural network:

Features → Weights → Decision

Email contains "FREE MONEY"  → weight: +0.9 →
Email from known contact     → weight: -0.7 → Spam score: 0.73
Email has attachment          → weight: +0.3 → → SPAM

Training = adjusting weights until predictions match reality.

An LLM has billions of these weights, processing text through hundreds of layers. The principle is identical — just at a mind-boggling scale.

Tokenization & Embeddings

Before the AI can process text, it converts words to numbers:

Tokenization — splitting text into tokens:

"Hello, how are you?" → ["Hello", ",", " how", " are", " you", "?"]
                       → [15339, 11, 1268, 527, 499, 30]

Embeddings — mapping tokens to high-dimensional vectors:

"king"  → [0.2, 0.8, -0.1, 0.5, ...]  (768+ dimensions)
"queen" → [0.2, 0.8, -0.1, 0.9, ...]  (similar but different!)

Famous result: king - man + woman ≈ queen

These vector relationships capture meaning — words used in similar contexts end up close together in vector space.

The Attention Mechanism

The breakthrough that made modern AI possible (2017):

Self-attention lets each token look at ALL other tokens to understand context:

"The bank by the river was steep"
                ↑
    "bank" attends to "river" → meaning: riverbank (not financial)

"I went to the bank to deposit money"
                ↑
    "bank" attends to "deposit", "money" → meaning: financial institution

Without attention, the model would treat "bank" the same in both sentences. With attention, it understands context — the key to language understanding.

The Common Sense Problem

AI can score 90%+ on IQ tests but fails at basic common sense:

The car wash problem:

"I took my car to the car wash. After washing, my car was clean." AI: ✓ Understands perfectly.

"I took my cat to the car wash. After washing, my cat was..." AI: "...clean." (Correct token prediction) Human: "...terrified, soaking wet, and trying to escape."

The AI predicts the statistically likely next word. It doesn't simulate reality. It has no world model — just pattern matching at incredible scale.

The Blackbox Problem

Why can't we debug AI decisions?

Traditional code: if (x > 5) return "big" — fully traceable.

LLM: 175,000,000,000 weights → somehow → "Paris" — untraceable.

You can't ask "WHY did you output this?" and get a meaningful answer. The model doesn't know why — it's a statistical computation, not a reasoning chain.

This is why prompt engineering matters: You can't fix the model, but you can improve the input to get better output.

---quiz question: What does an LLM actually do at each step? options:

{ text: "Searches the internet for answers", correct: false }
{ text: "Predicts the most likely next token based on all previous tokens", correct: true }
{ text: "Runs a database query against its training data", correct: false }
{ text: "Thinks about the question and formulates an answer", correct: false } feedback: LLMs predict the next token using probability. They don't search, query, or think — they compute the most statistically likely continuation.

---quiz question: What was the key innovation in the 2017 "Attention Is All You Need" paper? options:

{ text: "Bigger training datasets", correct: false }
{ text: "Self-attention — letting each token consider all other tokens for context", correct: true }
{ text: "Faster GPUs for training", correct: false } feedback: Self-attention allows the model to understand context by letting each token attend to every other token in the sequence. This is what makes transformers so powerful at language understanding.

---quiz question: Why is AI sometimes called a "blackbox"? options:

{ text: "Because the code is proprietary and closed-source", correct: false }
{ text: "Because billions of weights produce output that can't be traced to specific reasoning steps", correct: true }
{ text: "Because it only works in dark mode", correct: false } feedback: With 175B+ parameters, there's no way to trace WHY a specific output was produced. The computation is correct but unexplainable — unlike traditional deterministic code.