Scroll down to dive beneath the surface of AI, exploring the hidden
layers of large language models — from the applications you see to the
vast ocean of training data you don't.
Scroll to explore
The Big Picture
The AI Iceberg
Most people interact with AI through polished applications like ChatGPT,
Claude, or Gemini. But like an iceberg, what you see is only a tiny fraction of what's really
there. The vast majority — the training data, the model architecture, the
hidden processes — lies beneath the surface.
ApplicationsChatGPT, Claude, Gemini & moreRLHF + System Messages
The LLMTokenization, Weighting,Attention & Transformers
The Dataset~13 trillion tokens of training dataCommon Crawl, Wikipedia, GitHub, books, web pages
The Ocean & The SharksThe internet — source of the training dataBias, misinformation, and explicit content
Layer 3: The Application
Picture a carefully sculpted snowman sitting on top of the iceberg.
This represents an application like ChatGPT, Claude, or Gemini —
built on top of a general LLM and fine-tuned specifically for
conversational tasks.
The chatbot layer includes system messages (rules
governing every interaction) and RLHF
(reinforcement learning from human feedback), where human reviewers
judge outputs as helpful, harmful, correct, or incorrect — shaping
how the model responds.
Layer 2: The LLM
The visible iceberg above the waterline is the Large Language Model
itself — the result of the training process fuelled by the vast
dataset beneath. This is where three key technical processes happen.
Tokenization breaks text into machine-readable
pieces. Weighting assigns probability values to
connections between tokens. And the
transformer architecture with its
attention mechanism — introduced in 2017 by Google
researchers — allows the model to consider relationships across
entire passages of text.
LLMs are often called "black boxes" because their internal
connections are so massively complex that no human or team of humans
could possibly unravel everything going on inside.
Layer 1: The Dataset
The bulk of the iceberg — hidden underwater — represents the vast
dataset on which the LLM is trained. GPT-4 trained on roughly
13 trillion tokens — about 1,600 times the
population of Earth.
Known data sources include Common Crawl, The Pile, Wikipedia,
GitHub, and social media. Much of the data remains proprietary.
This layer carries plenty of side effects, including the kind of
bias and discrimination central to AI ethics discussions.
The Ocean & The Sharks
The ocean surrounding the iceberg represents the internet at large —
the vast environment from which the dataset is sourced.
And the sharks? They symbolize threats lurking in the data:
misinformation, bias,
toxic content, and data privacy
issues that can influence the dataset and, subsequently,
the behaviour of the LLM.
“A language model is a stochastic parrot — it
haphazardly stitches together sequences of linguistic forms
from its training data, without any reference to meaning.”
— Emily Bender et al., “On the Dangers of Stochastic Parrots” (2021)
Beneath the Surface
How It Actually Works
Let's open the black box. The following sections explore the key
processes that make large language models tick — from breaking text
into tokens to predicting the next word.
The Core Insight
The World's Most Sophisticated Autocomplete
Here's the single most important thing to understand about how LLMs work:
they build text one word at a time, each time predicting
the most likely next token based on everything that came before.
Think of the autocomplete on your phone. You type "I'm on my" and it
suggests "way." An LLM does the same thing — but with vastly more
context, vastly more data, and vastly more sophistication. It doesn't
"understand" your message. It predicts what comes next.
As AI researcher Gary Marcus puts it, large language models are
“glorified autocomplete” — impressive in their output,
but fundamentally just predicting the next word.
Process 1
Tokenization
Before an LLM can process language, text must be broken into small
pieces called tokens — like Scrabble tiles the model can work with.
Scroll to see how text is broken into tokens
Original text
The animal didn't cross the street because it was too tired
Tokenizer (BPE)
Tokens
The464
animal5765
didn1422
't956
cross3108
the279
street8080
because1606
it433
was574
too2288
tired12544
Full word
Subword piece
Tokens: The Scrabble Tiles of AI
Before an LLM can process text, it must break it into tokens —
small pieces it can work with, like Scrabble tiles.
Tokens are not always whole words. The tokenizer uses Byte-Pair Encoding (BPE)
to split rare or compound words into smaller, more common pieces. Notice how
“didn’t” becomes two tokens: didn and ’t.
Each token is then mapped to a numerical ID — the only thing the model
actually sees. GPT-4’s vocabulary contains roughly 100,000 possible tokens.
As Andrej Karpathy explains, “a lot of weird behaviors and problems of LLMs
actually trace back to tokenization.”
Process 2
Embeddings
Each token becomes a numerical vector — essentially GPS coordinates in
"meaning space," where similar concepts naturally cluster together.
These positions are learned during training, not hand-coded.
Scroll to explore meaning space
Meaning Space
Every word becomes a point in this space. As the model trains on billions
of sentences, it learns to place words that appear in similar contexts near
each other. The result is a rich geometric landscape where relationships
between words are captured as directions and
distances.
Places
Animals
Emotions
Science
Paris − France + Japan ≈ Tokyo
The direction from France to Paris encodes the concept "capital of."
Apply that same direction starting from Japan and you land near Tokyo. The model
discovers these geometric relationships entirely on its own, by predicting which
words appear in similar contexts across billions of sentences.
First demonstrated by Mikolov et al. (2013) with Word2Vec. Modern LLMs use
contextual embeddings where a word's position shifts depending on
the sentence around it — "bank" moves toward finance or rivers depending on context.
Process 3
Attention
The breakthrough that made modern AI possible. The attention mechanism
lets each word look at every other word to understand context — resolving
ambiguity in ways previous architectures couldn't.
Scroll to see attention in action
Self-Attention: Resolving “it”
When the model encounters the word “it”, attention lets it look at
every other word and decide which ones are relevant. The thicker the line, the more attention
“it” pays to that word.
The animal didn't cross the street because it was too tired
Query, Key, Value — A Library Analogy
Q
Query
“What am I looking for?” The word “it” asks: “What noun do I refer to?”
K
Key
“What do I contain?” Each word advertises its identity, like a library catalogue entry.
V
Value
“Here’s my actual information.” The content retrieved once Query matches a Key.
Contextual Attention: The “Mole” Problem
The word “mole” means completely different things depending on context.
Attention lets the model figure out which meaning is intended by attending to surrounding words.
Animal
The American shrew mole burrows underground
Chemistry
One mole of CO₂ weighs 44 grams
Medical
Biopsy of the skin mole was benign
Putting It Together
The Transformer Architecture
Tokenization, embeddings, and attention combine inside a
transformer — the architecture introduced in Google's landmark 2017
paper "Attention Is All You Need" that powers every modern LLM.
Scroll to open the black box
The LLM: a black box?
GPT-4 has an estimated 1.8 trillion parameters — numerical
values learned during training. That’s why it’s called a “black box”:
no human could unravel all those connections. But we can
understand the architecture. Let’s open the box.
?
Architecture Key
Input TokensText broken into pieces the model can process
EmbeddingEach token becomes a vector of ~12,000 numbers
Self-AttentionEvery word looks at every other word for context
Feed-Forward (MLP)Dense neural network layers that transform representations
Softmax OutputProbability distribution over every possible next token
GPT-4: ~1.8 trillion parameters96 transformer layers, each with billions of learned numerical connections.
This is why it’s a “black box” — the sheer scale makes
individual connections impossible to trace.
The Final Step
RLHF: Teaching the Model Manners
A raw LLM is powerful but unruly — it might generate toxic content,
confidently state falsehoods, or ignore your actual question. That's
where RLHF comes in.
Reinforcement Learning from Human Feedback
After the base model is trained on trillions of tokens, human reviewers
step in. They're shown pairs of model outputs and asked: which
response is more helpful, more accurate, less harmful? These
preferences are used to train a reward model — a
separate system that learns to score outputs the way humans would.
The LLM is then fine-tuned using reinforcement learning to maximise
that reward score. Over many iterations, the model learns to produce
responses that are helpful, honest, and safe — not because it
understands these concepts, but because it's been optimised to match
human preferences. This is what turns a raw text predictor into a
usable assistant like ChatGPT or Claude.
RLHF is the "snowman sculptor" — the process that shapes the raw
iceberg into the polished application sitting on top.
But there’s a hidden cost. The humans doing this work — many
employed through outsourcing firms in Kenya, the Philippines, and other
countries — are often paid less than $2 per hour to review and classify
content, including violent, sexual, and deeply disturbing material. A 2023
Time investigation revealed that workers labelling data for ChatGPT
described the work as “torture.” This is the invisible human
labour that makes AI “alignment” possible.
“Artificial intelligence is neither artificial nor intelligent.
It is both embodied and material, made from natural resources, fuel,
human labor, infrastructures, logistics, histories, and
classifications.”
— Kate Crawford, Atlas of AI (2021)
>
Which response is better?
Response A
Response B
Model alignment0%
Raw modelAligned
Through thousands of comparisons like these, human feedback shapes the
model's behaviour — making it more helpful, safer, and more accurate.
Beyond Text
How AI Generates Images
Image generators like DALL-E, Stable Diffusion, and Midjourney use a
different technique called diffusion. Imagine ink diffusing in a glass
of water until it's uniformly mixed — then learning to run the process
in reverse, starting from pure noise and gradually "un-mixing" it into
a coherent image, guided by a text prompt.
Scroll to watch noise become an image
How image generation works
Diffusion models like DALL-E and Stable Diffusion learn to create images by
reversing a noise process. Imagine ink dropped in water, diffusing until uniformly
mixed. These models learn to un-mix the ink, step by step, guided by a
text prompt.
>"A snowman face on a blue background"
Step 0: Pure noise
NoiseImage
Starting from pure random noise — no discernible pattern. This is the
equivalent of fully diffused ink in water.
The model begins its first denoising pass. It has learned
statistical patterns from millions of images during training.
Broad shapes begin to emerge. The text prompt “snowman face on blue background”
guides the model via cross-attention with a CLIP text encoder.
Colors start separating — whites clustering where the face should be,
blues filling the background region.
The pattern becomes clearly recognizable. Each denoising step refines
the prediction from the previous step.
Almost there. Fine details resolve. In practice, diffusion models
run 20-50 of these denoising steps.
The final denoised image — a snowman face on a blue background, generated
entirely from noise, guided only by text.
A Critical Reminder
Chatbots Don’t Make Sense, They Make Words
The terminology we use for AI — “learns,” “thinks,”
“understands” — is what researcher Melanie Mitchell calls
wishful mnemonics. These words describe what the system
appears to do, not what it actually does. An LLM is a sophisticated
statistical prediction engine, not a mind.
That’s why Leon Furze chose the iceberg metaphor: “I’m
deliberately avoiding any kind of analogy that represents the AI as magical,
mythical, human, or godlike — we’ve seen enough of them.”
The iceberg emphasises hidden infrastructure, not hidden intelligence.
What about “reasoning” models like OpenAI’s o1 or DeepSeek-R1?
These models are trained to generate a chain of thought —
working through a problem step by step before answering. But the mechanism is
identical: they’re still predicting the next token. The “reasoning”
is just more tokens — the model has learned that writing out intermediate
steps leads to better final answers. It’s autocomplete that’s been
taught to show its working.
Coming Q4 2027
Practical AI Strategies 2
The Critical Guide to GenAI in Education
Stay ahead of the curve. Subscribe for articles, resources, and updates
on understanding generative AI — delivered straight to your inbox.