StatusReady. Type a question, click Ask, or scroll manually.--:--
Prompt → Inference
Type a question and click Ask for an
auto-walkthrough (pause anytime), or scroll yourself—from tokens and
transformers down to the fab, then back to your screen.
Ask = auto tour · Scroll = manual · Best in landscape
Section 1: Application layer
[ The Application Layer ]
Ask AI Assistant
Shortcut: Ctrl+Enter (or
⌘+Enter on Mac)
Behind the scenes
Your text becomes tokens (integer IDs the
model actually sees), wrapped in
HTTPS (TLS) so it is encrypted on the wire,
and sent as packets from your device toward a remote data
center—often thousands of miles away. The server will
usually apply a chat template (special
tokens and separators) so the model knows what is system
instruction, user message, or prior assistant turns.
Expert: client ↔ API
Request payload — JSON fields such as
model id, messages or prompt string,
max_tokens/max_completion_tokens,
temperature, top‑p, stop sequences, stream=true for SSE
chunks.
Streaming — server emits partial tokens
(often as SSE); the UI can render incrementally while the
forward pass is still running decode steps.
Tool / function calling — model emits
structured “calls” the runtime executes; results are fed
back as messages—another loop on top of raw text
generation.
Processing…
8%
Packet preview:TLS_RECORD · tokens pending…
Idle—waiting for your keystrokes.
MSDN — Text pipeline (LLM edition)
[ Expert: from characters to token IDs ]
Tokenization and API
LLMs do not read UTF‑8 strings on the wire inside the network;
they read sequences of token IDs. A
tokenizer (BPE, Unigram LM, or similar) merges
bytes/subwords into a fixed vocabulary (often
32k–256k entries). Same text can tokenize differently across
vendors—always count tokens on the tokenizer you ship.
BPE / SentencePiece
Subword splits trade-off: fewer OOV issues, but multi-byte
characters and whitespace can cost extra tokens vs “naive”
splitting—billing and context limits are token-based.
Special tokens
<|bos|>, <|eos|>, tool
markers, etc.—carry structural meaning; the chat template inserts
them in a model-specific order.
Context window
Hard cap on positions the model can attend to at once; prompt +
completion must fit (or use retrieval / summarization
outside the model).
Logprobs
Some APIs return per-token log-probabilities—useful for
confidence scoring, contrastive decoding research, and debugging.
Tip: “prompt engineering” is really “latent space steering via tokens.”
Network — Dial-up to Fiber Highways
[ The Network Layer ]
Network layer
Connecting…
Copper gave way to glass: your prompt hops from Wi‑Fi to a
router, then rides long-haul fiber at roughly
two-thirds the speed of light—sometimes via ground stations, sometimes with a satellite
detour.
POTS / last mile
→
Metro fiber
→
Submarine & backbone
Hundreds of thousands of miles of undersea cable stitch continents
together—your bytes may cross an ocean before they reach a GPU
hall.
Latency cheat sheet
Last hop Wi‑Fi: often single-digit to tens of ms to the nearest
router.
Long-haul fiber: geography dominates—coast-to-coast adds real
milliseconds even at ludicrous speed.
Packets queued for the backbone…
System Properties — Regional Compute Site
[ The Data Center ]
Data center
GeneralCoolingPower
From your beige box to a warehouse of servers: a modern AI site
is a power plant dressed as a building—
thousands of accelerators, liquid cooling loops,
and tens of megawatts drawn from the grid like
a small city district.
Scale
Football-field footprints, redundant feeds, and fiber trunks
fanning out under the parking lot.
Submarine cable
Globally, undersea systems total on the order of
hundreds of thousands of miles
of glass on the seabed feeding coastal landing stations.
Your prompt
Routed to a cluster scheduler, then pinned to a slice of GPUs
reserved for inference.
Airflow: CRAC · PUE: “it depends” · Coffee: not included
Help — About the neural network file
[ The Model — not magic, just weights ]
Model checkpoint
What you call “the AI” in the cloud is usually a
checkpoint: multi-gigabyte blobs of
learned parameters (billions of numbers), plus a
config describing layer shapes and a
tokenizer that maps text ↔ tokens. Training
fits those weights once; inference
(this whole tour) replays the math with those weights held
fixed.
Where it lives before your prompt hits silicon
Object store / parallel filesystem — durable
copies of shards vendors and researchers can version.
Host DRAM → PCIe → GPU HBM — loaders stream
tensors into accelerator memory; the first token cannot run
until the working set (or layer stream) is resident.
Quantization — many deployments store
8-bit/4-bit approximations to shrink bandwidth and RAM
footprint; the stack still executes matrix math, just on
narrower numbers.
Expert: parameters & tensors
Weights vs activations — weights are static at
inference; activations are per forward pass and scale with
batch, sequence length, and width.
Sharding formats — SafeTensors, GGUF, custom
checkpoints; each implies different load and mmap behavior on
CPU vs GPU.
MoE (mixture-of-experts) — only a subset of
“expert” FFNs fire per token; saves compute but complicates
routing, load balancing, and memory placement.
Training burns the budget; inference burns the power bill every query.
Training Wizard — How the checkpoint was born
[ Expert: from random init to chatty assistant ]
Training and alignment
Pretraining minimizes next-token (or masked-token)
cross-entropy on massive text/code corpora—pure self-supervised
compression. That yields a base model that predicts distribution
over vocabulary, not yet a polite assistant.
Supervised fine-tuning (SFT) pairs instructions
with demonstrations. Preference optimization
(RLHF, DPO, IPO, …) nudges the policy toward human or AI judge
rankings—reducing toxicity and improving instruction-following.
Scaling laws
Loss often improves predictably with compute, data, and model
size (Chinchilla-style trade-offs: don’t under-train a huge
model on tiny data).
Optimizer & precision
AdamW, learning-rate schedules, gradient clipping, ZeRO/FSDP
sharding—training is a distributed systems problem as much as a
math one.
Evals
Perplexity on held-out text; downstream benchmarks (MMLU, GSM8K,
coding suites); red-teaming for safety—no single number captures
“quality.”
Inference vs training
Training needs full activations backward pass + optimizer state;
inference only needs forward (plus KV cache during decode)—hence
different memory and kernel recipes.
Alignment: teaching models what humans want—without perfect labels.
Device Manager — Accelerators
[ The GPU Rack ]
GPU rack
A cluster acts like one huge brain: your prompt is
partitioned across many GPUs, stitched together by
an interconnect moving terabytes per second inside
the rack—far faster than anything your desk PC sees on PCIe alone.
Each forward pass of a large transformer is a choreography of
matrix multiplies and nonlinearities
repeated for every layer—think enormous spreadsheets updated in
lockstep, not a single “if” statement.
Computer
├ Inference rack
│ ├ NVLink / fabric
│ └ GPU modules
└ SAN / object store
Why it matters
Big models do not fit in one GPU’s memory map; the system trades
network-on-node bandwidth for the illusion of a single giant
tensor engine. A long prompt widens activation tensors; a long
answer means many decode steps, each one another
full-ish pass through the stack—latency adds up fast.
IRQ not shown · IRQ not missed · IRQ definitely legacy joke
Tensor parallelism splits each giant matrix across
devices: every layer’s multiply is sliced, partial results are
all-gathered or reduced across
the fabric. Pipeline parallelism assigns whole
layers to different GPUs and streams micro-batches through like a
factory line—higher throughput, trickier bubble scheduling.
Collectives
All-reduce, all-to-all, and variants synchronize hidden states;
NVLink/InfiniBand latency determines how “tight” the virtual
machine feels.
One user, many GPUs
Even a single chat turn may occupy shards on multiple cards
because parameter and activation memory exceed one package’s
HBM—economics and physics, not vanity.
Collective ops: the group project nobody asked for, everyone depends on.
READ_ME.TXT — Notepad
[ Inside one transformer layer (inference) ]
Transformer mechanics
Tokens → embedding lookup → stack of L identical-ish blocks:
• Multi-head self-attention: each position attends to every other
position (causal mask during decode). Cost grows ~quadratically
with context length — long prompts are expensive.
• Feed-forward network (FFN): two linear layers with a nonlinearity
in the middle, per position — most FLOPs often live here.
• Residual connections + layer normalization: stabilize depth so
hundreds of blocks can stack.
Autoregressive decode: output logits → sample/argmax one token →
append to context → repeat until stop.
• Softmax over logits ⇒ categorical distribution; temperature
scales logits before softmax (T>1 = flatter / more random).
• Causal mask: position i may attend only to j≤i so the model
cannot “peek” at future tokens during training or decode.
Attention
Compares every token to every other (within mask) to mix
information—parallelizable, memory-hungry.
FFN
Per-token MLP; width often 4× the model hidden size—dominant
multiply count.
Decode loop
Each new token reruns the stack; KV cache (next
window) stores past keys/values so you do not recompute the
entire prefix every step.
Notepad · Word wrap: off · Spelling: your problem
Technical Reference — Transformer internals
[ Expert: what changed after “Attention is All You Need” ]
Deep architecture
Production LLMs stack dozens to hundreds of identical blocks. The
devil is in efficient attention, stable normalization, and
activation functions that play well with low-precision tensor
cores.
Positional encoding
RoPE (rotary) bakes relative position into Q/K;
extends better than absolute sinusoidal tables. ALiBi
biases attention by distance—another extrapolation strategy.
Normalization
Pre-norm (norm before sublayers) trains deeper
stacks more easily. RMSNorm drops mean centering
vs LayerNorm—cheaper, common in LLaMA-style models.
FFN activations
SwiGLU Gated Linear Units (two projections ×
sigmoid-gated third) often replace ReLU-FFN—better quality at
higher parameter cost.
Multi-Query / Grouped-Query Attention
Share K/V heads across Q heads—shrinks KV cache bandwidth and
memory, critical for long contexts and batched serving.
FlashAttention-style kernels
Tiled softmax in SRAM reduces HBM round-trips; IO-aware
algorithms matter as much as FLOP counts.
Reading papers helps; profiling GPUs convinces.
untitled — Paint
[ GPU Die + High Bandwidth Memory ]
GPU die and HBM
Billions of switches flipping each clock: a single accelerator die
can pack hundreds of billions of transistors,
flanked by stacked HBM—DRAM skyscrapers feeding
the compute array at multiple terabytes per second.
Tensor cores (and similar units) are hard-wired
matrix engines; they chew through the matmuls that attention and
FFN are made of—often in BF16/FP16 or narrower
formats to save bandwidth and power.
Calculator — rough intuition (not exact)
KV cacheGrows with layers × heads × context × precision
Bandwidth bound?If bytes moved ÷ FLOPs is high, feed the beast (HBM).
Mixed precisionBF16/FP16 + FP32 accum — faster, still stable enough.
Inference dataflow (one step): HBM streams weights
and cached K/V into on-chip SRAM → tensor cores matmul →
accumulators → activations written back—repeat for every layer,
every token.
IIS Manager — Inference stack (simulated)
[ Expert: from logits to user-visible text ]
Decoding and serving
Each decode step ends with a logit vector over
the vocabulary. How you turn logits into the next token is a
policy choice: greedy argmax, temperature-scaled softmax sampling,
top‑k / top‑p (nucleus) truncation, repetition penalties, grammar
or JSON constraints—each changes latency, diversity, and failure
modes.
Throughput vs latency
Continuous batching (iteration-level scheduling)
packs many requests; paged attention manages KV
cache in non-contiguous blocks—higher GPU utilization.
Speculative decoding
A small “draft” model proposes multiple tokens; the big model
verifies in parallel—can cut wall-clock per token when acceptance
rates are good.
Quantization (PTQ / QAT)
Post-training quantization (GPTQ, AWQ, …) fits deployment;
quantization-aware training integrates fake-quant during
finetune. Formats like GGUF bundle metadata for CPU/GPU loaders.
Structured output
Constrained decoding (FSMs, grammars) forces valid JSON/SQL—
critical for agents; overlaps with compiler-style graph
traversal over token masks.
Failure modes experts watch
Exposure bias — train on gold prefixes, infer on model’s own drifted text.
Context rot — quality drops as context fills with stale or contradictory text.
Jailbreaks / prompt injection — untrusted data in prompts treated as instructions.
Production LLM = model file + scheduler + tokenizer + safety filters + accountants.
Fab Setup Wizard
[ The Semiconductor Fab ]
Semiconductor fab
Welcome to the cleanest room on Earth
Fabs cost tens of billions of dollars, take
years to build, and run like a hospital OR for
silicon: particle counts measured per cubic foot, not “eh, looks
fine.”
FOUPs, robots, and choreography
Wafers travel in sealed pods; overhead robots shuffle them tool
to tool. A single misalignment or contamination event can scrap
expensive batches—so everything is instrumented.
Deposition, etch, implant, repeat
Circuits are printed in dozens to a hundred-plus layers: grow
film, carve trenches, dope atoms, polish flat, inspect, repeat.
Metrology catches defects before they become bad dice.
Why you should care
The chips that run inference begin here. No fab, no GPU; no GPU,
no tokens—just a very fancy Notepad window staring back at you.
Tip: Press Next for a guided slideshow (keyboard friendly).
euv_schematic.bmp — Paint
[ EUV Lithography ]
EUV lithography
Laser, tin, and mirrors at the limit of physics: a
CO₂ laser blasts molten tin droplets to make EUV light. That light
bounces through the flattest mirrors humanity can polish, through a
reticle, and exposes photoresist on a wafer—printing features
smaller than visible light could ever resolve.
Only a few dozen of these tools ship per year
(exact numbers vary by year), each priced in the
hundreds of millions of dollars—a deliberate,
scarce bottleneck at the frontier of patterning.
Now the response races back up through every layer
07 Wafer patterned
EUV and companion steps print dozens to a hundred-plus stacked
layers, each aligned to sub-nanometer precision—a
few atoms of misregistration shows up as dead yield.
06 Chip packaged
The wafer is diced, probed, binned, and assembled with
HBM stacks and a substrate into the module you
would recognize on a server board.
05 Memory feeds the model
Terabytes per second of bandwidth stream your context from stacked
DRAM into the compute die—memory bandwidth, not FLOPS alone, often
sets the pace.
04 GPUs generate tokens
Coordinated accelerators turn your question into massive parallel
math, emitting tokens—the next words of the
answer—under tight latency budgets.
03 Data center sends
Completed tokens leave as packets again—TLS on the wire—headed for
the same fiber highways, just reversed.
02 Fiber returns
At roughly two-thirds c in glass, bits sprint
across continents and under oceans until they pop out at your ISP
and home router.
01 Rendered on your screen
Your browser or app decrypts the stream, lays out glyphs, and the
answer appears—character by character if the UI streams—thanks to
every layer below.
About — Prompt → Inference
[ Thank you for visiting ]
“Why is the sky blue?”
Every answer depends on networks,
data centers, GPUs,
memory, giant learned neural nets,
fabs, and EUV machines working in
perfect sync—not magic, just engineering stacked on engineering.
If you read the expert panels along the way, you
now have the full stack story: tokenization & APIs,
training and alignment,
transformer architecture,
parallelism & silicon, and
decoding / serving / quantization—enough context
to dig into papers, benchmarks, and production systems without
drowning in hype.