Status Ready. Type a question, click Ask, or scroll manually.

Prompt → Inference

Type a question and click Ask for an auto-walkthrough (pause anytime), or scroll yourself—from tokens and transformers down to the fab, then back to your screen.

Section 1: Application layer

Ask AI Assistant

Shortcut: Ctrl+Enter (or +Enter on Mac)

Behind the scenes

Your text becomes tokens (integer IDs the model actually sees), wrapped in HTTPS (TLS) so it is encrypted on the wire, and sent as packets from your device toward a remote data center—often thousands of miles away. The server will usually apply a chat template (special tokens and separators) so the model knows what is system instruction, user message, or prior assistant turns.

Expert: client ↔ API
  • Request payload — JSON fields such as model id, messages or prompt string, max_tokens/max_completion_tokens, temperature, top‑p, stop sequences, stream=true for SSE chunks.
  • Streaming — server emits partial tokens (often as SSE); the UI can render incrementally while the forward pass is still running decode steps.
  • Tool / function calling — model emits structured “calls” the runtime executes; results are fed back as messages—another loop on top of raw text generation.
Packet preview: TLS_RECORD · tokens pending…
Idle—waiting for your keystrokes.
MSDN — Text pipeline (LLM edition)

Tokenization and API

LLMs do not read UTF‑8 strings on the wire inside the network; they read sequences of token IDs. A tokenizer (BPE, Unigram LM, or similar) merges bytes/subwords into a fixed vocabulary (often 32k–256k entries). Same text can tokenize differently across vendors—always count tokens on the tokenizer you ship.

BPE / SentencePiece
Subword splits trade-off: fewer OOV issues, but multi-byte characters and whitespace can cost extra tokens vs “naive” splitting—billing and context limits are token-based.
Special tokens
<|bos|>, <|eos|>, tool markers, etc.—carry structural meaning; the chat template inserts them in a model-specific order.
Context window
Hard cap on positions the model can attend to at once; prompt + completion must fit (or use retrieval / summarization outside the model).
Logprobs
Some APIs return per-token log-probabilities—useful for confidence scoring, contrastive decoding research, and debugging.
Tip: “prompt engineering” is really “latent space steering via tokens.”
Network — Dial-up to Fiber Highways

Network layer

Connecting…

Copper gave way to glass: your prompt hops from Wi‑Fi to a router, then rides long-haul fiber at roughly two-thirds the speed of light—sometimes via ground stations, sometimes with a satellite detour.

Latency cheat sheet
  • Last hop Wi‑Fi: often single-digit to tens of ms to the nearest router.
  • Long-haul fiber: geography dominates—coast-to-coast adds real milliseconds even at ludicrous speed.
Packets queued for the backbone…
System Properties — Regional Compute Site

Data center

General Cooling Power

From your beige box to a warehouse of servers: a modern AI site is a power plant dressed as a building— thousands of accelerators, liquid cooling loops, and tens of megawatts drawn from the grid like a small city district.

Scale
Football-field footprints, redundant feeds, and fiber trunks fanning out under the parking lot.
Submarine cable
Globally, undersea systems total on the order of hundreds of thousands of miles of glass on the seabed feeding coastal landing stations.
Your prompt
Routed to a cluster scheduler, then pinned to a slice of GPUs reserved for inference.
Airflow: CRAC · PUE: “it depends” · Coffee: not included
Help — About the neural network file

Model checkpoint

What you call “the AI” in the cloud is usually a checkpoint: multi-gigabyte blobs of learned parameters (billions of numbers), plus a config describing layer shapes and a tokenizer that maps text ↔ tokens. Training fits those weights once; inference (this whole tour) replays the math with those weights held fixed.

Where it lives before your prompt hits silicon
  • Object store / parallel filesystem — durable copies of shards vendors and researchers can version.
  • Host DRAM → PCIe → GPU HBM — loaders stream tensors into accelerator memory; the first token cannot run until the working set (or layer stream) is resident.
  • Quantization — many deployments store 8-bit/4-bit approximations to shrink bandwidth and RAM footprint; the stack still executes matrix math, just on narrower numbers.
Expert: parameters & tensors
  • Weights vs activations — weights are static at inference; activations are per forward pass and scale with batch, sequence length, and width.
  • Sharding formats — SafeTensors, GGUF, custom checkpoints; each implies different load and mmap behavior on CPU vs GPU.
  • MoE (mixture-of-experts) — only a subset of “expert” FFNs fire per token; saves compute but complicates routing, load balancing, and memory placement.
Training burns the budget; inference burns the power bill every query.
Training Wizard — How the checkpoint was born

Training and alignment

Pretraining minimizes next-token (or masked-token) cross-entropy on massive text/code corpora—pure self-supervised compression. That yields a base model that predicts distribution over vocabulary, not yet a polite assistant. Supervised fine-tuning (SFT) pairs instructions with demonstrations. Preference optimization (RLHF, DPO, IPO, …) nudges the policy toward human or AI judge rankings—reducing toxicity and improving instruction-following.

Scaling laws
Loss often improves predictably with compute, data, and model size (Chinchilla-style trade-offs: don’t under-train a huge model on tiny data).
Optimizer & precision
AdamW, learning-rate schedules, gradient clipping, ZeRO/FSDP sharding—training is a distributed systems problem as much as a math one.
Evals
Perplexity on held-out text; downstream benchmarks (MMLU, GSM8K, coding suites); red-teaming for safety—no single number captures “quality.”
Inference vs training
Training needs full activations backward pass + optimizer state; inference only needs forward (plus KV cache during decode)—hence different memory and kernel recipes.
Alignment: teaching models what humans want—without perfect labels.
Device Manager — Accelerators

GPU rack

A cluster acts like one huge brain: your prompt is partitioned across many GPUs, stitched together by an interconnect moving terabytes per second inside the rack—far faster than anything your desk PC sees on PCIe alone. Each forward pass of a large transformer is a choreography of matrix multiplies and nonlinearities repeated for every layer—think enormous spreadsheets updated in lockstep, not a single “if” statement.

Why it matters

Big models do not fit in one GPU’s memory map; the system trades network-on-node bandwidth for the illusion of a single giant tensor engine. A long prompt widens activation tensors; a long answer means many decode steps, each one another full-ish pass through the stack—latency adds up fast.

IRQ not shown · IRQ not missed · IRQ definitely legacy joke
Cluster Job — inference_task.exe

Parallel inference

Job ID: 0xDEC0DE · Scheduler: “fair share” · Priority: your curiosity

Tensor parallelism splits each giant matrix across devices: every layer’s multiply is sliced, partial results are all-gathered or reduced across the fabric. Pipeline parallelism assigns whole layers to different GPUs and streams micro-batches through like a factory line—higher throughput, trickier bubble scheduling.

Collectives
All-reduce, all-to-all, and variants synchronize hidden states; NVLink/InfiniBand latency determines how “tight” the virtual machine feels.
One user, many GPUs
Even a single chat turn may occupy shards on multiple cards because parameter and activation memory exceed one package’s HBM—economics and physics, not vanity.
Collective ops: the group project nobody asked for, everyone depends on.
READ_ME.TXT — Notepad

Transformer mechanics

Tokens → embedding lookup → stack of L identical-ish blocks: • Multi-head self-attention: each position attends to every other position (causal mask during decode). Cost grows ~quadratically with context length — long prompts are expensive. • Feed-forward network (FFN): two linear layers with a nonlinearity in the middle, per position — most FLOPs often live here. • Residual connections + layer normalization: stabilize depth so hundreds of blocks can stack. Autoregressive decode: output logits → sample/argmax one token → append to context → repeat until stop. • Softmax over logits ⇒ categorical distribution; temperature scales logits before softmax (T>1 = flatter / more random). • Causal mask: position i may attend only to j≤i so the model cannot “peek” at future tokens during training or decode.
Tokens Embed Self-attention (QKV, softmax, V) FFN + residual + norm Logits → token
Attention
Compares every token to every other (within mask) to mix information—parallelizable, memory-hungry.
FFN
Per-token MLP; width often 4× the model hidden size—dominant multiply count.
Decode loop
Each new token reruns the stack; KV cache (next window) stores past keys/values so you do not recompute the entire prefix every step.
Notepad · Word wrap: off · Spelling: your problem
Technical Reference — Transformer internals

Deep architecture

Production LLMs stack dozens to hundreds of identical blocks. The devil is in efficient attention, stable normalization, and activation functions that play well with low-precision tensor cores.

Positional encoding
RoPE (rotary) bakes relative position into Q/K; extends better than absolute sinusoidal tables. ALiBi biases attention by distance—another extrapolation strategy.
Normalization
Pre-norm (norm before sublayers) trains deeper stacks more easily. RMSNorm drops mean centering vs LayerNorm—cheaper, common in LLaMA-style models.
FFN activations
SwiGLU Gated Linear Units (two projections × sigmoid-gated third) often replace ReLU-FFN—better quality at higher parameter cost.
Multi-Query / Grouped-Query Attention
Share K/V heads across Q heads—shrinks KV cache bandwidth and memory, critical for long contexts and batched serving.
FlashAttention-style kernels
Tiled softmax in SRAM reduces HBM round-trips; IO-aware algorithms matter as much as FLOP counts.
Reading papers helps; profiling GPUs convinces.
untitled — Paint

GPU die and HBM

Billions of switches flipping each clock: a single accelerator die can pack hundreds of billions of transistors, flanked by stacked HBM—DRAM skyscrapers feeding the compute array at multiple terabytes per second. Tensor cores (and similar units) are hard-wired matrix engines; they chew through the matmuls that attention and FFN are made of—often in BF16/FP16 or narrower formats to save bandwidth and power.

Calculator — rough intuition (not exact)
KV cache Grows with layers × heads × context × precision
Bandwidth bound? If bytes moved ÷ FLOPs is high, feed the beast (HBM).
Compute bound? If matmuls dominate and SRAM hits, tensor cores stay hot.
Mixed precision BF16/FP16 + FP32 accum — faster, still stable enough.
Package substrate (schematic) HBM GPU die HBM Arrows: TB/s class memory buses (conceptual)

Inference dataflow (one step): HBM streams weights and cached K/V into on-chip SRAM → tensor cores matmul → accumulators → activations written back—repeat for every layer, every token.

HBM SRAM Tensor cores Acts On-die dataflow (schematic)
IIS Manager — Inference stack (simulated)

Decoding and serving

Each decode step ends with a logit vector over the vocabulary. How you turn logits into the next token is a policy choice: greedy argmax, temperature-scaled softmax sampling, top‑k / top‑p (nucleus) truncation, repetition penalties, grammar or JSON constraints—each changes latency, diversity, and failure modes.

Throughput vs latency
Continuous batching (iteration-level scheduling) packs many requests; paged attention manages KV cache in non-contiguous blocks—higher GPU utilization.
Speculative decoding
A small “draft” model proposes multiple tokens; the big model verifies in parallel—can cut wall-clock per token when acceptance rates are good.
Quantization (PTQ / QAT)
Post-training quantization (GPTQ, AWQ, …) fits deployment; quantization-aware training integrates fake-quant during finetune. Formats like GGUF bundle metadata for CPU/GPU loaders.
Structured output
Constrained decoding (FSMs, grammars) forces valid JSON/SQL— critical for agents; overlaps with compiler-style graph traversal over token masks.
Failure modes experts watch
  • Exposure bias — train on gold prefixes, infer on model’s own drifted text.
  • Context rot — quality drops as context fills with stale or contradictory text.
  • Jailbreaks / prompt injection — untrusted data in prompts treated as instructions.
Production LLM = model file + scheduler + tokenizer + safety filters + accountants.
Fab Setup Wizard

Semiconductor fab

Welcome to the cleanest room on Earth

Fabs cost tens of billions of dollars, take years to build, and run like a hospital OR for silicon: particle counts measured per cubic foot, not “eh, looks fine.”

FOUPs, robots, and choreography

Wafers travel in sealed pods; overhead robots shuffle them tool to tool. A single misalignment or contamination event can scrap expensive batches—so everything is instrumented.

Deposition, etch, implant, repeat

Circuits are printed in dozens to a hundred-plus layers: grow film, carve trenches, dope atoms, polish flat, inspect, repeat. Metrology catches defects before they become bad dice.

Why you should care

The chips that run inference begin here. No fab, no GPU; no GPU, no tokens—just a very fancy Notepad window staring back at you.

Tip: Press Next for a guided slideshow (keyboard friendly).
euv_schematic.bmp — Paint

EUV lithography

Laser, tin, and mirrors at the limit of physics: a CO₂ laser blasts molten tin droplets to make EUV light. That light bounces through the flattest mirrors humanity can polish, through a reticle, and exposes photoresist on a wafer—printing features smaller than visible light could ever resolve.

Only a few dozen of these tools ship per year (exact numbers vary by year), each priced in the hundreds of millions of dollars—a deliberate, scarce bottleneck at the frontier of patterning.

EUV scanner (not to scale, obviously) Laser Sn droplet EUV Mirror train Wafer stage

Now the response races back up through every layer

07 Wafer patterned

EUV and companion steps print dozens to a hundred-plus stacked layers, each aligned to sub-nanometer precision—a few atoms of misregistration shows up as dead yield.

06 Chip packaged

The wafer is diced, probed, binned, and assembled with HBM stacks and a substrate into the module you would recognize on a server board.

05 Memory feeds the model

Terabytes per second of bandwidth stream your context from stacked DRAM into the compute die—memory bandwidth, not FLOPS alone, often sets the pace.

04 GPUs generate tokens

Coordinated accelerators turn your question into massive parallel math, emitting tokens—the next words of the answer—under tight latency budgets.

03 Data center sends

Completed tokens leave as packets again—TLS on the wire—headed for the same fiber highways, just reversed.

02 Fiber returns

At roughly two-thirds c in glass, bits sprint across continents and under oceans until they pop out at your ISP and home router.

01 Rendered on your screen

Your browser or app decrypts the stream, lays out glyphs, and the answer appears—character by character if the UI streams—thanks to every layer below.

About — Prompt → Inference

Why is the sky blue?

Every answer depends on networks, data centers, GPUs, memory, giant learned neural nets, fabs, and EUV machines working in perfect sync—not magic, just engineering stacked on engineering.

If you read the expert panels along the way, you now have the full stack story: tokenization & APIs, training and alignment, transformer architecture, parallelism & silicon, and decoding / serving / quantization—enough context to dig into papers, benchmarks, and production systems without drowning in hype.