Large Language Model Architecture¶
Summary¶
This chapter explores the technical foundations of large language models, explaining how these powerful AI systems work under the hood. Students will learn about transformer architecture, attention mechanisms, and the training processes that enable LLMs to generate human-like text. Understanding these concepts is essential for effectively working with and evaluating AI platforms.
Concepts Covered¶
This chapter covers the following 18 concepts from the learning graph:
- Large Language Models
- Transformer Architecture
- Attention Mechanism
- Self-Attention
- Multi-Head Attention
- Pre-Training
- Fine-Tuning
- RLHF
- Token
- Tokenization
- Context Window
- Model Parameters
- Inference
- Latency
- Throughput
- Embeddings
Prerequisites¶
This chapter builds on concepts from:
Learning Objectives¶
After completing this chapter, students will be able to:
- Explain how large language models generate text through next-token prediction
- Describe the transformer architecture and role of attention mechanisms
- Understand the training process including pre-training, fine-tuning, and RLHF
- Explain tokens, context windows, and their business implications
- Interpret model parameters and their effects on performance
Introduction¶
The remarkable capabilities of modern AI assistants—their ability to write poetry, explain complex concepts, generate code, and engage in nuanced conversation—all derive from a common architectural foundation: the large language model (LLM). These systems represent the culmination of decades of research in natural language processing, neural network design, and distributed computing. Yet despite their sophisticated capabilities, LLMs operate according to a deceptively simple objective: predicting the next word in a sequence.
This chapter demystifies the technical machinery underlying LLMs. While business professionals need not understand every mathematical detail, a working knowledge of how these systems function—their architecture, training processes, and operational characteristics—is essential for making informed decisions about AI adoption, evaluating platform capabilities, and anticipating both the possibilities and limitations of generative AI.
Understanding Large Language Models¶
What Are Large Language Models?¶
Large Language Models (LLMs) are neural networks trained on massive text corpora to understand and generate human language. The term "large" refers to the number of parameters—the learnable weights that the model adjusts during training.
Neural Network Foundation¶
Before diving into LLM specifics, the following visualization shows how data flows through a basic neural network. Understanding this forward propagation process is essential for grasping how LLMs process tokens through their many layers.
Explore the Neural Network MicroSim →
Modern LLMs range from billions to trillions of parameters:
| Model | Organization | Parameters | Release Year |
|---|---|---|---|
| GPT-3 | OpenAI | 175 billion | 2020 |
| GPT-4 | OpenAI | ~1.8 trillion (estimated) | 2023 |
| Claude 3.5 Sonnet | Anthropic | Undisclosed | 2024 |
| Gemini Ultra | Undisclosed | 2024 | |
| Llama 3.1 | Meta | 405 billion | 2024 |
| Mixtral 8x22B | Mistral | 176 billion (sparse) | 2024 |
At their core, LLMs perform a single task: given a sequence of text, predict the most likely next token. This next-token prediction objective, when applied at sufficient scale with appropriate training data, yields systems capable of remarkably sophisticated linguistic behavior.
The fundamental insight is that predicting the next word well requires understanding context, grammar, facts about the world, reasoning patterns, and stylistic conventions. A model that excels at prediction must implicitly learn vast amounts of knowledge about language and the world.
The Next-Token Prediction Paradigm¶
Consider how an LLM generates a response to "The capital of France is":
- The model receives the input tokens
- It processes them through multiple layers of neural network computations
- It outputs a probability distribution over its entire vocabulary
- The token "Paris" receives high probability
- "Paris" is selected and appended to the sequence
- The process repeats with "The capital of France is Paris" as input
This autoregressive process continues until the model generates a stop token or reaches a maximum length. Each generation step considers all previous context, enabling coherent multi-paragraph outputs.
Autoregressive Generation Visualization¶
The following interactive simulation demonstrates how tokens flow through neural network layers during autoregressive generation. Watch as information is compressed through hidden layers to predict the next token, which then becomes part of the input for the next prediction cycle.
Explore the Autoregressive MicroSim →
Temperature and Sampling
The selection of the next token need not be deterministic. The temperature parameter controls randomness: temperature=0 always selects the highest-probability token, while higher temperatures introduce diversity by making the selection more random. This is why the same prompt can yield different responses.
Tokens and Tokenization¶
Understanding Tokens¶
A token is the fundamental unit of text that LLMs process. Contrary to intuition, tokens are neither words nor characters—they are subword units determined by the model's tokenizer. Common words typically map to single tokens, while rare words are split into multiple tokens.
Examples of tokenization (GPT-style):
| Text | Tokens | Token Count |
|---|---|---|
| "Hello" | ["Hello"] | 1 |
| "artificial" | ["art", "ificial"] | 2 |
| "ChatGPT" | ["Chat", "G", "PT"] | 3 |
| "antidisestablishmentarianism" | ["ant", "id", "is", "establish", "ment", "arian", "ism"] | 7 |
Tokenization is the process of converting raw text into token sequences. Different models use different tokenization schemes:
- Byte Pair Encoding (BPE): Used by GPT models. Iteratively merges frequent character pairs to build vocabulary.
- WordPiece: Used by BERT. Similar to BPE but uses likelihood-based merging.
- SentencePiece: Used by Llama and others. Language-agnostic tokenization that works directly on raw text.
Interactive Tokenizer¶
Try the interactive tokenizer below to see how different text inputs are broken into tokens. Notice how common words typically become single tokens while rare or compound words are split into multiple subword pieces.
Explore the Tokenizer MicroSim →
Diagram: Tokenization Process Visualization¶
Tokenization Process Visualization
Type: microsim
Purpose: Interactive demonstration of how text is converted to tokens and the implications for context window usage
Bloom Taxonomy: Understand (L2) - Explain how tokenization works and affects model behavior
Learning Objective: Students should be able to estimate token counts for different text inputs and understand tokenization implications
Canvas layout (responsive, minimum 800x400px): - Top section: Text input area - Middle section: Token visualization - Bottom section: Statistics and metrics
Visual elements: - Input text area with character counter - Token display showing each token as a colored chip - Token IDs displayed below each chip - Progress bar showing context window usage
Interactive controls: - Text input field (multi-line) - Dropdown: Tokenizer selection (GPT-4, Claude, Llama) - Button: "Tokenize" - Toggle: Show/hide token IDs - Slider: Context window size (4K, 8K, 32K, 128K, 200K)
Display metrics: - Character count - Token count - Tokens per character ratio - Context window percentage used - Estimated cost (based on typical pricing)
Behavior: - As user types, real-time token count updates - Tokens colored by type (word, subword, punctuation, special) - Hover over token shows: token text, token ID, frequency in training - Warning when approaching context limit
Sample texts: - "Hello, world!" - Technical paragraph with jargon - Code snippet - Non-English text (to show multilingual tokenization differences)
Implementation: p5.js or HTML/JavaScript
Business Implications of Tokenization¶
Understanding tokens has direct business relevance:
- Cost calculation: API pricing is typically per-token (input + output). Efficient prompts reduce costs.
- Context limits: Models have maximum token limits. Long documents may require chunking or summarization.
- Language efficiency: Tokenizers trained primarily on English may require more tokens for other languages, increasing costs.
- Code considerations: Programming languages tokenize differently than natural language, often requiring more tokens.
The Transformer Architecture¶
Why Transformers Revolutionized NLP¶
The Transformer architecture, introduced in the 2017 paper "Attention Is All You Need," replaced the sequential processing of recurrent neural networks with parallel attention-based mechanisms. This innovation enabled:
- Parallel training: All positions in a sequence can be processed simultaneously
- Long-range dependencies: Direct connections between any two positions regardless of distance
- Scalability: Efficient training on massive datasets using modern GPU clusters
- Transfer learning: Pre-trained transformers adapt effectively to diverse downstream tasks
Prior architectures—RNNs and LSTMs—processed sequences one element at a time, creating information bottlenecks and gradient flow problems for long sequences. Transformers eliminated these limitations through the attention mechanism.
Architecture Overview¶
A transformer model consists of stacked layers, each containing two primary components:
- Multi-Head Self-Attention: Allows each position to attend to all other positions
- Feed-Forward Network: Processes each position independently through dense layers
The full architecture includes:
- Embedding layer: Converts tokens to dense vector representations
- Positional encoding: Injects sequence position information
- N transformer layers: Each with attention and feed-forward sublayers
- Layer normalization: Stabilizes training
- Output projection: Maps final representations to vocabulary probabilities
Diagram: Transformer Architecture¶
The following diagram illustrates the complete transformer architecture, showing how information flows from input tokens through multiple layers to produce output predictions.
flowchart TB
subgraph Output["Output Layer"]
direction TB
SOFT["Softmax<br/>Probability Distribution"]
LINEAR["Linear Projection<br/>to Vocabulary"]
end
subgraph TL["Transformer Layers (×N)"]
direction TB
subgraph Layer["Single Transformer Layer"]
direction TB
NORM2["Layer Normalization"]
FFN["Feed-Forward Network<br/>Linear → ReLU → Linear"]
ADD2["Add (Residual)"]
NORM1["Layer Normalization"]
ATTN["Multi-Head Self-Attention<br/>h parallel attention heads"]
ADD1["Add (Residual)"]
end
end
subgraph Input["Input Processing"]
direction TB
POS["Positional Encoding<br/>Position information"]
EMB["Token Embeddings<br/>Vocabulary → Vectors"]
TOK["Input Tokens"]
end
TOK --> EMB
EMB --> POS
POS --> ADD1
ADD1 --> ATTN
ATTN --> NORM1
NORM1 --> ADD2
ADD2 --> FFN
FFN --> NORM2
NORM2 --> LINEAR
LINEAR --> SOFT
%% Residual connections
POS -.->|"Residual"| ADD1
NORM1 -.->|"Residual"| ADD2
style Output fill:#E1BEE7,stroke:#7B1FA2,stroke-width:2px
style TL fill:#C8E6C9,stroke:#388E3C,stroke-width:2px
style Input fill:#BBDEFB,stroke:#1976D2,stroke-width:2px
style ATTN fill:#A5D6A7,stroke:#388E3C
style FFN fill:#FFCC80,stroke:#F57C00
| Component | Function | Key Innovation |
|---|---|---|
| Token Embeddings | Convert tokens to dense vectors | Learned representations capture meaning |
| Positional Encoding | Add position information | Enables parallel processing of sequences |
| Multi-Head Attention | Relate all positions to each other | Captures long-range dependencies |
| Feed-Forward Network | Transform representations | Adds non-linear processing capacity |
| Residual Connections | Bypass around sublayers | Enables training of deep networks |
| Layer Normalization | Stabilize activations | Improves training dynamics |
Decoder-Only Architecture
Modern LLMs like GPT use a decoder-only variant where each position can only attend to earlier positions (causal masking). This enables autoregressive generation: predicting one token at a time based on all previous tokens.
The Attention Mechanism¶
Attention is the core innovation enabling transformers to model relationships between all positions in a sequence. The mechanism computes a weighted combination of values, where weights reflect the relevance of each position to every other position.
The attention computation follows these steps:
- Project inputs: Transform each position into Query (Q), Key (K), and Value (V) vectors
- Compute attention scores: Calculate dot product of queries with all keys
- Scale: Divide by square root of dimension to stabilize gradients
- Apply softmax: Convert scores to probability distribution
- Weight values: Multiply values by attention weights and sum
The mathematical formulation:
Where \(d_k\) is the dimension of the key vectors.
Self-Attention: Tokens Attending to Tokens¶
Self-attention refers to attention where queries, keys, and values all derive from the same sequence. This enables each token to "look at" all other tokens in the input and gather relevant information.
Consider the sentence: "The animal didn't cross the street because it was too tired."
When processing "it," self-attention allows the model to:
- Attend strongly to "animal" (to resolve the pronoun reference)
- Attend to "tired" (which semantically connects to animals, not streets)
- Attend to contextual words that disambiguate meaning
This mechanism enables sophisticated contextual understanding without explicit programming of linguistic rules.
Interactive Self-Attention Visualization¶
Explore the self-attention mechanism with this interactive visualization. Select different sentences and click on tokens to see how they attend to other tokens in the sequence. Notice how pronouns strongly attend to their referents and how verbs connect to their subjects.
Explore the Self-Attention MicroSim →
Diagram: Self-Attention Visualization¶
Self-Attention Visualization
Type: microsim
Purpose: Interactive visualization of how tokens attend to other tokens in self-attention
Bloom Taxonomy: Analyze (L4) - Examine attention patterns and their linguistic significance
Learning Objective: Students should be able to interpret attention patterns and understand how context influences token relationships
Canvas layout (responsive, minimum 800x500px): - Top: Input sentence display with clickable tokens - Middle: Attention matrix visualization (heatmap) - Bottom: Selected attention pattern explanation
Visual elements: - Input tokens as clickable chips arranged horizontally - Attention matrix as grid with color-coded cells (darker = higher attention) - Highlighted attention lines connecting selected token to attended tokens - Attention weight values displayed on hover
Interactive controls: - Text input field for custom sentences - Dropdown: Select attention head (1-12) - Dropdown: Select layer (1-24) - Toggle: Show attention from → to / to → from - Button: "Analyze"
Behavior: - Click any token to highlight its attention pattern - Attention weights shown as line thickness connecting tokens - Matrix cells show numerical values on hover - Side panel explains what the selected head appears to focus on
Sample sentences: - "The animal didn't cross the street because it was too tired." - "The bank was closed so I couldn't deposit my check at the river bank." - "She gave him her book and he gave her his."
Annotations: - Highlight pronoun resolution patterns - Show positional attention (adjacent tokens) - Indicate syntactic attention (subject-verb agreement)
Implementation: p5.js with matrix visualization
Multi-Head Attention¶
Multi-head attention runs multiple attention operations in parallel, each with different learned projections. This allows the model to attend to information from different representation subspaces at different positions.
The multi-head mechanism:
- Projects Q, K, V into h different subspaces (h = number of heads)
- Computes attention in each subspace independently
- Concatenates the results
- Projects back to original dimension
Where each head computes:
Different heads learn to focus on different types of relationships:
| Head Type | Typical Focus | Example |
|---|---|---|
| Syntactic heads | Subject-verb agreement | "The dogs [run]" |
| Positional heads | Adjacent tokens | Local context |
| Semantic heads | Related concepts | "doctor" ↔ "patient" |
| Coreference heads | Pronoun resolution | "she" ↔ "Maria" |
Diagram: Multi-Head Attention Comparison¶
Multi-Head Attention Comparison Visualization
Type: microsim
Purpose: Interactive visualization showing how different attention heads within a transformer focus on different aspects of the input simultaneously
Bloom Taxonomy: Analyze (L4) - Compare and contrast what different attention heads learn to focus on
Learning Objective: Students should be able to explain why multi-head attention is more powerful than single-head attention and identify what different head types capture
Canvas layout (responsive, minimum 900x600px):
- Top: Input sentence with selectable tokens
- Middle: Grid of 8 attention head visualizations (2x4 layout)
- Bottom: Legend and explanation panel
Visual elements:
- Input sentence displayed as horizontal token sequence
- Each attention head shown as a mini attention matrix or connection diagram
- Color-coded attention weights (heat map from light to dark)
- Head labels indicating learned focus type (e.g., "Positional", "Syntactic", "Semantic")
- Aggregate attention view combining all heads
Interactive controls:
- Text input field for custom sentences
- Number of heads selector (4, 8, 12)
- Layer selector slider (1-12)
- Toggle: Individual heads / Aggregated view
- Button: "Show Head Specialization"
Behavior:
- Click on any attention head to expand it to full view
- Hover over attention matrix cells to see exact weight values
- Side panel explains what linguistic phenomenon each head appears to capture
- Animation mode: step through heads showing different perspectives
Sample sentences:
- "The cat sat on the mat because it was tired."
- "The programmer fixed the bug that was causing crashes."
- "She told him that he should go to the bank."
Annotations:
- Identify head patterns: local attention, global attention, syntactic structure
- Show how heads specialize for different tasks
- Highlight complementary information captured by different heads
Implementation: p5.js with grid layout for multiple attention visualizations
Embeddings: Representing Meaning as Numbers¶
What Are Embeddings?¶
Embeddings are dense vector representations of tokens (or other discrete entities) in a continuous high-dimensional space. Rather than treating words as arbitrary symbols, embeddings encode semantic relationships through geometric proximity—similar concepts cluster together.
Key properties of embeddings:
- Dimensionality: Typically 768 to 4096 dimensions in modern LLMs
- Learned representations: Trained from data, not hand-crafted
- Semantic similarity: Cosine similarity measures conceptual relatedness
- Compositionality: Sentence meanings emerge from token embedding combinations
The famous demonstration of embedding arithmetic:
This suggests embeddings capture semantic relationships that support analogical reasoning.
Vector Similarity Visualization¶
The following interactive visualization demonstrates how word embeddings cluster semantically related words together. Click on different words to compare their cosine similarity scores and observe how the geometric distance reflects semantic relatedness.
Explore the Vector Similarity MicroSim →
Types of Embeddings in LLMs¶
| Embedding Type | Purpose | When Used |
|---|---|---|
| Token embeddings | Initial word representation | Input layer |
| Position embeddings | Encode sequence order | Added to token embeddings |
| Segment embeddings | Distinguish text segments | Some architectures (BERT) |
| Output embeddings | Final token representations | Before generation |
Business Applications of Embeddings
Embeddings enable powerful applications beyond text generation: semantic search (finding similar documents), clustering (grouping related items), classification (categorizing content), and recommendation systems (suggesting related content). Many organizations extract embeddings from LLMs for these downstream applications.
Diagram: Embedding Dimensions Explorer¶
Embedding Dimensions Explorer
Type: microsim
Purpose: Interactive visualization showing how embedding dimensions capture different semantic features and how dimensionality affects representation quality
Bloom Taxonomy: Understand (L2) - Explain how embedding dimensions encode meaning
Learning Objective: Students should be able to describe how embedding dimensions capture semantic features and understand the tradeoffs between different embedding sizes
Canvas layout (responsive, minimum 800x500px):
- Left panel: Word input and embedding vector display
- Center: 3D projection of embedding space (reducible to 2D)
- Right panel: Dimension feature analysis
Visual elements:
- 3D scatter plot of word embeddings (PCA/t-SNE reduced)
- Color-coded clusters by semantic category
- Selected word's raw embedding vector as bar chart
- "Feature activation" view showing which dimensions activate for concepts
- Dimension labels showing learned features (gender, size, animate, etc.)
Interactive controls:
- Text input: Enter word to visualize
- Slider: Embedding dimensions (50, 100, 300, 768, 1536)
- Toggle: 2D / 3D projection view
- Dropdown: Projection method (PCA, t-SNE, UMAP)
- Button: "Show Similar Words"
- Button: "Analyze Dimension Features"
Display metrics:
- Embedding dimension count
- Sparsity percentage
- Nearest neighbors in embedding space
- Dimension-by-dimension breakdown
Behavior:
- Entering a word shows its position in projected space
- Hover over dimensions to see what feature they might represent
- Click two words to see vector difference (analogies)
- Animation: word "morphing" through embedding space
Educational demonstrations:
- Show how "king - man + woman ≈ queen" works geometrically
- Display how synonyms cluster together
- Illustrate how adding dimensions improves discrimination
Sample word sets:
- Animals: cat, dog, lion, fish, bird
- Professions: doctor, nurse, teacher, engineer
- Actions: run, walk, sprint, jog
- Concepts: love, hate, fear, joy
Implementation: p5.js with 3D rendering or three.js
Training Large Language Models¶
Pre-Training: Learning from the Internet¶
Pre-training is the initial, computationally intensive phase where the model learns from massive text corpora. The objective is typically next-token prediction (for autoregressive models like GPT) or masked language modeling (for bidirectional models like BERT).
Pre-training characteristics:
- Data scale: Trillions of tokens from web text, books, code, academic papers
- Compute requirements: Thousands of GPUs for weeks or months
- Cost: Millions of dollars for frontier models
- Learning: Grammar, facts, reasoning patterns, world knowledge
The pre-training corpus significantly influences model capabilities. Models trained heavily on code excel at programming tasks; those with extensive scientific literature perform better on technical queries.
| Training Data Source | What the Model Learns |
|---|---|
| Web pages | General knowledge, diverse topics |
| Books | Long-form reasoning, narrative structure |
| Wikipedia | Factual information, structured knowledge |
| Academic papers | Technical concepts, citation patterns |
| Code repositories | Programming syntax, algorithms |
| Social media | Conversational patterns, colloquialisms |
Fine-Tuning: Specializing for Tasks¶
Fine-tuning adapts a pre-trained model to specific tasks or domains by training on smaller, targeted datasets. This transfer learning approach leverages the general capabilities acquired during pre-training while specializing for particular applications.
Fine-tuning approaches include:
- Full fine-tuning: Update all model parameters (resource-intensive)
- Parameter-efficient fine-tuning (PEFT): Update only a subset of parameters
- LoRA (Low-Rank Adaptation): Add small trainable matrices to existing weights
- Prompt tuning: Learn only soft prompt embeddings, freeze model weights
Fine-tuning enables:
- Domain adaptation (legal, medical, financial language)
- Task specialization (summarization, translation, Q&A)
- Style alignment (formal vs. casual, brand voice)
- Safety improvements (reducing harmful outputs)
RLHF: Aligning with Human Preferences¶
Reinforcement Learning from Human Feedback (RLHF) is a training methodology that aligns model outputs with human values and preferences. This technique proved crucial for making ChatGPT helpful, harmless, and honest—transforming a raw language model into an effective AI assistant.
The RLHF process:
- Supervised fine-tuning: Train on high-quality human demonstrations of desired behavior
- Reward model training: Train a separate model to predict human preference rankings
- Policy optimization: Use reinforcement learning (typically PPO) to maximize the reward model's score while maintaining output diversity
Diagram: RLHF Training Pipeline¶
flowchart LR
subgraph Stage1["Stage 1: Supervised Fine-Tuning"]
A[Pre-trained LLM] --> B[SFT Training]
C[Human Demonstrations] --> B
B --> D[SFT Model]
end
subgraph Stage2["Stage 2: Reward Model Training"]
D --> E[Generate Outputs]
E --> F[Human Rankings]
F --> G[Train Reward Model]
G --> H[Reward Model]
end
subgraph Stage3["Stage 3: Policy Optimization"]
D --> I[PPO Training]
H --> I
I --> J[RLHF Model]
end
J -.->|Iterative Improvement| C
style Stage1 fill:#e3f2fd
style Stage2 fill:#fff3e0
style Stage3 fill:#e8f5e9
RLHF Stage Summary:
| Stage | Human Involvement | Input | Output |
|---|---|---|---|
| 1. SFT | High (write ideal responses) | Pre-trained LLM + demos | SFT Model |
| 2. Reward | Medium (rank outputs) | SFT outputs + rankings | Reward Model |
| 3. PPO | None (automated) | SFT Model + Reward Model | Aligned Model |
Key Insight
Human annotation is the bottleneck. Stage 3 (PPO) runs automatically once the reward model is trained, allowing continuous improvement without additional human labeling.
RLHF addresses limitations of pure pre-training:
- Helpfulness: Models learn to provide useful, actionable responses
- Honesty: Models learn to acknowledge uncertainty rather than confabulate
- Harmlessness: Models learn to refuse harmful requests
- Format compliance: Models learn to follow instructions about output format
Model Parameters and Their Effects¶
What Are Parameters?¶
Model parameters are the learnable numerical values (weights and biases) that define a neural network's behavior. During training, these values are adjusted to minimize prediction error. During inference, they remain fixed and determine how inputs are transformed to outputs.
Parameter count correlates roughly with model capability, but the relationship is not linear:
- More parameters → more capacity to store knowledge and patterns
- More parameters → better generalization to novel inputs
- More parameters → higher computational cost for training and inference
- More parameters → greater risk of memorization vs. generalization
The scaling laws observed in LLM research suggest that performance improves predictably with increases in parameters, training data, and compute, though with diminishing returns at the frontier.
Diagram: LLM Scaling Laws Visualization¶
LLM Scaling Laws Visualization
Type: microsim
Purpose: Interactive demonstration of how model performance scales with parameters, training data, and compute according to Chinchilla and other scaling law research
Bloom Taxonomy: Apply (L3) - Use scaling laws to predict model performance tradeoffs
Learning Objective: Students should be able to interpret scaling law curves and understand the relationship between model size, training data, and expected performance
Canvas layout (responsive, minimum 800x500px):
- Main area: Log-log plot showing scaling curves
- Right panel: Model configuration inputs
- Bottom: Results summary and cost estimates
Visual elements:
- X-axis: Training compute (FLOPs) or model parameters (log scale)
- Y-axis: Loss or benchmark score (log scale)
- Multiple curves showing: Parameters only, Data only, Compute-optimal allocation
- Reference points for known models (GPT-3, GPT-4, Claude, Llama)
- Chinchilla optimal frontier line
- Isocompute curves showing different parameter/data tradeoffs
Interactive controls:
- Slider: Model parameters (1B - 1T, log scale)
- Slider: Training tokens (10B - 10T, log scale)
- Dropdown: Benchmark metric (Loss, MMLU, HumanEval)
- Toggle: Show/hide Chinchilla optimal line
- Toggle: Show/hide real model markers
- Button: "Calculate Optimal Allocation"
Display metrics:
- Predicted loss/performance
- Training FLOPs required
- Estimated training cost ($)
- Compute-optimal ratio (tokens per parameter)
- Distance from Chinchilla optimal
Behavior:
- Dragging model parameter slider updates predicted performance
- Show "undertrained" or "overtrained" region shading
- Highlight when configuration deviates significantly from optimal
- Click model markers to see actual vs predicted performance
- Animation: trace training progress over time
Scaling law equations displayed:
- Loss ≈ C / N^α (parameter scaling)
- Loss ≈ D / T^β (data scaling)
- Chinchilla: N_opt ∝ C^0.5, T_opt ∝ C^0.5
Implementation: Chart.js or p5.js with logarithmic axes
Inference: Running the Model¶
Inference is the process of using a trained model to generate outputs for new inputs. Unlike training (which updates parameters), inference uses fixed parameters to transform inputs through the network.
Inference considerations include:
- Latency: Time from input to first output token
- Throughput: Tokens generated per second
- Memory footprint: GPU memory required to load the model
- Cost: Computational expense per token
The autoregressive nature of LLM inference means each output token requires a full forward pass through the network. This creates a fundamental tension between response length and speed.
| Factor | Effect on Latency | Effect on Throughput |
|---|---|---|
| More parameters | ↑ Higher | ↓ Lower |
| Longer context | ↑ Higher | ↓ Lower |
| Longer output | ↑ Higher (cumulative) | Unchanged per token |
| Batch size | Minimal per request | ↑ Higher aggregate |
| Quantization | ↓ Lower | ↑ Higher |
Latency and Throughput Optimization¶
Latency—the time to generate a response—matters for interactive applications. Users perceive delays beyond 100-200ms as sluggish. LLM latency has multiple components:
- Time to first token (TTFT): Processing the input and generating the first output token
- Inter-token latency: Time between subsequent tokens
- Total response time: Cumulative time for complete response
Throughput—tokens generated per unit time—matters for batch processing and cost optimization. Techniques for improving throughput include:
- Batching: Processing multiple requests simultaneously
- KV-caching: Storing key-value computations to avoid recomputation
- Speculative decoding: Generating multiple tokens in parallel with verification
- Model parallelism: Distributing the model across multiple GPUs
Diagram: Inference Performance Tradeoffs¶
Inference Latency and Throughput Visualization
Type: microsim
Purpose: Interactive exploration of how model size, batch size, context length, and optimization techniques affect inference latency and throughput
Bloom Taxonomy: Apply (L3) - Calculate and compare inference metrics for different configurations
Learning Objective: Students should be able to estimate latency and throughput for different LLM deployment configurations and understand key tradeoffs
Canvas layout (responsive, minimum 800x500px):
- Top: Configuration panel with sliders
- Middle: Dual-axis chart (latency + throughput)
- Bottom: Cost and capacity estimates
Visual elements:
- Bar/line chart showing Time to First Token (TTFT) and Inter-Token Latency
- Throughput gauge (tokens/second)
- GPU memory utilization bar
- Cost-per-token indicator
- Visual timeline of a sample request lifecycle
Interactive controls:
- Dropdown: Model size (7B, 13B, 70B, 405B)
- Slider: Batch size (1-64)
- Slider: Input context length (256 - 128K tokens)
- Slider: Output length (100 - 4K tokens)
- Dropdown: Quantization (FP16, INT8, INT4)
- Toggle: KV-Cache enabled
- Toggle: Speculative decoding
Display metrics:
- Time to First Token (TTFT) in milliseconds
- Inter-Token Latency (ITL) in milliseconds
- Total response time
- Throughput (tokens/second/GPU)
- GPU memory usage (GB)
- Cost per 1K tokens ($)
Behavior:
- Real-time updates as sliders change
- Show "bottleneck" indicator (memory-bound vs compute-bound)
- Warning when configuration exceeds typical GPU memory
- Comparison mode: side-by-side configs
- Animation: visualize token generation timeline
Reference configurations:
- "Interactive chat" (low latency, small batch)
- "Batch processing" (high throughput, large batch)
- "Long document" (large context, optimized caching)
Educational annotations:
- Explain why larger batch sizes improve throughput but increase latency
- Show KV-cache memory scaling with context length
- Demonstrate quantization speedup vs quality tradeoff
Implementation: Chart.js with dynamic updates, or p5.js with custom visualization
The Context Window¶
What Is a Context Window?¶
The context window is the maximum number of tokens a model can process in a single forward pass—including both input and output tokens. This limit is architectural, determined by positional encoding schemes and attention computation memory requirements.
Context window sizes have grown dramatically:
| Model Generation | Typical Context Window |
|---|---|
| GPT-3 (2020) | 4,096 tokens |
| GPT-4 (2023) | 8,192 / 32,768 tokens |
| Claude 3 (2024) | 200,000 tokens |
| Gemini 1.5 Pro (2024) | 1,000,000 tokens |
A context window of 200,000 tokens accommodates approximately:
- 150,000 words of English text
- A 500-page book
- Multiple lengthy documents for comparison
- Extended multi-turn conversations
Business Implications of Context Windows¶
Context window size directly affects application architecture:
| Use Case | Context Requirement | Architectural Approach |
|---|---|---|
| Simple Q&A | ~1,000 tokens | Direct prompting |
| Document summarization | 10,000-50,000 tokens | Single-pass with large context |
| Book analysis | 100,000+ tokens | Large context or chunking + synthesis |
| Knowledge base queries | Variable | RAG (Retrieval-Augmented Generation) |
| Extended conversations | Cumulative | Context management, summarization |
Context Window Costs
Larger context windows increase computational cost. Processing 100,000 tokens costs substantially more than processing 1,000 tokens. Design applications to use appropriate context sizes, not maximum available.
Diagram: Context Window Management¶
The following diagram compares four strategies for managing context window limitations, each suited to different scenarios and requirements.
flowchart LR
subgraph S1["📝 Strategy 1: Direct Prompting"]
direction TB
D1A["Short Query"] --> D1B["LLM"] --> D1C["Response"]
end
subgraph S2["✂️ Strategy 2: Chunking + Synthesis"]
direction TB
D2A["Long Doc"]
D2B["Chunk 1"]
D2C["Chunk 2"]
D2D["Chunk 3"]
D2E["Synthesize"]
D2A --> D2B & D2C & D2D
D2B & D2C & D2D --> D2E
end
subgraph S3["🔍 Strategy 3: RAG"]
direction TB
D3A["Query"] --> D3B["Retrieve"]
D3B --> D3C["Top K Chunks"]
D3C --> D3D["Generate"]
end
subgraph S4["📚 Strategy 4: Full Context"]
direction TB
D4A["Entire Document<br/>in Context"] --> D4B["LLM"] --> D4C["Response"]
end
style S1 fill:#E3F2FD,stroke:#1565C0,stroke-width:2px
style S2 fill:#E8F5E9,stroke:#388E3C,stroke-width:2px
style S3 fill:#FFF3E0,stroke:#F57C00,stroke-width:2px
style S4 fill:#F3E5F5,stroke:#7B1FA2,stroke-width:2px
| Strategy | Context Usage | Best For | Pros | Cons |
|---|---|---|---|---|
| Direct Prompting | Low (100s of tokens) | Simple queries, short conversations | Fast, cheap, simple | Limited context, no external knowledge |
| Chunking + Synthesis | Medium per chunk | Long documents exceeding context | Handles any length | May lose cross-chunk relationships |
| RAG | Moderate | Large knowledge bases, specific queries | Scalable, current info, efficient | Retrieval quality critical |
| Full Context | High | Complete document understanding | No information loss, holistic | Expensive, slower, still has limits |
Decision Tree for Strategy Selection:
flowchart TD
Q1{"Does content fit<br/>in context window?"}
Q2{"Need complete<br/>document understanding?"}
Q3{"Large knowledge base<br/>with specific queries?"}
Q4{"Document too long<br/>for context?"}
Q1 -->|Yes| Q2
Q1 -->|No| Q4
Q2 -->|Yes| A1["Full Context"]
Q2 -->|No| Q3
Q3 -->|Yes| A2["RAG"]
Q3 -->|No| A3["Direct Prompting"]
Q4 -->|Yes| A4["Chunking + Synthesis"]
Q4 -->|No| A2
style A1 fill:#F3E5F5,stroke:#7B1FA2
style A2 fill:#FFF3E0,stroke:#F57C00
style A3 fill:#E3F2FD,stroke:#1565C0
style A4 fill:#E8F5E9,stroke:#388E3C
Cost Optimization
Start with the simplest strategy that meets your needs. Direct prompting costs pennies; full context on a 100K document can cost dollars per request. Match strategy complexity to actual requirements.
Putting It All Together¶
The architectural components covered in this chapter work together to enable LLM capabilities:
- Tokenization converts text to numerical representations the model can process
- Embeddings map tokens to dense vectors capturing semantic meaning
- Self-attention enables each position to gather information from all other positions
- Multi-head attention allows simultaneous focus on different relationship types
- Transformer layers stack to build progressively abstract representations
- Pre-training instills broad language knowledge and world understanding
- Fine-tuning specializes the model for particular tasks or domains
- RLHF aligns outputs with human preferences and values
- Inference applies the trained parameters to generate responses
- Context windows bound the information available for each generation
Understanding these components enables informed evaluation of LLM capabilities, realistic expectation setting, and effective application design.
Key Takeaways¶
- LLMs predict the next token based on all preceding context, with sophisticated language capabilities emerging from this simple objective at scale
- Tokens are subword units determined by the tokenizer; understanding tokenization is essential for cost estimation and context management
- The transformer architecture replaced sequential processing with parallel attention, enabling efficient training on massive datasets
- Self-attention allows each token to attend to all other tokens, capturing long-range dependencies and contextual relationships
- Multi-head attention enables simultaneous focus on different relationship types (syntactic, semantic, positional)
- Embeddings represent tokens as dense vectors where semantic similarity corresponds to geometric proximity
- Pre-training teaches broad language knowledge; fine-tuning specializes for tasks; RLHF aligns with human preferences
- Context windows limit the tokens available for processing; larger windows enable more comprehensive understanding but increase cost
- Inference characteristics (latency, throughput) depend on model size, context length, and optimization techniques
Review Questions¶
Explain why the attention mechanism was a breakthrough for NLP compared to recurrent architectures.
Recurrent architectures (RNNs, LSTMs) process sequences one element at a time, creating information bottlenecks as context must pass through each step sequentially. This causes: (1) Gradient vanishing/exploding over long sequences, (2) Difficulty capturing long-range dependencies, (3) Sequential processing preventing parallelization. Attention mechanisms allow direct connections between any two positions regardless of distance, enabling: (1) Parallel processing of all positions simultaneously, (2) Direct modeling of long-range relationships, (3) Efficient training on modern GPU clusters. These advantages enabled training on much larger datasets and longer sequences.
How does RLHF differ from standard supervised fine-tuning, and why is it necessary?
Supervised fine-tuning trains on human-written examples, teaching the model to imitate demonstrations. RLHF goes further by: (1) Training a reward model to predict human preferences between outputs, (2) Using reinforcement learning to optimize for reward model scores. This is necessary because: (1) Writing perfect demonstrations is expensive and limited, (2) Humans are better at comparing outputs than generating ideal ones, (3) RLHF can optimize for implicit preferences difficult to demonstrate (helpfulness, safety, format). The result is models that better align with what users actually want.
A 100,000-token document needs analysis. Compare using a large-context model versus RAG approach.
Large-context approach: Load entire document into context window; model has complete information for holistic analysis; best for tasks requiring understanding relationships across the full document; higher cost per query; works well when full context is consistently needed.
RAG approach: Index document, retrieve relevant chunks per query; efficient for specific questions; lower per-query cost; scales to unlimited document sizes; may miss cross-section relationships; requires quality retrieval system; better when queries target specific information rather than full-document synthesis.
Choose large-context for comprehensive analysis (summarization, theme extraction); choose RAG for question-answering over large corpora or when cost matters for many queries.