Large Language Model Architecture¶

Summary¶

This chapter explores the technical foundations of large language models, explaining how these powerful AI systems work under the hood. Students will learn about transformer architecture, attention mechanisms, and the training processes that enable LLMs to generate human-like text. Understanding these concepts is essential for effectively working with and evaluating AI platforms.

Concepts Covered¶

This chapter covers the following 18 concepts from the learning graph:

Large Language Models
Transformer Architecture
Attention Mechanism
Self-Attention
Multi-Head Attention
Pre-Training
Fine-Tuning
RLHF
Token
Tokenization
Context Window
Model Parameters
Inference
Latency
Throughput
Embeddings

Prerequisites¶

This chapter builds on concepts from:

Chapter 1: Digital Transformation and AI Foundations

Learning Objectives¶

After completing this chapter, students will be able to:

Explain how large language models generate text through next-token prediction
Describe the transformer architecture and role of attention mechanisms
Understand the training process including pre-training, fine-tuning, and RLHF
Explain tokens, context windows, and their business implications
Interpret model parameters and their effects on performance

Introduction¶

The remarkable capabilities of modern AI assistants—their ability to write poetry, explain complex concepts, generate code, and engage in nuanced conversation—all derive from a common architectural foundation: the large language model (LLM). These systems represent the culmination of decades of research in natural language processing, neural network design, and distributed computing. Yet despite their sophisticated capabilities, LLMs operate according to a deceptively simple objective: predicting the next word in a sequence.

This chapter demystifies the technical machinery underlying LLMs. While business professionals need not understand every mathematical detail, a working knowledge of how these systems function—their architecture, training processes, and operational characteristics—is essential for making informed decisions about AI adoption, evaluating platform capabilities, and anticipating both the possibilities and limitations of generative AI.

Understanding Large Language Models¶

What Are Large Language Models?¶

Large Language Models (LLMs) are neural networks trained on massive text corpora to understand and generate human language. The term "large" refers to the number of parameters—the learnable weights that the model adjusts during training.

Neural Network Foundation¶

Before diving into LLM specifics, the following visualization shows how data flows through a basic neural network. Understanding this forward propagation process is essential for grasping how LLMs process tokens through their many layers.

Explore the Neural Network MicroSim →

Modern LLMs range from billions to trillions of parameters:

Model	Organization	Parameters	Release Year
GPT-3	OpenAI	175 billion	2020
GPT-4	OpenAI	~1.8 trillion (estimated)	2023
Claude 3.5 Sonnet	Anthropic	Undisclosed	2024
Gemini Ultra	Google	Undisclosed	2024
Llama 3.1	Meta	405 billion	2024
Mixtral 8x22B	Mistral	176 billion (sparse)	2024

At their core, LLMs perform a single task: given a sequence of text, predict the most likely next token. This next-token prediction objective, when applied at sufficient scale with appropriate training data, yields systems capable of remarkably sophisticated linguistic behavior.

The fundamental insight is that predicting the next word well requires understanding context, grammar, facts about the world, reasoning patterns, and stylistic conventions. A model that excels at prediction must implicitly learn vast amounts of knowledge about language and the world.

The Next-Token Prediction Paradigm¶

Consider how an LLM generates a response to "The capital of France is":

The model receives the input tokens
It processes them through multiple layers of neural network computations
It outputs a probability distribution over its entire vocabulary
The token "Paris" receives high probability
"Paris" is selected and appended to the sequence
The process repeats with "The capital of France is Paris" as input

This autoregressive process continues until the model generates a stop token or reaches a maximum length. Each generation step considers all previous context, enabling coherent multi-paragraph outputs.

Autoregressive Generation Visualization¶

The following interactive simulation demonstrates how tokens flow through neural network layers during autoregressive generation. Watch as information is compressed through hidden layers to predict the next token, which then becomes part of the input for the next prediction cycle.

Explore the Autoregressive MicroSim →

Temperature and Sampling

The selection of the next token need not be deterministic. The temperature parameter controls randomness: temperature=0 always selects the highest-probability token, while higher temperatures introduce diversity by making the selection more random. This is why the same prompt can yield different responses.

Tokens and Tokenization¶

Understanding Tokens¶

A token is the fundamental unit of text that LLMs process. Contrary to intuition, tokens are neither words nor characters—they are subword units determined by the model's tokenizer. Common words typically map to single tokens, while rare words are split into multiple tokens.

Examples of tokenization (GPT-style):

Text	Tokens	Token Count
"Hello"	["Hello"]	1
"artificial"	["art", "ificial"]	2
"ChatGPT"	["Chat", "G", "PT"]	3
"antidisestablishmentarianism"	["ant", "id", "is", "establish", "ment", "arian", "ism"]	7

Tokenization is the process of converting raw text into token sequences. Different models use different tokenization schemes:

Byte Pair Encoding (BPE): Used by GPT models. Iteratively merges frequent character pairs to build vocabulary.
WordPiece: Used by BERT. Similar to BPE but uses likelihood-based merging.
SentencePiece: Used by Llama and others. Language-agnostic tokenization that works directly on raw text.

Interactive Tokenizer¶

Try the interactive tokenizer below to see how different text inputs are broken into tokens. Notice how common words typically become single tokens while rare or compound words are split into multiple subword pieces.

Explore the Tokenizer MicroSim →

Diagram: Tokenization Process Visualization¶

Tokenization Process Visualization

Type: microsim

Purpose: Interactive demonstration of how text is converted to tokens and the implications for context window usage

Bloom Taxonomy: Understand (L2) - Explain how tokenization works and affects model behavior

Learning Objective: Students should be able to estimate token counts for different text inputs and understand tokenization implications

Canvas layout (responsive, minimum 800x400px): - Top section: Text input area - Middle section: Token visualization - Bottom section: Statistics and metrics

Visual elements: - Input text area with character counter - Token display showing each token as a colored chip - Token IDs displayed below each chip - Progress bar showing context window usage

Interactive controls: - Text input field (multi-line) - Dropdown: Tokenizer selection (GPT-4, Claude, Llama) - Button: "Tokenize" - Toggle: Show/hide token IDs - Slider: Context window size (4K, 8K, 32K, 128K, 200K)

Display metrics: - Character count - Token count - Tokens per character ratio - Context window percentage used - Estimated cost (based on typical pricing)

Behavior: - As user types, real-time token count updates - Tokens colored by type (word, subword, punctuation, special) - Hover over token shows: token text, token ID, frequency in training - Warning when approaching context limit

Sample texts: - "Hello, world!" - Technical paragraph with jargon - Code snippet - Non-English text (to show multilingual tokenization differences)

Implementation: p5.js or HTML/JavaScript

Business Implications of Tokenization¶

Understanding tokens has direct business relevance:

Cost calculation: API pricing is typically per-token (input + output). Efficient prompts reduce costs.
Context limits: Models have maximum token limits. Long documents may require chunking or summarization.
Language efficiency: Tokenizers trained primarily on English may require more tokens for other languages, increasing costs.
Code considerations: Programming languages tokenize differently than natural language, often requiring more tokens.

The Transformer Architecture¶

Why Transformers Revolutionized NLP¶

The Transformer architecture, introduced in the 2017 paper "Attention Is All You Need," replaced the sequential processing of recurrent neural networks with parallel attention-based mechanisms. This innovation enabled:

Parallel training: All positions in a sequence can be processed simultaneously
Long-range dependencies: Direct connections between any two positions regardless of distance
Scalability: Efficient training on massive datasets using modern GPU clusters
Transfer learning: Pre-trained transformers adapt effectively to diverse downstream tasks

Prior architectures—RNNs and LSTMs—processed sequences one element at a time, creating information bottlenecks and gradient flow problems for long sequences. Transformers eliminated these limitations through the attention mechanism.

Architecture Overview¶

A transformer model consists of stacked layers, each containing two primary components:

Multi-Head Self-Attention: Allows each position to attend to all other positions
Feed-Forward Network: Processes each position independently through dense layers

The full architecture includes:

Embedding layer: Converts tokens to dense vector representations
Positional encoding: Injects sequence position information
N transformer layers: Each with attention and feed-forward sublayers
Layer normalization: Stabilizes training
Output projection: Maps final representations to vocabulary probabilities

Diagram: Transformer Architecture¶

The following diagram illustrates the complete transformer architecture, showing how information flows from input tokens through multiple layers to produce output predictions.

flowchart TB
    subgraph Output["Output Layer"]
        direction TB
        SOFT["Softmax<br/>Probability Distribution"]
        LINEAR["Linear Projection<br/>to Vocabulary"]
    end

    subgraph TL["Transformer Layers (×N)"]
        direction TB
        subgraph Layer["Single Transformer Layer"]
            direction TB
            NORM2["Layer Normalization"]
            FFN["Feed-Forward Network<br/>Linear → ReLU → Linear"]
            ADD2["Add (Residual)"]
            NORM1["Layer Normalization"]
            ATTN["Multi-Head Self-Attention<br/>h parallel attention heads"]
            ADD1["Add (Residual)"]
        end
    end

    subgraph Input["Input Processing"]
        direction TB
        POS["Positional Encoding<br/>Position information"]
        EMB["Token Embeddings<br/>Vocabulary → Vectors"]
        TOK["Input Tokens"]
    end

    TOK --> EMB
    EMB --> POS
    POS --> ADD1
    ADD1 --> ATTN
    ATTN --> NORM1
    NORM1 --> ADD2
    ADD2 --> FFN
    FFN --> NORM2
    NORM2 --> LINEAR
    LINEAR --> SOFT

    %% Residual connections
    POS -.->|"Residual"| ADD1
    NORM1 -.->|"Residual"| ADD2

    style Output fill:#E1BEE7,stroke:#7B1FA2,stroke-width:2px
    style TL fill:#C8E6C9,stroke:#388E3C,stroke-width:2px
    style Input fill:#BBDEFB,stroke:#1976D2,stroke-width:2px
    style ATTN fill:#A5D6A7,stroke:#388E3C
    style FFN fill:#FFCC80,stroke:#F57C00

Component	Function	Key Innovation
Token Embeddings	Convert tokens to dense vectors	Learned representations capture meaning
Positional Encoding	Add position information	Enables parallel processing of sequences
Multi-Head Attention	Relate all positions to each other	Captures long-range dependencies
Feed-Forward Network	Transform representations	Adds non-linear processing capacity
Residual Connections	Bypass around sublayers	Enables training of deep networks
Layer Normalization	Stabilize activations	Improves training dynamics

Decoder-Only Architecture

Modern LLMs like GPT use a decoder-only variant where each position can only attend to earlier positions (causal masking). This enables autoregressive generation: predicting one token at a time based on all previous tokens.

The Attention Mechanism¶

Attention is the core innovation enabling transformers to model relationships between all positions in a sequence. The mechanism computes a weighted combination of values, where weights reflect the relevance of each position to every other position.

The attention computation follows these steps:

Project inputs: Transform each position into Query (Q), Key (K), and Value (V) vectors
Compute attention scores: Calculate dot product of queries with all keys
Scale: Divide by square root of dimension to stabilize gradients
Apply softmax: Convert scores to probability distribution
Weight values: Multiply values by attention weights and sum

The mathematical formulation:

\[\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V\]

Where $d_k$ is the dimension of the key vectors.

Self-Attention: Tokens Attending to Tokens¶

Self-attention refers to attention where queries, keys, and values all derive from the same sequence. This enables each token to "look at" all other tokens in the input and gather relevant information.

Consider the sentence: "The animal didn't cross the street because it was too tired."

When processing "it," self-attention allows the model to:

Attend strongly to "animal" (to resolve the pronoun reference)
Attend to "tired" (which semantically connects to animals, not streets)
Attend to contextual words that disambiguate meaning

This mechanism enables sophisticated contextual understanding without explicit programming of linguistic rules.

Interactive Self-Attention Visualization¶

Explore the self-attention mechanism with this interactive visualization. Select different sentences and click on tokens to see how they attend to other tokens in the sequence. Notice how pronouns strongly attend to their referents and how verbs connect to their subjects.

Explore the Self-Attention MicroSim →

Diagram: Self-Attention Visualization¶

Self-Attention Visualization

Type: microsim

Purpose: Interactive visualization of how tokens attend to other tokens in self-attention

Bloom Taxonomy: Analyze (L4) - Examine attention patterns and their linguistic significance

Learning Objective: Students should be able to interpret attention patterns and understand how context influences token relationships

Canvas layout (responsive, minimum 800x500px): - Top: Input sentence display with clickable tokens - Middle: Attention matrix visualization (heatmap) - Bottom: Selected attention pattern explanation

Visual elements: - Input tokens as clickable chips arranged horizontally - Attention matrix as grid with color-coded cells (darker = higher attention) - Highlighted attention lines connecting selected token to attended tokens - Attention weight values displayed on hover

Interactive controls: - Text input field for custom sentences - Dropdown: Select attention head (1-12) - Dropdown: Select layer (1-24) - Toggle: Show attention from → to / to → from - Button: "Analyze"

Behavior: - Click any token to highlight its attention pattern - Attention weights shown as line thickness connecting tokens - Matrix cells show numerical values on hover - Side panel explains what the selected head appears to focus on

Sample sentences: - "The animal didn't cross the street because it was too tired." - "The bank was closed so I couldn't deposit my check at the river bank." - "She gave him her book and he gave her his."

Annotations: - Highlight pronoun resolution patterns - Show positional attention (adjacent tokens) - Indicate syntactic attention (subject-verb agreement)

Implementation: p5.js with matrix visualization

Multi-Head Attention¶

Multi-head attention runs multiple attention operations in parallel, each with different learned projections. This allows the model to attend to information from different representation subspaces at different positions.

The multi-head mechanism:

Projects Q, K, V into h different subspaces (h = number of heads)
Computes attention in each subspace independently
Concatenates the results
Projects back to original dimension

\[\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, ..., \text{head}_h)W^O\]

Where each head computes:

\[\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)\]

Different heads learn to focus on different types of relationships:

Head Type	Typical Focus	Example
Syntactic heads	Subject-verb agreement	"The dogs [run]"
Positional heads	Adjacent tokens	Local context
Semantic heads	Related concepts	"doctor" ↔ "patient"
Coreference heads	Pronoun resolution	"she" ↔ "Maria"

Diagram: Multi-Head Attention Comparison¶

Multi-Head Attention Comparison Visualization

Type: microsim

Purpose: Interactive visualization showing how different attention heads within a transformer focus on different aspects of the input simultaneously

Bloom Taxonomy: Analyze (L4) - Compare and contrast what different attention heads learn to focus on

Learning Objective: Students should be able to explain why multi-head attention is more powerful than single-head attention and identify what different head types capture

Canvas layout (responsive, minimum 900x600px):

Top: Input sentence with selectable tokens
Middle: Grid of 8 attention head visualizations (2x4 layout)
Bottom: Legend and explanation panel

Visual elements:

Input sentence displayed as horizontal token sequence
Each attention head shown as a mini attention matrix or connection diagram
Color-coded attention weights (heat map from light to dark)
Head labels indicating learned focus type (e.g., "Positional", "Syntactic", "Semantic")
Aggregate attention view combining all heads

Interactive controls:

Text input field for custom sentences
Number of heads selector (4, 8, 12)
Layer selector slider (1-12)
Toggle: Individual heads / Aggregated view
Button: "Show Head Specialization"

Behavior:

Click on any attention head to expand it to full view
Hover over attention matrix cells to see exact weight values
Side panel explains what linguistic phenomenon each head appears to capture
Animation mode: step through heads showing different perspectives

Sample sentences:

"The cat sat on the mat because it was tired."
"The programmer fixed the bug that was causing crashes."
"She told him that he should go to the bank."

Annotations:

Identify head patterns: local attention, global attention, syntactic structure
Show how heads specialize for different tasks
Highlight complementary information captured by different heads

Implementation: p5.js with grid layout for multiple attention visualizations

Embeddings: Representing Meaning as Numbers¶

What Are Embeddings?¶

Embeddings are dense vector representations of tokens (or other discrete entities) in a continuous high-dimensional space. Rather than treating words as arbitrary symbols, embeddings encode semantic relationships through geometric proximity—similar concepts cluster together.

Key properties of embeddings:

Dimensionality: Typically 768 to 4096 dimensions in modern LLMs
Learned representations: Trained from data, not hand-crafted
Semantic similarity: Cosine similarity measures conceptual relatedness
Compositionality: Sentence meanings emerge from token embedding combinations

The famous demonstration of embedding arithmetic:

\[\vec{king} - \vec{man} + \vec{woman} \approx \vec{queen}\]

This suggests embeddings capture semantic relationships that support analogical reasoning.

Vector Similarity Visualization¶

The following interactive visualization demonstrates how word embeddings cluster semantically related words together. Click on different words to compare their cosine similarity scores and observe how the geometric distance reflects semantic relatedness.

Explore the Vector Similarity MicroSim →

Types of Embeddings in LLMs¶

Embedding Type	Purpose	When Used
Token embeddings	Initial word representation	Input layer
Position embeddings	Encode sequence order	Added to token embeddings
Segment embeddings	Distinguish text segments	Some architectures (BERT)
Output embeddings	Final token representations	Before generation

Business Applications of Embeddings

Embeddings enable powerful applications beyond text generation: semantic search (finding similar documents), clustering (grouping related items), classification (categorizing content), and recommendation systems (suggesting related content). Many organizations extract embeddings from LLMs for these downstream applications.

Diagram: Embedding Dimensions Explorer¶

Embedding Dimensions Explorer

Type: microsim

Purpose: Interactive visualization showing how embedding dimensions capture different semantic features and how dimensionality affects representation quality

Bloom Taxonomy: Understand (L2) - Explain how embedding dimensions encode meaning

Learning Objective: Students should be able to describe how embedding dimensions capture semantic features and understand the tradeoffs between different embedding sizes

Canvas layout (responsive, minimum 800x500px):

Left panel: Word input and embedding vector display
Center: 3D projection of embedding space (reducible to 2D)
Right panel: Dimension feature analysis

Visual elements:

3D scatter plot of word embeddings (PCA/t-SNE reduced)
Color-coded clusters by semantic category
Selected word's raw embedding vector as bar chart
"Feature activation" view showing which dimensions activate for concepts
Dimension labels showing learned features (gender, size, animate, etc.)

Interactive controls:

Text input: Enter word to visualize
Slider: Embedding dimensions (50, 100, 300, 768, 1536)
Toggle: 2D / 3D projection view
Dropdown: Projection method (PCA, t-SNE, UMAP)
Button: "Show Similar Words"
Button: "Analyze Dimension Features"

Display metrics:

Embedding dimension count
Sparsity percentage
Nearest neighbors in embedding space
Dimension-by-dimension breakdown

Behavior:

Entering a word shows its position in projected space
Hover over dimensions to see what feature they might represent
Click two words to see vector difference (analogies)
Animation: word "morphing" through embedding space

Educational demonstrations:

Show how "king - man + woman ≈ queen" works geometrically
Display how synonyms cluster together
Illustrate how adding dimensions improves discrimination

Sample word sets:

Animals: cat, dog, lion, fish, bird
Professions: doctor, nurse, teacher, engineer
Actions: run, walk, sprint, jog
Concepts: love, hate, fear, joy

Implementation: p5.js with 3D rendering or three.js

Training Large Language Models¶

Pre-Training: Learning from the Internet¶

Pre-training is the initial, computationally intensive phase where the model learns from massive text corpora. The objective is typically next-token prediction (for autoregressive models like GPT) or masked language modeling (for bidirectional models like BERT).

Pre-training characteristics:

Data scale: Trillions of tokens from web text, books, code, academic papers
Compute requirements: Thousands of GPUs for weeks or months
Cost: Millions of dollars for frontier models
Learning: Grammar, facts, reasoning patterns, world knowledge

The pre-training corpus significantly influences model capabilities. Models trained heavily on code excel at programming tasks; those with extensive scientific literature perform better on technical queries.

Training Data Source	What the Model Learns
Web pages	General knowledge, diverse topics
Books	Long-form reasoning, narrative structure
Wikipedia	Factual information, structured knowledge
Academic papers	Technical concepts, citation patterns
Code repositories	Programming syntax, algorithms
Social media	Conversational patterns, colloquialisms

Fine-Tuning: Specializing for Tasks¶

Fine-tuning adapts a pre-trained model to specific tasks or domains by training on smaller, targeted datasets. This transfer learning approach leverages the general capabilities acquired during pre-training while specializing for particular applications.

Fine-tuning approaches include:

Full fine-tuning: Update all model parameters (resource-intensive)
Parameter-efficient fine-tuning (PEFT): Update only a subset of parameters
LoRA (Low-Rank Adaptation): Add small trainable matrices to existing weights
Prompt tuning: Learn only soft prompt embeddings, freeze model weights

Fine-tuning enables:

Domain adaptation (legal, medical, financial language)
Task specialization (summarization, translation, Q&A)
Style alignment (formal vs. casual, brand voice)
Safety improvements (reducing harmful outputs)

RLHF: Aligning with Human Preferences¶

Reinforcement Learning from Human Feedback (RLHF) is a training methodology that aligns model outputs with human values and preferences. This technique proved crucial for making ChatGPT helpful, harmless, and honest—transforming a raw language model into an effective AI assistant.

The RLHF process:

Supervised fine-tuning: Train on high-quality human demonstrations of desired behavior
Reward model training: Train a separate model to predict human preference rankings
Policy optimization: Use reinforcement learning (typically PPO) to maximize the reward model's score while maintaining output diversity

Diagram: RLHF Training Pipeline¶

flowchart LR
    subgraph Stage1["Stage 1: Supervised Fine-Tuning"]
        A[Pre-trained LLM] --> B[SFT Training]
        C[Human Demonstrations] --> B
        B --> D[SFT Model]
    end

    subgraph Stage2["Stage 2: Reward Model Training"]
        D --> E[Generate Outputs]
        E --> F[Human Rankings]
        F --> G[Train Reward Model]
        G --> H[Reward Model]
    end

    subgraph Stage3["Stage 3: Policy Optimization"]
        D --> I[PPO Training]
        H --> I
        I --> J[RLHF Model]
    end

    J -.->|Iterative Improvement| C

    style Stage1 fill:#e3f2fd
    style Stage2 fill:#fff3e0
    style Stage3 fill:#e8f5e9

RLHF Stage Summary:

Stage	Human Involvement	Input	Output
1. SFT	High (write ideal responses)	Pre-trained LLM + demos	SFT Model
2. Reward	Medium (rank outputs)	SFT outputs + rankings	Reward Model
3. PPO	None (automated)	SFT Model + Reward Model	Aligned Model

Key Insight

Human annotation is the bottleneck. Stage 3 (PPO) runs automatically once the reward model is trained, allowing continuous improvement without additional human labeling.

RLHF addresses limitations of pure pre-training:

Helpfulness: Models learn to provide useful, actionable responses
Honesty: Models learn to acknowledge uncertainty rather than confabulate
Harmlessness: Models learn to refuse harmful requests
Format compliance: Models learn to follow instructions about output format

Model Parameters and Their Effects¶

What Are Parameters?¶

Model parameters are the learnable numerical values (weights and biases) that define a neural network's behavior. During training, these values are adjusted to minimize prediction error. During inference, they remain fixed and determine how inputs are transformed to outputs.

Parameter count correlates roughly with model capability, but the relationship is not linear:

More parameters → more capacity to store knowledge and patterns
More parameters → better generalization to novel inputs
More parameters → higher computational cost for training and inference
More parameters → greater risk of memorization vs. generalization

The scaling laws observed in LLM research suggest that performance improves predictably with increases in parameters, training data, and compute, though with diminishing returns at the frontier.

Diagram: LLM Scaling Laws Visualization¶

LLM Scaling Laws Visualization

Type: microsim

Purpose: Interactive demonstration of how model performance scales with parameters, training data, and compute according to Chinchilla and other scaling law research

Bloom Taxonomy: Apply (L3) - Use scaling laws to predict model performance tradeoffs

Learning Objective: Students should be able to interpret scaling law curves and understand the relationship between model size, training data, and expected performance

Canvas layout (responsive, minimum 800x500px):

Main area: Log-log plot showing scaling curves
Right panel: Model configuration inputs
Bottom: Results summary and cost estimates

Visual elements:

X-axis: Training compute (FLOPs) or model parameters (log scale)
Y-axis: Loss or benchmark score (log scale)
Multiple curves showing: Parameters only, Data only, Compute-optimal allocation
Reference points for known models (GPT-3, GPT-4, Claude, Llama)
Chinchilla optimal frontier line
Isocompute curves showing different parameter/data tradeoffs

Interactive controls:

Slider: Model parameters (1B - 1T, log scale)
Slider: Training tokens (10B - 10T, log scale)
Dropdown: Benchmark metric (Loss, MMLU, HumanEval)
Toggle: Show/hide Chinchilla optimal line
Toggle: Show/hide real model markers
Button: "Calculate Optimal Allocation"

Display metrics:

Predicted loss/performance
Training FLOPs required
Estimated training cost ($)
Compute-optimal ratio (tokens per parameter)
Distance from Chinchilla optimal

Behavior:

Dragging model parameter slider updates predicted performance
Show "undertrained" or "overtrained" region shading
Highlight when configuration deviates significantly from optimal
Click model markers to see actual vs predicted performance
Animation: trace training progress over time

Scaling law equations displayed:

Loss ≈ C / N^α (parameter scaling)
Loss ≈ D / T^β (data scaling)
Chinchilla: N_opt ∝ C^0.5, T_opt ∝ C^0.5

Implementation: Chart.js or p5.js with logarithmic axes

Inference: Running the Model¶

Inference is the process of using a trained model to generate outputs for new inputs. Unlike training (which updates parameters), inference uses fixed parameters to transform inputs through the network.

Inference considerations include:

Latency: Time from input to first output token
Throughput: Tokens generated per second
Memory footprint: GPU memory required to load the model
Cost: Computational expense per token

The autoregressive nature of LLM inference means each output token requires a full forward pass through the network. This creates a fundamental tension between response length and speed.

Factor	Effect on Latency	Effect on Throughput
More parameters	↑ Higher	↓ Lower
Longer context	↑ Higher	↓ Lower
Longer output	↑ Higher (cumulative)	Unchanged per token
Batch size	Minimal per request	↑ Higher aggregate
Quantization	↓ Lower	↑ Higher

Latency and Throughput Optimization¶

Latency—the time to generate a response—matters for interactive applications. Users perceive delays beyond 100-200ms as sluggish. LLM latency has multiple components:

Time to first token (TTFT): Processing the input and generating the first output token
Inter-token latency: Time between subsequent tokens
Total response time: Cumulative time for complete response

Throughput—tokens generated per unit time—matters for batch processing and cost optimization. Techniques for improving throughput include:

Batching: Processing multiple requests simultaneously
KV-caching: Storing key-value computations to avoid recomputation
Speculative decoding: Generating multiple tokens in parallel with verification
Model parallelism: Distributing the model across multiple GPUs

Diagram: Inference Performance Tradeoffs¶

Inference Latency and Throughput Visualization

Type: microsim

Purpose: Interactive exploration of how model size, batch size, context length, and optimization techniques affect inference latency and throughput

Bloom Taxonomy: Apply (L3) - Calculate and compare inference metrics for different configurations

Learning Objective: Students should be able to estimate latency and throughput for different LLM deployment configurations and understand key tradeoffs

Canvas layout (responsive, minimum 800x500px):

Top: Configuration panel with sliders
Middle: Dual-axis chart (latency + throughput)
Bottom: Cost and capacity estimates

Visual elements:

Bar/line chart showing Time to First Token (TTFT) and Inter-Token Latency
Throughput gauge (tokens/second)
GPU memory utilization bar
Cost-per-token indicator
Visual timeline of a sample request lifecycle

Interactive controls:

Dropdown: Model size (7B, 13B, 70B, 405B)
Slider: Batch size (1-64)
Slider: Input context length (256 - 128K tokens)
Slider: Output length (100 - 4K tokens)
Dropdown: Quantization (FP16, INT8, INT4)
Toggle: KV-Cache enabled
Toggle: Speculative decoding

Display metrics:

Time to First Token (TTFT) in milliseconds
Inter-Token Latency (ITL) in milliseconds
Total response time
Throughput (tokens/second/GPU)
GPU memory usage (GB)
Cost per 1K tokens ($)

Behavior:

Real-time updates as sliders change
Show "bottleneck" indicator (memory-bound vs compute-bound)
Warning when configuration exceeds typical GPU memory
Comparison mode: side-by-side configs
Animation: visualize token generation timeline

Reference configurations:

"Interactive chat" (low latency, small batch)
"Batch processing" (high throughput, large batch)
"Long document" (large context, optimized caching)

Educational annotations:

Explain why larger batch sizes improve throughput but increase latency
Show KV-cache memory scaling with context length
Demonstrate quantization speedup vs quality tradeoff

Implementation: Chart.js with dynamic updates, or p5.js with custom visualization

The Context Window¶

What Is a Context Window?¶

The context window is the maximum number of tokens a model can process in a single forward pass—including both input and output tokens. This limit is architectural, determined by positional encoding schemes and attention computation memory requirements.

Context window sizes have grown dramatically:

Model Generation	Typical Context Window
GPT-3 (2020)	4,096 tokens
GPT-4 (2023)	8,192 / 32,768 tokens
Claude 3 (2024)	200,000 tokens
Gemini 1.5 Pro (2024)	1,000,000 tokens

A context window of 200,000 tokens accommodates approximately:

150,000 words of English text
A 500-page book
Multiple lengthy documents for comparison
Extended multi-turn conversations

Business Implications of Context Windows¶

Context window size directly affects application architecture:

Use Case	Context Requirement	Architectural Approach
Simple Q&A	~1,000 tokens	Direct prompting
Document summarization	10,000-50,000 tokens	Single-pass with large context
Book analysis	100,000+ tokens	Large context or chunking + synthesis
Knowledge base queries	Variable	RAG (Retrieval-Augmented Generation)
Extended conversations	Cumulative	Context management, summarization

Context Window Costs

Larger context windows increase computational cost. Processing 100,000 tokens costs substantially more than processing 1,000 tokens. Design applications to use appropriate context sizes, not maximum available.

Diagram: Context Window Management¶

The following diagram compares four strategies for managing context window limitations, each suited to different scenarios and requirements.

flowchart LR
    subgraph S1["📝 Strategy 1: Direct Prompting"]
        direction TB
        D1A["Short Query"] --> D1B["LLM"] --> D1C["Response"]
    end

    subgraph S2["✂️ Strategy 2: Chunking + Synthesis"]
        direction TB
        D2A["Long Doc"]
        D2B["Chunk 1"]
        D2C["Chunk 2"]
        D2D["Chunk 3"]
        D2E["Synthesize"]
        D2A --> D2B & D2C & D2D
        D2B & D2C & D2D --> D2E
    end

    subgraph S3["🔍 Strategy 3: RAG"]
        direction TB
        D3A["Query"] --> D3B["Retrieve"]
        D3B --> D3C["Top K Chunks"]
        D3C --> D3D["Generate"]
    end

    subgraph S4["📚 Strategy 4: Full Context"]
        direction TB
        D4A["Entire Document<br/>in Context"] --> D4B["LLM"] --> D4C["Response"]
    end

    style S1 fill:#E3F2FD,stroke:#1565C0,stroke-width:2px
    style S2 fill:#E8F5E9,stroke:#388E3C,stroke-width:2px
    style S3 fill:#FFF3E0,stroke:#F57C00,stroke-width:2px
    style S4 fill:#F3E5F5,stroke:#7B1FA2,stroke-width:2px

Strategy	Context Usage	Best For	Pros	Cons
Direct Prompting	Low (100s of tokens)	Simple queries, short conversations	Fast, cheap, simple	Limited context, no external knowledge
Chunking + Synthesis	Medium per chunk	Long documents exceeding context	Handles any length	May lose cross-chunk relationships
RAG	Moderate	Large knowledge bases, specific queries	Scalable, current info, efficient	Retrieval quality critical
Full Context	High	Complete document understanding	No information loss, holistic	Expensive, slower, still has limits

Decision Tree for Strategy Selection:

flowchart TD
    Q1{"Does content fit<br/>in context window?"}
    Q2{"Need complete<br/>document understanding?"}
    Q3{"Large knowledge base<br/>with specific queries?"}
    Q4{"Document too long<br/>for context?"}

    Q1 -->|Yes| Q2
    Q1 -->|No| Q4
    Q2 -->|Yes| A1["Full Context"]
    Q2 -->|No| Q3
    Q3 -->|Yes| A2["RAG"]
    Q3 -->|No| A3["Direct Prompting"]
    Q4 -->|Yes| A4["Chunking + Synthesis"]
    Q4 -->|No| A2

    style A1 fill:#F3E5F5,stroke:#7B1FA2
    style A2 fill:#FFF3E0,stroke:#F57C00
    style A3 fill:#E3F2FD,stroke:#1565C0
    style A4 fill:#E8F5E9,stroke:#388E3C

Cost Optimization

Start with the simplest strategy that meets your needs. Direct prompting costs pennies; full context on a 100K document can cost dollars per request. Match strategy complexity to actual requirements.

Putting It All Together¶

The architectural components covered in this chapter work together to enable LLM capabilities:

Tokenization converts text to numerical representations the model can process
Embeddings map tokens to dense vectors capturing semantic meaning
Self-attention enables each position to gather information from all other positions
Multi-head attention allows simultaneous focus on different relationship types
Transformer layers stack to build progressively abstract representations
Pre-training instills broad language knowledge and world understanding
Fine-tuning specializes the model for particular tasks or domains
RLHF aligns outputs with human preferences and values
Inference applies the trained parameters to generate responses
Context windows bound the information available for each generation

Understanding these components enables informed evaluation of LLM capabilities, realistic expectation setting, and effective application design.

Key Takeaways¶

LLMs predict the next token based on all preceding context, with sophisticated language capabilities emerging from this simple objective at scale
Tokens are subword units determined by the tokenizer; understanding tokenization is essential for cost estimation and context management
The transformer architecture replaced sequential processing with parallel attention, enabling efficient training on massive datasets
Self-attention allows each token to attend to all other tokens, capturing long-range dependencies and contextual relationships
Multi-head attention enables simultaneous focus on different relationship types (syntactic, semantic, positional)
Embeddings represent tokens as dense vectors where semantic similarity corresponds to geometric proximity
Pre-training teaches broad language knowledge; fine-tuning specializes for tasks; RLHF aligns with human preferences
Context windows limit the tokens available for processing; larger windows enable more comprehensive understanding but increase cost
Inference characteristics (latency, throughput) depend on model size, context length, and optimization techniques

Review Questions¶

Explain why the attention mechanism was a breakthrough for NLP compared to recurrent architectures.

Recurrent architectures (RNNs, LSTMs) process sequences one element at a time, creating information bottlenecks as context must pass through each step sequentially. This causes: (1) Gradient vanishing/exploding over long sequences, (2) Difficulty capturing long-range dependencies, (3) Sequential processing preventing parallelization. Attention mechanisms allow direct connections between any two positions regardless of distance, enabling: (1) Parallel processing of all positions simultaneously, (2) Direct modeling of long-range relationships, (3) Efficient training on modern GPU clusters. These advantages enabled training on much larger datasets and longer sequences.

How does RLHF differ from standard supervised fine-tuning, and why is it necessary?

Supervised fine-tuning trains on human-written examples, teaching the model to imitate demonstrations. RLHF goes further by: (1) Training a reward model to predict human preferences between outputs, (2) Using reinforcement learning to optimize for reward model scores. This is necessary because: (1) Writing perfect demonstrations is expensive and limited, (2) Humans are better at comparing outputs than generating ideal ones, (3) RLHF can optimize for implicit preferences difficult to demonstrate (helpfulness, safety, format). The result is models that better align with what users actually want.

A 100,000-token document needs analysis. Compare using a large-context model versus RAG approach.

Large-context approach: Load entire document into context window; model has complete information for holistic analysis; best for tasks requiring understanding relationships across the full document; higher cost per query; works well when full context is consistently needed.

RAG approach: Index document, retrieve relevant chunks per query; efficient for specific questions; lower per-query cost; scales to unlimited document sizes; may miss cross-section relationships; requires quality retrieval system; better when queries target specific information rather than full-document synthesis.

Choose large-context for comprehensive analysis (summarization, theme extraction); choose RAG for question-answering over large corpora or when cost matters for many queries.