Self-Attention Visualization¶
Run the Self-Attention Visualization Fullscreen
About This MicroSim¶
This interactive visualization demonstrates how the self-attention mechanism in transformers works. Self-attention is the key innovation that allows LLMs to understand context and relationships between words in a sentence.
Iframe Embedding¶
<iframe src="https://dmccreary.github.io/Digital-Transformation-with-AI-Spring-2026/sims/self-attention-visualization/main.html"
height="652px"
width="100%"
scrolling="no">
</iframe>
How to Use¶
- Select a Sentence: Choose from different example sentences to see various attention patterns
- Click a Token: Click on any token in the row to see which other tokens it attends to
- Read the Matrix: The attention matrix shows strength of attention from each token (row) to each token (column)
- Observe Patterns: Notice how certain linguistic patterns create strong attention connections
Key Attention Patterns¶
| Pattern Type | Description | Example |
|---|---|---|
| Pronoun Resolution | Pronouns attend strongly to their referents | "it" → "cat" |
| Subject-Verb Agreement | Verbs attend to their subjects | "passed" → "students" |
| Adjective-Noun | Adjectives attend to nouns they modify | "quick" → "fox" |
| Positional | Nearby tokens generally have higher attention | Local context matters |
Understanding the Attention Matrix¶
The attention matrix is a square grid where:
- Rows represent the "from" token (the one doing the attending)
- Columns represent the "to" token (the one being attended to)
- Cell color indicates attention strength (darker = stronger)
- Each row sums to 1 (softmax normalization)
Self-Attention Mechanism¶
The self-attention mechanism computes attention scores using three learned projections:
- Query (Q): What am I looking for?
- Key (K): What do I contain?
- Value (V): What information do I provide?
The attention score between token \(i\) and token \(j\) is:
Learning Objectives¶
After using this tool, students should be able to:
- Understand (Bloom's L2): Explain how self-attention captures token relationships
- Analyze (Bloom's L4): Interpret attention patterns and their linguistic significance
- Evaluate (Bloom's L5): Assess why certain patterns emerge in attention distributions
Lesson Plan¶
Activity 1: Pattern Discovery (10 minutes)¶
- Select "Pronoun Reference" sentence
- Click on "it" and observe what it attends to
- Explain why "cat" has high attention
Activity 2: Linguistic Analysis (15 minutes)¶
- For each sentence type, identify the primary attention pattern
- Document which token pairs have strong connections
- Hypothesize why these patterns help language understanding
Discussion Questions¶
- Why does the "it" token need to attend to "cat" to generate correct text?
- How does attention help models understand long-range dependencies?
- What happens when multiple valid referents exist for a pronoun?
Related Concepts¶
- Chapter 2: Large Language Model Architecture
- Transformer Architecture
- Multi-Head Attention
- Query-Key-Value Mechanism
References¶
- Vaswani, A., et al. (2017). Attention Is All You Need. NeurIPS.
- Clark, K., et al. (2019). What Does BERT Look At? ACL Workshop BlackboxNLP.
- Vig, J. (2019). A Multiscale Visualization of Attention in the Transformer Model. ACL Demo.
Self-Assessment Quiz¶
Test your understanding of the self-attention mechanism.
Question 1: What is the primary purpose of self-attention in transformer models?
- To reduce the size of the model
- To capture relationships and context between all tokens in a sequence
- To make the model run faster
- To save memory during training
Answer
B) To capture relationships and context between all tokens in a sequence - Self-attention allows each token to "attend to" all other tokens, learning which words are most relevant for understanding each position in the text.
Question 2: In the attention matrix visualization, what do the rows and columns represent?
- Rows are inputs, columns are outputs
- Rows are the "from" tokens (doing the attending), columns are the "to" tokens (being attended)
- Rows are layers, columns are neurons
- Rows are words, columns are letters
Answer
B) Rows are the "from" tokens (doing the attending), columns are the "to" tokens (being attended) - Each cell shows how much attention one token pays to another, with darker colors indicating stronger attention.
Question 3: What are the three learned projections used in self-attention?
- Input, Output, and Hidden
- Query (Q), Key (K), and Value (V)
- Forward, Backward, and Lateral
- Beginning, Middle, and End
Answer
B) Query (Q), Key (K), and Value (V) - Each token is projected into Query ("what am I looking for?"), Key ("what do I contain?"), and Value ("what information do I provide?") representations.
Question 4: Why does the pronoun "it" typically show high attention to its referent (like "cat" in "The cat sat because it was tired")?
- Random chance
- The model needs to understand what "it" refers to in order to generate contextually appropriate text
- Pronouns always attend to the first noun
- Attention is alphabetical
Answer
B) The model needs to understand what "it" refers to in order to generate contextually appropriate text - Self-attention learns to connect pronouns with their referents because this relationship is crucial for understanding and generating coherent language.
Question 5: Why is self-attention considered a breakthrough for processing sequences?
- It is cheaper than previous methods
- It allows direct connections between any positions, solving the long-range dependency problem
- It requires less training data
- It only works on English text
Answer
B) It allows direct connections between any positions, solving the long-range dependency problem - Unlike recurrent networks that must pass information step-by-step, self-attention creates direct connections between all positions, enabling effective modeling of relationships across long sequences.