Self-Attention Visualization¶

Run the Self-Attention Visualization Fullscreen

About This MicroSim¶

This interactive visualization demonstrates how the self-attention mechanism in transformers works. Self-attention is the key innovation that allows LLMs to understand context and relationships between words in a sentence.

Iframe Embedding¶

<iframe src="https://dmccreary.github.io/Digital-Transformation-with-AI-Spring-2026/sims/self-attention-visualization/main.html"
        height="652px"
        width="100%"
        scrolling="no">
</iframe>

How to Use¶

Select a Sentence: Choose from different example sentences to see various attention patterns
Click a Token: Click on any token in the row to see which other tokens it attends to
Read the Matrix: The attention matrix shows strength of attention from each token (row) to each token (column)
Observe Patterns: Notice how certain linguistic patterns create strong attention connections

Key Attention Patterns¶

Pattern Type	Description	Example
Pronoun Resolution	Pronouns attend strongly to their referents	"it" → "cat"
Subject-Verb Agreement	Verbs attend to their subjects	"passed" → "students"
Adjective-Noun	Adjectives attend to nouns they modify	"quick" → "fox"
Positional	Nearby tokens generally have higher attention	Local context matters

Understanding the Attention Matrix¶

The attention matrix is a square grid where:

Rows represent the "from" token (the one doing the attending)
Columns represent the "to" token (the one being attended to)
Cell color indicates attention strength (darker = stronger)
Each row sums to 1 (softmax normalization)

Self-Attention Mechanism¶

The self-attention mechanism computes attention scores using three learned projections:

Query (Q): What am I looking for?
Key (K): What do I contain?
Value (V): What information do I provide?

The attention score between token \(i\) and token \(j\) is:

\[\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V\]

Learning Objectives¶

After using this tool, students should be able to:

Understand (Bloom's L2): Explain how self-attention captures token relationships
Analyze (Bloom's L4): Interpret attention patterns and their linguistic significance
Evaluate (Bloom's L5): Assess why certain patterns emerge in attention distributions

Lesson Plan¶

Activity 1: Pattern Discovery (10 minutes)¶

Select "Pronoun Reference" sentence
Click on "it" and observe what it attends to
Explain why "cat" has high attention

Activity 2: Linguistic Analysis (15 minutes)¶

For each sentence type, identify the primary attention pattern
Document which token pairs have strong connections
Hypothesize why these patterns help language understanding

Discussion Questions¶

Why does the "it" token need to attend to "cat" to generate correct text?
How does attention help models understand long-range dependencies?
What happens when multiple valid referents exist for a pronoun?

Chapter 2: Large Language Model Architecture
Transformer Architecture
Multi-Head Attention
Query-Key-Value Mechanism

References¶

Vaswani, A., et al. (2017). Attention Is All You Need. NeurIPS.
Clark, K., et al. (2019). What Does BERT Look At? ACL Workshop BlackboxNLP.
Vig, J. (2019). A Multiscale Visualization of Attention in the Transformer Model. ACL Demo.

Self-Assessment Quiz¶

Test your understanding of the self-attention mechanism.

Question 1: What is the primary purpose of self-attention in transformer models?

To reduce the size of the model
To capture relationships and context between all tokens in a sequence
To make the model run faster
To save memory during training

Answer

B) To capture relationships and context between all tokens in a sequence - Self-attention allows each token to "attend to" all other tokens, learning which words are most relevant for understanding each position in the text.

Question 2: In the attention matrix visualization, what do the rows and columns represent?

Rows are inputs, columns are outputs
Rows are the "from" tokens (doing the attending), columns are the "to" tokens (being attended)
Rows are layers, columns are neurons
Rows are words, columns are letters

Answer

B) Rows are the "from" tokens (doing the attending), columns are the "to" tokens (being attended) - Each cell shows how much attention one token pays to another, with darker colors indicating stronger attention.

Question 3: What are the three learned projections used in self-attention?

Input, Output, and Hidden
Query (Q), Key (K), and Value (V)
Forward, Backward, and Lateral
Beginning, Middle, and End

Answer

B) Query (Q), Key (K), and Value (V) - Each token is projected into Query ("what am I looking for?"), Key ("what do I contain?"), and Value ("what information do I provide?") representations.

Question 4: Why does the pronoun "it" typically show high attention to its referent (like "cat" in "The cat sat because it was tired")?

Random chance
The model needs to understand what "it" refers to in order to generate contextually appropriate text
Pronouns always attend to the first noun
Attention is alphabetical

Answer

B) The model needs to understand what "it" refers to in order to generate contextually appropriate text - Self-attention learns to connect pronouns with their referents because this relationship is crucial for understanding and generating coherent language.

Question 5: Why is self-attention considered a breakthrough for processing sequences?

It is cheaper than previous methods
It allows direct connections between any positions, solving the long-range dependency problem
It requires less training data
It only works on English text

Answer

B) It allows direct connections between any positions, solving the long-range dependency problem - Unlike recurrent networks that must pass information step-by-step, self-attention creates direct connections between all positions, enabling effective modeling of relationships across long sequences.

Self-Attention Visualization¶

About This MicroSim¶

Iframe Embedding¶

How to Use¶

Key Attention Patterns¶

Understanding the Attention Matrix¶

Self-Attention Mechanism¶

Learning Objectives¶

Lesson Plan¶

Activity 1: Pattern Discovery (10 minutes)¶

Activity 2: Linguistic Analysis (15 minutes)¶

Discussion Questions¶

Related Concepts¶

References¶

Self-Assessment Quiz¶