Skip to content

MMLU Timeline MicroSim

<iframe src="https://dmccreary.github.io/tracking-ai-course/sims/mmlu-timeline/main.html"  height="550px" scrolling="no"
  style="overflow: hidden;"></iframe>

Run the MMLU Timeline MicroSim

Edit the MMLU MicroSim

MMUL Timeline with Chart.js Version 3

MMUL Timeline with Chart.js Version 2

MMUL Timeline with Chart.js

React Timeline Chart

React Chart

Notes

I'll search for the most recent MMLU benchmark data to update your dataset. Let me gather that information for you.

Let me search for more recent and specific MMLU benchmark data:

Let me search for more specific MMLU benchmark scores:

Based on my research, I've gathered the most recent MMLU benchmark data to update your dataset. Here's an updated version with the latest models and their scores:

Based on my research, I've updated the MMLU benchmark data with the most recent results from various AI models through April 2025. Here's what I found:

Latest MMLU Benchmark Scores (as of May 2025)

The interactive chart now includes the most recent benchmark scores from the leading AI models:

  1. GPT-4.1: Achieves 90.2% on the MMLU benchmark as reported by OpenAI in April 2025 OpenAI highlights GPT-4.1's "strong 90.2% score on the Massive Multitask Language Understanding (MMLU) benchmark"

  2. GPT-4o1 (o1-high): Scored around 91.5% as of February 2025, representing OpenAI's reasoning model

  3. Claude 3.7 Sonnet: Scores around 87-88% based on various reports, with some sources showing scores in the 80-85% range Claude 3.7 Sonnet has "a MMLU score of 0.803" while other sources indicate higher scores

  4. Claude 3.5 Sonnet: Achieved 88.7% in June 2024 Claude 3.5 Sonnet "proves that it can complete natural language tasks with better quality than its predecessors and competitors such as GPT-4, with a score of 88.7% in the MMLU benchmark"

  5. Gemini 2.5 Pro: Reaches approximately 89.8% as of April 2025, making it highly competitive with the best models from OpenAI

  6. Llama 4 (Maverick): Achieves 85.5% On MMLU, GPT-4o's rumored 87-88 range exceeds Maverick's 85.5%

  7. Grok 3: Reportedly tops several benchmarks with an impressive 92.7% on MMLU Grok 3 "tops several academic benchmarks (92.7% MMLU, ~89% GSM8K for math) with a massive 2.7-trillion-parameter design"

The chart now includes data from 2020 through April 2025, showing the dramatic improvement in model performance over this period, from the early GPT-3 scores in the 40% range to today's frontier models exceeding 90%.

  1. The past year has seen intense competition, with scores improving from the mid-80s to over 90%

  2. Different measurement methodologies can affect scores, making direct comparisons challenging Model creators have reported MMLU scores using non-standard prompting techniques

  3. The gap between leading models has narrowed significantly, with several top models now performing within a few percentage points of each other

This updated visualization should help track the rapid advancement in AI capabilities as measured by the MMLU benchmark, which tests knowledge across 57 subjects ranging from STEM fields to humanities.

References

Data Is Beautiful Visualization

Self-Assessment Quiz

Test your understanding of the MMLU benchmark and AI progress.

Question 1: What does MMLU stand for?

  1. Machine Model Language Understanding
  2. Massive Multitask Language Understanding
  3. Multiple Machine Learning Units
  4. Modern Model Learning Utility
Answer

B) Massive Multitask Language Understanding - MMLU is a benchmark that tests AI models across 57 subjects ranging from STEM fields to humanities.

Question 2: Approximately what MMLU score range do the best AI models achieve as of 2025?

  1. 40-50%
  2. 60-70%
  3. Over 90%
  4. Under 30%
Answer

C) Over 90% - Frontier models like Grok-3, GPT-4.1, and GPT-5 achieve scores above 90% on MMLU, compared to earlier models that scored in the 40-50% range.

Question 3: What trend does the MMLU timeline demonstrate?

  1. AI performance has remained constant
  2. AI performance has dramatically improved from mid-40s% to over 90% in a few years
  3. AI performance has declined
  4. Only one model has ever been tested
Answer

B) AI performance has dramatically improved from mid-40s% to over 90% in a few years - The timeline shows rapid improvement in AI capabilities, with scores improving by roughly 50 percentage points in just a few years.

Question 4: Why might direct comparison of MMLU scores between models be challenging?

  1. Scores are measured in different units
  2. Different measurement methodologies and prompting techniques can affect scores
  3. MMLU tests change every day
  4. Scores are not made public
Answer

B) Different measurement methodologies and prompting techniques can affect scores - Model creators may use different prompting approaches or evaluation methods, making direct comparisons difficult without standardized testing conditions.

Question 5: What does it mean when AI models surpass average human performance on MMLU?

  1. AI is generally smarter than all humans
  2. AI can now answer academic knowledge questions at college-educated human levels
  3. MMLU is no longer a valid benchmark
  4. Human education has failed
Answer

B) AI can now answer academic knowledge questions at college-educated human levels - Surpassing human average performance on MMLU indicates AI can demonstrate knowledge across academic subjects comparable to educated humans, though this doesn't mean general superiority in all tasks.