AI Task Completion Time Horizons¶

An interactive visualization showing how long different AI models can work on tasks before failing, based on METR's research on measuring AI ability to complete long tasks.

Run the MicroSim Fullscreen

To use this interactive chart on any website simply copy this line of HTML into your web page:

<iframe src="https://dmccreary.github.io/tracking-ai-course/sims/ai-task-horizons/main.html" width="100%" height="588"></iframe>

This MicroSim is based on the wonderful chart at the METR.org blob post

Overview¶

This MicroSim visualizes the task-completion time horizons for various AI language models, showing the relationship between model capabilities and their ability to sustain work on extended tasks. The data represents predictions about how long AI agents can operate before encountering failures.

The visualization demonstrates that:

Frontier models (shown in green) generally achieve longer task-completion horizons
Success probability significantly impacts time horizons - models perform better at 50% success thresholds than 80%
There is substantial variation among models, even within the same category
The trend shows exponential growth in AI capabilities over time

Key Findings¶

According to METR's research, the length of tasks models can complete is well predicted by an exponential trend with a doubling time of around 7 months. This suggests rapid improvement in AI's ability to handle longer, more complex tasks.

Interactive Features¶

Scale Toggle¶

Linear Scale: Shows absolute differences between models
Log Scale: Better visualizes the exponential growth pattern and compresses the range for easier comparison

Success Probability Toggle¶

50% Success: The time horizon where a model has a 50% chance of successfully completing a task
80% Success: The more conservative metric requiring 80% success probability (typically shows shorter horizons)

Hover Tooltips¶

Hover over any data point to see: - Model name - Time horizon in minutes and hours - Release date - Success probability level

Color Coding¶

Green points: Frontier models (state-of-the-art)
Gray points: Non-frontier models

Model Data¶

Frontier Models¶

GPT-5: 8,239 min / 137 hrs (50%) / 1,583 min / 26 hrs (80%)
Grok-4: 6,605 min / 110 hrs (50%) / 895 min / 15 hrs (80%)
o3: 5,531 min / 92 hrs (50%) / 1,223 min / 20 hrs (80%)
Claude 3.7 Sonnet: 3,254 min / 54 hrs (50%) / 909 min / 15 hrs (80%)
o1-elicited: 2,354 min / 39 hrs (50%) / 358 min / 6 hrs (80%)
Claude 3.5 Sonnet 20241022: 1,739 min / 29 hrs (50%) / 278 min / 5 hrs (80%)
o1-preview: 1,326 min / 22 hrs (50%) / 277 min / 5 hrs (80%)
Claude 3.5 Sonnet: 1,093 min / 18 hrs (50%) / 192 min / 3 hrs (80%)

Non-Frontier Models¶

Claude Sonnet 4.5: 6,798 min / 113 hrs (50%) / 1,209 min / 20 hrs (80%)
Claude 4.1 Opus: 6,330 min / 106 hrs (50%) / 1,265 min / 21 hrs (80%)
Claude 4 Opus: 4,792 min / 80 hrs (50%) / 1,218 min / 20 hrs (80%)
Claude 4 Sonnet: 4,063 min / 68 hrs (50%) / 1,000 min / 17 hrs (80%)
Claude 3 Opus: 385 min / 6 hrs (50%) / 68 min / 1 hr (80%)

Customization Guide¶

Modifying the Data¶

To update the visualization with new models or data, edit the modelData object in main.html:

const modelData = {
    frontier: [
        {
            name: 'Model Name',
            horizon50: 100.0,  // 50% success time in minutes
            horizon80: 20.0,   // 80% success time in minutes
            date: '2025-MM-DD' // Release date
        }
    ],
    nonFrontier: [
        // Similar structure for non-frontier models
    ]
};

Changing Colors¶

Modify the background and border colors for each dataset:

datasets: [
    {
        label: 'Frontier Models',
        backgroundColor: 'rgba(39, 174, 96, 0.8)',  // Green
        borderColor: 'rgba(39, 174, 96, 1)',
        // ...
    },
    {
        label: 'Non-Frontier Models',
        backgroundColor: 'rgba(149, 165, 166, 0.8)',  // Gray
        borderColor: 'rgba(149, 165, 166, 1)',
        // ...
    }
]

Adjusting Chart Dimensions¶

For iframe embedding, modify the height in index.md:

<iframe src="./main.html" width="100%" height="650" frameborder="0"></iframe>

Or adjust the chart container height in the CSS within main.html:

.chart-container {
    height: 500px;  /* Adjust as needed */
}

Technical Details¶

Library: Chart.js 4.4.0
Chart Type: Scatter plot
Data Points: 30 AI models (14 frontier, 16 non-frontier)
Time Range: From 0.3 minutes (GPT-2) to 8,239 minutes / 137 hours (GPT-5)
Data Source: METR Research (March 2025)
Interactive Controls: Scale toggle (linear/log), Success probability toggle (50%/80%)
Browser Compatibility: All modern browsers (Chrome, Firefox, Safari, Edge)
Dependencies: Chart.js (loaded from CDN)
Responsive: Yes, adapts to container width with mobile optimizations

Interpretation Guide¶

Reading the Chart¶

X-axis: Models are arranged chronologically by release date
Y-axis: Time horizon in minutes (how long the model can work on tasks)
Higher points: Indicate models can work on longer tasks
Color: Distinguishes frontier (green) vs non-frontier (gray) models

Understanding Success Probabilities¶

50% horizon of 100 minutes: The model has a 50% chance of successfully completing a 100-minute task
80% horizon of 20 minutes: The model has an 80% chance of successfully completing a 20-minute task
Higher success probabilities generally correspond to shorter time horizons (more conservative estimates)

Using Log Scale¶

The log scale is particularly useful for: - Visualizing exponential growth trends - Comparing models with widely varying capabilities - Identifying patterns in capability improvements over time

Use Cases¶

This visualization is useful for:

Understanding AI progress: Track how model capabilities improve over time
Comparing models: Evaluate different AI systems' ability to handle extended tasks
Planning AI deployment: Choose appropriate models based on task duration requirements
Educational content: Demonstrate exponential growth in AI capabilities
Research presentations: Show empirical data on AI task-completion abilities

References¶

About This MicroSim¶

This MicroSim replicates the interactive chart from METR's research on AI task-completion capabilities. It demonstrates the exponential growth trend in AI systems' ability to handle longer, more complex tasks over time, with a doubling period of approximately 7 months.

The visualization helps students and researchers understand both current AI capabilities and the rapid pace of improvement in this critical dimension of AI performance.

Self-Assessment Quiz¶

Test your understanding of AI task completion time horizons.

Question 1: What does "task horizon" measure in the context of AI performance?

How far into the future AI can see
The length of tasks AI can complete autonomously with a specified success rate
The maximum number of users AI can serve
The geographical reach of AI systems

Answer

B) The length of tasks AI can complete autonomously with a specified success rate - Task horizon indicates how long (in human work time) a task can be while still having a specified probability of successful AI completion.

Question 2: What is the difference between "frontier" and "non-frontier" models in this visualization?

Frontier models are older; non-frontier are newer
Frontier models represent state-of-the-art at their release; non-frontier are other capable models
Frontier models work on simple tasks; non-frontier on complex tasks
There is no difference; it's just random categorization

Answer

B) Frontier models represent state-of-the-art at their release; non-frontier are other capable models - Frontier models (shown in green) were the most capable models available at their time of release, while non-frontier models (gray) are other capable but not leading-edge models.

Question 3: Why is the logarithmic scale option useful for viewing this data?

It makes the chart more colorful
It better visualizes exponential growth patterns across a wide range of values
It is required by scientific standards
It hides poor-performing models

Answer

B) It better visualizes exponential growth patterns across a wide range of values - The log scale compresses the range to show the consistent doubling pattern and allows comparison of models that vary by orders of magnitude in capability.

Question 4: What does the gap between 50% and 80% success metrics indicate about a model?

The model is broken
The model's consistency and reliability across different task lengths
The model's cost difference
The model's age

Answer

B) The model's consistency and reliability across different task lengths - A large gap indicates the model is less consistent, while a smaller gap suggests more reliable performance across the capability range.

Question 5: Based on the trend data, what type of tasks might AI reliably handle by 2027-2028?

Only simple 5-minute tasks
Complex multi-day to week-long tasks
Tasks requiring physical labor
Tasks that have not been invented yet

Answer

B) Complex multi-day to week-long tasks - If the 7-month doubling trend continues, projections suggest AI could handle tasks equivalent to multiple days or even weeks of human work by 2027-2028.