SkillsBench

A benchmark of 86 verifiable, expert-created tasks measuring how reusable skills improve AI coding agents across 11 professional domains.

skillsbench.ai · Feb 2026
Verifiable Human-Created Expert-Verified Real Data (not synthetic) Tasks Humans Get Paid to Do
Skills
Applications
Harness
Operating Systems
Models
CPUs
Skills Layer
Domain-specific capabilities and workflows that extend agent functionality. Like applications on an OS, skills provide specialized knowledge and tools for particular tasks.
Agent Harness Layer
The execution environment that orchestrates agents, manages tool access, and handles I/O. Analogous to an operating system that mediates between applications and hardware.
Models Layer
The foundational AI models that power reasoning and generation. Like CPUs, they provide the raw computational capability that upper layers build upon.
86 Expert-Level Tasks Across 11 Domains
Office & White Collar
Excel, Word, PowerPoint, PDF, Gmail
Natural Science
Physics, Astronomy, Chemistry, Earth Sciences
Finance
Macrofinance, Economics, Portfolio Mgmt
Healthcare
Clinical Lab, Cancer Proteomics
Manufacturing
Codebook, Equipment, Job-Shop Scheduling
Cybersecurity
CVE Patching, CTF, Network Security
Energy
Power Grid, Optimal Power Flow
Mathematics
Formal Proof, Game Optimization
Robotics
PDDL Planning, Control Systems
Media & Content
Video Editing, Audio, Dubbing, TTS
Software Eng.
ML Repro, DevOps, Bug Fix, Migration
Difficulty
Core 20%
Extended 50%
Extreme 30%
Core — 17 tasks · <60 min
Extended — 43 tasks · 1-4 hours
Extreme — 26 tasks · >4 hours
Community Momentum
800+
Community Members
180+
Contributors
80%+
PhDs or Senior Professionals
Agent Performance — All Conditions 5 trials × 84 tasks × 7 models
Without Skills
With Skills (range)
Self-Generated
▲ Norm. Gain
1
Gemini CLI Flash
31.3±3.0 → 48.7±3.1
25.3%
2
Claude Code Opus 4.5
22.0±2.8 → 45.3±2.5
21.6
29.9%
3
Codex GPT-5.2
30.6±3.1 → 44.7±3.0
25.0
20.3%
4
Claude Code Opus 4.6
30.6±2.6 → 44.5±3.1
32.0
20.0%
5
Gemini CLI Pro
27.6±3.0 → 41.2±3.1
18.8%
6
Claude Code Sonnet 4.5
17.3±2.5 → 31.8±2.9
15.2
17.5%
7
Claude Code Haiku 4.5
11.0±2.1 → 27.7±2.9
11.0
18.8%
Claude Code Gemini CLI Codex Normalized Gain = (with − without) / (100 − without) · Method D · 95% CIs
Skills Substitute for Model Scale
Without Skills
With Skills
0 10 20 30 40 50 Haiku 4.5 Sonnet 4.5 Opus 4.5 Opus 4.6 ← smaller · larger → 11.0 17.3 22.0 30.6 27.7 31.8 45.3 44.5 Haiku + Skills (27.7) > Opus 4.5 (22.0) Sonnet + Skills (31.8) > Opus 4.6 (30.6)
skillsbench.ai
github.com/benchflow-ai/skillsbench Open Source