Frontier Agent Evaluation

High signal environments
for agents

Curated by human experts from diverse, high-value professional domains. All tasks are verifiable, expert verified, using real data — things humans would get paid to do.

Science/Finance/Healthcare/Cybersecurity/Energy/Mathematics/Robotics/Media/Software Eng./Insurance/Office

Explore SkillsBench View on GitHub

Our Work

What we've built

View all →

January 2026

SkillsBench

The first evaluation framework measuring how skills (custom instructions) work for AI agents. 84 expert-curated tasks across diverse, high-GDP-value domains. The first dataset measuring how powerful models are at using skills.

Pokemon Red battle comparison across 5 AI models

March 2025

PokemonGym

First open-source harness for any LLM and agent to play Pokemon Red and Blue. Tests vision, reasoning, planning, memory, and sequential decision-making. Featured in the Gemini model launch.

BenchFlow Hub showing 66 integrated benchmarks

December 2024

BenchFlow Hub & Runtime

The first protocol for agent and benchmark unification. HuggingFace for benchmarks and RL environments. One-line setup for 60+ benchmarks spanning NLP, web agents, code, medical AI, and more.

Backed by

Jeff Dean

Google

Arash Ferdowsi

Dropbox

Eugene Yan

Amazon

Founders, Inc.

A16z

Scout Fund

High signal environmentsfor agents

What we've built

SkillsBench

PokemonGym

BenchFlow Hub & Runtime

High signal environments
for agents