BenchFlow builds the environments AI agents learn in.
A frontier environment lab for AI agents. We ship SkillsBench, ClawsBench, and the BenchFlow runtime.
What we ship
All research →
01· BenchmarkSkillsBench
The first benchmark for whether procedural skills — instructions, scripts, references an agent loads on demand — make agents better at real work. 86 tasks, 11 domains.
02· EnvironmentClawsBench
Five mock workplaces — Gmail, Calendar, Drive, Docs, Slack — wire-compatible with the upstream `gws` and Slack APIs. Production agents and skills run unchanged against a safety-evaluable replica.
03· RuntimeBenchFlow
The agent simulation runtime. One Scene-based lifecycle for single-agent, multi-agent, and multi-round evals. Sandboxed, hardened against reward hacking, full trajectory capture.
Thesis
Data is the bottleneck. Environments are the new data.
AI data went from labels to post-training trajectories to environments. Models in 2026 don’t get better from more static prompts — they get better from running through realistic environments and being judged on the whole workflow.
- 1.0
Labels
Image tags, span annotations, yes/no labels.
- 2.0
Post-training
SFT, preferences, reward labels, short trajectories.
- 3.0we’re here
Environments
Stateful workplaces with services, files, tools, verifiers, replay.
Ecosystem
- May 26· CAIS · San Jose
Agent Skills ’26 workshop
First workshop on agent skills. Speakers: Dawn Song, Ross Taylor, Kanav Garg (DeepMind), Yu Su. Live SkillsBench design challenge.
agentskills-workshop.org ↗ - May 27· Google DeepMind · Mountain View
BenchFlow / SkillsBench / ClawsEnv 1.0
Co-launch with Kaggle and Google DeepMind. Kaggle competition with $50k prize pool already secured.
Backed by
Angels
Jeff Dean
Chief Scientist, Google DeepMind
Logan Kilpatrick
MTS, Google DeepMind · Gemini Developer Lead
Eugene Yan
MTS, Anthropic
Bill Chambers
PM, LMArena
Arash Ferdowsi
Co-founder / CTO, Dropbox
Firms
+ more across Anthropic, Google DeepMind, and the frontier AI ecosystem.