All Benchmarks

Browse and discover all benchmarks from our repository and the wider community. Filter by category and search to find the benchmarks you need.

  • All
  • Agent
  • Code
  • Cybersecurity
  • Embedding
  • General
  • Knowledge
  • Mathematics
  • Performance
  • Reasoning
  • Safeguards
  • Vision

Recently Added

SWE-bench

Available

A benchmark for software engineering tasks.

PAWS: Paraphrase Adversaries from Word Scrambling

Available

Evaluating models on the task of paraphrase detection by providing pairs of sentences that are either paraphrases or not.

BIRD-SQL

Available

A pioneering, cross-domain dataset for evaluating text-to-SQL models on large-scale databases.

XSTest: A benchmark for identifying exaggerated safety behaviours in LLMs

Available

Dataset with 250 safe prompts across ten prompt types that well-calibrated models should not refuse, and 200 unsafe prompts as contrasts that models, for most applications, should refuse.

WebArena

Available

A realistic web environment for developing autonomous agents. GPT-4 agent achieves 14.41% success rate vs 78.24% human performance.

MLE-bench

Request

A benchmark for measuring how well AI agents perform at machine learning engineering.

SWE-bench

Available

A benchmark for software engineering tasks.

SWE-bench Multimodal

Available

A benchmark for evaluating AI systems on visual software engineering tasks with JavaScript.

Agentbench

Request

A comprehensive benchmark to evaluate LLMs as agents (ICLR'24)

Tau (饾湉)-Bench

Request

A benchmark for evaluating AI agents' performance in real-world settings with dynamic interaction.

BIRD-SQL

Available

A pioneering, cross-domain dataset for evaluating text-to-SQL models on large-scale databases.

LegalBench

Request

A collaboratively built benchmark for measuring legal reasoning in large language models.

STS (Semantic Textual Similarity)

Request

A benchmark for evaluating semantic equivalence between text snippets.

GLUE MS Marco

Request

A large-scale dataset for benchmarking information retrieval systems.

Stanford HELM

Request

A comprehensive framework for evaluating language models across various scenarios.

API-bank

Request

A benchmark for evaluating tool-augmented LLMs with 73 API tools and 314 dialogues.

ARC (AI2 Reasoning Challenge)

Request

A dataset of 7,787 grade-school level, multiple-choice science questions for advanced QA research.

HellaSwag

Available

A benchmark for testing physical situation reasoning with harder endings and longer contexts.

HumanEval

Available

Evaluates functional correctness of code generation based on docstrings.

MMLU (Massive Multitask Language Understanding)

Request

A comprehensive benchmark evaluating models on multiple choice questions across 57 subjects.

SuperGLUE

Request

A rigorous benchmark for language understanding with eight challenging tasks.

TruthfulQA

Available

A benchmark for evaluating model truthfulness with 817 questions across 38 categories.

EQ Bench

Request

A benchmark for evaluating emotional intelligence in Large Language Models.

CyberSecEval

Request

A benchmark for evaluating cybersecurity risks and capabilities in Large Language Models.

Spec-Bench

Request

A comprehensive benchmark for evaluating Speculative Decoding methods across diverse scenarios.

The Hong Kong Polytechnic University, Peking University, Microsoft Research Asia, Alibaba Group

MobileAIBench

Request

A comprehensive benchmark for evaluating LLMs & LMMs performance and resource consumption on mobile devices.

MEGA-Bench

Request

A comprehensive multimodal evaluation suite with 505 real-world tasks and diverse output formats.

Alexa Arena

Request

A user-centric interactive platform for embodied AI and robotic task completion in simulated environments.

BIG-bench

Request

A diverse set of tasks designed to measure AI's general capabilities across reasoning, common sense, creativity, and many other areas.

CodeXGLUE

Request

A benchmark for evaluating models on various programming and software engineering tasks such as code completion, bug detection, and code translation.

BEIR

Request

A heterogeneous benchmark suite for evaluating the performance of retrieval models across various datasets and domains.

MMOCR

Request

A benchmark for optical character recognition tasks, evaluating models on text detection, recognition, and end-to-end reading comprehension.

HotpotQA

Request

A dataset of 113K Wikipedia-based questions requiring multi-hop reasoning and supporting fact identification.

TriviaQA

Request

A large-scale QA dataset with 950K question-answer pairs from 662K Wikipedia and web documents.

Allen Institute for Artificial Intelligence

Berkeley Function Calling Leaderboard (BFCL)

Request

A comprehensive benchmark for evaluating function/tool calling capabilities of language models across single-turn, multi-turn and multi-step scenarios.

NexusBench

Request

A comprehensive benchmark suite for evaluating function calling, tool use, and agent capabilities of language models.

HaluBench

Request

A comprehensive hallucination evaluation benchmark with 15K samples from real-world domains including finance and medicine.

RouterBench

Request

A comprehensive benchmark for evaluating multi-LLM routing systems with 405k+ inference outcomes.

StableToolBench

Request

A stable benchmark for evaluating LLMs' tool learning capabilities with virtual API system and solvable queries.

TaskBench

Request

A comprehensive framework for evaluating LLMs in task automation across decomposition, tool selection, and parameter prediction.

MMGenBench

Request

A benchmark evaluating LMMs' image understanding through text-to-image generation.

InfiniteBench

Request

A benchmark for evaluating LLMs' ability to process and reason over super long contexts (100K+ tokens).

BABILong

Request

A benchmark for evaluating LLMs' ability to process and reason over super long contexts using a needle-in-a-haystack approach.

LOFT

Request

A comprehensive benchmark for evaluating long-context language models across retrieval, RAG, SQL, and more.

HELMET

Request

A comprehensive benchmark for evaluating long-context language models across seven diverse categories.

APPS: Automated Programming Progress Standard

Available

APPS is a dataset for evaluating model performance on Python programming tasks across three difficulty levels consisting of 1,000 at introductory, 3,000 at interview, and 1,000 at competition level. The dataset consists of an additional 5,000 training samples, for a total of 10,000 total samples. We evaluate on questions from the test split, which consists of programming problems commonly found in coding interviews.

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

Available

Python coding benchmark with 1,140 diverse questions drawing on numerous python libraries.

CORE-Bench

Available

Evaluate how well an LLM Agent is at computationally reproducing the results of a set of scientific papers.

ClassEval: A Manually-Crafted Benchmark for Evaluating LLMs on Class-level Code Generation

Available

Evaluates LLMs on class-level code generation with 100 tasks constructed over 500 person-hours. The study shows that LLMs perform worse on class-level tasks compared to method-level tasks.

DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation

Available

Code generation benchmark with a thousand data science problems spanning seven Python libraries.

HumanEval: Evaluating Large Language Models Trained on Code

Available

Measures the functional correctness of synthesizing programs from docstrings.

MBPP: Mostly Basic Python Problems

Available

Measures the ability to synthesize short Python programs from natural language descriptions.

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

Available

Machine learning tasks drawn from 75 Kaggle competitions.

SWE-bench Verified: Resolving Real-World GitHub Issues

Available

Software engineering problems drawn from real GitHub issues and corresponding pull requests across 12 popular Python repositories.

SciCode: A Research Coding Benchmark Curated by Scientists

Available

SciCode tests the ability of language models to generate code to solve scientific research problems. It assesses models on 65 problems from mathematics, physics, chemistry, biology, and materials science.

USACO: USA Computing Olympiad

Available

Challenging programming problems from USA Computing Olympiad competitions, spread across four difficulty levels.

AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?

Available

Tests whether AI agents can perform real-world time-consuming tasks on the web.

GAIA: A Benchmark for General AI Assistants

Available

Proposes real-world questions that require a set of fundamental abilities such as reasoning, multi-modality handling, web browsing, and generally tool-use proficiency. GAIA questions are conceptually simple for humans yet challenging for most advanced AIs.

Mind2Web: Towards a Generalist Agent for the Web

Available

A dataset for developing and evaluating generalist agents for the web that can follow language instructions to complete complex tasks on any website.

OSWorld

Available

Benchmark for multimodal agents for open-ended tasks in real computer environments.

Sycophancy Eval

Available

Evaluate sycophancy of language models across a variety of free-form text-generation tasks.

A StrongREJECT for Empty Jailbreaks

Available

A benchmark that evaluates the susceptibility of LLMs to various jailbreak prompts.

CYBERSECEVAL 3: Advancing the Evaluation of Cybersecurity Risks and Capabilities in Large Language Models

Request

Evaluates Large Language Models for cybersecurity risk to third parties, application developers and end users.

Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models

Available

40 professional-level Capture the Flag (CTF) tasks from 4 distinct CTF competitions, chosen to be recent, meaningful, and spanning a wide range of difficulties.

CyberMetric: A Benchmark Dataset based on Retrieval-Augmented Generation for Evaluating LLMs in Cybersecurity Knowledge

Request

Datasets containing 80, 500, 2000 and 10000 multiple-choice questions, designed to evaluate understanding across nine domains within cybersecurity

CyberSecEval_2: A Wide-Ranging Cybersecurity Evaluation Suite for Large Language Models

Request

Evaluates Large Language Models for risky capabilities in cybersecurity.

GDM Dangerous Capabilities: Capture the Flag

Request

CTF challenges covering web app vulnerabilities, off-the-shelf exploits, databases, Linux privilege escalation, password cracking and spraying. Demonstrates tool use and sandboxing untrusted model code.

InterCode: Capture the Flag

Request

Measure expertise in coding, cryptography (i.e. binary exploitation, forensics), reverse engineering, and recognizing security vulnerabilities. Demonstrates tool use and sandboxing untrusted model code.

SEvenLLM: A benchmark to elicit, and improve cybersecurity incident analysis and response abilities in LLMs for Security Events.

Request

Designed for analyzing cybersecurity incidents, which is comprised of two primary task categories: understanding and generation, with a further breakdown into 28 subcategories of tasks.

SecQA: A Concise Question-Answering Dataset for Evaluating Large Language Models in Computer Security

Request

"Security Question Answering" dataset to assess LLMs' understanding and application of security principles. SecQA has "v1" and "v2" datasets of multiple-choice questions that aim to provide two levels of cybersecurity evaluation criteria. The questions were generated by GPT-4 based on the "Computer Systems Security: Planning for Success" textbook and vetted by humans.

AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents

Available

Diverse set of 110 explicitly malicious agent tasks (440 with augmentations), covering 11 harm categories including fraud, cybercrime, and harassment.

LAB-Bench: Measuring Capabilities of Language Models for Biology Research

Request

Tests LLMs and LLM-augmented agents abilities to answer questions on scientific research workflows in domains like chemistry, biology, materials science, as well as more general science tasks

WMDP: Measuring and Reducing Malicious Use With Unlearning

Request

A dataset of 3,668 multiple-choice questions developed by a consortium of academics and technical consultants that serve as a proxy measurement of hazardous knowledge in biosecurity, cybersecurity, and chemical security.

AIME 2024: Problems from the American Invitational Mathematics Examination

Available

A benchmark for evaluating AI's ability to solve challenging mathematics problems from AIME - a prestigious high school mathematics competition.

GSM8K: Grade School Math Word Problems

Available

Measures how effectively language models solve realistic, linguistically rich math word problems suitable for grade-school-level mathematics.

MATH: Measuring Mathematical Problem Solving

Available

Dataset of 12,500 challenging competition mathematics problems. Demonstrates fewshot prompting and custom scorers.

MGSM: Multilingual Grade School Math

Available

Extends the original GSM8K dataset by translating 250 of its problems into 10 typologically diverse languages.

MathVista: Evaluating Mathematical Reasoning in Visual Contexts

Available

Diverse mathematical and visual tasks that require fine-grained, deep visual understanding and compositional reasoning. Demonstrates multimodal inputs and custom scorers.

ARC: AI2 Reasoning Challenge

Request

Dataset of natural, grade-school science multiple-choice questions (authored for human tests).

BBH: Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

Available

Suite of 23 challenging BIG-Bench tasks for which prior language model evaluations did not outperform the average human-rater.

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

Available

Reading comprehension dataset that queries for complex, non-factoid information, and require difficult entailment-like inference to solve.

DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs

Available

Evaluates reading comprehension where models must resolve references in a question, perhaps to multiple input positions, and perform discrete operations over them (such as addition, counting, or sorting).

DocVQA: A Dataset for VQA on Document Images

Available

DocVQA is a Visual Question Answering benchmark that consists of 50,000 questions covering 12,000+ document images. This implementation solves and scores the "validation" split.

IFEval: Instruction-Following Evaluation for Large Language Models

Available

Evaluates the ability to follow a set of "verifiable instructions" such as "write in more than 400 words" and "mention the keyword of AI at least 3 times. Demonstrates custom scoring.

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark

Request

Multimodal questions from college exams, quizzes, and textbooks, covering six core disciplinestasks, demanding college-level subject knowledge and deliberate reasoning. Demonstrates multimodel inputs.

MuSR: Testing the Limits of Chain-of-thought with Multistep Soft Reasoning

Available

Evaluating models on multistep soft reasoning tasks in the form of free text narratives.

Needle in a Haystack (NIAH): In-Context Retrieval Benchmark for Long Context LLMs

Available

NIAH evaluates in-context retrieval ability of long context LLMs by testing a model's ability to extract factual information from long-context inputs.

PAWS: Paraphrase Adversaries from Word Scrambling

Available

Evaluating models on the task of paraphrase detection by providing pairs of sentences that are either paraphrases or not.

PIQA: Reasoning about Physical Commonsense in Natural Language

Available

Measure physical commonsense reasoning (e.g. "To apply eyeshadow without a brush, should I use a cotton swab or a toothpick?")

RACE-H: A benchmark for testing reading comprehension and reasoning abilities of neural models

Available

Reading comprehension tasks collected from the English exams for middle and high school Chinese students in the age range between 12 to 18.

SQuAD: A Reading Comprehension Benchmark requiring reasoning over Wikipedia articles

Available

Set of 100,000+ questions posed by crowdworkers on a set of Wikipedia articles, where the answer to each question is a segment of text from the corresponding reading passage.

V*Bench: A Visual QA Benchmark with Detailed High-resolution Images

Request

V*Bench is a visual question & answer benchmark that evaluates MLLMs in their ability to process high-resolution and visually crowded images to find and focus on small details.

WINOGRANDE: An Adversarial Winograd Schema Challenge at Scale

Available

Set of 273 expert-crafted pronoun resolution problems originally designed to be unsolvable for statistical models that rely on selectional preferences or word associations.

鈭濨ench: Extending Long Context Evaluation Beyond 100K Tokens

Request

LLM benchmark featuring an average data length surpassing 100K tokens. Comprises synthetic and realistic tasks spanning diverse domains in English and Chinese.

AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models

Request

AGIEval is a human-centric benchmark specifically designed to evaluate the general abilities of foundation models in tasks pertinent to human cognition and problem-solving.

AIR Bench: AI Risk Benchmark

Available

A safety benchmark evaluating language models against risk categories derived from government regulations and company policies.

CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge

Available

Measure question answering with commonsense prior knowledge.

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

Request

Challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry (experts at PhD level in the corresponding domains reach 65% accuracy).

Humanity's Last Exam

Available

Humanity's Last Exam (HLE) is a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage. Humanity's Last Exam consists of 3,000 questions across dozens of subjects, including mathematics, humanities, and the natural sciences. HLE is developed globally by subject-matter experts and consists of multiple-choice and short-answer questions suitable for automated grading.

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

Available

An enhanced dataset designed to extend the mostly knowledge-driven MMLU benchmark by integrating more challenging, reasoning-focused questions and expanding the choice set from four to ten options.

O-NET: A high-school student knowledge test

Request

Questions and answers from the Ordinary National Educational Test (O-NET), administered annually by the National Institute of Educational Testing Service to Matthayom 6 (Grade 12 / ISCED 3) students in Thailand. The exam contains six subjects: English language, math, science, social knowledge, and Thai language. There are questions with multiple-choice and true/false answers. Questions can be in either English or Thai.

PubMedQA: A Dataset for Biomedical Research Question Answering

Available

Biomedical question answering (QA) dataset collected from PubMed abstracts.

SimpleQA: Measuring short-form factuality in large language models

Available

A benchmark that evaluates the ability of language models to answer short, fact-seeking questions.

TruthfulQA: Measuring How Models Mimic Human Falsehoods

Available

Measure whether a language model is truthful in generating answers to questions using questions that some humans would answer falsely due to a false belief or misconception.

XSTest: A benchmark for identifying exaggerated safety behaviours in LLMs

Available

Dataset with 250 safe prompts across ten prompt types that well-calibrated models should not refuse, and 200 unsafe prompts as contrasts that models, for most applications, should refuse.