Swebench/BenchFlow · BenchFlow

mirrored 2 minutes ago
Benchmark Card Files and versions Leaderboard
kikk-xuhjdocs: add swebench docs ff06407
# Frequently Asked Questions

## General Questions

### What is SWE-bench?

SWE-bench is a benchmark for evaluating large language models on real-world software engineering tasks. It consists of GitHub issues and their corresponding fixes, allowing LLMs to be evaluated on their ability to generate patches that resolve these issues.

### Which datasets are available?

SWE-bench offers four main datasets:
- **SWE-bench**: The full benchmark with 2,294 instances
- **SWE-bench Lite**: A smaller subset with 534 instances
- **SWE-bench Verified**: 500 instances verified by engineers as solvable
- **SWE-bench Multimodal**: 100 _development_ instances with screenshots and UI elements (test eval is on [the SWE-bench API](https://www.swe-bench.com/sb-cli).)

### How does the evaluation work?

The evaluation process:
1. Sets up a Docker environment for a repository
2. Applies the model's generated patch
3. Runs the repository's test suite
4. Determines if the patch successfully resolves the issue

### What metrics are reported?

Key metrics include:
- **Total instances**: Number of instances in the dataset
- **Instances submitted**: Number of instances the model attempted
- **Instances completed**: Number of instances that completed the evaluation process
- **Instances resolved**: Number of instances where the model's patch fixed the issue
- **Instances unresolved**: Number of instances where the evaluation completed but the issue wasn't fixed
- **Instances with empty patches**: Number of instances where the model returned an empty patch
- **Instances with errors**: Number of instances where the evaluation encountered errors
- **Resolution rate**: The percentage of submitted instances that were successfully resolved


## Installation and Setup

### How can I save disk space when using Docker?

Use these commands to clean up Docker resources:
```bash
# Remove unused containers
docker container prune

# Remove unused images
docker image prune

# Remove all unused Docker objects
docker system prune
```

You can also set `--cache_level=env` and `--clean=True` when running `swebench.harness.run_evaluation` to only dynamically remove instance images after they are used. This will make each run longer, but will use less disk space.

## Using the Benchmark

### How do I run evaluations on my own model?

Generate predictions in the required format and use the evaluation harness:
```python
from swebench.harness.run_evaluation import run_evaluation

predictions = [
    {
        "instance_id": "repo_owner_name__repo_name-issue_number",
        "model": "my-model-name",
        "prediction": "code patch here"
    }
]

results = run_evaluation(
    predictions=predictions,
    dataset_name="SWE-bench_Lite",
)
```

More effectively, you can save the predictions to a file and then run the evaluation harness on that file:
```bash
python -m swebench.harness.run_evaluation --predictions_path predictions.jsonl
```

### Can I run evaluations without using Docker?

No. Docker is required for consistent evaluation environments. This ensures that the evaluation is reproducible across different systems.

### How can I speed up evaluations?

- Use SWE-bench Lite instead of the full benchmark
- Increase parallelism with the `num_workers` parameter
- Use Modal for cloud-based evaluations

### What format should my model's output be in?

Your model should produce a diff or patch that can be applied to the original code. The exact format depends on the instance, but we typically recommend the diff generated by `git diff`.

## Troubleshooting

### My evaluation is stuck or taking too long

Try:
- Reducing the number of parallel workers
- Checking Docker resource limits
- Making sure you have enough disk space

### Docker build is failing with network errors

Make sure your Docker network is properly configured and you have internet access. You can try:
```bash
docker network ls
docker network inspect bridge
```

## Advanced Usage

### How do I use cloud-based evaluation?

Use Modal for cloud-based evaluations:
```python
from swebench.harness.modal_eval.run_modal import run_modal_evaluation

results = run_modal_evaluation(
    predictions=predictions,
    dataset_name="SWE-bench_Lite",
    parallelism=10
)
```

### How do I contribute to SWE-bench?

You can contribute by:
- Reporting issues or bugs
- Debugging installation and testing issues as you run into them
- Suggesting improvements to the benchmark or documentation

### I want to build a custom dataset. How do I do that?

Email us at `support@swebench.com` and we can talk about helping you build a custom dataset.