/
Swebenchff06407
# Frequently Asked Questions
## General Questions
### What is SWE-bench?
SWE-bench is a benchmark for evaluating large language models on real-world software engineering tasks. It consists of GitHub issues and their corresponding fixes, allowing LLMs to be evaluated on their ability to generate patches that resolve these issues.
### Which datasets are available?
SWE-bench offers four main datasets:
- **SWE-bench**: The full benchmark with 2,294 instances
- **SWE-bench Lite**: A smaller subset with 534 instances
- **SWE-bench Verified**: 500 instances verified by engineers as solvable
- **SWE-bench Multimodal**: 100 _development_ instances with screenshots and UI elements (test eval is on [the SWE-bench API](https://www.swe-bench.com/sb-cli).)
### How does the evaluation work?
The evaluation process:
1. Sets up a Docker environment for a repository
2. Applies the model's generated patch
3. Runs the repository's test suite
4. Determines if the patch successfully resolves the issue
### What metrics are reported?
Key metrics include:
- **Total instances**: Number of instances in the dataset
- **Instances submitted**: Number of instances the model attempted
- **Instances completed**: Number of instances that completed the evaluation process
- **Instances resolved**: Number of instances where the model's patch fixed the issue
- **Instances unresolved**: Number of instances where the evaluation completed but the issue wasn't fixed
- **Instances with empty patches**: Number of instances where the model returned an empty patch
- **Instances with errors**: Number of instances where the evaluation encountered errors
- **Resolution rate**: The percentage of submitted instances that were successfully resolved
## Installation and Setup
### How can I save disk space when using Docker?
Use these commands to clean up Docker resources:
```bash
# Remove unused containers
docker container prune
# Remove unused images
docker image prune
# Remove all unused Docker objects
docker system prune
```
You can also set `--cache_level=env` and `--clean=True` when running `swebench.harness.run_evaluation` to only dynamically remove instance images after they are used. This will make each run longer, but will use less disk space.
## Using the Benchmark
### How do I run evaluations on my own model?
Generate predictions in the required format and use the evaluation harness:
```python
from swebench.harness.run_evaluation import run_evaluation
predictions = [
{
"instance_id": "repo_owner_name__repo_name-issue_number",
"model": "my-model-name",
"prediction": "code patch here"
}
]
results = run_evaluation(
predictions=predictions,
dataset_name="SWE-bench_Lite",
)
```
More effectively, you can save the predictions to a file and then run the evaluation harness on that file:
```bash
python -m swebench.harness.run_evaluation --predictions_path predictions.jsonl
```
### Can I run evaluations without using Docker?
No. Docker is required for consistent evaluation environments. This ensures that the evaluation is reproducible across different systems.
### How can I speed up evaluations?
- Use SWE-bench Lite instead of the full benchmark
- Increase parallelism with the `num_workers` parameter
- Use Modal for cloud-based evaluations
### What format should my model's output be in?
Your model should produce a diff or patch that can be applied to the original code. The exact format depends on the instance, but we typically recommend the diff generated by `git diff`.
## Troubleshooting
### My evaluation is stuck or taking too long
Try:
- Reducing the number of parallel workers
- Checking Docker resource limits
- Making sure you have enough disk space
### Docker build is failing with network errors
Make sure your Docker network is properly configured and you have internet access. You can try:
```bash
docker network ls
docker network inspect bridge
```
## Advanced Usage
### How do I use cloud-based evaluation?
Use Modal for cloud-based evaluations:
```python
from swebench.harness.modal_eval.run_modal import run_modal_evaluation
results = run_modal_evaluation(
predictions=predictions,
dataset_name="SWE-bench_Lite",
parallelism=10
)
```
### How do I contribute to SWE-bench?
You can contribute by:
- Reporting issues or bugs
- Debugging installation and testing issues as you run into them
- Suggesting improvements to the benchmark or documentation
### I want to build a custom dataset. How do I do that?
Email us at `support@swebench.com` and we can talk about helping you build a custom dataset.