# Frequently Asked Questions ## General Questions ### What is SWE-bench? SWE-bench is a benchmark for evaluating large language models on real-world software engineering tasks. It consists of GitHub issues and their corresponding fixes, allowing LLMs to be evaluated on their ability to generate patches that resolve these issues. ### Which datasets are available? SWE-bench offers four main datasets: - **SWE-bench**: The full benchmark with 2,294 instances - **SWE-bench Lite**: A smaller subset with 534 instances - **SWE-bench Verified**: 500 instances verified by engineers as solvable - **SWE-bench Multimodal**: 100 _development_ instances with screenshots and UI elements (test eval is on [the SWE-bench API](https://www.swe-bench.com/sb-cli).) ### How does the evaluation work? The evaluation process: 1. Sets up a Docker environment for a repository 2. Applies the model's generated patch 3. Runs the repository's test suite 4. Determines if the patch successfully resolves the issue ### What metrics are reported? Key metrics include: - **Total instances**: Number of instances in the dataset - **Instances submitted**: Number of instances the model attempted - **Instances completed**: Number of instances that completed the evaluation process - **Instances resolved**: Number of instances where the model's patch fixed the issue - **Instances unresolved**: Number of instances where the evaluation completed but the issue wasn't fixed - **Instances with empty patches**: Number of instances where the model returned an empty patch - **Instances with errors**: Number of instances where the evaluation encountered errors - **Resolution rate**: The percentage of submitted instances that were successfully resolved ## Installation and Setup ### How can I save disk space when using Docker? Use these commands to clean up Docker resources: ```bash # Remove unused containers docker container prune # Remove unused images docker image prune # Remove all unused Docker objects docker system prune ``` You can also set `--cache_level=env` and `--clean=True` when running `swebench.harness.run_evaluation` to only dynamically remove instance images after they are used. This will make each run longer, but will use less disk space. ## Using the Benchmark ### How do I run evaluations on my own model? Generate predictions in the required format and use the evaluation harness: ```python from swebench.harness.run_evaluation import run_evaluation predictions = [ { "instance_id": "repo_owner_name__repo_name-issue_number", "model": "my-model-name", "prediction": "code patch here" } ] results = run_evaluation( predictions=predictions, dataset_name="SWE-bench_Lite", ) ``` More effectively, you can save the predictions to a file and then run the evaluation harness on that file: ```bash python -m swebench.harness.run_evaluation --predictions_path predictions.jsonl ``` ### Can I run evaluations without using Docker? No. Docker is required for consistent evaluation environments. This ensures that the evaluation is reproducible across different systems. ### How can I speed up evaluations? - Use SWE-bench Lite instead of the full benchmark - Increase parallelism with the `num_workers` parameter - Use Modal for cloud-based evaluations ### What format should my model's output be in? Your model should produce a diff or patch that can be applied to the original code. The exact format depends on the instance, but we typically recommend the diff generated by `git diff`. ## Troubleshooting ### My evaluation is stuck or taking too long Try: - Reducing the number of parallel workers - Checking Docker resource limits - Making sure you have enough disk space ### Docker build is failing with network errors Make sure your Docker network is properly configured and you have internet access. You can try: ```bash docker network ls docker network inspect bridge ``` ## Advanced Usage ### How do I use cloud-based evaluation? Use Modal for cloud-based evaluations: ```python from swebench.harness.modal_eval.run_modal import run_modal_evaluation results = run_modal_evaluation( predictions=predictions, dataset_name="SWE-bench_Lite", parallelism=10 ) ``` ### How do I contribute to SWE-bench? You can contribute by: - Reporting issues or bugs - Debugging installation and testing issues as you run into them - Suggesting improvements to the benchmark or documentation ### I want to build a custom dataset. How do I do that? Email us at `support@swebench.com` and we can talk about helping you build a custom dataset.