Flamra + DeepSeek-V4-Pro on SWE-bench Lite

Technical report · 2026-06-14

61.33%
184 / 300 instances resolved on SWE-bench Lite, graded with the official SWE-bench harness (local Docker, no custom grading). This page is the technical report accompanying Flamra's SWE-bench Lite leaderboard submission.

System

Flamra is a Windows-first, bring-your-own-key AI coding agent — a single Rust core shared by a CLI, a Tauri desktop app, and a web UI. The same flamra --agent tool-use loop that ships in the product solved each instance.

Model: DeepSeek-V4-Pro (open-weight, MIT), via the provider's OpenAI-compatible API.
Mode: pass@1 — one autonomous agent run per instance, no instance-specific tuning, no human in the loop.
Tools given to the agent: read_file, write_file, edit_file, grep, glob, bash, todo_write.
Withheld: web_fetch / any network retrieval — the agent only sees the repository at base_commit and the issue text (no internet, no lookup of SWE-bench solutions). No use of FAIL_TO_PASS / PASS_TO_PASS / hints.
Cost: ≈ $15 for the full 300-instance run (>95% prompt-cache hit) — roughly $0.05 / instance.

Results by repository

Repository	Resolved	Rate
django/django	80 / 114	70%
scikit-learn/scikit-learn	16 / 23	70%
matplotlib/matplotlib	13 / 23	57%
sympy/sympy	43 / 77	56%
pytest-dev/pytest	9 / 17	53%
sphinx-doc/sphinx	9 / 16	56%
astropy/astropy	3 / 6	50%
psf/requests	3 / 6	50%
pylint-dev/pylint	2 / 6	33%
pydata/xarray	2 / 5	40%
mwaskom/seaborn	4 / 4	100%
pallets/flask	0 / 3	0%

Evaluation & reproduction

Predictions were generated one isolated worktree per instance, then graded with the unmodified official harness: python -m swebench.harness.run_evaluation --dataset_name SWE-bench/SWE-bench_Lite --split test --predictions_path all_preds.jsonl --run_id flamra-lite-300. Per-instance verdicts were cross-checked against a second (cloud-Docker) run and matched. Compliance: pass@1, no SWE-bench test knowledge, no web browsing.

Limitations

10 of 300 instances produced an empty patch (the agent generated no fix); these count as unresolved and are the clearest near-term headroom. SWE-bench Lite is a mature, widely published benchmark — like all high-scoring systems, results should be read with standard train-data-contamination caveats. We make no claim of novelty beyond the harness + model pairing.

Reproduction artifacts (predictions, per-instance reasoning traces, and official harness logs) are part of the SWE-bench leaderboard submission.