Technical report · 2026-06-14
61.33%
184 / 300 instances resolved on SWE-bench Lite, graded with the official SWE-bench harness (local Docker, no custom grading). This page is the technical report accompanying Flamra's SWE-bench Lite leaderboard submission.
Flamra is a Windows-first, bring-your-own-key AI coding agent — a single Rust core shared by a CLI, a Tauri desktop app, and a web UI. The same flamra --agent tool-use loop that ships in the product solved each instance.
read_file, write_file, edit_file, grep, glob, bash, todo_write.web_fetch / any network retrieval — the agent only sees the repository at base_commit and the issue text (no internet, no lookup of SWE-bench solutions). No use of FAIL_TO_PASS / PASS_TO_PASS / hints.| Repository | Resolved | Rate |
|---|---|---|
| django/django | 80 / 114 | 70% |
| scikit-learn/scikit-learn | 16 / 23 | 70% |
| matplotlib/matplotlib | 13 / 23 | 57% |
| sympy/sympy | 43 / 77 | 56% |
| pytest-dev/pytest | 9 / 17 | 53% |
| sphinx-doc/sphinx | 9 / 16 | 56% |
| astropy/astropy | 3 / 6 | 50% |
| psf/requests | 3 / 6 | 50% |
| pylint-dev/pylint | 2 / 6 | 33% |
| pydata/xarray | 2 / 5 | 40% |
| mwaskom/seaborn | 4 / 4 | 100% |
| pallets/flask | 0 / 3 | 0% |
Predictions were generated one isolated worktree per instance, then graded with the unmodified official harness: python -m swebench.harness.run_evaluation --dataset_name SWE-bench/SWE-bench_Lite --split test --predictions_path all_preds.jsonl --run_id flamra-lite-300. Per-instance verdicts were cross-checked against a second (cloud-Docker) run and matched. Compliance: pass@1, no SWE-bench test knowledge, no web browsing.
10 of 300 instances produced an empty patch (the agent generated no fix); these count as unresolved and are the clearest near-term headroom. SWE-bench Lite is a mature, widely published benchmark — like all high-scoring systems, results should be read with standard train-data-contamination caveats. We make no claim of novelty beyond the harness + model pairing.
Reproduction artifacts (predictions, per-instance reasoning traces, and official harness logs) are part of the SWE-bench leaderboard submission.