← Back to Flamra

Flamra + DeepSeek-V4-Pro on SWE-bench Lite

Technical report · 2026-06-14

61.33%
184 / 300 instances resolved on SWE-bench Lite, graded with the official SWE-bench harness (local Docker, no custom grading). This page is the technical report accompanying Flamra's SWE-bench Lite leaderboard submission.

System

Flamra is a Windows-first, bring-your-own-key AI coding agent — a single Rust core shared by a CLI, a Tauri desktop app, and a web UI. The same flamra --agent tool-use loop that ships in the product solved each instance.

Results by repository

RepositoryResolvedRate
django/django80 / 11470%
scikit-learn/scikit-learn16 / 2370%
matplotlib/matplotlib13 / 2357%
sympy/sympy43 / 7756%
pytest-dev/pytest9 / 1753%
sphinx-doc/sphinx9 / 1656%
astropy/astropy3 / 650%
psf/requests3 / 650%
pylint-dev/pylint2 / 633%
pydata/xarray2 / 540%
mwaskom/seaborn4 / 4100%
pallets/flask0 / 30%

Evaluation & reproduction

Predictions were generated one isolated worktree per instance, then graded with the unmodified official harness: python -m swebench.harness.run_evaluation --dataset_name SWE-bench/SWE-bench_Lite --split test --predictions_path all_preds.jsonl --run_id flamra-lite-300. Per-instance verdicts were cross-checked against a second (cloud-Docker) run and matched. Compliance: pass@1, no SWE-bench test knowledge, no web browsing.

Limitations

10 of 300 instances produced an empty patch (the agent generated no fix); these count as unresolved and are the clearest near-term headroom. SWE-bench Lite is a mature, widely published benchmark — like all high-scoring systems, results should be read with standard train-data-contamination caveats. We make no claim of novelty beyond the harness + model pairing.

Reproduction artifacts (predictions, per-instance reasoning traces, and official harness logs) are part of the SWE-bench leaderboard submission.