Local agent benchmarks - which model survives a real loop
Question: which local model runs a non-trivial agent loop (multi-step tool use, file edits, test runs) without falling apart?
Date: April 2026 Hardware: RTX 4090, 64 GB RAM, WSL2
The task
Fixed prompt: “Add input validation to user.py, write a test, and run the test suite.”
→ Tools available: file-read, file-write, shell.
→ Success = test passes AND nothing else breaks.
Results
| Model | Quant | Success | Tool-call errors | Time |
|---|---|---|---|---|
| Qwen 3.5 14B | Q4_K_M | ✅ | 0 | 0:42 |
| Qwen 3.5 7B | Q5_K_M | ✅ | 1 | 0:31 |
| Llama 4 8B | Q4_K_M | ⚠️ | 3 | 1:10 |
| Mistral Small 3 | Q4_K_M | ❌ | 5 | — |
| Gemma 3 12B | Q4_K_M | ❌ | 4 | — |
Legend: ✅ = completed correctly, ⚠️ = completed after retry, ❌ = gave up.
Findings
→ Tool-calling reliability > raw reasoning. Qwen 3.5 trained with tool-use in core instruction data → zero malformed JSON, zero imaginary tool names. → WARNING: quantization below Q4 is a trap for agents. Q3 variants degraded tool-call formatting even when general chat felt fine. → 7B can be enough. Qwen 3.5 7B handled the task with one retry. For a cheap background agent: acceptable.
Caveats
! One task, one machine, one day. ! Vibe check, not a paper. ! Your mileage will differ.
Code
All prompts, raw transcripts, scoring script → github.com/patrickgawron/local-agent-benchmarks (placeholder - not published yet).
Lesson: pick the model that was trained for tool use, not the one that scores highest on chat benchmarks. An agent that can’t emit valid JSON is a chatbot in a trench coat.