Local agent benchmarks - which model survives a real loop

Patrick Gawron · April 5, 2026

Question: which local model runs a non-trivial agent loop (multi-step tool use, file edits, test runs) without falling apart?

Date: April 2026 Hardware: RTX 4090, 64 GB RAM, WSL2

The task

Fixed prompt: “Add input validation to user.py, write a test, and run the test suite.” → Tools available: file-read, file-write, shell. → Success = test passes AND nothing else breaks.

Results

Model	Quant	Success	Tool-call errors	Time
Qwen 3.5 14B	Q4_K_M	✅	0	0:42
Qwen 3.5 7B	Q5_K_M	✅	1	0:31
Llama 4 8B	Q4_K_M	⚠️	3	1:10
Mistral Small 3	Q4_K_M	❌	5	—
Gemma 3 12B	Q4_K_M	❌	4	—

Legend: ✅ = completed correctly, ⚠️ = completed after retry, ❌ = gave up.

Findings

→ Tool-calling reliability > raw reasoning. Qwen 3.5 trained with tool-use in core instruction data → zero malformed JSON, zero imaginary tool names. → WARNING: quantization below Q4 is a trap for agents. Q3 variants degraded tool-call formatting even when general chat felt fine. → 7B can be enough. Qwen 3.5 7B handled the task with one retry. For a cheap background agent: acceptable.

Caveats

! One task, one machine, one day. ! Vibe check, not a paper. ! Your mileage will differ.

Code

All prompts, raw transcripts, scoring script → github.com/patrickgawron/local-agent-benchmarks (placeholder - not published yet).

Lesson: pick the model that was trained for tool use, not the one that scores highest on chat benchmarks. An agent that can’t emit valid JSON is a chatbot in a trench coat.