Local agent benchmarks - which model survives a real loop

Question: which local model runs a non-trivial agent loop (multi-step tool use, file edits, test runs) without falling apart?

Date: April 2026 Hardware: RTX 4090, 64 GB RAM, WSL2

The task

Fixed prompt: “Add input validation to user.py, write a test, and run the test suite.” → Tools available: file-read, file-write, shell. → Success = test passes AND nothing else breaks.

Results

ModelQuantSuccessTool-call errorsTime
Qwen 3.5 14BQ4_K_M00:42
Qwen 3.5 7BQ5_K_M10:31
Llama 4 8BQ4_K_M⚠️31:10
Mistral Small 3Q4_K_M5
Gemma 3 12BQ4_K_M4

Legend: ✅ = completed correctly, ⚠️ = completed after retry, ❌ = gave up.

Findings

→ Tool-calling reliability > raw reasoning. Qwen 3.5 trained with tool-use in core instruction data → zero malformed JSON, zero imaginary tool names. → WARNING: quantization below Q4 is a trap for agents. Q3 variants degraded tool-call formatting even when general chat felt fine. → 7B can be enough. Qwen 3.5 7B handled the task with one retry. For a cheap background agent: acceptable.

Caveats

! One task, one machine, one day. ! Vibe check, not a paper. ! Your mileage will differ.

Code

All prompts, raw transcripts, scoring script → github.com/patrickgawron/local-agent-benchmarks (placeholder - not published yet).

Lesson: pick the model that was trained for tool use, not the one that scores highest on chat benchmarks. An agent that can’t emit valid JSON is a chatbot in a trench coat.