Tuning Qwen3.6-35B-NVFP4 on a single RTX 5090 — vLLM sweet spot
Short version of today’s vLLM tuning run on the RTX 5090 with RedHatAI/Qwen3.6-35B-A3B-NVFP4:
- 22 configs, 6 iterations, ~4 hours wall time. 22/22 eventually passed.
Sweet spot: 96k context · 2 parallel · fp8 KV · mem-util 0.94 → 160.5 tok/s, TTFT 0.56 s.
Counter-intuitive: fp8 KV cache is faster than bf16 at the same shape (+12%). Smaller cache → more prefix-cache hits → better utilization.
Also a killer-bug: Hybrid-Mamba models need --max-num-batched-tokens ≥ 2096 when --kv-cache-dtype fp8 is on. Default 2048 is 48 tokens short and the engine dies with a cryptic assertion.
The deep dive — three acts of debugging (bash-backslash bug, fp8 paradox, full matrix), eight counter-intuitive findings, three ready-to-copy configs — lives in the article: Tuning Qwen3.6-35B-NVFP4 for vLLM on an RTX 5090.