How well does an RTX 5090 run Qwen3.6-35B? (caveman version)
105 words a second solo. 160 words a second at the sweet spot. 0.56 s wait. 31.8 GB used out of 32.
The card runs this AI very well.
Setup
- AI: Qwen3.6-35B-A3B-NVFP4. Mixture-of-experts. 17 GB on disk.
- Card: RTX 5090, Blackwell, 32 GB.
- Software: vLLM 0.19.1. Free, open source. PagedAttention scheduler.
- Quantisation: NVFP4 (Blackwell-native 4-bit float).
- Context: 96k at the sweet spot, 128k single-stream.
The numbers
| What | Result |
|---|---|
| Solo (1 chat) | 105 words/sec, 1.12 s wait |
| Sweet spot (2 chats, 96k, fp8 cache) | 160 words/sec, 0.56 s wait |
| Max context (1 chat, 128k) | 135 words/sec |
| Memory used | 31.8 GB / 32 |
Sweet-spot recipe
Three flags:
--kv-cache-dtype fp8— smaller cache, more prefix-cache hits, +12 % speed.--max-num-seqs 2— the wide tensor cores need at least 2 streams to stay busy.--gpu-memory-utilization 0.94— 94 % of the card. Above breaks. Below wastes.
Config
vllm serve RedHatAI/Qwen3.6-35B-A3B-NVFP4 \
--served-model-name qwen36 \
--host 0.0.0.0 --port 10002 \
--max-model-len 98304 \
--max-num-seqs 2 \
--gpu-memory-utilization 0.94 \
--kv-cache-dtype fp8 \
--max-num-batched-tokens 8192 \
--enable-prefix-caching \
--enable-chunked-prefill \
--enable-expert-parallel \
--enable-auto-tool-choice \
--tool-call-parser qwen3_xml \
--reasoning-parser qwen3
128k mode: --max-model-len 131072 --max-num-seqs 1 → 135 words/sec.
Three surprises
- fp8 cache is faster than bf16. +12 %. Smaller cache, more hits.
- One chat wastes 35 % of the card. The 5090 needs at least 2 streams.
- 96k beats 64k and 128k. The fastest context is in the middle, not at either end.
Two bugs that cost me 90 minutes
- Dropped backslash in a multi-line shell script. Default flags ran. Out-of-memory crash. Fix:
cat -Ayour launcher. - Mamba assertion. Hybrid Mamba-attention models need
--max-num-batched-tokens 8192when fp8 cache is on. Default 2048 is 48 too small.
What to do
- Confirm Blackwell card with NVFP4 support.
- Install vLLM 0.19.1+.
- Pull
RedHatAI/Qwen3.6-35B-A3B-NVFP4. - Use the sweet-spot config. Confirm 160 words/sec.
- Always set
--max-num-batched-tokens 8192when fp8 cache is on. - Cap
--gpu-memory-utilizationat 0.94. Never higher. - Never
--enforce-eageroutside of debugging. 5 times slower.
Recap
- 5090 + Qwen3.6-35B on vLLM: 105 solo, 160 at the sweet spot.
- Sweet spot: 96k, fp8 cache, 2 chats, 0.94 memory.
- Smaller cache faster, one chat wastes a third, middle context wins.
- The card runs this AI very well.