How well does an RTX 5090 run Qwen3.6-35B? (caveman version)

Contents

Setup
The numbers
Sweet-spot recipe
Config
Three surprises
Two bugs that cost me 90 minutes
What to do
Recap

105 words a second solo. 160 words a second at the sweet spot. 0.56 s wait. 31.8 GB used out of 32.

The card runs this AI very well.

Setup

AI: Qwen3.6-35B-A3B-NVFP4. Mixture-of-experts. 17 GB on disk.
Card: RTX 5090, Blackwell, 32 GB.
Software: vLLM 0.19.1. Free, open source. PagedAttention scheduler.
Quantisation: NVFP4 (Blackwell-native 4-bit float).
Context: 96k at the sweet spot, 128k single-stream.

The numbers

What	Result
Solo (1 chat)	105 words/sec, 1.12 s wait
Sweet spot (2 chats, 96k, fp8 cache)	160 words/sec, 0.56 s wait
Max context (1 chat, 128k)	135 words/sec
Memory used	31.8 GB / 32

Sweet-spot recipe

Three flags:

--kv-cache-dtype fp8 — smaller cache, more prefix-cache hits, +12 % speed.
--max-num-seqs 2 — the wide tensor cores need at least 2 streams to stay busy.
--gpu-memory-utilization 0.94 — 94 % of the card. Above breaks. Below wastes.

Config

vllm serve RedHatAI/Qwen3.6-35B-A3B-NVFP4 \
  --served-model-name qwen36 \
  --host 0.0.0.0 --port 10002 \
  --max-model-len 98304 \
  --max-num-seqs 2 \
  --gpu-memory-utilization 0.94 \
  --kv-cache-dtype fp8 \
  --max-num-batched-tokens 8192 \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --enable-expert-parallel \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_xml \
  --reasoning-parser qwen3

128k mode: --max-model-len 131072 --max-num-seqs 1 → 135 words/sec.

Three surprises

fp8 cache is faster than bf16. +12 %. Smaller cache, more hits.
One chat wastes 35 % of the card. The 5090 needs at least 2 streams.
96k beats 64k and 128k. The fastest context is in the middle, not at either end.

Two bugs that cost me 90 minutes

Dropped backslash in a multi-line shell script. Default flags ran. Out-of-memory crash. Fix: cat -A your launcher.
Mamba assertion. Hybrid Mamba-attention models need --max-num-batched-tokens 8192 when fp8 cache is on. Default 2048 is 48 too small.

What to do

Confirm Blackwell card with NVFP4 support.
Install vLLM 0.19.1+.
Pull RedHatAI/Qwen3.6-35B-A3B-NVFP4.
Use the sweet-spot config. Confirm 160 words/sec.
Always set --max-num-batched-tokens 8192 when fp8 cache is on.
Cap --gpu-memory-utilization at 0.94. Never higher.
Never --enforce-eager outside of debugging. 5 times slower.

Recap

5090 + Qwen3.6-35B on vLLM: 105 solo, 160 at the sweet spot.
Sweet spot: 96k, fp8 cache, 2 chats, 0.94 memory.
Smaller cache faster, one chat wastes a third, middle context wins.
The card runs this AI very well.