How well does an RTX 5090 run Qwen3.6-35B? (caveman version)

105 words a second solo. 160 words a second at the sweet spot. 0.56 s wait. 31.8 GB used out of 32.

The card runs this AI very well.

Setup

  • AI: Qwen3.6-35B-A3B-NVFP4. Mixture-of-experts. 17 GB on disk.
  • Card: RTX 5090, Blackwell, 32 GB.
  • Software: vLLM 0.19.1. Free, open source. PagedAttention scheduler.
  • Quantisation: NVFP4 (Blackwell-native 4-bit float).
  • Context: 96k at the sweet spot, 128k single-stream.

The numbers

WhatResult
Solo (1 chat)105 words/sec, 1.12 s wait
Sweet spot (2 chats, 96k, fp8 cache)160 words/sec, 0.56 s wait
Max context (1 chat, 128k)135 words/sec
Memory used31.8 GB / 32

Sweet-spot recipe

Three flags:

  1. --kv-cache-dtype fp8 — smaller cache, more prefix-cache hits, +12 % speed.
  2. --max-num-seqs 2 — the wide tensor cores need at least 2 streams to stay busy.
  3. --gpu-memory-utilization 0.94 — 94 % of the card. Above breaks. Below wastes.

Config

vllm serve RedHatAI/Qwen3.6-35B-A3B-NVFP4 \
  --served-model-name qwen36 \
  --host 0.0.0.0 --port 10002 \
  --max-model-len 98304 \
  --max-num-seqs 2 \
  --gpu-memory-utilization 0.94 \
  --kv-cache-dtype fp8 \
  --max-num-batched-tokens 8192 \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --enable-expert-parallel \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_xml \
  --reasoning-parser qwen3

128k mode: --max-model-len 131072 --max-num-seqs 1 → 135 words/sec.

Three surprises

  • fp8 cache is faster than bf16. +12 %. Smaller cache, more hits.
  • One chat wastes 35 % of the card. The 5090 needs at least 2 streams.
  • 96k beats 64k and 128k. The fastest context is in the middle, not at either end.

Two bugs that cost me 90 minutes

  • Dropped backslash in a multi-line shell script. Default flags ran. Out-of-memory crash. Fix: cat -A your launcher.
  • Mamba assertion. Hybrid Mamba-attention models need --max-num-batched-tokens 8192 when fp8 cache is on. Default 2048 is 48 too small.

What to do

  1. Confirm Blackwell card with NVFP4 support.
  2. Install vLLM 0.19.1+.
  3. Pull RedHatAI/Qwen3.6-35B-A3B-NVFP4.
  4. Use the sweet-spot config. Confirm 160 words/sec.
  5. Always set --max-num-batched-tokens 8192 when fp8 cache is on.
  6. Cap --gpu-memory-utilization at 0.94. Never higher.
  7. Never --enforce-eager outside of debugging. 5 times slower.

Recap

  • 5090 + Qwen3.6-35B on vLLM: 105 solo, 160 at the sweet spot.
  • Sweet spot: 96k, fp8 cache, 2 chats, 0.94 memory.
  • Smaller cache faster, one chat wastes a third, middle context wins.
  • The card runs this AI very well.