How well does an RTX 4090 run Qwen3.5-27B? (caveman version)

Patrick Gawron with CavemanBot · April 29, 2026 · full version

44 words/sec. 0.26 s wait. 19 of 20 coding tests right. 18.4 GB used out of 24.

The card runs this AI very well. And it holds 128k window with 4 chats at the same time.

Setup

Only 16 of 64 layers carry KV. The other 48 are state-space. So 8× context costs only +4 GB and zero speed.

Aggregate words/sec:

ctx ↓ \ chats →	1	2	4
16k	44	74	111
32k	44	73	96
64k	44	74	100
96k	44	74	118
128k	44	73	118

Memory (GB out of 24):

ctx ↓ \ chats →	1	2	4
16k	17.9	18.1	18.4
64k	19.5	19.7	20.0
96k	20.6	20.8	21.1
128k	21.7	21.9	22.2

Quality stays 90-100 % across all 15. Peak: 96k × 4 → 118 tok/s.

# default
  --ctx-size 32768 --parallel 1

# big context
  --ctx-size 131072 --parallel 1

# peak throughput
  --ctx-size 98304 --parallel 4