How well does an RTX 4090 run Qwen3.5-27B? (caveman version)
44 words/sec. 0.26 s wait. 19 of 20 coding tests right. 18.4 GB used out of 24.
The card runs this AI very well. And it holds 128k window with 4 chats at the same time.
Setup
- AI: Qwen3.5-27B, distilled, Q4_K_M (~16.5 GB on disk).
- Card: RTX 4090, 24 GB.
- Software: llama.cpp /
llama-server. Free, open. - Flags:
--flash-attn on,q8_0cache, reasoning on. - Test: 20 HumanEval coding problems.
The Hybrid Free-Lunch Rule
Only 16 of 64 layers carry KV. The other 48 are state-space. So 8× context costs only +4 GB and zero speed.
Basic numbers
| What | Result |
|---|---|
| Speed | 44 words/sec |
| Wait | 0.26 s |
| Correct (of 20) | 19 |
| Memory | 18.4 GB / 24 |
| File size | 16.5 GB |
Full matrix — 15 configs, all ran
Aggregate words/sec:
| ctx ↓ \ chats → | 1 | 2 | 4 |
|---|---|---|---|
| 16k | 44 | 74 | 111 |
| 32k | 44 | 73 | 96 |
| 64k | 44 | 74 | 100 |
| 96k | 44 | 74 | 118 |
| 128k | 44 | 73 | 118 |
Memory (GB out of 24):
| ctx ↓ \ chats → | 1 | 2 | 4 |
|---|---|---|---|
| 16k | 17.9 | 18.1 | 18.4 |
| 64k | 19.5 | 19.7 | 20.0 |
| 96k | 20.6 | 20.8 | 21.1 |
| 128k | 21.7 | 21.9 | 22.2 |
Quality stays 90-100 % across all 15. Peak: 96k × 4 → 118 tok/s.
Configs
# default
--ctx-size 32768 --parallel 1
# big context
--ctx-size 131072 --parallel 1
# peak throughput
--ctx-size 98304 --parallel 4
Action steps
- Pull
Qwen3.5-27B...Q4_K_M.gguf+mmproj-f16.gguf. - Pin to 4090:
CUDA_VISIBLE_DEVICES=<idx>. - Start with 32k × 1. Confirm 44 words/sec.
- Need long context? Switch to 128k × 1. Same speed.
- Need throughput? 96k × 4. Get 118 words/sec.
- Don’t go past 4 chats. Card hits its math ceiling.
- Always
--flash-attn onandq8_0cache.
Recap
- 44 words/sec solo. 118 peak at 96k × 4.
- 8× context = +4 GB and zero speed loss.
- Quality stays 90-100 % across all 15 tested setups.
- 128k × 4 fits at 22.2 GB.
- The card holds the whole map.