How well does an RTX 4090 run Qwen3.6-27B? (caveman version)
43 words/sec. 0.29 s wait. 20 of 20 coding tests right. 19.5 GB used out of 24.
The card runs this AI very well. And it holds 128k window with 4 chats at the same time.
Setup
- AI: Qwen3.6-27B, Unsloth UD-Q4_K_XL (~17 GB on disk).
- Card: RTX 4090, 24 GB.
- Software: llama.cpp /
llama-server. Free, open. - Flags:
--flash-attn on,q8_0cache, reasoning on. - Test: 20 HumanEval coding problems.
The Hybrid Free-Lunch Rule
Only 16 of 64 layers carry KV. So 8× context costs only +4 GB and zero speed.
Basic numbers
| What | Result |
|---|---|
| Speed | 43 words/sec |
| Wait | 0.29 s |
| Correct (of 20) | 20 |
| Memory | 19.5 GB / 24 |
Full matrix — 15 configs, all ran
Aggregate words/sec:
| ctx ↓ \ chats → | 1 | 2 | 4 |
|---|---|---|---|
| 16k | 43 | 72 | 122 |
| 32k | 43 | 74 | 97 |
| 64k | 43 | 73 | 100 |
| 96k | 43 | 74 | 122 |
| 128k | 43 | 75 | 97 |
Memory (GB out of 24):
| ctx ↓ \ chats → | 1 | 2 | 4 |
|---|---|---|---|
| 16k | 18.9 | 19.1 | 19.4 |
| 64k | 20.6 | 20.7 | 21.0 |
| 96k | 21.6 | 21.8 | 22.1 |
| 128k | 22.7 | 22.9 | 23.2 |
Quality stays 90-100 % across all 15. Peak: 96k × 4 (or 16k × 4) → 122 tok/s.
Configs
# default
--ctx-size 32768 --parallel 1
# big context
--ctx-size 131072 --parallel 1
# peak throughput
--ctx-size 98304 --parallel 4
Action steps
- Pull
Qwen3.6-27B-UD-Q4_K_XL.gguf+mmproj-F16.gguffrom Unsloth. - Pin to 4090:
CUDA_VISIBLE_DEVICES=<idx>. - Start with 32k × 1. Confirm 43 words/sec.
- Need long context? 128k × 1. Same speed.
- Need throughput? 96k × 4. Get 122 words/sec.
- Don’t go past 4 chats. Card hits its math ceiling.
- Always
--flash-attn onandq8_0cache. Reasoning on for code.
Recap
- 43 words/sec solo. 122 peak at 96k × 4.
- 8× context = +4 GB and zero speed loss.
- Perfect score (20/20) on default config.
- 128k × 4 fits at 23.2 GB.
- The card holds the whole map.