How well does an RTX 4090 run Qwen3.6-27B? (caveman version)

43 words/sec. 0.29 s wait. 20 of 20 coding tests right. 19.5 GB used out of 24.

The card runs this AI very well. And it holds 128k window with 4 chats at the same time.

Setup

  • AI: Qwen3.6-27B, Unsloth UD-Q4_K_XL (~17 GB on disk).
  • Card: RTX 4090, 24 GB.
  • Software: llama.cpp / llama-server. Free, open.
  • Flags: --flash-attn on, q8_0 cache, reasoning on.
  • Test: 20 HumanEval coding problems.

The Hybrid Free-Lunch Rule

Only 16 of 64 layers carry KV. So 8× context costs only +4 GB and zero speed.

Basic numbers

WhatResult
Speed43 words/sec
Wait0.29 s
Correct (of 20)20
Memory19.5 GB / 24

Full matrix — 15 configs, all ran

Aggregate words/sec:

ctx ↓ \ chats →124
16k4372122
32k437497
64k4373100
96k4374122
128k437597

Memory (GB out of 24):

ctx ↓ \ chats →124
16k18.919.119.4
64k20.620.721.0
96k21.621.822.1
128k22.722.923.2

Quality stays 90-100 % across all 15. Peak: 96k × 4 (or 16k × 4) → 122 tok/s.

Configs

# default
  --ctx-size 32768 --parallel 1

# big context
  --ctx-size 131072 --parallel 1

# peak throughput
  --ctx-size 98304 --parallel 4

Action steps

  1. Pull Qwen3.6-27B-UD-Q4_K_XL.gguf + mmproj-F16.gguf from Unsloth.
  2. Pin to 4090: CUDA_VISIBLE_DEVICES=<idx>.
  3. Start with 32k × 1. Confirm 43 words/sec.
  4. Need long context? 128k × 1. Same speed.
  5. Need throughput? 96k × 4. Get 122 words/sec.
  6. Don’t go past 4 chats. Card hits its math ceiling.
  7. Always --flash-attn on and q8_0 cache. Reasoning on for code.

Recap

  • 43 words/sec solo. 122 peak at 96k × 4.
  • 8× context = +4 GB and zero speed loss.
  • Perfect score (20/20) on default config.
  • 128k × 4 fits at 23.2 GB.
  • The card holds the whole map.