How well does an RTX 4090 run Qwen3.5-27B? (caveman version)

44 words/sec. 0.26 s wait. 19 of 20 coding tests right. 18.4 GB used out of 24.

The card runs this AI very well. And it holds 128k window with 4 chats at the same time.

Setup

  • AI: Qwen3.5-27B, distilled, Q4_K_M (~16.5 GB on disk).
  • Card: RTX 4090, 24 GB.
  • Software: llama.cpp / llama-server. Free, open.
  • Flags: --flash-attn on, q8_0 cache, reasoning on.
  • Test: 20 HumanEval coding problems.

The Hybrid Free-Lunch Rule

Only 16 of 64 layers carry KV. The other 48 are state-space. So 8× context costs only +4 GB and zero speed.

Basic numbers

WhatResult
Speed44 words/sec
Wait0.26 s
Correct (of 20)19
Memory18.4 GB / 24
File size16.5 GB

Full matrix — 15 configs, all ran

Aggregate words/sec:

ctx ↓ \ chats →124
16k4474111
32k447396
64k4474100
96k4474118
128k4473118

Memory (GB out of 24):

ctx ↓ \ chats →124
16k17.918.118.4
64k19.519.720.0
96k20.620.821.1
128k21.721.922.2

Quality stays 90-100 % across all 15. Peak: 96k × 4 → 118 tok/s.

Configs

# default
  --ctx-size 32768 --parallel 1

# big context
  --ctx-size 131072 --parallel 1

# peak throughput
  --ctx-size 98304 --parallel 4

Action steps

  1. Pull Qwen3.5-27B...Q4_K_M.gguf + mmproj-f16.gguf.
  2. Pin to 4090: CUDA_VISIBLE_DEVICES=<idx>.
  3. Start with 32k × 1. Confirm 44 words/sec.
  4. Need long context? Switch to 128k × 1. Same speed.
  5. Need throughput? 96k × 4. Get 118 words/sec.
  6. Don’t go past 4 chats. Card hits its math ceiling.
  7. Always --flash-attn on and q8_0 cache.

Recap

  • 44 words/sec solo. 118 peak at 96k × 4.
  • 8× context = +4 GB and zero speed loss.
  • Quality stays 90-100 % across all 15 tested setups.
  • 128k × 4 fits at 22.2 GB.
  • The card holds the whole map.