How well does an RTX 4090 run Qwen3.6-27B?

RTX 4090 runs Qwen3.6-27B at 43 words a second. 0.29 second wait. 20 of 20 coding tasks correct. 19.5 GB used out of 24. Plus a 5×3 matrix showing it holds 128k context with 4 parallel chats.
Visual summary. Source: caveman version.

I asked one question: how well does an RTX 4090 run Qwen3.6-27B?

43 words a second. About a third of a second to wait. 20 of 20 coding problems solved. 19.5 GB of memory used out of 24.

The card runs this AI very well.

The Hybrid Free-Lunch Rule

Most AIs slow down hard when you give them more to read at once. This one does not.

I tested 5 reading-window sizes (16k up to 128k tokens) and 3 numbers of chats at the same time (1, 2, 4). All 15 setups ran. The biggest setup — 128k window with 4 parallel chats — fit on the 24 GB card with 0.8 GB to spare. Speed stayed the same.

Qwen3.6-27B uses a hybrid design where only 16 of its 64 layers carry a memory of past tokens. The other 48 are state-space layers that don’t need that memory. Growing the window grows a small part of the model, not the whole thing.

That’s the rule: on a hybrid model, long context is almost free.

What the words mean

  • The AI: Qwen3.6-27B. The newest free, open AI in this size class. Reasoning-tuned. Strong at code.
  • The card: RTX 4090. 24 GB of fast memory.
  • The software: llama.cpp / llama-server. Free, open. The numbers in this article come from that server. Other software (vLLM, Ollama, LM Studio) gives different numbers.
  • Quantisation: UD-Q4_K_XL from Unsloth. A “dynamic” 4-bit format. About 1 GB bigger than plain Q4_K_M. Buys back a bit of accuracy.
  • Context window: how much the AI can read at once.
  • Parallel chats: how many people can talk to it at the same time.
  • Tokens per second: how fast the AI types. 43 tokens a second is about 30 English words a second.
  • TTFT: the wait before the first letter shows up.
  • pass@1: the AI writes code, the test passes on the first try. 20 of 20 = perfect.

The basic numbers

WhatResult
Speed (single chat)43 words/sec
Wait before first letter0.29 s
Correct answers (out of 20)20
Card memory used19.5 GB out of 24
AI file size on disk~17 GB

20 questions came from a public coding test (HumanEval). Reasoning mode is on, so the AI thinks before answering. That makes its answers longer but more accurate. The score of 20 of 20 confirms it.

The full matrix — context × parallel chats

5 context sizes × 3 parallelism settings = 15 setups. Every single one ran. None crashed.

Words per second (all chats added together):

ctx ↓ \ chats →1 chat2 chats4 chats
16k4372122
32k437497
64k4373100
96k4374122
128k437597

Memory used out of 24 GB:

ctx ↓ \ chats →1 chat2 chats4 chats
16k18.9 GB19.1 GB19.4 GB
32k19.5 GB19.6 GB19.9 GB
64k20.6 GB20.7 GB21.0 GB
96k21.6 GB21.8 GB22.1 GB
128k22.7 GB22.9 GB23.2 GB

Three things that surprise people

1. Eight times more context costs almost nothing

Going from 16k to 128k is eight times more reading. The cost: about 4 GB extra memory and zero speed loss.

2. The card runs out of math, not memory

At 128k with 4 chats, the card uses 23.2 GB. It still has 0.8 GB free — tight, but inside the budget. The bottleneck is the math part, not the memory part. Adding a fifth chat would split the same throughput pie thinner without making it bigger.

3. Quality stays high across all 15 setups

pass@1 lands between 0.90 and 1.00 across the matrix. No slow drop with longer context. The variation is sampling noise from the model’s temperature, not a context problem.

What to do

  1. One person, normal use: 32k context, 1 chat. Cheap, fast, easy.
  2. Hermes-agent / voice agent / long reasoning loop: 64k single chat.
  3. Whole codebase or long research: 128k single chat. Same speed.
  4. Multi-agent fan-out or batch jobs: 96k × 4 chats. Peaks at 122 words a second.
  5. Always turn on --flash-attn on and q8_0 cache.
  6. Keep reasoning on for code work. The score climbs from “good” to “perfect.”

Configs you can copy

llama-server \
  --model ~/.cache/hf-direct/qwen3.6-27b/Qwen3.6-27B-UD-Q4_K_XL.gguf \
  --mmproj ~/.cache/hf-direct/qwen3.6-27b/mmproj-F16.gguf \
  --alias qwen36 \
  --host 0.0.0.0 --port 10003 \
  --n-gpu-layers 999 \
  --ctx-size 32768 \
  --parallel 1 \
  --jinja \
  --reasoning on --reasoning-format none \
  --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 \
  --repeat-penalty 1.0 \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  --flash-attn on \
  --metrics

Big-context: change --ctx-size 32768 to --ctx-size 131072. Multi-agent: also add --parallel 4.

Maxim

On a hybrid AI, long context is almost free. The card holds the whole map.

Recap

  • An RTX 4090 runs Qwen3.6-27B at 43 words a second.
  • 0.29 seconds before the first letter.
  • 20 of 20 coding questions correct.
  • 19.5 GB used at 32k. 23.2 GB at 128k × 4 chats.
  • Hybrid Mamba/Attention design makes long context cheap.
  • Quality stays high across all 15 tested setups.
  • Peak throughput: 122 words a second at 96k × 4 (or 16k × 4).
  • The card runs this AI very well.