How well does an RTX 4090 run Qwen3.5-27B?

RTX 4090 runs Qwen3.5-27B at 44 words a second. 0.26 second wait. 19 of 20 coding tasks correct. 18.4 GB used out of 24. Plus a 5×3 matrix showing it holds 128k context with 4 parallel chats.
Visual summary. Source: caveman version.

I asked one question: how well does an RTX 4090 run Qwen3.5-27B?

44 words a second. A quarter of a second wait. 19 of 20 coding problems solved. 18 GB of memory used out of 24.

The card runs this AI very well.

The Hybrid Free-Lunch Rule

Most AIs slow down hard when you give them more to read at once. This one does not.

I tested 5 reading-window sizes (16k up to 128k tokens) and 3 numbers of chats at the same time (1, 2, 4). All 15 setups ran. The biggest setup — 128k window with 4 parallel chats — fit on the 24 GB card with room to spare. Speed stayed the same.

This is rare. Most models charge a heavy memory tax for long context. Qwen3.5-27B uses a hybrid design where only 16 of its 64 layers carry a memory of past tokens. The other 48 layers don’t need that memory. So growing the window mostly grows a small part of the model, not the whole thing.

That’s the rule: on a hybrid model, long context is almost free.

What the words mean

  • The AI: Qwen3.5-27B. A free, open AI that is good at writing code and answering questions. The “27B” means it has 27 billion knobs that can be tuned.
  • The card: RTX 4090. A computer chip with 24 GB of fast memory. Made for video games. Also great at AI math.
  • The software: llama.cpp, free and open. The numbers in this article come from its built-in llama-server. Other software (vLLM, Ollama, LM Studio) gives different numbers.
  • Quantisation: Q4_K_M. A way of squashing the AI from huge to about 16.5 GB so it fits on the card. You lose tiny bits of accuracy. You gain speed.
  • Context window: how much the AI can read at once. 32k tokens is roughly a long blog post. 128k is roughly a short novel.
  • Parallel chats: more than one person talking to the AI at the same time. Like one teacher helping two students at once.
  • Tokens per second: how fast the AI types. 44 tokens a second is about 30 English words a second.
  • TTFT: the wait before the first letter shows up. 0.26 seconds is fast.
  • pass@1: the AI writes code, the code runs, the test passes. 19 of 20 = 95 % correct.

The basic numbers

WhatResult
Speed (single chat)44 words/sec
Wait before first letter0.26 s
Correct answers (out of 20)19
Card memory used18.4 GB out of 24
AI file size on disk16.5 GB

The 20 questions came from a public coding test called HumanEval. Each question gives the AI a coding problem. The AI writes a function. Tests run against it. Passing means the function works on the first try.

The full matrix — context × parallel chats

I pushed both knobs to their limits. 5 context sizes × 3 parallelism settings = 15 setups. Every one ran. None crashed.

Words per second (all chats added together):

ctx ↓ \ chats →1 chat2 chats4 chats
16k4474111
32k447396
64k4474100
96k4474118
128k4473118

Memory used out of 24 GB:

ctx ↓ \ chats →1 chat2 chats4 chats
16k17.9 GB18.1 GB18.4 GB
32k18.4 GB18.6 GB18.9 GB
64k19.5 GB19.7 GB20.0 GB
96k20.6 GB20.8 GB21.1 GB
128k21.7 GB21.9 GB22.2 GB

Three things that surprise people

1. Eight times more context costs almost nothing

Going from 16k to 128k is eight times more reading. The cost: about 4 GB extra memory and zero speed loss. Most AIs would crash or slow to a crawl.

2. The card runs out of math, not memory

At 128k with 4 chats the card uses 22.2 GB. It still has 1.8 GB free. That free space is not the bottleneck. The math part of the card hits its top speed at 4 chats. Adding more chats would split the same pie thinner without baking a bigger pie.

3. Quality stays flat across all 15 setups

pass@1 ranges from 0.90 to 1.00 across the matrix. There is no slow drop as the window grows. The variation is just sampling noise (the AI sometimes types a typo). The model does not “forget” facts in long contexts the way most models do.

What to do

  1. One person, normal use: start with 32k context and 1 chat. Cheap, fast, easy.
  2. Long files or long agent loops: 64k or 96k single chat. Same speed, more room.
  3. Whole codebase or long research: 128k single chat. Speed unchanged.
  4. Multi-agent fan-out or batch jobs: 96k × 4 chats. Peaks at 118 words a second.
  5. Always turn on --flash-attn on and q8_0 cache. Free memory wins.
  6. Don’t run more than 4 chats. The card hits its math ceiling there.

Configs you can copy

llama-server \
  --model ~/models/qwen35-distilled/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled.i1-Q4_K_M.gguf \
  --mmproj ~/models/qwen35-distilled/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled.mmproj-f16.gguf \
  --alias qwen35 \
  --host 0.0.0.0 --port 10003 \
  --n-gpu-layers 999 \
  --ctx-size 32768 \
  --parallel 1 \
  --jinja \
  --reasoning on --reasoning-format none \
  --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 \
  --repeat-penalty 1.0 \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  --flash-attn on \
  --metrics

Big-context: change --ctx-size 32768 to --ctx-size 131072. Multi-agent: also add --parallel 4.

Maxim

On a hybrid AI, long context is almost free. The card holds the whole map.

Recap

  • An RTX 4090 runs Qwen3.5-27B at 44 words a second.
  • 0.26 seconds before the first letter.
  • 19 of 20 coding questions correct.
  • 18.4 GB used at 32k. 22.2 GB used at 128k × 4 chats.
  • Hybrid Mamba/Attention design makes long context cheap.
  • Quality stays flat across all 15 tested setups.
  • Peak throughput: 118 words a second at 96k × 4 chats.
  • The card runs this AI very well.