Why Model Architecture Determines Your VRAM Budget More Than Parameter Count
2025-07-10
Elian Freyermuth
Why Model Architecture Determines Your VRAM Budget More Than Parameter Count¶
qwen2.5:7b: 3,252 MB of peak VRAM. llama3.1:8b: 7,303 MB. Same parameter count. The instinct is to assume quantization artifact or measurement error. It's neither — both ran Q4_K_M through the same inference stack. The difference is real, and it's architectural.
Parameter count is the number everyone reaches for first when figuring out what will fit on their GPU. It's in the model name, it's the headline figure on every model card. And it's a genuinely bad proxy for VRAM usage once you understand where VRAM actually goes.
Where VRAM Actually Goes¶
VRAM usage during inference breaks down into two distinct buckets: model weights and the KV cache. Most mental models for "will this model fit?" stop at weights, which is why they get it wrong.
Model weights are static. They're loaded once from disk into VRAM and stay there for the lifetime of the inference session. With Q4_K_M quantization, you're storing weights at approximately 4.5 bits per parameter on average — roughly 0.56 bytes per parameter.
But qwen and llama don't just differ in parameter count. Qwen 2.5 7B has 28 layers and a hidden size of 3,584. Llama 3.1 8B has 32 layers and a hidden size of 4,096. Those structural differences mean qwen's weights are genuinely lighter — not by a small rounding error, but meaningfully so. Still, even accounting for that, the weight delta alone doesn't explain the full 2.24× VRAM gap. The rest lives in the second bucket.
The KV cache is dynamic. It grows with every token you process, and it persists across the full context window while the model is loaded. It exists because of how transformer attention works: to compute attention for each new token, the model needs the key and value projections of every previous token in the context. Rather than recompute them from scratch on every forward pass, the inference engine caches them. Faster, but it costs VRAM — and that cost scales directly with context length.
The KV Cache Under Multi-Head Attention¶
In standard Multi-Head Attention (MHA), every attention head maintains its own separate key and value matrices. The size of the KV cache for a single token at a single layer is:
per_token_per_layer = 2 × num_kv_heads × head_dim × bytes_per_element
The 2 is for K and V separately. head_dim is hidden_size / num_attention_heads. In llama.cpp with GGUF models, the KV cache is stored in FP16 regardless of weight quantization — so bytes_per_element is 2.
Scale that across the full context and all layers:
KV cache total = context_length × num_layers × 2 × num_kv_heads × head_dim × 2
For llama3.1:8b at 8,192 tokens:
8192 × 32 layers × 2 × 8 KV heads × 128 head_dim × 2 bytes
= 8192 × 32 × 4096
= 1,073,741,824 bytes
≈ 1,024 MB
Per token, per layer, that's 2 × 8 × 128 × 2 = 4,096 bytes — 4 KB. Multiplied across 32 layers and 8,192 context tokens, it adds up fast.
The key point is that this scales linearly with context length. Double the context, double the KV cache. At 4,096 tokens you're at ~512 MB. At 16,384 tokens you'd be at ~2,048 MB — just for the cache. On a 12GB GPU that's not a rounding error anymore.
Under vanilla MHA, num_kv_heads equals num_attention_heads. Every query head gets its own K and V. That was the standard for a long time — GPT-2, the original LLaMA, early Mistral. And it works fine at short contexts. At long contexts it becomes expensive.
What GQA Changes¶
Grouped-Query Attention (GQA) was introduced to attack exactly this problem. The idea is simple: instead of one KV head per query head, you have multiple query heads share a single KV head. The query heads stay the same — you don't lose expressive power on the attention side. You just stop maintaining redundant KV state for heads that can share it.
The ratio between query heads and KV heads is the lever. Here's what it looks like across the three models I benchmarked:
| Model | Q heads | KV heads | Ratio | Attention type |
|---|---|---|---|---|
qwen2.5:7b | 28 | 4 | 7:1 | GQA (aggressive) |
llama3.1:8b | 32 | 8 | 4:1 | GQA (moderate) |
mistral:7b | 32 | 8 | 4:1 | GQA (moderate) |
Llama 3.1 and Mistral both use GQA at a 4:1 ratio. To make that concrete: a pure MHA version of llama would have num_kv_heads = num_attention_heads = 32 — one KV head per query head. GQA at 8 KV heads is already a 4× reduction in KV cache size from that baseline. Qwen 2.5 goes further: 7 query heads per KV head, a 7× reduction from the MHA baseline.
Now run the same KV cache calculation for qwen2.5:7b at 8,192 tokens:
8192 × 28 layers × 2 × 4 KV heads × 128 head_dim × 2 bytes
= 8192 × 28 × 2048
= 471,859,200 bytes
≈ 450 MB
Llama: ~1,024 MB. Qwen: ~450 MB. 2.27× difference from the KV cache alone, at 8,192 tokens.
And qwen has an additional structural advantage: 28 layers instead of 32, and a smaller hidden size (3,584 vs 4,096). That makes the weights lighter too. The VRAM gap compounds across both buckets — qwen is smaller on weights and dramatically smaller on KV cache.
The 2.24× measured gap in the benchmark — 3,252 MB vs 7,303 MB — follows directly from these numbers. It's not mysterious. It's the predictable result of architectural choices that are documented in the model configs.
The Benchmark as Evidence¶
The KV cache math is clean on paper. The benchmark numbers confirm it in practice.
qwen2.5:7b held 26.00 tok/s at 8,192 tokens versus 28.92 tok/s at 512 tokens — a 10% drop across the full context range. llama3.1:8b dropped by a similar 10% (27.96 to 25.26 tok/s). Both degrade slowly and predictably. The M40 has no FlashAttention — it's computing full quadratic attention — but at 8,192 tokens with a small KV cache, the cost stays manageable.
deepseek-r1:7b is the contrast case. It dropped 41% — from 28.17 tok/s at 512 tokens to 16.72 tok/s at 8,192. What makes this interesting is that the Ollama deepseek-r1:7b tag pulls the DeepSeek-R1-Distill-Qwen-7B model — a distillation of R1's reasoning capability onto the Qwen2 architecture. Same config as qwen2.5:7b: 28 query heads, 4 KV heads, 28 layers. Identical GQA ratio.
So the architecture is the same. The KV cache per token is the same. The difference is usage pattern. DeepSeek-R1 generates extended chain-of-thought reasoning traces internally — <think>...</think> blocks that can run for hundreds of tokens before the model produces any visible output. Those traces consume context aggressively. By 8,192 tokens, the effective context is full of thinking, the KV cache is at its limit, and without FlashAttention there's nothing to soften the quadratic attention cost. The cliff isn't from having a larger KV cache — it's from filling a normal-sized KV cache much faster than a standard chat model would.
That distinction matters. GQA reduces the cost per token. It doesn't change how many tokens the model needs to do its job.
What's happening in these three curves isn't random variance. It's KV cache pressure, quantified. Smaller KV cache → slower degradation at long context. Larger KV cache + no attention optimization → sharp cliff.
The VRAM numbers track the same story. qwen2.5:7b at 3,252 MB leaves substantial headroom on a 12GB GPU even after accounting for the KV cache at longer contexts. llama3.1:8b at 7,303 MB is already at 63% capacity at the context lengths I tested — push to 16K or 32K and you're in trouble on a 12GB card.
What This Actually Means for Model Selection¶
"Will this model fit on my GPU?" is the wrong first question. Or rather, it's incomplete without two follow-ups: at what context length, and what's the KV head count?
Context length is the multiplier. At short contexts — 512, 1024 tokens — weight size dominates and parameter count is a reasonable proxy. A 7B model at Q4_K_M is roughly 4 GB of weights plus maybe 50–100 MB of KV cache. Fine. At long contexts — 8K, 16K, 32K — the KV cache becomes a significant second term. For a model with 8 KV heads and 32 layers at 16,384 tokens, you're looking at ~2 GB of KV cache on top of the weights. On a 12GB GPU that's the difference between running and not running.
KV head count is what actually controls that second term. It's in config.json for every model on HuggingFace — the field is num_key_value_heads. Takes ten seconds to check.
A rough framework for 12GB VRAM at Q4_K_M:
| Context | KV heads | Approximate KV cache | Fits comfortably? |
|---|---|---|---|
| 8K | 4 (qwen-style) | ~450 MB | Yes |
| 8K | 8 (llama/mistral-style) | ~1,024 MB | Yes, but tighter |
| 16K | 4 | ~900 MB | Yes |
| 16K | 8 | ~2,048 MB | Tight for 7-8B models |
| 32K | 8 | ~4,096 MB | Dangerous for 7-8B models |
These are approximations assuming 28–32 layers and head_dim of 128 — typical for 7-8B models. Actual VRAM also includes runtime overhead, buffers, and compute graphs. The absolute numbers will vary; the relative comparisons hold.
One more thing worth checking: not every model uses GQA. The original LLaMA used MHA. Some fine-tunes of older base models inherit that. Multi-Query Attention (MQA) is the extreme end — a single KV head shared across all queries — used in some older efficient models. GQA sits in between and is now the standard for most new releases. If you're running something from 2022 or early 2023, don't assume.
The Numbers Doesn't Lie¶
Parameter count became the default shorthand because it's easy and it's on the label. "7B model" communicates something real — roughly what task complexity the model can handle, roughly what quality to expect. As a capability signal it's fine.
As a VRAM signal it's incomplete at best and misleading at worst. The M40 benchmark put two models with essentially the same parameter count side by side and showed a 2.24× VRAM gap. That gap is the KV cache. The KV cache is controlled by num_key_value_heads. That field is sitting in every model's config.json, one API call away.
The Qwen 2.5 team made a deliberate choice when they went with 4 KV heads at 28 query heads. At long contexts on constrained hardware, that choice wins — smaller KV cache, more headroom, better sustained throughput. Llama and Mistral at 8 KV heads carry more KV state and pay for it at longer contexts. That tradeoff might be worth it for quality reasons in some settings, but if you're running on a 12GB GPU and care about context length, qwen's architecture is the better fit. The math is unambiguous.
Appendix: KV Cache Reference Calculations¶
All calculations assume Q4_K_M quantization, FP16 KV cache (llama.cpp default), and the head counts from the official model configs.
Formula¶
KV cache (bytes) = context_length × num_layers × 2 × num_kv_heads × head_dim × 2
Where head_dim = hidden_size / num_attention_heads.
Model configs used¶
| Field | qwen2.5:7b | llama3.1:8b | mistral:7b |
|---|---|---|---|
| num_attention_heads | 28 | 32 | 32 |
| num_key_value_heads | 4 | 8 | 8 |
| hidden_size | 3,584 | 4,096 | 4,096 |
| num_hidden_layers | 28 | 32 | 32 |
| head_dim | 128 | 128 | 128 |
Sources: Qwen2.5 and Mistral configs from HuggingFace model repos. Llama 3.1 8B hyperparameters from the Llama 3 technical report.
KV cache at selected context lengths¶
| Model | 2,048 tokens | 4,096 tokens | 8,192 tokens | 16,384 tokens |
|---|---|---|---|---|
qwen2.5:7b | 112 MB | 225 MB | 450 MB | 900 MB |
llama3.1:8b | 256 MB | 512 MB | 1,024 MB | 2,048 MB |
mistral:7b | 256 MB | 512 MB | 1,024 MB | 2,048 MB |