Speculative decoding in llama.cpp — MTP vs the others

llama.cpp's speculative decoding framework (--spec-type X) supports three modes. PR #22673 added MTP as the third. The framework is the same in every case: propose N candidate tokens, verify them in one parallel forward pass of the target, accept the longest correct prefix. The mode only determines how the proposal is made.

Mode	Flag	Draft source	Extra files
Classic draft model	`--spec-type draft` + `--model-draft`	A second, smaller GGUF	Yes
N-gram cache	`--spec-type ngram-cache` (or `ngram-mod`)	N-gram lookup over prompt + recent output	No
MTP	`--spec-type draft-mtp`	An MTP head that shares weights with the target	Embedded in the main GGUF

Classic draft model

Pick a smaller model from the same family (e.g. Qwen 0.5B as draft for Qwen 32B target). Each speculative step runs a full forward pass on the draft model to produce ≤ --spec-draft-n-max tokens. Then the target verifies.

Trade-off: high-quality drafts if a good small sibling exists, but you pay draft-compute and VRAM for a second model. Works on any target — no special training needed.

N-gram cache

No model at all. Cache stores n-grams seen in the prompt and previously generated tokens. When the recent context tail matches a cached n-gram prefix, propose its suffix. Effectively free per proposal (a hash lookup, microseconds).

Trade-off: zero overhead, but only useful when the output literally repeats prompt vocabulary — RAG extraction, paraphrasing, structured JSON with predictable keys, code that echoes identifiers from imports. Falls apart on synthesis ("who is the narrator?" → ~32% acceptance in our 128K test). Needs no model support.

MTP — multi-token prediction

The target model is trained with extra projection layers (nextn_predict_layers) that, given the hidden state at position T, predict positions T+1, T+2, … directly. One main-model forward pass produces both the next real token and the speculative draft for the next N positions in the same call. No separate model, no separate lookup. The MTP head reuses the target's hidden state, vocabulary, and most weights — it just adds a thin prediction head.

Trade-off: highest-quality drafts (the head was trained alongside the target on the same data) and tiny additional VRAM (~500 MB compute buffer in our test), but requires the model to have been trained with MTP heads. That's why Unsloth ships a separate -MTP-GGUF repo for Qwen3.6 — the head tensors are baked into the file.

How they differ operationally

Property	Classic draft	N-gram	MTP
Extra weights loaded	Whole draft model (1–7 GB)	None	Just the MTP head (~hundreds of MB)
Per-proposal compute	Full draft forward pass	Hash lookup (free)	Few extra matmuls on the main forward
Acceptance rate	Depends on draft quality	~30–90% (workload-dependent)	~75–90% when target is MTP-trained
Hidden cost	VRAM	None	D2H/H2D pre-norm embedding transfer per ubatch — kills prefill at long context
`nParallel > 1`	Works	Works	Silently disabled
`--cache-reuse`	Works	Works	Silently disabled
Setup	Two files	None	One file, must be MTP-trained

Why MTP feels new even though "speculative decoding" already existed

The head is trained, not bolted on. Classic drafts use a separately trained model; MTP heads are trained jointly with the target on the same tokens. This drives the >70% acceptance rates we saw — far above what an arbitrary small draft would manage.
It's free in setup. Until MTP, "use speculative decoding" meant either picking a draft model carefully (and managing two GGUF files) or accepting ngram-cache's workload sensitivity. With MTP-trained models, you flip one flag — no second model, near-best-case acceptance.

The cost is that the model must be MTP-aware: Qwen3.6 27B and 35B-A3B currently, with more architectures landing as upstream nextn_predict_layers plumbing matures.

MTP and ngram fail in opposite directions

One result from our 128K test illustrates this:

	Base	MTP n=2	ngram-cache n=2
Prefill tok/s	3,840	2,978 (−22%)	3,886 (+1%)
Decode tok/s	90	95 (+5%)	75 (−17%)
Acceptance	—	76%	32%

MTP costs you prefill (the D2H roundtrip). ngram is free on prefill but its acceptance collapses when the workload isn't repetitive. Pick by workload:

Structured output, generation-heavy → MTP
Prompt-echoing output (extraction, RAG, paraphrase) → ngram-cache
Balanced or unknown → MTP if you have an MTP-trained model, plain base otherwise

One sentence

MTP is "speculative decoding using a draft head that ships inside the target model's GGUF and was trained alongside it," whereas classic draft-model speculative needs a separate small sibling and ngram-cache uses no model at all but only works on repetitive output. The verify step is identical across all three.