llama.cpp's speculative decoding framework (--spec-type X) supports three modes.
PR #22673 added MTP as the third. The framework is the same in every case:
propose N candidate tokens, verify them in one parallel forward pass of the
target, accept the longest correct prefix. The mode only determines
how the proposal is made.
| Mode | Flag | Draft source | Extra files |
|---|---|---|---|
| Classic draft model | --spec-type draft + --model-draft | A second, smaller GGUF | Yes |
| N-gram cache | --spec-type ngram-cache (or ngram-mod) | N-gram lookup over prompt + recent output | No |
| MTP | --spec-type draft-mtp | An MTP head that shares weights with the target | Embedded in the main GGUF |
Pick a smaller model from the same family (e.g. Qwen 0.5B as draft for Qwen 32B target).
Each speculative step runs a full forward pass on the draft model to produce
≤ --spec-draft-n-max tokens. Then the target verifies.
Trade-off: high-quality drafts if a good small sibling exists, but you pay draft-compute and VRAM for a second model. Works on any target — no special training needed.
No model at all. Cache stores n-grams seen in the prompt and previously generated tokens. When the recent context tail matches a cached n-gram prefix, propose its suffix. Effectively free per proposal (a hash lookup, microseconds).
Trade-off: zero overhead, but only useful when the output literally repeats prompt vocabulary — RAG extraction, paraphrasing, structured JSON with predictable keys, code that echoes identifiers from imports. Falls apart on synthesis ("who is the narrator?" → ~32% acceptance in our 128K test). Needs no model support.
The target model is trained with extra projection layers
(nextn_predict_layers) that, given the hidden state at position T, predict
positions T+1, T+2, … directly. One main-model forward pass produces
both the next real token and the speculative draft for the next N
positions in the same call. No separate model, no separate lookup. The MTP
head reuses the target's hidden state, vocabulary, and most weights — it just
adds a thin prediction head.
Trade-off: highest-quality drafts (the head was trained alongside the target
on the same data) and tiny additional VRAM (~500 MB compute buffer in our test),
but requires the model to have been trained with MTP heads. That's why
Unsloth ships a separate -MTP-GGUF repo for Qwen3.6 — the head tensors are
baked into the file.
| Property | Classic draft | N-gram | MTP |
|---|---|---|---|
| Extra weights loaded | Whole draft model (1–7 GB) | None | Just the MTP head (~hundreds of MB) |
| Per-proposal compute | Full draft forward pass | Hash lookup (free) | Few extra matmuls on the main forward |
| Acceptance rate | Depends on draft quality | ~30–90% (workload-dependent) | ~75–90% when target is MTP-trained |
| Hidden cost | VRAM | None | D2H/H2D pre-norm embedding transfer per ubatch — kills prefill at long context |
nParallel > 1 | Works | Works | Silently disabled |
--cache-reuse | Works | Works | Silently disabled |
| Setup | Two files | None | One file, must be MTP-trained |
The cost is that the model must be MTP-aware: Qwen3.6 27B and 35B-A3B currently,
with more architectures landing as upstream nextn_predict_layers plumbing matures.
One result from our 128K test illustrates this:
| Base | MTP n=2 | ngram-cache n=2 | |
|---|---|---|---|
| Prefill tok/s | 3,840 | 2,978 (−22%) | 3,886 (+1%) |
| Decode tok/s | 90 | 95 (+5%) | 75 (−17%) |
| Acceptance | — | 76% | 32% |
MTP costs you prefill (the D2H roundtrip). ngram is free on prefill but its acceptance collapses when the workload isn't repetitive. Pick by workload:
MTP is "speculative decoding using a draft head that ships inside the target model's GGUF and was trained alongside it," whereas classic draft-model speculative needs a separate small sibling and ngram-cache uses no model at all but only works on repetitive output. The verify step is identical across all three.