Speculative decoding in llama.cpp — MTP vs the others

llama.cpp's speculative decoding framework (--spec-type X) supports three modes. PR #22673 added MTP as the third. The framework is the same in every case: propose N candidate tokens, verify them in one parallel forward pass of the target, accept the longest correct prefix. The mode only determines how the proposal is made.

ModeFlagDraft sourceExtra files
Classic draft model--spec-type draft + --model-draftA second, smaller GGUFYes
N-gram cache--spec-type ngram-cache (or ngram-mod)N-gram lookup over prompt + recent outputNo
MTP--spec-type draft-mtpAn MTP head that shares weights with the targetEmbedded in the main GGUF

Classic draft model

Pick a smaller model from the same family (e.g. Qwen 0.5B as draft for Qwen 32B target). Each speculative step runs a full forward pass on the draft model to produce ≤ --spec-draft-n-max tokens. Then the target verifies.

Trade-off: high-quality drafts if a good small sibling exists, but you pay draft-compute and VRAM for a second model. Works on any target — no special training needed.

N-gram cache

No model at all. Cache stores n-grams seen in the prompt and previously generated tokens. When the recent context tail matches a cached n-gram prefix, propose its suffix. Effectively free per proposal (a hash lookup, microseconds).

Trade-off: zero overhead, but only useful when the output literally repeats prompt vocabulary — RAG extraction, paraphrasing, structured JSON with predictable keys, code that echoes identifiers from imports. Falls apart on synthesis ("who is the narrator?" → ~32% acceptance in our 128K test). Needs no model support.

MTP — multi-token prediction

The target model is trained with extra projection layers (nextn_predict_layers) that, given the hidden state at position T, predict positions T+1, T+2, … directly. One main-model forward pass produces both the next real token and the speculative draft for the next N positions in the same call. No separate model, no separate lookup. The MTP head reuses the target's hidden state, vocabulary, and most weights — it just adds a thin prediction head.

Trade-off: highest-quality drafts (the head was trained alongside the target on the same data) and tiny additional VRAM (~500 MB compute buffer in our test), but requires the model to have been trained with MTP heads. That's why Unsloth ships a separate -MTP-GGUF repo for Qwen3.6 — the head tensors are baked into the file.

How they differ operationally

PropertyClassic draftN-gramMTP
Extra weights loadedWhole draft model (1–7 GB)NoneJust the MTP head (~hundreds of MB)
Per-proposal computeFull draft forward passHash lookup (free)Few extra matmuls on the main forward
Acceptance rateDepends on draft quality~30–90% (workload-dependent)~75–90% when target is MTP-trained
Hidden costVRAMNoneD2H/H2D pre-norm embedding transfer per ubatch — kills prefill at long context
nParallel > 1WorksWorksSilently disabled
--cache-reuseWorksWorksSilently disabled
SetupTwo filesNoneOne file, must be MTP-trained

Why MTP feels new even though "speculative decoding" already existed

  1. The head is trained, not bolted on. Classic drafts use a separately trained model; MTP heads are trained jointly with the target on the same tokens. This drives the >70% acceptance rates we saw — far above what an arbitrary small draft would manage.
  2. It's free in setup. Until MTP, "use speculative decoding" meant either picking a draft model carefully (and managing two GGUF files) or accepting ngram-cache's workload sensitivity. With MTP-trained models, you flip one flag — no second model, near-best-case acceptance.

The cost is that the model must be MTP-aware: Qwen3.6 27B and 35B-A3B currently, with more architectures landing as upstream nextn_predict_layers plumbing matures.

MTP and ngram fail in opposite directions

One result from our 128K test illustrates this:

BaseMTP n=2ngram-cache n=2
Prefill tok/s3,8402,978 (−22%)3,886 (+1%)
Decode tok/s9095 (+5%)75 (−17%)
Acceptance76%32%

MTP costs you prefill (the D2H roundtrip). ngram is free on prefill but its acceptance collapses when the workload isn't repetitive. Pick by workload:

One sentence

MTP is "speculative decoding using a draft head that ships inside the target model's GGUF and was trained alongside it," whereas classic draft-model speculative needs a separate small sibling and ngram-cache uses no model at all but only works on repetitive output. The verify step is identical across all three.