thc1006/qwen3.6-speculative-decoding-rtx3090
First public benchmark of llama.cpp speculative decoding on Qwen3.6-35B-A3B with a single RTX 3090 (post PR #19493 merge, 2026-04-19). 19 configurations covering ngram-cache, ngram-mod, and classic draft with vocab-matched Qwen3.5-0.8B. Finding: no variant achieves net speedup on Ampere + A3B MoE. Raw JSON, plots, full reproducibility.
GitHub repository with 28 stars and 1 forks.
Language: Python
Topics: ampere, benchmark, cuda, ggml, inference-benchmark, llama-cpp, local-llm, mixture-of-experts, moe, qwen