Indras-Mirror/llama.cpp-turboq-mtp
Fused TBQ4 Flash Attention + MTP + Shared Tensors for llama.cpp — 82+ tok/s with lossless 4.25 bpv KV cache at 200K context on RTX 4090
GitHub repository with 78 stars and 5 forks.
Language: C++
Topics: cuda, flash-attention, fwht, kv-cache, llama-cpp, mtp, multi-token-prediction, quantization, qwen, rtx-4090