qflen/nsa-from-scratch
From-scratch reimplementation of DeepSeek's Native Sparse Attention (arXiv:2502.11089) in Triton + CUDA Hopper WGMMA. 7.4x faster than FlashAttention-3 at 64k context. Five-model training fleet, perplexity sweep, LongBench v2, MoBA comparison.
GitHub repository with 6 stars and 0 forks.
Language: Python
Topics: attention-mechanism, cuda, deepseek, flash-attention, gpu-kernels, hopper, llm, long-context, native-sparse-attention, nsa