lavawolfiee/mini-flash-attention
Minimal FlashAttention in CUDA C++/CuTe: readable WMMA/CuTe kernels, no NxN workspace, up to 4.5x faster than naive PyTorch
GitHub repository with 21 stars and 1 forks.
Language: Cuda
Topics: attention, cuda, cute, cutlass, flash-attention, flashattention, gpu-kernels, llm, pytorch-extension, tensor-cores