kekzl/imp

High-performance LLM inference engine in C++/CUDA for NVIDIA Blackwell (RTX 5090/5080/5070 Ti, RTX PRO 6000; sm_120). Native NVFP4/GGUF, 270 tok/s decode on Qwen3-Coder-30B MoE. Written entirely by Claude Code.

GitHub repository with 18 stars and 2 forks.

Language: Cuda

Topics: blackwell, cpp, cuda, cuda-graphs, gated-deltanet, gguf, inference, inference-engine, llm, mixture-of-experts

Open provider repository

Latest metric snapshot

2026-06-05: 18 stars and 2 forks.

Trending in Cuda

  1. 1. alibaba/rtp-llm

    RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.

    GitHub repository with 1,179 stars and 204 forks.

    Trending score: 1.09; stars gained: +9; forks gained: +0.

    Language: Cuda

    Topics: gpt, inference, llama, llm, llm-serving, llmops

  2. 2. lavawolfiee/mini-flash-attention

    Minimal FlashAttention in CUDA C++/CuTe: readable WMMA/CuTe kernels, no NxN workspace, up to 4.5x faster than naive PyTorch

    GitHub repository with 21 stars and 1 forks.

    Trending score: 1.02; stars gained: +9; forks gained: +1.

    Language: Cuda

    Topics: attention, cuda, cute, cutlass, flash-attention, flashattention

  3. 3. NVIDIA/CUDALibrarySamples

    CUDA Library Samples

    GitHub repository with 2,424 stars and 459 forks.

    Trending score: 0.79; stars gained: +5; forks gained: +1.

    Language: Cuda

    Topics: cufft, curand, cusolver, cusparse, nvjpeg, cudss

  4. 4. brucefan1983/GPUMD

    Graphics Processing Units Molecular Dynamics

    GitHub repository with 782 stars and 186 forks.

    Trending score: 0.69; stars gained: +4; forks gained: +2.

    Language: Cuda

    Topics: molecular-dynamics-simulation, heat-transport, cuda, molecular-dynamics, gpumd, phonon

  5. 5. NVIDIA/nvbench

    CUDA Kernel Benchmarking Library

    GitHub repository with 870 stars and 109 forks.

    Trending score: 0.50; stars gained: +1; forks gained: +0.

    Language: Cuda

    Topics: benchmark, kernel-benchmark, cuda-kernels, cuda, performance, nvidia

  6. 6. rapidsai/cugraph

    cuGraph - RAPIDS Graph Analytics Library

    GitHub repository with 2,189 stars and 357 forks.

    Trending score: 0.49; stars gained: +2; forks gained: +0.

    Language: Cuda

    Topics: rapids, nvidia, gpu, cuda, graph, graph-algorithms

Trending topic: blackwell

  1. 1. vllm-project/vllm

    A high-throughput and memory-efficient inference and serving engine for LLMs

    GitHub repository with 81,978 stars and 17,668 forks.

    Trending score: 3.75; stars gained: +79; forks gained: +46.

    Language: Python

    Topics: amd, blackwell, cuda, deepseek, deepseek-v3, gpt

  2. 2. openlake-project/openlake

    High performance object store for fast LLM Inference and GPU Training. Feed your GPUs at blazing fast speeds.

    GitHub repository with 913 stars and 55 forks.

    Trending score: 3.11; stars gained: +60; forks gained: +0.

    Language: Rust

    Topics: blackwell, gpt, gpu, high-performance, llm, llm-training

  3. 3. lightseekorg/tokenspeed

    TokenSpeed is a speed-of-light LLM inference engine.

    GitHub repository with 1,366 stars and 141 forks.

    Trending score: 1.86; stars gained: +6; forks gained: +2.

    Language: Python

    Topics: blackwell, deepseek, gpt-oss, kimi, lightseek, llm

  4. 4. sgl-project/sglang

    SGLang is a high-performance serving framework for large language models and multimodal models.

    GitHub repository with 28,888 stars and 6,343 forks.

    Trending score: 1.72; stars gained: -55; forks gained: +18.

    Language: Python

    Topics: attention, blackwell, cuda, deepseek, diffusion, glm

  5. 5. NVIDIA/TensorRT-LLM

    TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT LLM also contains components to create Python and C++ runtimes that orchestrate the inference execution in a performant way.

    GitHub repository with 13,807 stars and 2,440 forks.

    Trending score: 1.18; stars gained: +16; forks gained: +7.

    Language: Python

    Topics: blackwell, cuda, llm-serving, moe, pytorch

  6. 6. NVIDIA/cudnn-frontend

    cuDNN Frontend is NVIDIA's modern, open-source entry point to the cuDNN library and a growing collection of high-performance open-source kernels.

    GitHub repository with 840 stars and 177 forks.

    Trending score: 0.69; stars gained: +4; forks gained: +2.

    Language: Python

    Topics: attention, blackwell, cuda, cuda-kernels, cuda-toolkit, deep-learning