FZJ-JSC/tutorial-multi-gpu

Efficient Distributed GPU Programming for Exascale, an SC/ISC Tutorial

GitHub repository with 361 stars and 76 forks.

Language: Cuda

Topics: cuda, exascale-computing, gpu, hpc, isc22, isc23, isc24, isc25, mpi, multi-gpu

Open provider repository

Latest metric snapshot

2026-06-05: 361 stars and 76 forks.

Similar repositories

  1. 1. lavawolfiee/mini-flash-attention

    Minimal FlashAttention in CUDA C++/CuTe: readable WMMA/CuTe kernels, no NxN workspace, up to 4.5x faster than naive PyTorch

    GitHub repository with 21 stars and 1 forks.

    Trending score: 1.02; stars gained: +9; forks gained: +1.

    Language: Cuda

    Topics: attention, cuda, cute, cutlass, flash-attention, flashattention

  2. 2. NVIDIA/CUDALibrarySamples

    CUDA Library Samples

    GitHub repository with 2,424 stars and 459 forks.

    Trending score: 0.79; stars gained: +5; forks gained: +1.

    Language: Cuda

    Topics: cufft, curand, cusolver, cusparse, nvjpeg, cudss

  3. 3. brucefan1983/GPUMD

    Graphics Processing Units Molecular Dynamics

    GitHub repository with 782 stars and 186 forks.

    Trending score: 0.69; stars gained: +4; forks gained: +2.

    Language: Cuda

    Topics: cuda, gpu, gpumd, heat-transport, high-performance-computing, machine-learning

  4. 4. NVIDIA/nvbench

    CUDA Kernel Benchmarking Library

    GitHub repository with 868 stars and 109 forks.

    Trending score: 0.50; stars gained: +1; forks gained: +0.

    Language: Cuda

    Topics: benchmark, kernel-benchmark, cuda-kernels, cuda, performance, nvidia

  5. 5. rapidsai/cugraph

    cuGraph - RAPIDS Graph Analytics Library

    GitHub repository with 2,189 stars and 357 forks.

    Trending score: 0.49; stars gained: +2; forks gained: +0.

    Language: Cuda

    Topics: rapids, nvidia, gpu, cuda, graph, graph-algorithms

  6. 6. supranational/sppark

    Zero-knowledge template library

    GitHub repository with 219 stars and 97 forks.

    Trending score: 0.18; stars gained: +0; forks gained: +1.

    Language: Cuda

    Topics: cuda, bls12-377, bls12-381, pasta-curves, zero-knowledge, zero-knowledge-proofs

Trending in Cuda

  1. 1. alibaba/rtp-llm

    RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.

    GitHub repository with 1,179 stars and 204 forks.

    Trending score: 1.09; stars gained: +9; forks gained: +0.

    Language: Cuda

    Topics: gpt, inference, llama, llm, llm-serving, llmops

  2. 2. lavawolfiee/mini-flash-attention

    Minimal FlashAttention in CUDA C++/CuTe: readable WMMA/CuTe kernels, no NxN workspace, up to 4.5x faster than naive PyTorch

    GitHub repository with 21 stars and 1 forks.

    Trending score: 1.02; stars gained: +9; forks gained: +1.

    Language: Cuda

    Topics: attention, cuda, cute, cutlass, flash-attention, flashattention

  3. 3. NVIDIA/CUDALibrarySamples

    CUDA Library Samples

    GitHub repository with 2,424 stars and 459 forks.

    Trending score: 0.79; stars gained: +5; forks gained: +1.

    Language: Cuda

    Topics: cufft, curand, cusolver, cusparse, nvjpeg, cudss

  4. 4. uccl-project/mKernel

    mKernel: fast multi-node, multi-GPU fused kernels

    GitHub repository with 216 stars and 20 forks.

    Trending score: 0.76; stars gained: +5; forks gained: +1.

    Language: Cuda

  5. 5. brucefan1983/GPUMD

    Graphics Processing Units Molecular Dynamics

    GitHub repository with 782 stars and 186 forks.

    Trending score: 0.69; stars gained: +4; forks gained: +2.

    Language: Cuda

    Topics: cuda, gpu, gpumd, heat-transport, high-performance-computing, machine-learning

  6. 6. mirage-project/mirage

    Mirage Persistent Kernel: Compiling LLMs into a MegaKernel

    GitHub repository with 2,290 stars and 214 forks.

    Trending score: 0.60; stars gained: +3; forks gained: -1.

    Language: Cuda

Trending topic: cuda

  1. 1. vllm-project/vllm

    A high-throughput and memory-efficient inference and serving engine for LLMs

    GitHub repository with 81,949 stars and 17,658 forks.

    Trending score: 3.75; stars gained: +79; forks gained: +46.

    Language: Python

    Topics: amd, blackwell, cuda, deepseek, deepseek-v3, gpt

  2. 2. gpustack/gpustack

    A GPU cluster manager that configures and orchestrates inference engines like vLLM and SGLang for high-performance AI model deployment.

    GitHub repository with 5,102 stars and 541 forks.

    Trending score: 2.51; stars gained: +11; forks gained: +1.

    Language: Python

    Topics: ascend, cuda, deepseek, distributed-inference, genai, high-performance-inference

  3. 3. Luce-Org/lucebox-hub

    Fast LLM speculative inference server for consumer hardware.

    GitHub repository with 2,330 stars and 217 forks.

    Trending score: 2.31; stars gained: +17; forks gained: +3.

    Language: C++

    Topics: kernel, llama-cpp, local-ai, nvidia-cuda, qwen, rtx3090

  4. 4. LMCache/LMCache

    LMCache: Supercharge Your LLM with the Fastest KV Cache Layer

    GitHub repository with 8,417 stars and 1,246 forks.

    Trending score: 2.17; stars gained: +11; forks gained: +6.

    Language: Python

    Topics: amd, cuda, fast, inference, kv-cache, llm

  5. 5. shader-slang/slang

    Making it easier to work with shaders

    GitHub repository with 5,348 stars and 451 forks.

    Trending score: 2.08; stars gained: +4; forks gained: +2.

    Language: C++

    Topics: shaders, hlsl, glsl, d3d12, vulkan, cuda

  6. 6. tenstorrent/tt-metal

    :metal: TT-NN operator library, and TT-Metalium low level kernel programming model.

    GitHub repository with 1,494 stars and 480 forks.

    Trending score: 1.82; stars gained: +7; forks gained: +5.

    Language: C++

    Topics: accelerator, ai, cuda, deepseek, gpu, img-gen