Mattral/Composed-Mixture-of-Experts-Engine
moe-engine is a research-grade infrastructure layer for training large Mixture-of-Experts language models at hyperscale. It is designed around one core constraint: at 10K+ GPUs, nodes die continuously. The system must keep training alive end-to-end — routing correctly, checkpointing durably, and resuming without operator intervention.
GitHub repository with 10 stars and 8 forks.
Language: Python
Topics: distributed-training, fault-tolerance, llm-training, machine-learning, mixture-of-experts, moe, production-infrastructure, pytorch, sparse-training, triton