BaizeAI/kcover
🧯 Kubernetes coverage for fault awareness and recovery, works for any LLMOps, MLOps, AI workloads.
GitHub repository with 35 stars and 3 forks.
Language: Go
Topics: kubeflow, kubernetes, kubernetes-controller, llm, llmops, mlops, nvidia-gpu, pytorchjob, tfjob, xid-error