huggingface/lighteval

Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends

GitHub repository with 2,451 stars and 489 forks.

Language: Python

Topics: evaluation, evaluation-framework, evaluation-metrics, huggingface

Open provider repository

24h trend summary

Trending score 0.78, freshness score 0.18, stars gained +4, forks gained +3.

Latest metric snapshot

2026-06-15: 2,451 stars and 489 forks.

Similar repositories

  1. 1. comet-ml/opik

    Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.

    GitHub repository with 19,653 stars and 1,522 forks.

    Trending score: 3.44; stars gained: +58; forks gained: +4.

    Language: Python

    Topics: evaluation, hacktoberfest, hacktoberfest2025, langchain, llama-index, llm

  2. 2. mlflow/mlflow

    The open source AI engineering platform for agents, LLMs, and ML models. MLflow enables teams of all sizes to debug, evaluate, monitor, and optimize production-quality AI applications while controlling costs and managing access to models and data.

    GitHub repository with 26,532 stars and 5,845 forks.

    Trending score: 2.94; stars gained: +20; forks gained: +9.

    Language: Python

    Topics: agentops, agents, ai, ai-governance, apache-spark, evaluation

  3. 3. open-compass/VLMEvalKit

    Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks

    GitHub repository with 4,220 stars and 722 forks.

    Trending score: 2.06; stars gained: +6; forks gained: +3.

    Language: Python

    Topics: chatgpt, claude, clip, computer-vision, evaluation, gemini

  4. 4. NVIDIA-NeMo/Gym

    Evaluate and improve models and agents using environments

    GitHub repository with 982 stars and 179 forks.

    Trending score: 1.69; stars gained: +4; forks gained: +1.

    Language: Python

    Topics: reinforcement-learning, reinforcement-learning-environments, rl-environment, rl-training, gym, agents

  5. 5. lihouwenbin/ai-redteam-recursive-self-improvement

    Domain-neutral AI red-team framework for recursive self-improvement governance

    GitHub repository with 44 stars and 2 forks.

    Trending score: 1.39; stars gained: +2; forks gained: +1.

    Language: Python

    Topics: agentic-ai, ai-safety, evaluation, governance, python, recursive-self-improvement

  6. 6. TIGER-AI-Lab/ClawBench

    Open-source benchmark for browser AI agents on daily tasks.

    GitHub repository with 393 stars and 22 forks.

    Trending score: 1.37; stars gained: +2; forks gained: +0.

    Language: Python

    Topics: ai-agents, benchmark, browser-automation, browser-use, dataset, evaluation

Trending in Python

  1. 1. chopratejas/headroom

    Compress tool outputs, logs, files, and RAG chunks before they reach the LLM. 60-95% fewer tokens, same answers. Library, proxy, MCP server.

    GitHub repository with 27,902 stars and 1,891 forks.

    Trending score: 6.49; stars gained: +2,776; forks gained: +250.

    Language: Python

    Topics: agent, ai, anthropic, claude-code, compression, context-engineering

  2. 2. harry0703/MoneyPrinterTurbo

    利用AI大模型,一键生成高清短视频 Generate short videos with one click using AI LLM.

    GitHub repository with 88,031 stars and 12,625 forks.

    Trending score: 6.02; stars gained: +1,097; forks gained: +218.

    Language: Python

    Topics: ai, automation, chatgpt, moviepy, python, shortvideo

  3. 3. pewdiepie-archdaemon/odysseus

    Self-hosted AI workspace.

    GitHub repository with 71,392 stars and 9,098 forks.

    Trending score: 5.98; stars gained: +834; forks gained: +140.

    Language: Python

  4. 4. NousResearch/hermes-agent

    The agent that grows with you

    GitHub repository with 194,052 stars and 33,977 forks.

    Trending score: 5.92; stars gained: +753; forks gained: +209.

    Language: Python

    Topics: ai, ai-agent, ai-agents, anthropic, chatgpt, claude

  5. 5. NVIDIA/SkillSpector

    Security scanner for AI agent skills. Detect vulnerabilities, malicious patterns, and security risks.

    GitHub repository with 5,654 stars and 427 forks.

    Trending score: 5.61; stars gained: +874; forks gained: +76.

    Language: Python

  6. 6. rohitg00/ai-engineering-from-scratch

    Learn it. Build it. Ship it for others.

    GitHub repository with 32,676 stars and 5,366 forks.

    Trending score: 5.59; stars gained: +762; forks gained: +135.

    Language: Python

    Topics: agents, ai, ai-agents, ai-engineering, computer-vision, course

Trending topic: evaluation

  1. 1. langfuse/langfuse

    🪢 Open source AI engineering platform: LLM evals, observability, metrics, prompt management, playground, datasets. Integrates with OpenTelemetry, LangChain, OpenAI SDK, LiteLLM, and more. 🍊YC W23

    GitHub repository with 29,108 stars and 3,015 forks.

    Trending score: 3.83; stars gained: +75; forks gained: +8.

    Language: TypeScript

    Topics: analytics, autogen, evaluation, langchain, large-language-models, llama-index

  2. 2. comet-ml/opik

    Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.

    GitHub repository with 19,653 stars and 1,522 forks.

    Trending score: 3.44; stars gained: +58; forks gained: +4.

    Language: Python

    Topics: evaluation, hacktoberfest, hacktoberfest2025, langchain, llama-index, llm

  3. 3. promptfoo/promptfoo

    Test your prompts, agents, and RAGs. Red teaming/pentesting/vulnerability scanning for AI. Compare performance of GPT, Claude, Gemini, DeepSeek, and more. Simple declarative configs with command line and CI/CD integration. Used by OpenAI and Anthropic.

    GitHub repository with 22,220 stars and 1,979 forks.

    Trending score: 3.36; stars gained: +39; forks gained: +11.

    Language: TypeScript

    Topics: ci, ci-cd, cicd, evaluation, evaluation-framework, llm

  4. 4. Tencent/WeKnora

    Open-source LLM knowledge platform: turn raw documents into a queryable RAG, an autonomous reasoning agent, and a self-maintaining Wiki.

    GitHub repository with 16,292 stars and 2,106 forks.

    Trending score: 3.22; stars gained: +32; forks gained: +8.

    Language: Go

    Topics: agent, agentic, ai, chatbot, embeddings, evaluation

  5. 5. mlflow/mlflow

    The open source AI engineering platform for agents, LLMs, and ML models. MLflow enables teams of all sizes to debug, evaluate, monitor, and optimize production-quality AI applications while controlling costs and managing access to models and data.

    GitHub repository with 26,532 stars and 5,845 forks.

    Trending score: 2.94; stars gained: +20; forks gained: +9.

    Language: Python

    Topics: agentops, agents, ai, ai-governance, apache-spark, evaluation

  6. 6. trpc-group/trpc-agent-go

    A Go framework for building production agent systems with graph workflows, tools, memory, A2A, AG-UI, MCP, evaluation, and observability.

    GitHub repository with 1,352 stars and 165 forks.

    Trending score: 2.59; stars gained: +13; forks gained: +2.

    Language: Go

    Topics: a2a, a2a-protocol, ag-ui, agent, agent-framework, ai